Download - Evolution Towards Cloud : Overview of Next Generation Computing Architecturemhasan11.web.engr.illinois.edu/thesis/mhasan_bsc_thesis.pdf · 2015-08-15 · Certiﬁcation The thesis

Evolution Towards Cloud :Overview of Next Generation Computing

Architectureby

Monowar Hasan(Student ID. 0605021)

&Sabbir Ahmed

(Student ID. 0605013)

A Thesis submitted to the Department of Computer Science and Engineering inpartial fulfillment of the requirements for the degree of

Bachelor of Science (B.Sc.) in

Computer Science and Engineering

Thesis Supervisor:Dr. Md. Humayun Kabir

Bangladesh University of Engineering and TechnologyDhaka, Bangladesh

17 March 2012

Certification

The thesis titled “Evolution Towards Cloud : Overview of Next Generation Com-

puting Architecture”, submitted by Monowar Hasan, Student No. 0605021, Sabbir

Ahmed, Student No. 0605013, to the Department of Computer Science and Engi-

neering, Bangladesh University of Engineering and Technology, has been accepted as

satisfactory for the partial fulfillment of the requirements for the degree of Bachelor

of Science in Computer Science and Engineering.

Supervisor

Dr. Md. Humayun Kabir

Professor,

Department of Computer Science and Engineering,

Bangladesh University of Engineering and Technology,

Dhaka-1000, Bangladesh.

ii

Declaration

We, hereby, declare that the work presented in this thesis is the outcome of the in-

vestigation performed by us under the supervision of Dr. Md. Humayun Kabir,

Professor, Department of Computer Science and Engineering, Bangladesh University

of Engineering and Technology. We also declare that no part of this thesis has been

submitted elsewhere for the award of any degree or diploma.

Signature of the Students

Monowar Hasan

Student No. 0605021

Sabbir Ahmed

Student No. 0605013

iii

Abstract

Nowadays Cloud Computing has become a buzz-word in distributed processing. Cloud

Computing, originated from the ideas of concurrent processing from Computer Clus-

ter. It has enhanced the established architecture and standards of Grid Computing

with the ideas of Utility and Service-oriented Computing. Computing through Cloud

supplements a business model as a form of X-as-a-Service where X stands for Hard-

ware, Software, Developing platform or some Storage media. End-users can consume

any of these services from providers, pay-as-you-go basis without knowing the details

of underlying architecture. Hence, Cloud offers a layers of abstraction to end-users

and a scope to modify the application demand for end-users, developers and providers.

iv

Acknowledgements

We are grateful to several people for this work without whom it will not be a success-

ful. Our heartiest thanks goes to our supervisor Professor Dr. Md. Humayun Kabir

for his support and valuable guidelines. His continuous feedback and assistance help

us to clear our ideas and understandings on the topic.

Special thanks to Professor Dr. Hanan Lutfiyya from University of Western Ontario,

Canada and Professor Dr. Ivona Brandic, Vienna University of Technology, Vienna,

Austria for proving their research publications which help to progress our thesis.

Department of Computer Science and Engineering, Bangladesh University of Engi-

neering and Technology provides us with sound working environment and helps us to

get electronic copy of the publications.

Last but not the least, we acknowledge the contribution and support of our family

members for being with us and encouraging us all the way. Without their sacrifice it

would not end up a successful one.

v

Table of Contents

Certification ii

Declaration iii

Abstract iv

Acknowledgments v

Table of Contents ix

List of Tables x

List of Figures xiii

1 Introduction 1

2 Computer Clusters 4

2.1 Architecture of Computer Clusters . . . . . . . . . . . . . . . . . . . 5

2.2 Cluster Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Protocols for Cluster Communication . . . . . . . . . . . . . . . . . . 9

2.3.1 Internet Protocols . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Low-latency Protocols . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2.1 Active Messages . . . . . . . . . . . . . . . . . . . . 11

vi

2.3.2.2 Fast Messages . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2.3 VMMC . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2.4 U-net . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2.5 BIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Standards for Cluster Communication . . . . . . . . . . . . . 14

2.3.3.1 VIA . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3.2 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Cluster Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Message-based Middleware . . . . . . . . . . . . . . . . . . . 20

2.4.2 RPC-based Middleware . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Object Request Broker . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Single System Image (SSI) . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Benefits of SSI . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Features of SSI Clustering Systems . . . . . . . . . . . . . . . 23

2.5.3 Functional Relationship among Middleware SSI Modules . . . 24

2.5.3.1 Resource Management and scheduling (RMS) . . . . 24

2.6 Examples of Cluster implementation . . . . . . . . . . . . . . . . . . 25

2.6.1 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.2 Windows Compute Cluster Server 2003 . . . . . . . . . . . . . 29

2.6.2.1 Compute Cluster Components . . . . . . . . . . . . . 30

2.6.2.2 Network Architecture . . . . . . . . . . . . . . . . . 30

2.6.2.3 Software Architecture . . . . . . . . . . . . . . . . . 31

2.6.2.4 Job Execution . . . . . . . . . . . . . . . . . . . . . 33

2.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Grid Computing : An Introduction 38

3.1 Grid Computing: definitions and overview . . . . . . . . . . . . . . . 39

vii

3.2 Grids over Cluster Computing . . . . . . . . . . . . . . . . . . . . . . 41

3.3 An example of Grid Computing environment . . . . . . . . . . . . . . 43

3.4 Grid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.1 Fabric Layer: Interfaces to Local Resources . . . . . . . . . . . 45

3.4.2 Connectivity Layer: Managing Communications . . . . . . . . 46

3.4.3 Resource Layer: Sharing of a Single Resource . . . . . . . . . 47

3.4.4 Collective Layer : Co-ordination with multiple resources . . . 47

3.4.5 Application Layer : User defined Grid Applications . . . . . . 48

3.5 Grid Computing with Globus . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Resource Management in Grid Computing . . . . . . . . . . . . . . . 51

3.6.1 Resource Specification Language . . . . . . . . . . . . . . . . . 52

3.6.2 Globus Resource Allocation Manager (GRAM) . . . . . . . . . 53

3.7 Resource Monitoring in Grid Computing . . . . . . . . . . . . . . . . 54

3.8 Evolution towards Cloud Computing from Grid . . . . . . . . . . . . 61


4 An overview of Cloud Architecture 63

4.1 Cloud Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Cloud Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 A layered model of Cloud architecture - Cloud ontology . . . . 66

4.2.2 Cloud Business Model . . . . . . . . . . . . . . . . . . . . . . 73

4.2.3 Cloud Deployment Model . . . . . . . . . . . . . . . . . . . . 74

4.3 Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.1 Infrastructure as a Service (IaaS) . . . . . . . . . . . . . . . . 78

4.3.2 Platform as a Service (PaaS) . . . . . . . . . . . . . . . . . . . 79

4.3.3 Software as a Service (SaaS) . . . . . . . . . . . . . . . . . . . 81

4.4 Virtualization on Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 82

viii

4.4.1 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.3 Motivations of Virtualization . . . . . . . . . . . . . . . . . . 87

4.5 Example of a Cloud Implementation . . . . . . . . . . . . . . . . . . 88

4.5.1 Amazon S3 Concepts . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.2 Amazon S3 Data Consistency Model . . . . . . . . . . . . . . 91

4.5.3 Managing Concurrent Applications . . . . . . . . . . . . . . . 92


5 Comparisons of Grid and Cloud : Similarities & Differences 95

5.1 Major Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 Points of Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2.2 Scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2.3 Multitasking and Availability . . . . . . . . . . . . . . . . . . 98

5.2.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . 98

5.2.5 Application Model . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2.6 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.1 Comparative results . . . . . . . . . . . . . . . . . . . . . . . 104


6 Conclusion and Future works 106

Bibliography 107

ix

List of Tables

2.1 Categories of Cluster Interconnection Hardware . . . . . . . . . . . . 7

4.1 Example of existing Cloud Systems w.r.to classification into layers of

Cloud Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Comparative analysis between an existing Grid and Cloud implemen-

tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

x

List of Figures

2.1 Architecture of a computer Cluster . . . . . . . . . . . . . . . . . . . 6

2.2 Traditional Protocol Overhead and Transmission Time. . . . . . . . . 10

2.3 The InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Functional Relationship Among Middleware SSI Modules . . . . . . 24

2.5 Resource Management and scheduling (RMS) . . . . . . . . . . . . . 25

2.6 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.7 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.8 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.9 Serial Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.10 Parallel Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Evolution of Grid Computing . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Serving job requests in traditional environment . . . . . . . . . . . . 41

3.3 Serving job requests in Grid environment . . . . . . . . . . . . . . . . 42

3.4 Google search architecture . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Grid Protocol Architecture . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6 Collective and Resource layer protocols are combined in various ways

to provide application functionality . . . . . . . . . . . . . . . . . . . 48

3.7 Programmers view of Grid Architecture. Thin lines denotes protocol

interactions where bold lines represent a direct call . . . . . . . . . . 49

xi

3.8 A resource management architecture for Grid Computing environment 51

3.9 Globus GRAM Architecture . . . . . . . . . . . . . . . . . . . . . . . 54

3.10 Grid Monitoring Architecture Components . . . . . . . . . . . . . . . 55

3.11 Enhancement of generic Grid architecture to Service Oriented Grid . 61

4.1 Components of a Cloud Computing Solution . . . . . . . . . . . . . . 64

4.2 Hierarchical abstraction layers of Cluster, Grid and Cloud Computing 66

4.3 Cloud layered architecture : consists of five layers, figure represents

inter-dependency between layers . . . . . . . . . . . . . . . . . . . . . 67

4.4 Virtualization reduces number of servers . . . . . . . . . . . . . . . . 70

4.5 Cloud computing Business model . . . . . . . . . . . . . . . . . . . . 73

4.6 External or Public Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7 Internal or Private Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.8 Example of Hybrid Cloud . . . . . . . . . . . . . . . . . . . . . . . . 78

4.9 Correlation between Cloud Architecture and Cloud Services . . . . . 79

4.10 Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . 80

4.11 Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.12 Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.13 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.14 A Paravirtualized deployment where many OS can run simultaneously 85

4.15 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.16 Conceptual view of Amazon Simple Storage Service . . . . . . . . . . 89

4.17 Managing Concurrent Applications : W1 & W2 complete before the

start of R1 & R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.18 Managing Concurrent Applications : W2 does not complete before the

start of R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xii

4.19 Managing Concurrent Applications : W2 is performed before S3 returns

a ‘success’ for W1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1 Motivation of Grid and Cloud . . . . . . . . . . . . . . . . . . . . . . 96

5.2 Comparison regarding performance, reliability and cost . . . . . . . . 97

xiii

Chapter 1

Introduction

Sometime applications need more computing power than a sequential computer can

provide. A feasible and cost-effective solution is to connect multiple processors to-

gether and coordinate their computational powers. The resulting systems are popu-

larly known as parallel computers or Computer Clusters, and they allow the sharing

of a computational task among multiple processors. The components of a Cluster

are usually connected to each other through fast Local Area Networks. This is in

contrast to traditional supercomputer, which has many processors connected by a

local high-speed computer bus. Each node in Cluster running its own instance of an

operating system. Traditionally, Computer Clusters run on separate physical comput-

ers with the same operating system. Hence, nodes in Cluster are homogeneous and

tightly-coupled. The activities of the computing nodes are monitored by ‘Clustering

Middleware’, a software layer that sits atop the nodes and allows the users to con-

sider the Cluster as single computing unit, through a ‘Single System Image’ concept.

Computer Clusters are covered details in Chapter 2.

Computational Grids, another approach of distributed processing, also uses many

nodes like Computer Cluster but a more dynamic and usually heterogeneous system.

The heterogeneous pools of servers, storage systems and networks are pooled together

1

in a virtualized system that is exposed to the user as a single computing entity. In

Grid, a computer job uses one or few nodes, with a little or no inter-node commu-

nication. Job requests are firstly pooled and allocated to the processors available in

an efficient way. ‘Grid middleware’ is specific software, which provides the necessary

functionality required to enable sharing of heterogeneous resources. Grid Computing

is the deployed Grid middleware. Architectures and issues of Computer Grids are

covered in Chapter 3.

Cluster Grids (or Computer Clusters) are local resources that operate inside the fire-

wall and are controlled by a single administrative entity that has complete control

over each component. Thus, clusters do actually not involve sharing of resources and

cannot be considered as Grids in the narrow sense. Enterprise Grid is used to refer

to application of Grid Computing for sharing resources within the bounds of a single

company. All components of an Enterprise Grid operate inside the firewall of a com-

pany, but may be heterogeneous and physically distributed across multiple company

locations. A Grid that is owned and deployed by a third party service provider is

called a Utility Grid. The service being offered via a Utility Grid is utility computing,

i.e. compute capacity and storage in a pay-per-use manner. A Utility Grid operates

outside the firewall of the user. The moving trends toward Utility Grid populates the

influential approaches of Cloud Computing.

Cloud Computing, a relatively recent term, is a computing paradigm, where a large

pool of systems are connected in private or public networks, to provide dynamically

scalable infrastructure for application, data and file storage. It implies a service ori-

ented architecture, reduced information technology overhead for the end-user, great

flexibility, reduced total cost of ownership, on-demand service and many other things.

2

In Cloud, the applications delivered as services over the Internet. Infrastructure re-

sources (hardware, storage and system software) and applications are provided in a

X-as-a-Service manner. When a Cloud is made available in a pay-as-you-go manner

to the general public, we call it a Public Cloud. We use the term Private Cloud to

refer to internal datacenters of a business or other organization, not made available

to the general public. Thus, Cloud Computing is the combination of SaaS and Utility

Computing, but does not include Private Clouds. Detailed overview of Cloud Com-

puting is discussed in Chapter 4.

Cloud Computing is not all the way similar to Computer Grids or Utility Grids. Cloud

differs with Grid in various considerations. Issues related to similarity and differences

between Grid and Cloud Computing are discussed in Chapter 5.

3

Chapter 2

Computer Clusters

A Cluster [1] is a type of parallel or distributed processing system. It consists of a

collection of interconnected stand-alone computers and they are working together as a

single, integrated computing resource. All the component subsystems of a Cluster are

supervised within a single administrative domain, usually residing in a single room

and managed as a single computer system. Cluster Computing can be used for load

balancing as well as for high availability [2]. Cluster Computing can also be used as

a relatively low-cost form of parallel processing for scientific and other applications

that lend themselves to parallel operations.

Some properties of Cluster Computing:

• Computers also known as nodes on a Cluster are networked in a tightly-coupled

fashion. They are all on the same subnet of the same domain and often net-

worked with very high bandwidth connections.

• Nodes of a Cluster are homogeneous. They all use the same hardware, run the

same software, and are generally configured identically. Each node in a Cluster

is a dedicated resource generally only the Cluster applications run on a Cluster

node.

4

• Message Passing Interface (MPI) [3] is used in Cluster which is a programming

interface that allows the distributed application instances to communicate with

each other and share information.

• Dedicated hardware, high-speed interconnects, and MPI provide Clusters the

ability to work efficiently on fine-grained parallel problems where the subtasks

must communicate many times per second, including problems with short tasks,

some of which may depend on the results of previous tasks.

2.1 Architecture of Computer Clusters

In Cluster Computing a computer node can be a single or multiprocessor system[4].

The nodes can be PCs, workstations, or Symmetric Multiprocessors (SMP) with mem-

ory, I/O facilities, and an operating system. In Cluster Computing, two or more nodes

are connected together. These nodes can exist in a single cabinet or be physically

separated and connected via a LAN. This LAN-based inter-connected Cluster of com-

puters appear as a single system to the users and applications. Cluster Computing

can provide a cost-effective way to gain features and benefits like fast and reliable ser-

vices that could previously found only on more expensive proprietary shared memory

systems. Typical architecture of a Cluster is shown in Figure 2.1.

In Cluster Computing several high Performance Networks or Switches are used to

connect the nodes of the Cluster. Among them Gigabit Ethernet and Myrinet are

most common. Switched networks are preferred since it allows multiple simultane-

ous messages to be sent, which can improve overall application performance. Cluster

Interconnection use Network Interface Cards. Interconnection technologies may be

classified into four categories, depending on whether the internal connection is from

5

Figure 2.1: Architecture of a computer Cluster

the I/O bus or the memory bus, and depending on whether the communication be-

tween the computers is performed primarily using messages or using shared storage.

We will discuss Cluster Interconnection in Section 2.2. Several Fast Communication

Protocols and Services are used to communicate within nodes. We will discuss them

briefly in Section 2.3.

The operating system in the individual nodes of the Cluster provides the fundamental

system support for Cluster operations. Whether the user is opening files, sending mes-

sages, or starting additional processes, the operating system is always present. The

primary role of an operating system is to multiplex multiple processes onto hardware

components that comprise a system (resource management and scheduling), as well

as provide a high-level software interface for user applications. These services include

protection boundaries, process and thread co-ordination, inter-process communication

and device handling.

There is a Middleware which sits between operating system and application. Middle-

6

ware layers enable the seamless usage of heterogeneous components across the Cluster.

Middleware provides the system Single System Image (SSI) and System Availability

Infrastructure. Cluster Middleware and Single System Image (SSI) are discussed in

Section 2.4 and 2.5. Both Sequential and Parallel or Distributed Applications can be

done by using Cluster Computing. For Parallel Applications several Parallel Program-

ming Environments and Tools such as compilers, MPI (Message Passing Interface) are

used. We will conclude the Chapter by giving two Applications of Cluster Computing:

Linux Virtual Server (LVS) in Section 2.6.1 and Windows Compute Cluster Server

2003 in Section 2.6.2.

2.2 Cluster Interconnection

In Cluster Computing the choice of interconnection technology is a key component.

We can classify the Interconnection technologies into four categories. These four cat-

egories depend on the internal connection and how the nodes communicate with each

other. The internal connection can be from the I/O bus or the memory bus and the

communication between the computers can be performed primarily using messages or

using shared storage [5]. Table 2.1 illustrates the four types of interconnection.

Type Message Based Shared StorageI/O Attached Most common type, in-

cludes most high-speednetworks; VIA, TCP/IP.

Shared disk subsystems.

Memory Attached Usually implemented insoftware as optimizationsof I/O attached message-based.

Global shared memory,Distributed shared mem-ory.

Table 2.1: Categories of Cluster Interconnection Hardware

7

Among the four interconnection categories I/O attached message-based systems are

by far the most common. This system includes all commonly-used wide-area and

local-area network technologies. It also includes several recent products that are

specifically designed for Cluster computing. I/O attached shared storage systems in-

clude computers that share a common disk sub-system. Memory attached systems

are not common like I/O attached systems, since the memory bus of an individual

computer generally has a design that is unique to that type of computer. How-

ever, many memory-attached systems are implemented. Most of the time they are

implemented in software or with memory-mapped I/O, such as Reflective Memory [6].

There are several Hybrid systems that combine the features of more than one category.

Example of a Hybrid system is the Infiniband standard [7]. Infiniband is an I/O

attached interconnection. It can be used to send data to a shared disk sub-system

as well as to send messages to another computer. There are many factors that affect

the choice of interconnect technology for a Cluster. Factors like compatibility with

the Cluster hardware and operating system, price, and performance. Performance of

a Cluster depends on the latency and bandwidth.

• Latency is the time needed to send data from one computer to another. Latency

also includes overhead for the software to construct the message as well as the

time to transfer the bits from one computer to another.

• Bandwidth is the number of bits per second that can be transmitted over the

interconnect hardware.

Applications that utilize small messages will have better performance particularly be-

cause the latency is reduced. Applications that send large messages will have better

performance particularly as the bandwidth increases. The latency is a function of

8

both the communication software and network hardware.

2.3 Protocols for Cluster Communication

A communication protocol defines a set of rules and conventions for communicat-

ing among the nodes in the Cluster [8]. Each protocol uses different technology to

exchange information. Communication protocols can be classified as:

• Connection oriented or connectionless.

• Offers various level of reliability. Protocol can be reliable that fully guaranteed

to arrive in order. Protocol can be unreliable that not guaranteed to arrive in

order.

• Communication can be not buffered which is synchronous or buffered which is

asynchronous.

• By the number of intermediate data copies between buffers, which may be zero,

one or more.

Several protocols are used in Clusters. Formerly Traditional Internet protocols are

used for Clustering. Later several protocols that have been designed specifically for

Cluster communication. Finally two new protocol standards have been specially de-

signed for use in Cluster Computing.

2.3.1 Internet Protocols

The Internet Protocol (IP) is the standard for networking worldwide. The Trans-

mission Control Protocol (TCP) and the User Datagram Protocol (UDP) are both

9

transport layer protocols built over the Internet Protocol. TCP and UDP protocols

and the de facto standard BSD sockets Application Programmer’s Interface (API) to

TCP and UDP were among the first messaging libraries used for [9] Cluster Comput-

ing.

• Internet Protocol uses one or more buffers in system memory with the help of

operating system services.

• User application constructs the message in user memory, and then makes an

operating system request to copy the message into a system buffer.

• A system interrupt is required to send and receive the message.

In Internet protocol, Operating system overhead and the overhead for copies to and

from system memory are a significant portion of the total time to send a message. As

network hardware became faster during the 1990s, the overhead of the communica-

tion protocols became significantly larger than the actual hardware transmission time

for messages, as shown in Figure 2.2. So there needed the necessity of new types of

protocols for Cluster computing.

Figure 2.2: Traditional Protocol Overhead and Transmission Time.

10

2.3.2 Low-latency Protocols

For avoiding operating system intervention in message transmission several research

projects were done during the 1990’s. These projects led to the development of low-

latency protocols. These protocols also provide user-level messaging services across

high-speed networks. Low-latency protocols developed during the 1990’s include Ac-

tive Messages, Fast Messages, the VMMC (Virtual Memory-Mapped Communication)

system, U-net, and Basic Interface for Parallelism (BIP), among others.

2.3.2.1 Active Messages

Active Messages was developed in the University of Berkeley. It has provided low-

latency communications library for the Berkeley Network of Workstations (NOW)

project [10, 11]. Short messages used in Active Messages, are synchronous and based

on the concept of a request-reply primitive.

• Sending side user-level application constructs a message in user memory. The

receiving process allocates a receive buffer in user memory on the receiving side

and sends a request to the sender.

• The sender replies by copying the message from the user buffer on the sending

side directly to the network buffer. No buffering in system memory is performed.

• Network hardware transfers the message to the receiver, and then the message

is transferred from the network buffer to the receive buffer in user memory.

It is required that user virtual memory on both the sending and receiving sides being

pinned to an address in physical memory. The reason behind it not to be paged out

during the network operation. Once the pinned user memory buffers are established,

no operating system intervention is required for a message to be sent. Since no copies

11

from user memory to system memory are used, this protocol is known as a zero-copy

protocol.

To support multiple concurrent parallel applications in a Cluster Active Messages

was extended to Generic Active Messages (GAM). In GAM, a copy sometimes oc-

curs to a buffer in system memory on the receiving side so that user buffers can be

reused more efficiently. In this case, the protocol is referred to as a ‘one-copy’ protocol.

2.3.2.2 Fast Messages

Fast Message was developed at the University of Illinois. It is similar to Active

Messages [12]. Fast Message extends Active Message by imposing stronger guarantees

on the underlying communication.

• Fast Message guarantees that all messages arrive reliably and in-order, even if

the underlying network hardware does not.

• Fast Message uses flow control to ensure that a fast sender cannot overrun a

slow receiver, thus causing messages to be lost. Flow control is implemented in

Fast Messages with a credit system that manages pinned memory in the host

computers.

2.3.2.3 VMMC

The Virtual Memory-Mapped Communication (VMMC) [13] system was developed

as a low-latency protocol for the Princeton SHRIMP project. One goal of VMMC

was to view messaging as reads and writes into the user-level virtual memory system.

12

• VMMC works by mapping a page of user virtual memory to physical memory.

It makes a correspondence between pages on the sending and the receiving sides.

• It uses specially designed hardware. This hardware allows the network interface

to snoop writes to memory on the local host and have these writes automatically

updated on the remote hosts memory. Various optimizations of these writes have

been developed that helped to minimize the total number of writes, network

traffic, and overall application performance.

VMMC is an example of a paradigm known as distributed shared memory (DSM).

In DSM systems memory is physically distributed among the nodes in a system, but

processes in an application may view shared memory locations as identical and per-

form reads and writes to the shared memory locations.

2.3.2.4 U-net

The U-net network interface architecture [14] was developed at Cornell University.

U-net provides zero-copy messaging where possible.

• U-net adds the concept of a virtual network interface for each connection in a

user application. Just as an application has a virtual memory address space

that is mapped to real physical memory on demand.

• Each communication endpoint of the application is viewed as a virtual network

interface mapped to a real set of network buffers and queues on demand.

The advantage of this architecture is that once the mapping is defined, each active

interface has direct access to the network without operating system intervention. The

result is that communication can occur with very low latency.

13

2.3.2.5 BIP

Basic Interface for Parallelism (BIP) is a low-latency protocol that was developed at

the University of Lyon [15].

• BIP is designed as a low-level message layer over which a higher-level layer such

as Message Passing Interface (MPI) [3] can be built. Programmers can use MPI

over BIP for parallel application programming.

• The initial BIP interface consisted of both blocking and non-blocking calls.

Later versions (BIP-SMP) provide multiplexing between the network and shared

memory under a single API for use on Clusters of symmetric multiprocessors.

BIP achieves low latency and high bandwidth by using different protocols, like Active

Messages and Fast Messages for various message sizes. It also provides a zero or single

memory copy of user data. To simplify the design and keep the overheads low, BIP

guarantees in-order delivery of messages, although some flow control issues for small

messages are passed to higher software levels.

2.3.3 Standards for Cluster Communication

Research on low-latency protocols had progressed sufficiently and established new

standard for low-latency messaging to be developed, the Virtual Interface Architec-

ture (VIA). Industrial researchers worked on standards for shared storage subsystems.

The combination of the efforts of many researchers has resulted in the InfiniBand stan-

dard.

14

2.3.3.1 VIA

The Virtual Interface Architecture [16] is a communications standard that combines

many of the best features of various academic projects. A consortium of academic and

industrial partners, including Intel, Compaq, and Microsoft, developed the standard.

• VIA supported heterogeneous hardware and was available as of early 2001.

• It was based on the concept of a virtual network interface. Before a message

can be sent in VIA, send and receive buffers must be allocated and pinned to

physical memory locations.

• There was no need of system calls after the buffers and associated data structures

are allocated.

• A send or receive operation in a user application consists of posting a descriptor

to a queue. The application can choose to wait for a confirmation that the

operation has completed, or can continue host processing while the message is

being processed.

Several hardware vendors and some independent developers have developed VIA im-

plementations for various network [17][18] products. VIA implementations can be

classified as native or emulated.

• A native implementation of VIA off-loads a portion of the processing required

to send and receive messages to special hardware on the network interface card.

When a message arrives in a native VIA implementation, the network card

performs at least a portion of the work required to copy the message into user

memory.

15

• An emulated VIA implementation, the host CPU performs the processing to

send and receive messages. Although the host processor is used in both cases,

an emulated implementation of VIA has less overhead than TCP/IP. However,

the services provided by VIA are different than those provided by TCP/IP, since

the communication may not be guaranteed to arrive reliably in VIA.

2.3.3.2 InfiniBand

The InfiniBand standard [19] is another standard for Cluster protocol and was sup-

ported by a large consortium of industrial partners, including Compaq, Dell, Hewlett-

Packard, IBM, Intel, Microsoft and Sun Microsystems. The InfiniBand architecture

replaces the standard shared bus for I/O on current computers with a high-speed

serial, channel-based, message-passing, scalable, and switched fabric. There are two

types of adaptors. Host channel adapters (HCA) and target channel adapters (TCA).

All systems and devices attach to the fabric through host channel adapters (HCA) or

target channel adapters (TCA), as shown in Figure 2.3 . In InfiniBand data is sent

as packets, and six types of transfer methods are available, including:

• Reliable and unreliable connections.

• Reliable and unreliable datagrams.

• Multicast connections.

• Raw packets.

InfiniBand supports remote direct memory access (RDMA) read or write operations.

This allows one processor to read or write the contents of memory at another processor,

and also directly supports IPv6 [20] messaging for the Internet. There are several

components of InfiniBand. They are -

16

Figure 2.3: The InfiniBand Architecture

• Host channel adapter (HCA): Host channel adapter is an interface that

resides within a server. HCA communicates directly with the server’s memory,

processor, target channel adapter or a switch. It guarantees delivery of data

and can recover from transmission errors.

• Target channel adapter (TCA): Target channel adapter enables I/O devices

to be located within the network independent of a host computer. It includes

an I/O controller that is specific to its particular device’s protocol. TCAs can

communicate with an HCA or a switch.

• Switch: Switch is virtually equivalent to a traffic police. It allows many HCAs

and TCAs to connect to it and handles network traffic. Offers higher availability,

higher aggregate bandwidth, load balancing, data mirroring and much more.

Looks at the “local route header” on each packet of data and forwards it to the

appropriate location. A group of switches is referred to as a fabric. If a host

computer is down, the switch still continues to operate. The switch also frees

up servers and other devices by handling network traffic.

• Router: Router forwards data packets from a local network (called a subnet)

17

to other external subnets. Reads the ‘global route header’ and forwards to

appropriate address. It rebuilds each packet with the proper local address header

as it passes it to the new subnet.

• Subnet Manager: It is an application responsible for configuring the local

subnet and ensuring its continued operation. Configuration responsibilities in-

clude managing switch and router setups and reconfiguring the subnet if a link

goes down or a new one is added.

The InfiniBand Architecture (IBA) is comprised of four primary layers that describe

communication devices and methodology.

• Physical Layer: Defines the electrical and mechanical characteristics of the

IBA, including the cables, connectors and hot-swap characteristics. IBA con-

nectors include fiber, copper and backplane connectors. There are three link

speeds specified as 1X, 4X and 12X. 1X link cable has four wires; two for each

direction of communication (read and write).

• Link Layer: Link Layer includes packet layout, point-to-point link instruction,

switching within a local subnet and data integrity. Two type of packets, man-

agement and data. Management packets handle link configurations and main-

tenance. Data packets carry up to 4 kilobytes of transaction payload. Every

device in a local subnet has a local ID (LID) for forwarding data appropriately.

It handles data integrity by including variant and invariant cyclic redundancy

checking (CRC). The variant CRC checks fields that change from point-to-point

and the invariant CRC provides end-to-end data integrity.

• Network Layer: The network layer is responsible for routing packets from one

subnet to another. The global route header located within a packet includes an

18

IPv6 address for the source and destination of each packet. For single subnet

environments, the network layer information is not used.

• Transport Layer: Transport layer handles the order of packet delivery. Also

handles partitioning, multiplexing and transport services that determine reliable

connections.

2.4 Cluster Middleware

Middleware is the layer of software sandwiched between the operating system and

applications. It has re-emerged as a means of integrating software applications that

run in a heterogeneous environment. There is large overlap between the infrastructure

that is provided to a Cluster by high-level Single System Image (SSI) services and

those provided by the traditional view of middleware. Middleware helps a developer

overcome three potential problems with developing applications on a heterogeneous

Cluster:

• Gives the ability to access to software inside or outside their site.

• Helps to integrate software from different sources.

• Rapid application development.

The services that middleware provides are not restricted to application development.

Middleware also provides services for the management and administration of a het-

erogeneous system.

19

2.4.1 Message-based Middleware

Message-based middleware uses a common communication protocol to exchange data

between applications. The communication protocol hides many of the low-level mes-

sage passing primitives from the application developer. Message-based middleware

software can pass messages directly between applications, send messages via software

that queues waiting messages, or use some combination of the two. Examples of this

type of middleware are the three upper layers of the OSI model [21], the session,

presentation and applications layers.

2.4.2 RPC-based Middleware

There are many applications where the interactions between processes in a distributed

system are remote operations, often with a return value. For these applications Re-

mote Procedure Call (RPC) is used. The implementation of the client/server model in

terms of Remote Procedure Call (RPC) allows the code of the application to remain

the same whether the procedures are the same or not. Inter-process communication

mechanisms serve four important functions [22]:

• They offer mechanisms against failure. They also provide the means to cross

administrative boundaries.

• They allow communications between separate processes over a computer net-

work.

• They enforce clean and simple interfaces, thus providing a natural aid for the

modular structure of large distributed applications.

• They hide the distinction between local and remote communication, thus allow-

ing static or dynamic reconfiguration.

20

2.4.3 Object Request Broker

An Object Request Broker (ORB) is a type of middleware that supports the remote

execution of objects. An international ORB standard is CORBA (Common Object

Request Broker Architecture). It is supported by more than 700 groups and managed

by the Object Management Group (OMG) [23]. The OMG is a non profit-making

organization whose objective is to define and promote standards for object orienta-

tion in order to integrate applications based on existing technologies. The Object

Management Architecture (OMA) is characterized by the following:

• The Object Request Broker (ORB): It is the controlling element of the archi-

tecture and it supports the portability of objects and their interoperability in a

network of heterogeneous systems.

• Object services: These are specific system services for the manipulation of ob-

jects. Their goal is to simplify the process of constructing applications.

• Application services: They offer a set of facilities for allowing applications access

databases, to printing services, to synchronize with other application, and so on.

• Application objects: They allow the rapid development of applications. A new

application can be formed from objects in a combined library of application

services.

2.5 Single System Image (SSI)

SSI is the illusion, created by software or hardware, that presents a collection of com-

puting resources as one, more whole resource [24]. In other words, it the property of a

21

system that hides the heterogeneous and distributed nature of the available resources

and presents them to users and applications as a single unified computing resource.

SSI makes the Cluster appear like a single machine to the user, to applications, and to

the network. SSI Cluster-based systems are mainly focused on complete transparency

of resource management, scalable performance, and system availability in supporting

user applications. SSI is supported by a middleware layer that resides between the

OS and user-level environment. Middleware consists of essentially 2 sub-layers of SW

infrastructure.

• SSI infrastructure - Glue together OSs on all nodes to offer unified access to

system resources.

• System availability infrastructure - Enable Cluster services such as check

pointing, automatic failover, recovery from failure and fault-tolerant support

among all nodes of the Cluster.

2.5.1 Benefits of SSI

There are several benefits of SSI:

• Use of system resources transparent.

• Transparent process migration and load balancing across nodes.

• Improved reliability and higher availability.

• Improved system response time and performance.

• Simplified system management.

• Reduction in the risk of operator errors.

22

• No need to be aware of the underlying system architecture to use these machines

effectively.

2.5.2 Features of SSI Clustering Systems

• Single I/O Space: Any node can access any peripheral or disk devices without

the knowledge of physical location.

• Single Process Space: Any process on any node create process with Cluster

wide process and they communicate through signal, pipes, etc, as if they are

one a single node.

• Single Global Job Management System: SSI provides single global job

management system. The manager node manages all the operations.

• Checkpointing : Some SSI systems allow checkpointing of running processes,

allowing their current state to be saved and reloaded at a later date. Check-

pointing can be seen as related to migration, as migrating a process from one

node to another can be implemented by first checkpointing the process, then

restarting it on another node. Alternatively checkpointing can be considered as

migration to disk.

• Process Migration: Many SSI systems provide process migration. Processes

may start on one node and be moved to another node, possibly for resource

balancing or administrative reasons. As processes are moved from one node to

another, other associated resources may be moved with them.

23

Figure 2.4: Functional Relationship Among Middleware SSI Modules

2.5.3 Functional Relationship among Middleware SSI Mod-ules

Every SSI has a boundary. Single system support can exist at different levels within

a system, one able to be build on another. In SSI there can be three levels of ab-

stractions. They are application and subsystem level, operating system kernel level

and hardware level. In Figure 2.4 the functional relationship among middleware SSI

module is shown. Resource Management and Scheduling is done in subsystem level.

2.5.3.1 Resource Management and scheduling (RMS)

RMS system is responsible for distributing applications among Cluster nodes. It

enables the effective and efficient utilization of the resources available. In RMS there

are two types software components. Basic architecture of RMS: client-server system

is shown in Figure 2.5.

• Resource manager: Locating and allocating computational resource, authen-

tication, process creation and migration.

24

Figure 2.5: Resource Management and scheduling (RMS)

• Resource scheduler: Queuing applications, resource location and assignment.

It instructs resource manager what to do when (policy).

There are several services which are provided by RMS:

• Process Migration.

• Checkpointing.

• Fault Tolerance.

• Minimization of Impact on Users.

• Load Balancing.

• Multiple Application Queues.

2.6 Examples of Cluster implementation

In this Section, we will discuss two existing Cluster implementation: Linux Virtual

Server (LVS), an open source project which is an advanced load balancing solution

25

for Linux systems; and Windows Compute Cluster Server 2003, a commercial Cluster

server developed by Microsoft Corporation.

2.6.1 Linux Virtual Server

In this Section, we will briefly discuss Linux Virtual Server [25]. Linux Virtual Server

(LVS) is an advanced load balancing solution for Linux systems. It is an open source

project started by Wensong Zhang in May 1998. The mission of the project was

to build a high-performance and highly available server for Linux using Clustering

technology, which provides good scalability, reliability and serviceability. The Linux

Virtual Server directs clients’ network connection requests to multiple servers that

share their workload, which can be used to build scalable and highly available Inter-

net services.

The Linux Virtual Server directs clients’ network connection requests to the different

servers according to scheduling algorithms and makes the parallel services of the

Cluster to appear as a single virtual service with a single IP address. The Linux

Virtual Server extends the TCP/IP stack of Linux kernel to support three IP load-

balancing techniques:

• NAT (Network Address Translation): Maps IP addresses from one group

to another. NAT is used when hosts in internal networks want to access the

Internet and be accessed in the Internet.

• IP tunneling: Encapsulates IP datagram within IP datagrams. This allows

datagrams destined for one IP address to be wrapped and redirected to another

IP address.

26

• Direct routing: Allows route response to the actual user machine instead of

the load balancer.

The Linux Virtual Server also provides four scheduling algorithms for selecting servers

from Cluster for new connections:

• Round robin: Directs the network connections to the different server in a

round-robin manner.

• Weighted round robin: Treats the real servers of different processing capac-

ities. A scheduling sequence will be generated according to the server weights.

Clients’ requests are directed to the different real servers based on the scheduling

sequence in a round robin manner.

• Least-connection: Directs clients’ network connection requests to the server

with the least number of established connections.

• Weighted least-connection: A performance weight can be assigned to each

real server. The servers with a higher weight value will receive a larger percent-

age of live connections at any time.

Client applications interact with the Cluster as if it were a single server. The clients

are not affected by the interaction with the Cluster and do not need modification.

The application performance scalability is achieved by adding one or more nodes to

the Cluster. Automatically detecting node or daemon failures and reconfiguring the

system appropriately achieve high availability. The Linux Virtual Server that follows

a three-tier architecture is shown in Figure 2.6. The functionality of each tier is:

• Load Balancer: The front end to the service as viewed by connecting clients.

The load balancer directs network connections from clients who access a single

27

Figure 2.6: Linux Virtual Server

IP address for a particular service, to a set of servers that actually provide the

service.

• Server Pool: It consists of a Cluster of servers that implement the actual

services, such as Web, FTP, mail, DNS, and so on.

• Back-end Storage: It provides the shared storage for the servers, so that it is

easy for servers to keep the same content and provide the same services.

The load balancer handles incoming connections using IP load balancing techniques.

The Load balancer selects servers from the server pool, maintains the state of con-

current connections and forwards packets, and all the work is performed inside the

kernel, so that the handling overhead of the load balancer is low. The load balancer

can handle much larger numbers of connections than a general server, therefore the

load balancer can schedule a large number of servers and it will not be a potential

bottleneck in the system.

28

The server nodes may be replicated for either scalability or high availability. When

the load on the system saturates the capacity of the current server nodes, more server

nodes can be added to handle the increasing workload. One of the advantages of

a Clustered system is that it can be built with hardware and software redundancy.

Detecting a node or daemon failure and then reconfiguring the system appropriately

so that its functionality can be taken over by the remaining nodes in the Cluster is

a means of providing high system availability. A Cluster-monitor-daemon can run on

the load balancer and monitor the health of server nodes. If a server node cannot be

reached by ICMP (Internet Control Message Protocol) ping or there is no response

of the service in the specified period, the monitor will remove or disable the server in

the scheduling table, so that the load balancer will not schedule new connections to

the failed one and the failure of a server node can be masked.

The back-end storage for this system is usually provided by distributed and fault

tolerant file system. Such a system also takes care of the availability and scalability

issues of file system accesses. The server nodes access the distributed file system in

a similar fashion to that of accessing a local file system. However, multiple identical

applications running on different server nodes may access a shared data concurrently.

Any conflict among applications must be reconciled so that the data remains in a

consistent state.

2.6.2 Windows Compute Cluster Server 2003

In this Section, we will briefly discuss Windows Compute Cluster Server 2003 [26].

It is an integrated platform for running, managing, and developing high performance

29

computing applications.

2.6.2.1 Compute Cluster Components

Each Windows Compute Cluster Server 2003 Cluster consists of a head node and one

or more compute nodes. The head node mediates all access to the Cluster resources

and acts as a single point for Cluster deployment, management, and job scheduling.

A Cluster can consist of only a head node.

• Head node: The head node is responsible for providing user interface and

management services to the Cluster. The user interface consists of the Com-

pute Cluster Administrator, which is a Microsoft Management Console (MMC)

snap-in, the Compute Cluster Job Manager, which is a Win32 graphic user

interface, and a Command Line Interface (CLI). Management services include

job scheduling, job and resource management, node management, and Remote

Installation Services (RIS).

• Compute node: A compute node is a computer configured as part of a high

performance Cluster to provide computational resources for the end user. Com-

pute nodes on a Windows Compute Cluster Server 2003 Cluster must have a

supported operating system installed, but nodes within the same Cluster can

have different operating systems and different hardware configurations.

2.6.2.2 Network Architecture

Network configuration consists of a head node and a scalable number of compute

nodes. The nodes can be connected as part of a larger server network, or as a private

network with the head node serving as a gateway. Figure 2.7 shows both types of

30

arrangement. The networking medium can be Ethernet or it can be a high-speed

medium such as InfiniBand (typically used only for MPI or similar communication

among the nodes).

Figure 2.7: Network Architecture

2.6.2.3 Software Architecture

The software architecture consists of a user interface layer, a scheduling layer, and an

execution layer. The interface and scheduling layers reside on the head node. The

execution layer resides primarily on the compute nodes. The execution layer as shown

in Figure 2.8 includes the Microsoft implementation of MPI, called MS MPI, which

was developed for Windows and is included in the Microsoft Compute Cluster Pack.

• Interface layer: The user interface layers consist of the Compute Cluster Job

Manager, the Compute Cluster Administrator, and Command Line Interface

(CLI). The Compute Cluster Job Manager is a WIN32 graphic user interface to

the Job Scheduler that is used for job creation and submission. The Compute

Cluster Administrator is a Microsoft Management Console (MMC) snap-in that

is used for configuration and management of the Cluster. The Command Line

Interface is a standard Windows command prompt which provides a command-

line alternative to use of the Job Manager and the Administrator.

31

Figure 2.8: Software Architecture

• Scheduling layer: The scheduling layer consists of the Job Scheduler, which is

responsible for queuing the jobs and tasks, reserving resources, and dispatching

jobs to the compute nodes.

• Execution layer: The execution layer consists of the following components

replicated on each compute node: the Node Manager Service, the MS MPI

launcher mpiexec, and the MS MPI Service. The Node Manager is a service

that runs on all compute nodes in the Cluster. The Node Manager executes jobs

on the node, sets task environmental variables, and sends a heartbeat (health

check) signal to the Job Scheduler at specified intervals (the default interval is

32

one minute). mpiexec is the MPICH2-compatible multi-threading executable

within which all MPI tasks are run. The MS MPI Service is responsible for

starting the job tasks on the various processors.

2.6.2.4 Job Execution

Steps of job execution are as follows:

1. Creating and submitting jobs:

Creating a job is the first step in Cluster computing. It is a resource request con-

taining one or more computing tasks to be run in parallel. Each task may in turn

be parallel or it may be serial. One can create a job using the Job Manager or

the CLI. To create a job means describe job priority, run time limit, number of

processors required, specific nodes requested, and whether nodes will be reserved

exclusively for the job. Then add the tasks that the job will execute. The task’s

properties also include any input, output, and error files required, as well as a list

of any other tasks on which this task depends. After defining the job and its tasks,

the next step is to submit it to the Job Scheduler. After the job is submitted, it

takes its place in the job queue with the status Queued and waits its turn to be

activated.

2. Job Scheduler:

When a job is submitted, it is placed under the management of Job Scheduler. Job

Scheduler determines the job’s place in the queue and allocates resources to the

job when the job reaches the top of the queue and as resources become available.

Jobs are ordered in the queue according to a set of rules called scheduling policies.

33

Resource allocation is based on resource sorting. When the requested resources

have been allocated, the scheduler dispatches the job tasks to the compute nodes

and takes on a management and monitoring function. The scheduler manages jobs

by enforcing certain job and task options, as well as managing job or task status

changes. It monitors jobs by reporting on the status of the job and its tasks, as

well as the health of the nodes. Job Scheduler implements the following scheduling

policies:

• Priority-based, first-come, first-served scheduling: Priority-based, first-

come, first-served (FCFS) scheduling is a combination of FCFS and priority-

based scheduling. Using priority-based FCFS scheduling, the scheduler places

a job into a higher or lower priority group depending on the job’s priority set-

ting, but always places that job at the end of the queue in that priority group

because it is the last submitted job.

• Backfilling: Backfilling maximizes node utilization by allowing a smaller

job or jobs lower in the queue to run ahead of a job waiting at the top of

the queue, as long as the job at the top is not delayed as a result. When a

job reaches the top of the queue, a sufficient number of nodes may not be

available to meet its minimum processors requirement. When this happens,

the job reserves any nodes that are immediately available and waits for the

job that is currently running to complete.

• Exclusive scheduling: By default, a job has exclusive use of the nodes

reserved by it. This can produce idle reserved processors on a node. Idle

reserved processors are processors that are not used by the job but are also

not available to other jobs. By turning off the exclusive property, the user

allows the job to share its unused processors with other jobs that have also

34

been set as nonexclusive. Therefore, non-exclusivity is a reciprocal agreement

among participating jobs, allowing each to take advantage of the other’s un-

used processors.

3. Task execution:

Job Scheduler dispatches tasks to the compute nodes in the order that they appear

in the task list. To dispatch the task, Job Scheduler passes the task to a desig-

nated node, which can be any of the compute nodes allocated to the job. Unless

dependencies have been specified, the tasks are dispatched a first-come, first-served

(FCFS) basis.

For serial tasks, the first two tasks will be dispatched to and run on the designated

node (assuming it has two processors), the next two tasks will be dispatched to

and run on a second designated node, and the sequence will repeat itself until

there are no more tasks or until all the processors in the Cluster are being used.

Any remaining tasks must wait for the next available processor and run when it

becomes available. The following Figure 2.9 shows this process. The file server

shown on the head node may not actually reside there. It can reside anywhere

in the external or internal network. An MSDE server stores the job specifications

and user log-on credentials. The task ID number, which also contains the job ID

number, allows Job Scheduler to keep track of the status of the task as part of the

job, displaying both job and task status to the user.

For parallel tasks, execution flow depends on the user application and the software

that supports it. For jobs that are run using the Microsoft Message Passing In-

35

Figure 2.9: Serial Task execution

terface Service, tasks are executed as follows. The MS MPI executable mpiexec

is started on the designated node. mpiexec, in turn, starts all the task processes

through the node-specific MS MPI Service. If more than one node is required for

the task, additional instances of MS MPI, one per node, are spawned before the

task processes themselves are started. Parallel task flow is shown in Figure 2.10.

In the Figure, P0 through P5 represent the processes that are created, each part

of a single task. This illustration shows the most common case, in which only one

process,P0, handles all the standard input and output files.

36

Figure 2.10: Parallel Task execution

2.7 Concluding Remarks

As a beginning of our work, we are trying to study the issues related of parallel com-

putation and focusing architectures, protocols and standards of Computer Clusters.

The motivation of distributed processing using Computer Cluster turns into more

advance technology known as Grid Computing which we will going to discuss in the

next Chapter.

37

Chapter 3

Grid Computing : An Introduction

Grid Computing, more specially ‘Grid Computing System’ is a virtualized distributed

environment. Grid environment provides dynamic runtime selection, sharing and ag-

gregation of geographically distributed resources based on availability, capability, per-

formance and cost of these computing resources. Fundamentally, Grid Computing is

the advanced form of distributed processing which is the combination of decentralized

architecture for managing computing resources and a layered hierarchical architecture

for providing services to the user [27].

The rest of the Chapter is organized as follows. We begin our discussion with definition

of Grid Computing in Section 3.1 and the comparing Grid with Computer Clusters

in Section 3.2. In Section 3.4 and 3.5 we consider the underlying layers of Grid

Computing in details. Resource management architecture is discussed in Section 3.6

and the protocol for resource management (GRAM) is discussed in Section 3.6.2. We

also present a Resource Monitoring Architecture for Grid environment in Section 3.7.

We Conclude our discussion in Section 3.8 introducing a new approach of distributed

processing known as Cloud Computing.

38

3.1 Grid Computing: definitions and overview

The concept of Grid was introduced in early 1990’s, where high performance com-

puters were connected by fast data communication. The motivation of that approach

was to support calculation and data-intensive scientific applications. Figure 3.1 [28]

shows the evolution of grid over time.

Figure 3.1: Evolution of Grid Computing

The basics of Grid is to co-allocation of distributed computation resources. The most

cited definition of Grid is [29]:

“A computational grid is a hardware and software infrastructure

that provides dependable, consistent, pervasive, and inexpensive

access to high-end computational capabilities.”

Again, according to IBM definition [30],

“A grid is a collection of distributed computing resources available

over a local or wide area network that appear to an end user or

application as one large virtual computing system. The vision is to

39

create virtual dynamic organizations through secure, coordinated

resource-sharing among individuals, institutions, and resources.”

A Grid Computing environments must include:

Coordinated resources: Grid environment must be facilitated with necessary in-

frastructure for co-ordination of resources based upon policies and service level

agreements.

Open standard protocols and frameworks: Open standards can provide inter-

operability and integration facilities. These standard should be applied for re-

source discovery, resource access and resource co-ordination. Open Grid Services

Infrastructure (OGSI) [31] and Open Grid Services Architecture (OGSA) [32]

was published by the Global Grid Forum (GGF) as a proposed recommendation

for this approach.

Grid Computing can be distinguished also from High Performance Computing (HPC)

and Clustered Systems in following way: while Grid focuses on resource sharing and

can result in HPC, whereas HPC does not necessarily involve sharing of resources

[33]. Grid enables the abstraction of distributed systems and resources such as pro-

cessing, network bandwidth and data storage to create a Single System Image. Such

abstraction provides continuous access to large pool of IT capabilities. Figure 3.2

and 3.3 [28] compares the Grid environment over the traditional computations. An

organization-owned computational Grid is shown in Figure 3.3 on Page 42, where a

scheduler sets policies and priorities for placing jobs in the Grid infrastructure.

40

Figure 3.2: Serving job requests in traditional environment

3.2 Grids over Cluster Computing

Computer Clusters discussed in Chapter 2 are local to the domain. The Clusters

are designed to resolve the problem of inadequate computing power. It provides

more computation power by pooling of computational resources and parallelizing the

workload. As Clusters provide dedicated functionality to local domain, they are not

suitable solution for resource sharing between users of various domains. Nodes in the

Cluster controlled centrally and Cluster manager is monitoring the state of the node

[34]. So, in brief, Cluster units only provide a subset of Grid functionality.

The big difference is that a Cluster is homogeneous while Grids are heterogeneous

[35]. The computers that are part of a Grid can run different operating systems and

have different hardware whereas the Cluster Computers all have the same hardware

and OS. A Grid can make use of spare computing power on a desktop computer while

the machines in a Cluster are dedicated to work as a single unit. Grid are inherently

41

Figure 3.3: Serving job requests in Grid environment

distributed by its nature over a LAN or WAN. The computers in the Cluster are

normally contained in a single location.

Clusters are configurable in Active-Active or Active-Passive ways. Active-Active be-

ing that each computer runs it’s own set of services (Say, one runs a SQL instance, the

other runs a web server) and they share some resources such as storage. If one of the

computers in a Cluster goes down the service fails over to the other node and almost

seamlessly starts running there. Active-Passive is similar, but only one machine runs

these services and only takes over once there is a failure. Cluster components can be

shared or dedicated. On the other hand, some Grid resources may be shared, other

may be dedicated or reserved.

42

Another difference lies in the way resources are handled. In case of Cluster, all nodes

behave like a single system view and resources are managed by centralized resource

manager. In case of Grid, every node is autonomous, for example, it has its own

resource manager and behaves like an independent entity.

3.3 An example of Grid Computing environment

Figure 3.4: Google search architecture

We consider searching world wide web in Google as an example of Grid Computing.

Figure 3.4 shows the abstract view of Google search architecture [36]. Google pro-

cesses tens of thousands of queries per second. Each of this query is first received by

one of the Web Servers, then passes it to the array of Index Servers. Index Servers are

responsible for keeping index of words and phrases found in websites. The servers are

distributed in several machines and hence the searching runs concurrently. In fraction

43

of second, index servers perform a logical AND operation and return the reference of

the websites containing query (searching phrase). The resultant references then sent

to Store Servers. Store Servers maintain compressed copies of all the pages known

to Google. These compressed copies are used to prepare page snippets and finally

presented to the end user in a readable form.

Crawler Machines synchronizing through the web and updating the Google database

of pages stored in Index and Store servers. So, the Store Servers actually contains

relatively recent and compressed copies of all the pages available in the web.

Grid Computing can facilitates the above scenario of efficient searching. As it stated

earlier the servers are distributed and searching should be parallel in order to achieve

efficiency. The infrastructure also need to scale with the growth of web as the num-

ber of pages and indexes increased. Different organizations and numerous servers are

shared with Google. Copy the content and transforming it into its local resource is al-

lowed by Google. Local resources contain keyword database of the Index Servers and

cached content in the database of the Store Servers. The resources partially shared

with end-users who send queries through their browsers. Users can then directly con-

tact with the original servers to request the full content of the web page.

Google also shares computing cycles. Google shares its computing resources, such

as storage and computing capabilities with the end-user by performing data caching,

ranking and searching of query.

44

3.4 Grid Architecture

In this Section, we will discuss Grid architecture, which identifies the basic compo-

nents of a Grid system. It also defines the purpose and functions of such components.

However, this layered Grid architecture also indicates how these components actually

interacts with one another.

Here, we present Grid architecture described in [37]. Figure 3.5 shows the Grid layers

from top to bottom.

Figure 3.5: Grid Protocol Architecture

3.4.1 Fabric Layer: Interfaces to Local Resources

Fabric layer provides the resources that can be shared in Grid environment. An exam-

ple of such resources may be computational resources, storage systems, sensors and

network systems. Grid architecture does not deal with resources like distributed file

systems, where resource implementation requires individual internal protocols [37].

The computational resources represent multiple architectures such as clusters, super-

computers, servers and ordinary PCs which run on variety of operating systems (such

45

as UNIX variants or Windows) [38].

Components of the Fabric layer implement the local and resource-specific operations

on specific resources. Such resources are physical or even logical. Logical resources

my include Software Components, Policy files, Workflow applications etc. [39]. These

resource-specific operations provides functionalities of sharing operations at higher

levels. In order to support sharing mechanisms we need to provide [34] :

• an inquiry mechanism so that the components of Fabric are allowed to discover

and monitor resources.

• an appropriate (either application dependent or unified or both) resource man-

agement functionalities to control the QoS in Grid environment.

3.4.2 Connectivity Layer: Managing Communications

Connectivity layer defines the core communication and authentication protocols neces-

sary for grid networks. Communication protocol transfers data between Fabric layer

resources. Authentication protocols, however, build on communication services for

providing cryptographically secure mechanisms to the Grid users and resources.

The communication protocol can work with any of the networking layer protocols

that support transport, routing, and naming functionalities. In computational Grid,

TCP/IP Internet protocol stack is commonly used [37].

46

3.4.3 Resource Layer: Sharing of a Single Resource

Resource layer is on the top of Connectivity layer to define the protocols along with

API and SDKs for secure negotiation, monitoring, initialization, control and payment

of sharing operations on individual resources. Resource layer uses Fabric layer inter-

faces and functions to access and control local resources. This layer entirely considers

local and individual resources and therefore, ignores global resource management is-

sues. To share single resource, we need to classify two resource layer protocols [37]:

• Information protocols: Information protocols are used to discover the infor-

mation about state and structure of the resource for example - the configuration

of resource, current load state, usage policy or costing of the resource.

• Management protocols: Management protocols in Resource layer are used

to control and access to a shared resource. The protocols specify resource re-

quirements, which includes advanced reservation and QoS and the operations

on resources. Such operations include process creation, data access etc.

3.4.4 Collective Layer : Co-ordination with multiple resources

Resource layer, described in Section 3.4.3 deals with operation and management of

single resource (for example, computational resources, storage and network systems

etc.). But the Collective layer in the Grid architecture contains protocols and services

that are not associated with any one specific resource but rather are global in nature

and handles interactions across collections of resources. This layer provides necessary

API and SDKs not associated with specific resource rather the global resources in

overall grid environment.

47

Figure 3.6: Collective and Resource layer protocols are combined in various ways toprovide application functionality

The implementation of Collective layer functions can be built on Resource layer or

other Collective layer protocols and APIs [37]. Figure 3.6 shows a Collective co-

allocation API and SDK that uses a Resource layer management protocol to control

resources. On the top of this, a co-reservation service protocol and the service itself

are defined. To implement co-allocation operations, co-allocation API is called which

provides additional functionality such as authorization, fault tolerance etc. An appli-

cation then use the co-reservation service protocols to request and perform end-to-end

reservations.

3.4.5 Application Layer : User defined Grid Applications

The top layer of the Grid consists of user applications, which are constructed by uti-

lizing the services defined at each lower layer. At each layer, we have well-defined

protocols that access some useful services for example resource management, data ac-

cess, resource discovery etc. Figure 3.7 shows the correlation between different layers

[37]. APIs are implemented by SDKs, which use Grid protocols to provide function-

48

alities to end user. Higher level SDKs can also provide functionality so that it is not

directly mapped to a specific protocol. However, it may combine protocol operations

with calls to additional APIs to implement local functionality.

Figure 3.7: Programmers view of Grid Architecture. Thin lines denotes protocolinteractions where bold lines represent a direct call

3.5 Grid Computing with Globus

Globus [40] provides a software infrastructure so that applications can distribute com-

puting resources as a single virtual machine [41]. Globus Tooklit, the core component

of the infrastructure defines basic services and capabilities required for computational

Grid. Globus is designed as a layered architecture where high-level global services are

built on the top of low-level local services. In this Section, we will discuss how Globus

toolkit protocols actually interact with Grid layers.

• Fabric Layer:

Globus toolkit is designed to use existing fabric components [37]. For example,

enquiry software is provided for discovering and state information of various

49

common resources such as computer information (i.e. OS version, hardware

configuration etc), storage systems (i.e. available spaces) etc. In the higher

level protocols (particularly at the Resource layer) implementation of Resource

management, is normally assumed to be the domain of local resource managers.

• Connectivity Layer:

Globus uses public-key based Grid Security Infrastructure (GSI) protocols [42,

43] for authentication, communication protection, and authorization. GSI ex-

tends the Transport Layer Security (TLS) protocols [44] to address the issues

of single sign-on, delegation, integration with various local security solutions.

• Resource Layer:

A Grid Resource Information Protocol (GRIP) [45] is used to define standard re-

source information protocol. The HTTP-based Grid Resource Access and Man-

agement (GRAM) [46] protocol is used for allocation of computational resources

and also for monitoring and controlling the computation of those resources. An

extended version of the FTP, GridFTP [47], is used for partial file access and

management of parallelism in the high-speed data transfers [37].

The Globus Toolkit defines client-side C and Java APIs and SDKs for these

protocols. However, Server-side SDKs can also provided for each protocol, to

provide the integration of various resources for example computational, storage,

network into the Grid [37].

• Collective Layer:

Grid Information Index Servers (GIISs) supports arbitrary views on resource

subsets, LDAP information protocol used to access resource-specific GRISs to

obtain resource state and Grid Resource Registration Protocol (GRRP) is used

50

for resource registration. Also couple of replica catalog and replica management

services are used to support the management of dataset replicas. There is an

on-line credential repository service known ‘MyProxy’ provide secure storage for

proxy credentials [48]. The Dynamically-Updated Request Online Coallocator

(DUROC) provides an SDK and API for resource co-allocation [49].

3.6 Resource Management in Grid Computing

In this Section, we will discuss a resource management architecture which is used as a

Resource layer protocol described in [46]. Block diagram of the architecture is found

in Figure 3.8.

Figure 3.8: A resource management architecture for Grid Computing environment

To communicate request for resources between components an Resource Specification

Language (RSL) is used which is described details in Section 3.6.1. With the help of

the process called specialization, Resource Brokers transfer the high level RSL speci-

51

fication into concrete specification of resources. This specification of request named

ground request is passed to a co-allocator, which is responsible for allocating and man-

aging the resources at multiple sites. A multi-request is a request which is involved

resources at multiple sites. Resource co-allocators can break such multi-request into

components and pass each element into appropriate resource manager. The infor-

mation service, working between Resource Broker and Co-allocator is responsible for

giving access to availability and capability of resources.

3.6.1 Resource Specification Language

Resource Specification Language (RSL) is combination of parameters including the

operators:

• & : conjunction of parameter specifications

• | : disjunction of parameter specifications

• + : combining two or more request into single compound request or multi-request

Resource broker, co-allocators and resource managers each define a set of parameter-name.

Resource managers generally recognize two types of parameter-name in order to com-

municate with local schedulers.

• MDS attribute name: to express constraint on resources. For example: memory>64

or network=atm etc.

• Scheduler parameters: used to communicate information related to job, i.e.

count (number of nodes required), max_time (maximum time required), executables,

environment (environment variables) etc.

52

For example the following simple specification taken from [46],

&(executable=myprog)(|(&count=5)(memory>=64))(&(count=10)(memory>=32)))

requests 5 nodes with at least 64MB memory or 10 nodes with atleast 32 MB memory.

Here, executable and count are scheduler parameters.

Again, the following is an example of multi-request:

+(&count=80)(memory>=64)(executable=my_executable)(resourcemanager=rm1)

(&(count=256)(network=atm)(executable=my_executable)(resourcemanager=rm2)

Here two requests are concatenated by + operator. This is also an example of ground

request as every component of