Evolution Towards Cloud :Overview of Next Generation Computing
Architectureby
Monowar Hasan(Student ID. 0605021)
&Sabbir Ahmed
(Student ID. 0605013)
A Thesis submitted to the Department of Computer Science and Engineering inpartial fulfillment of the requirements for the degree of
Bachelor of Science (B.Sc.) in
Computer Science and Engineering
Thesis Supervisor:Dr. Md. Humayun Kabir
Bangladesh University of Engineering and TechnologyDhaka, Bangladesh
17 March 2012
Certification
The thesis titled “Evolution Towards Cloud : Overview of Next Generation Com-
puting Architecture”, submitted by Monowar Hasan, Student No. 0605021, Sabbir
Ahmed, Student No. 0605013, to the Department of Computer Science and Engi-
neering, Bangladesh University of Engineering and Technology, has been accepted as
satisfactory for the partial fulfillment of the requirements for the degree of Bachelor
of Science in Computer Science and Engineering.
Supervisor
Dr. Md. Humayun Kabir
Professor,
Department of Computer Science and Engineering,
Bangladesh University of Engineering and Technology,
Dhaka-1000, Bangladesh.
ii
Declaration
We, hereby, declare that the work presented in this thesis is the outcome of the in-
vestigation performed by us under the supervision of Dr. Md. Humayun Kabir,
Professor, Department of Computer Science and Engineering, Bangladesh University
of Engineering and Technology. We also declare that no part of this thesis has been
submitted elsewhere for the award of any degree or diploma.
Signature of the Students
Monowar Hasan
Student No. 0605021
Sabbir Ahmed
Student No. 0605013
iii
Abstract
Nowadays Cloud Computing has become a buzz-word in distributed processing. Cloud
Computing, originated from the ideas of concurrent processing from Computer Clus-
ter. It has enhanced the established architecture and standards of Grid Computing
with the ideas of Utility and Service-oriented Computing. Computing through Cloud
supplements a business model as a form of X-as-a-Service where X stands for Hard-
ware, Software, Developing platform or some Storage media. End-users can consume
any of these services from providers, pay-as-you-go basis without knowing the details
of underlying architecture. Hence, Cloud offers a layers of abstraction to end-users
and a scope to modify the application demand for end-users, developers and providers.
iv
Acknowledgements
We are grateful to several people for this work without whom it will not be a success-
ful. Our heartiest thanks goes to our supervisor Professor Dr. Md. Humayun Kabir
for his support and valuable guidelines. His continuous feedback and assistance help
us to clear our ideas and understandings on the topic.
Special thanks to Professor Dr. Hanan Lutfiyya from University of Western Ontario,
Canada and Professor Dr. Ivona Brandic, Vienna University of Technology, Vienna,
Austria for proving their research publications which help to progress our thesis.
Department of Computer Science and Engineering, Bangladesh University of Engi-
neering and Technology provides us with sound working environment and helps us to
get electronic copy of the publications.
Last but not the least, we acknowledge the contribution and support of our family
members for being with us and encouraging us all the way. Without their sacrifice it
would not end up a successful one.
v
Table of Contents
Certification ii
Declaration iii
Abstract iv
Acknowledgments v
Table of Contents ix
List of Tables x
List of Figures xiii
1 Introduction 1
2 Computer Clusters 4
2.1 Architecture of Computer Clusters . . . . . . . . . . . . . . . . . . . 5
2.2 Cluster Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Protocols for Cluster Communication . . . . . . . . . . . . . . . . . . 9
2.3.1 Internet Protocols . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Low-latency Protocols . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2.1 Active Messages . . . . . . . . . . . . . . . . . . . . 11
vi
2.3.2.2 Fast Messages . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2.3 VMMC . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2.4 U-net . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2.5 BIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Standards for Cluster Communication . . . . . . . . . . . . . 14
2.3.3.1 VIA . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3.2 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Cluster Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Message-based Middleware . . . . . . . . . . . . . . . . . . . 20
2.4.2 RPC-based Middleware . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Object Request Broker . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Single System Image (SSI) . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Benefits of SSI . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Features of SSI Clustering Systems . . . . . . . . . . . . . . . 23
2.5.3 Functional Relationship among Middleware SSI Modules . . . 24
2.5.3.1 Resource Management and scheduling (RMS) . . . . 24
2.6 Examples of Cluster implementation . . . . . . . . . . . . . . . . . . 25
2.6.1 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Windows Compute Cluster Server 2003 . . . . . . . . . . . . . 29
2.6.2.1 Compute Cluster Components . . . . . . . . . . . . . 30
2.6.2.2 Network Architecture . . . . . . . . . . . . . . . . . 30
2.6.2.3 Software Architecture . . . . . . . . . . . . . . . . . 31
2.6.2.4 Job Execution . . . . . . . . . . . . . . . . . . . . . 33
2.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Grid Computing : An Introduction 38
3.1 Grid Computing: definitions and overview . . . . . . . . . . . . . . . 39
vii
3.2 Grids over Cluster Computing . . . . . . . . . . . . . . . . . . . . . . 41
3.3 An example of Grid Computing environment . . . . . . . . . . . . . . 43
3.4 Grid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Fabric Layer: Interfaces to Local Resources . . . . . . . . . . . 45
3.4.2 Connectivity Layer: Managing Communications . . . . . . . . 46
3.4.3 Resource Layer: Sharing of a Single Resource . . . . . . . . . 47
3.4.4 Collective Layer : Co-ordination with multiple resources . . . 47
3.4.5 Application Layer : User defined Grid Applications . . . . . . 48
3.5 Grid Computing with Globus . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Resource Management in Grid Computing . . . . . . . . . . . . . . . 51
3.6.1 Resource Specification Language . . . . . . . . . . . . . . . . . 52
3.6.2 Globus Resource Allocation Manager (GRAM) . . . . . . . . . 53
3.7 Resource Monitoring in Grid Computing . . . . . . . . . . . . . . . . 54
3.8 Evolution towards Cloud Computing from Grid . . . . . . . . . . . . 61
3.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 An overview of Cloud Architecture 63
4.1 Cloud Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Cloud Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 A layered model of Cloud architecture - Cloud ontology . . . . 66
4.2.2 Cloud Business Model . . . . . . . . . . . . . . . . . . . . . . 73
4.2.3 Cloud Deployment Model . . . . . . . . . . . . . . . . . . . . 74
4.3 Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Infrastructure as a Service (IaaS) . . . . . . . . . . . . . . . . 78
4.3.2 Platform as a Service (PaaS) . . . . . . . . . . . . . . . . . . . 79
4.3.3 Software as a Service (SaaS) . . . . . . . . . . . . . . . . . . . 81
4.4 Virtualization on Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
4.4.1 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Motivations of Virtualization . . . . . . . . . . . . . . . . . . 87
4.5 Example of a Cloud Implementation . . . . . . . . . . . . . . . . . . 88
4.5.1 Amazon S3 Concepts . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.2 Amazon S3 Data Consistency Model . . . . . . . . . . . . . . 91
4.5.3 Managing Concurrent Applications . . . . . . . . . . . . . . . 92
4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 Comparisons of Grid and Cloud : Similarities & Differences 95
5.1 Major Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Points of Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.2 Scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.3 Multitasking and Availability . . . . . . . . . . . . . . . . . . 98
5.2.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . 98
5.2.5 Application Model . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.6 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Comparative results . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Conclusion and Future works 106
Bibliography 107
ix
List of Tables
2.1 Categories of Cluster Interconnection Hardware . . . . . . . . . . . . 7
4.1 Example of existing Cloud Systems w.r.to classification into layers of
Cloud Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Comparative analysis between an existing Grid and Cloud implemen-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
x
List of Figures
2.1 Architecture of a computer Cluster . . . . . . . . . . . . . . . . . . . 6
2.2 Traditional Protocol Overhead and Transmission Time. . . . . . . . . 10
2.3 The InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Functional Relationship Among Middleware SSI Modules . . . . . . 24
2.5 Resource Management and scheduling (RMS) . . . . . . . . . . . . . 25
2.6 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Serial Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Parallel Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Evolution of Grid Computing . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Serving job requests in traditional environment . . . . . . . . . . . . 41
3.3 Serving job requests in Grid environment . . . . . . . . . . . . . . . . 42
3.4 Google search architecture . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Grid Protocol Architecture . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Collective and Resource layer protocols are combined in various ways
to provide application functionality . . . . . . . . . . . . . . . . . . . 48
3.7 Programmers view of Grid Architecture. Thin lines denotes protocol
interactions where bold lines represent a direct call . . . . . . . . . . 49
xi
3.8 A resource management architecture for Grid Computing environment 51
3.9 Globus GRAM Architecture . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Grid Monitoring Architecture Components . . . . . . . . . . . . . . . 55
3.11 Enhancement of generic Grid architecture to Service Oriented Grid . 61
4.1 Components of a Cloud Computing Solution . . . . . . . . . . . . . . 64
4.2 Hierarchical abstraction layers of Cluster, Grid and Cloud Computing 66
4.3 Cloud layered architecture : consists of five layers, figure represents
inter-dependency between layers . . . . . . . . . . . . . . . . . . . . . 67
4.4 Virtualization reduces number of servers . . . . . . . . . . . . . . . . 70
4.5 Cloud computing Business model . . . . . . . . . . . . . . . . . . . . 73
4.6 External or Public Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Internal or Private Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8 Example of Hybrid Cloud . . . . . . . . . . . . . . . . . . . . . . . . 78
4.9 Correlation between Cloud Architecture and Cloud Services . . . . . 79
4.10 Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . 80
4.11 Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.12 Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.13 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.14 A Paravirtualized deployment where many OS can run simultaneously 85
4.15 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.16 Conceptual view of Amazon Simple Storage Service . . . . . . . . . . 89
4.17 Managing Concurrent Applications : W1 & W2 complete before the
start of R1 & R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.18 Managing Concurrent Applications : W2 does not complete before the
start of R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xii
4.19 Managing Concurrent Applications : W2 is performed before S3 returns
a ‘success’ for W1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1 Motivation of Grid and Cloud . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Comparison regarding performance, reliability and cost . . . . . . . . 97
xiii
Chapter 1
Introduction
Sometime applications need more computing power than a sequential computer can
provide. A feasible and cost-effective solution is to connect multiple processors to-
gether and coordinate their computational powers. The resulting systems are popu-
larly known as parallel computers or Computer Clusters, and they allow the sharing
of a computational task among multiple processors. The components of a Cluster
are usually connected to each other through fast Local Area Networks. This is in
contrast to traditional supercomputer, which has many processors connected by a
local high-speed computer bus. Each node in Cluster running its own instance of an
operating system. Traditionally, Computer Clusters run on separate physical comput-
ers with the same operating system. Hence, nodes in Cluster are homogeneous and
tightly-coupled. The activities of the computing nodes are monitored by ‘Clustering
Middleware’, a software layer that sits atop the nodes and allows the users to con-
sider the Cluster as single computing unit, through a ‘Single System Image’ concept.
Computer Clusters are covered details in Chapter 2.
Computational Grids, another approach of distributed processing, also uses many
nodes like Computer Cluster but a more dynamic and usually heterogeneous system.
The heterogeneous pools of servers, storage systems and networks are pooled together
1
in a virtualized system that is exposed to the user as a single computing entity. In
Grid, a computer job uses one or few nodes, with a little or no inter-node commu-
nication. Job requests are firstly pooled and allocated to the processors available in
an efficient way. ‘Grid middleware’ is specific software, which provides the necessary
functionality required to enable sharing of heterogeneous resources. Grid Computing
is the deployed Grid middleware. Architectures and issues of Computer Grids are
covered in Chapter 3.
Cluster Grids (or Computer Clusters) are local resources that operate inside the fire-
wall and are controlled by a single administrative entity that has complete control
over each component. Thus, clusters do actually not involve sharing of resources and
cannot be considered as Grids in the narrow sense. Enterprise Grid is used to refer
to application of Grid Computing for sharing resources within the bounds of a single
company. All components of an Enterprise Grid operate inside the firewall of a com-
pany, but may be heterogeneous and physically distributed across multiple company
locations. A Grid that is owned and deployed by a third party service provider is
called a Utility Grid. The service being offered via a Utility Grid is utility computing,
i.e. compute capacity and storage in a pay-per-use manner. A Utility Grid operates
outside the firewall of the user. The moving trends toward Utility Grid populates the
influential approaches of Cloud Computing.
Cloud Computing, a relatively recent term, is a computing paradigm, where a large
pool of systems are connected in private or public networks, to provide dynamically
scalable infrastructure for application, data and file storage. It implies a service ori-
ented architecture, reduced information technology overhead for the end-user, great
flexibility, reduced total cost of ownership, on-demand service and many other things.
2
In Cloud, the applications delivered as services over the Internet. Infrastructure re-
sources (hardware, storage and system software) and applications are provided in a
X-as-a-Service manner. When a Cloud is made available in a pay-as-you-go manner
to the general public, we call it a Public Cloud. We use the term Private Cloud to
refer to internal datacenters of a business or other organization, not made available
to the general public. Thus, Cloud Computing is the combination of SaaS and Utility
Computing, but does not include Private Clouds. Detailed overview of Cloud Com-
puting is discussed in Chapter 4.
Cloud Computing is not all the way similar to Computer Grids or Utility Grids. Cloud
differs with Grid in various considerations. Issues related to similarity and differences
between Grid and Cloud Computing are discussed in Chapter 5.
3
Chapter 2
Computer Clusters
A Cluster [1] is a type of parallel or distributed processing system. It consists of a
collection of interconnected stand-alone computers and they are working together as a
single, integrated computing resource. All the component subsystems of a Cluster are
supervised within a single administrative domain, usually residing in a single room
and managed as a single computer system. Cluster Computing can be used for load
balancing as well as for high availability [2]. Cluster Computing can also be used as
a relatively low-cost form of parallel processing for scientific and other applications
that lend themselves to parallel operations.
Some properties of Cluster Computing:
• Computers also known as nodes on a Cluster are networked in a tightly-coupled
fashion. They are all on the same subnet of the same domain and often net-
worked with very high bandwidth connections.
• Nodes of a Cluster are homogeneous. They all use the same hardware, run the
same software, and are generally configured identically. Each node in a Cluster
is a dedicated resource generally only the Cluster applications run on a Cluster
node.
4
• Message Passing Interface (MPI) [3] is used in Cluster which is a programming
interface that allows the distributed application instances to communicate with
each other and share information.
• Dedicated hardware, high-speed interconnects, and MPI provide Clusters the
ability to work efficiently on fine-grained parallel problems where the subtasks
must communicate many times per second, including problems with short tasks,
some of which may depend on the results of previous tasks.
2.1 Architecture of Computer Clusters
In Cluster Computing a computer node can be a single or multiprocessor system[4].
The nodes can be PCs, workstations, or Symmetric Multiprocessors (SMP) with mem-
ory, I/O facilities, and an operating system. In Cluster Computing, two or more nodes
are connected together. These nodes can exist in a single cabinet or be physically
separated and connected via a LAN. This LAN-based inter-connected Cluster of com-
puters appear as a single system to the users and applications. Cluster Computing
can provide a cost-effective way to gain features and benefits like fast and reliable ser-
vices that could previously found only on more expensive proprietary shared memory
systems. Typical architecture of a Cluster is shown in Figure 2.1.
In Cluster Computing several high Performance Networks or Switches are used to
connect the nodes of the Cluster. Among them Gigabit Ethernet and Myrinet are
most common. Switched networks are preferred since it allows multiple simultane-
ous messages to be sent, which can improve overall application performance. Cluster
Interconnection use Network Interface Cards. Interconnection technologies may be
classified into four categories, depending on whether the internal connection is from
5
Figure 2.1: Architecture of a computer Cluster
the I/O bus or the memory bus, and depending on whether the communication be-
tween the computers is performed primarily using messages or using shared storage.
We will discuss Cluster Interconnection in Section 2.2. Several Fast Communication
Protocols and Services are used to communicate within nodes. We will discuss them
briefly in Section 2.3.
The operating system in the individual nodes of the Cluster provides the fundamental
system support for Cluster operations. Whether the user is opening files, sending mes-
sages, or starting additional processes, the operating system is always present. The
primary role of an operating system is to multiplex multiple processes onto hardware
components that comprise a system (resource management and scheduling), as well
as provide a high-level software interface for user applications. These services include
protection boundaries, process and thread co-ordination, inter-process communication
and device handling.
There is a Middleware which sits between operating system and application. Middle-
6
ware layers enable the seamless usage of heterogeneous components across the Cluster.
Middleware provides the system Single System Image (SSI) and System Availability
Infrastructure. Cluster Middleware and Single System Image (SSI) are discussed in
Section 2.4 and 2.5. Both Sequential and Parallel or Distributed Applications can be
done by using Cluster Computing. For Parallel Applications several Parallel Program-
ming Environments and Tools such as compilers, MPI (Message Passing Interface) are
used. We will conclude the Chapter by giving two Applications of Cluster Computing:
Linux Virtual Server (LVS) in Section 2.6.1 and Windows Compute Cluster Server
2003 in Section 2.6.2.
2.2 Cluster Interconnection
In Cluster Computing the choice of interconnection technology is a key component.
We can classify the Interconnection technologies into four categories. These four cat-
egories depend on the internal connection and how the nodes communicate with each
other. The internal connection can be from the I/O bus or the memory bus and the
communication between the computers can be performed primarily using messages or
using shared storage [5]. Table 2.1 illustrates the four types of interconnection.
Type Message Based Shared StorageI/O Attached Most common type, in-
cludes most high-speednetworks; VIA, TCP/IP.
Shared disk subsystems.
Memory Attached Usually implemented insoftware as optimizationsof I/O attached message-based.
Global shared memory,Distributed shared mem-ory.
Table 2.1: Categories of Cluster Interconnection Hardware
7
Among the four interconnection categories I/O attached message-based systems are
by far the most common. This system includes all commonly-used wide-area and
local-area network technologies. It also includes several recent products that are
specifically designed for Cluster computing. I/O attached shared storage systems in-
clude computers that share a common disk sub-system. Memory attached systems
are not common like I/O attached systems, since the memory bus of an individual
computer generally has a design that is unique to that type of computer. How-
ever, many memory-attached systems are implemented. Most of the time they are
implemented in software or with memory-mapped I/O, such as Reflective Memory [6].
There are several Hybrid systems that combine the features of more than one category.
Example of a Hybrid system is the Infiniband standard [7]. Infiniband is an I/O
attached interconnection. It can be used to send data to a shared disk sub-system
as well as to send messages to another computer. There are many factors that affect
the choice of interconnect technology for a Cluster. Factors like compatibility with
the Cluster hardware and operating system, price, and performance. Performance of
a Cluster depends on the latency and bandwidth.
• Latency is the time needed to send data from one computer to another. Latency
also includes overhead for the software to construct the message as well as the
time to transfer the bits from one computer to another.
• Bandwidth is the number of bits per second that can be transmitted over the
interconnect hardware.
Applications that utilize small messages will have better performance particularly be-
cause the latency is reduced. Applications that send large messages will have better
performance particularly as the bandwidth increases. The latency is a function of
8
both the communication software and network hardware.
2.3 Protocols for Cluster Communication
A communication protocol defines a set of rules and conventions for communicat-
ing among the nodes in the Cluster [8]. Each protocol uses different technology to
exchange information. Communication protocols can be classified as:
• Connection oriented or connectionless.
• Offers various level of reliability. Protocol can be reliable that fully guaranteed
to arrive in order. Protocol can be unreliable that not guaranteed to arrive in
order.
• Communication can be not buffered which is synchronous or buffered which is
asynchronous.
• By the number of intermediate data copies between buffers, which may be zero,
one or more.
Several protocols are used in Clusters. Formerly Traditional Internet protocols are
used for Clustering. Later several protocols that have been designed specifically for
Cluster communication. Finally two new protocol standards have been specially de-
signed for use in Cluster Computing.
2.3.1 Internet Protocols
The Internet Protocol (IP) is the standard for networking worldwide. The Trans-
mission Control Protocol (TCP) and the User Datagram Protocol (UDP) are both
9
transport layer protocols built over the Internet Protocol. TCP and UDP protocols
and the de facto standard BSD sockets Application Programmer’s Interface (API) to
TCP and UDP were among the first messaging libraries used for [9] Cluster Comput-
ing.
• Internet Protocol uses one or more buffers in system memory with the help of
operating system services.
• User application constructs the message in user memory, and then makes an
operating system request to copy the message into a system buffer.
• A system interrupt is required to send and receive the message.
In Internet protocol, Operating system overhead and the overhead for copies to and
from system memory are a significant portion of the total time to send a message. As
network hardware became faster during the 1990s, the overhead of the communica-
tion protocols became significantly larger than the actual hardware transmission time
for messages, as shown in Figure 2.2. So there needed the necessity of new types of
protocols for Cluster computing.
Figure 2.2: Traditional Protocol Overhead and Transmission Time.
10
2.3.2 Low-latency Protocols
For avoiding operating system intervention in message transmission several research
projects were done during the 1990’s. These projects led to the development of low-
latency protocols. These protocols also provide user-level messaging services across
high-speed networks. Low-latency protocols developed during the 1990’s include Ac-
tive Messages, Fast Messages, the VMMC (Virtual Memory-Mapped Communication)
system, U-net, and Basic Interface for Parallelism (BIP), among others.
2.3.2.1 Active Messages
Active Messages was developed in the University of Berkeley. It has provided low-
latency communications library for the Berkeley Network of Workstations (NOW)
project [10, 11]. Short messages used in Active Messages, are synchronous and based
on the concept of a request-reply primitive.
• Sending side user-level application constructs a message in user memory. The
receiving process allocates a receive buffer in user memory on the receiving side
and sends a request to the sender.
• The sender replies by copying the message from the user buffer on the sending
side directly to the network buffer. No buffering in system memory is performed.
• Network hardware transfers the message to the receiver, and then the message
is transferred from the network buffer to the receive buffer in user memory.
It is required that user virtual memory on both the sending and receiving sides being
pinned to an address in physical memory. The reason behind it not to be paged out
during the network operation. Once the pinned user memory buffers are established,
no operating system intervention is required for a message to be sent. Since no copies
11
from user memory to system memory are used, this protocol is known as a zero-copy
protocol.
To support multiple concurrent parallel applications in a Cluster Active Messages
was extended to Generic Active Messages (GAM). In GAM, a copy sometimes oc-
curs to a buffer in system memory on the receiving side so that user buffers can be
reused more efficiently. In this case, the protocol is referred to as a ‘one-copy’ protocol.
2.3.2.2 Fast Messages
Fast Message was developed at the University of Illinois. It is similar to Active
Messages [12]. Fast Message extends Active Message by imposing stronger guarantees
on the underlying communication.
• Fast Message guarantees that all messages arrive reliably and in-order, even if
the underlying network hardware does not.
• Fast Message uses flow control to ensure that a fast sender cannot overrun a
slow receiver, thus causing messages to be lost. Flow control is implemented in
Fast Messages with a credit system that manages pinned memory in the host
computers.
2.3.2.3 VMMC
The Virtual Memory-Mapped Communication (VMMC) [13] system was developed
as a low-latency protocol for the Princeton SHRIMP project. One goal of VMMC
was to view messaging as reads and writes into the user-level virtual memory system.
12
• VMMC works by mapping a page of user virtual memory to physical memory.
It makes a correspondence between pages on the sending and the receiving sides.
• It uses specially designed hardware. This hardware allows the network interface
to snoop writes to memory on the local host and have these writes automatically
updated on the remote hosts memory. Various optimizations of these writes have
been developed that helped to minimize the total number of writes, network
traffic, and overall application performance.
VMMC is an example of a paradigm known as distributed shared memory (DSM).
In DSM systems memory is physically distributed among the nodes in a system, but
processes in an application may view shared memory locations as identical and per-
form reads and writes to the shared memory locations.
2.3.2.4 U-net
The U-net network interface architecture [14] was developed at Cornell University.
U-net provides zero-copy messaging where possible.
• U-net adds the concept of a virtual network interface for each connection in a
user application. Just as an application has a virtual memory address space
that is mapped to real physical memory on demand.
• Each communication endpoint of the application is viewed as a virtual network
interface mapped to a real set of network buffers and queues on demand.
The advantage of this architecture is that once the mapping is defined, each active
interface has direct access to the network without operating system intervention. The
result is that communication can occur with very low latency.
13
2.3.2.5 BIP
Basic Interface for Parallelism (BIP) is a low-latency protocol that was developed at
the University of Lyon [15].
• BIP is designed as a low-level message layer over which a higher-level layer such
as Message Passing Interface (MPI) [3] can be built. Programmers can use MPI
over BIP for parallel application programming.
• The initial BIP interface consisted of both blocking and non-blocking calls.
Later versions (BIP-SMP) provide multiplexing between the network and shared
memory under a single API for use on Clusters of symmetric multiprocessors.
BIP achieves low latency and high bandwidth by using different protocols, like Active
Messages and Fast Messages for various message sizes. It also provides a zero or single
memory copy of user data. To simplify the design and keep the overheads low, BIP
guarantees in-order delivery of messages, although some flow control issues for small
messages are passed to higher software levels.
2.3.3 Standards for Cluster Communication
Research on low-latency protocols had progressed sufficiently and established new
standard for low-latency messaging to be developed, the Virtual Interface Architec-
ture (VIA). Industrial researchers worked on standards for shared storage subsystems.
The combination of the efforts of many researchers has resulted in the InfiniBand stan-
dard.
14
2.3.3.1 VIA
The Virtual Interface Architecture [16] is a communications standard that combines
many of the best features of various academic projects. A consortium of academic and
industrial partners, including Intel, Compaq, and Microsoft, developed the standard.
• VIA supported heterogeneous hardware and was available as of early 2001.
• It was based on the concept of a virtual network interface. Before a message
can be sent in VIA, send and receive buffers must be allocated and pinned to
physical memory locations.
• There was no need of system calls after the buffers and associated data structures
are allocated.
• A send or receive operation in a user application consists of posting a descriptor
to a queue. The application can choose to wait for a confirmation that the
operation has completed, or can continue host processing while the message is
being processed.
Several hardware vendors and some independent developers have developed VIA im-
plementations for various network [17][18] products. VIA implementations can be
classified as native or emulated.
• A native implementation of VIA off-loads a portion of the processing required
to send and receive messages to special hardware on the network interface card.
When a message arrives in a native VIA implementation, the network card
performs at least a portion of the work required to copy the message into user
memory.
15
• An emulated VIA implementation, the host CPU performs the processing to
send and receive messages. Although the host processor is used in both cases,
an emulated implementation of VIA has less overhead than TCP/IP. However,
the services provided by VIA are different than those provided by TCP/IP, since
the communication may not be guaranteed to arrive reliably in VIA.
2.3.3.2 InfiniBand
The InfiniBand standard [19] is another standard for Cluster protocol and was sup-
ported by a large consortium of industrial partners, including Compaq, Dell, Hewlett-
Packard, IBM, Intel, Microsoft and Sun Microsystems. The InfiniBand architecture
replaces the standard shared bus for I/O on current computers with a high-speed
serial, channel-based, message-passing, scalable, and switched fabric. There are two
types of adaptors. Host channel adapters (HCA) and target channel adapters (TCA).
All systems and devices attach to the fabric through host channel adapters (HCA) or
target channel adapters (TCA), as shown in Figure 2.3 . In InfiniBand data is sent
as packets, and six types of transfer methods are available, including:
• Reliable and unreliable connections.
• Reliable and unreliable datagrams.
• Multicast connections.
• Raw packets.
InfiniBand supports remote direct memory access (RDMA) read or write operations.
This allows one processor to read or write the contents of memory at another processor,
and also directly supports IPv6 [20] messaging for the Internet. There are several
components of InfiniBand. They are -
16
Figure 2.3: The InfiniBand Architecture
• Host channel adapter (HCA): Host channel adapter is an interface that
resides within a server. HCA communicates directly with the server’s memory,
processor, target channel adapter or a switch. It guarantees delivery of data
and can recover from transmission errors.
• Target channel adapter (TCA): Target channel adapter enables I/O devices
to be located within the network independent of a host computer. It includes
an I/O controller that is specific to its particular device’s protocol. TCAs can
communicate with an HCA or a switch.
• Switch: Switch is virtually equivalent to a traffic police. It allows many HCAs
and TCAs to connect to it and handles network traffic. Offers higher availability,
higher aggregate bandwidth, load balancing, data mirroring and much more.
Looks at the “local route header” on each packet of data and forwards it to the
appropriate location. A group of switches is referred to as a fabric. If a host
computer is down, the switch still continues to operate. The switch also frees
up servers and other devices by handling network traffic.
• Router: Router forwards data packets from a local network (called a subnet)
17
to other external subnets. Reads the ‘global route header’ and forwards to
appropriate address. It rebuilds each packet with the proper local address header
as it passes it to the new subnet.
• Subnet Manager: It is an application responsible for configuring the local
subnet and ensuring its continued operation. Configuration responsibilities in-
clude managing switch and router setups and reconfiguring the subnet if a link
goes down or a new one is added.
The InfiniBand Architecture (IBA) is comprised of four primary layers that describe
communication devices and methodology.
• Physical Layer: Defines the electrical and mechanical characteristics of the
IBA, including the cables, connectors and hot-swap characteristics. IBA con-
nectors include fiber, copper and backplane connectors. There are three link
speeds specified as 1X, 4X and 12X. 1X link cable has four wires; two for each
direction of communication (read and write).
• Link Layer: Link Layer includes packet layout, point-to-point link instruction,
switching within a local subnet and data integrity. Two type of packets, man-
agement and data. Management packets handle link configurations and main-
tenance. Data packets carry up to 4 kilobytes of transaction payload. Every
device in a local subnet has a local ID (LID) for forwarding data appropriately.
It handles data integrity by including variant and invariant cyclic redundancy
checking (CRC). The variant CRC checks fields that change from point-to-point
and the invariant CRC provides end-to-end data integrity.
• Network Layer: The network layer is responsible for routing packets from one
subnet to another. The global route header located within a packet includes an
18
IPv6 address for the source and destination of each packet. For single subnet
environments, the network layer information is not used.
• Transport Layer: Transport layer handles the order of packet delivery. Also
handles partitioning, multiplexing and transport services that determine reliable
connections.
2.4 Cluster Middleware
Middleware is the layer of software sandwiched between the operating system and
applications. It has re-emerged as a means of integrating software applications that
run in a heterogeneous environment. There is large overlap between the infrastructure
that is provided to a Cluster by high-level Single System Image (SSI) services and
those provided by the traditional view of middleware. Middleware helps a developer
overcome three potential problems with developing applications on a heterogeneous
Cluster:
• Gives the ability to access to software inside or outside their site.
• Helps to integrate software from different sources.
• Rapid application development.
The services that middleware provides are not restricted to application development.
Middleware also provides services for the management and administration of a het-
erogeneous system.
19
2.4.1 Message-based Middleware
Message-based middleware uses a common communication protocol to exchange data
between applications. The communication protocol hides many of the low-level mes-
sage passing primitives from the application developer. Message-based middleware
software can pass messages directly between applications, send messages via software
that queues waiting messages, or use some combination of the two. Examples of this
type of middleware are the three upper layers of the OSI model [21], the session,
presentation and applications layers.
2.4.2 RPC-based Middleware
There are many applications where the interactions between processes in a distributed
system are remote operations, often with a return value. For these applications Re-
mote Procedure Call (RPC) is used. The implementation of the client/server model in
terms of Remote Procedure Call (RPC) allows the code of the application to remain
the same whether the procedures are the same or not. Inter-process communication
mechanisms serve four important functions [22]:
• They offer mechanisms against failure. They also provide the means to cross
administrative boundaries.
• They allow communications between separate processes over a computer net-
work.
• They enforce clean and simple interfaces, thus providing a natural aid for the
modular structure of large distributed applications.
• They hide the distinction between local and remote communication, thus allow-
ing static or dynamic reconfiguration.
20
2.4.3 Object Request Broker
An Object Request Broker (ORB) is a type of middleware that supports the remote
execution of objects. An international ORB standard is CORBA (Common Object
Request Broker Architecture). It is supported by more than 700 groups and managed
by the Object Management Group (OMG) [23]. The OMG is a non profit-making
organization whose objective is to define and promote standards for object orienta-
tion in order to integrate applications based on existing technologies. The Object
Management Architecture (OMA) is characterized by the following:
• The Object Request Broker (ORB): It is the controlling element of the archi-
tecture and it supports the portability of objects and their interoperability in a
network of heterogeneous systems.
• Object services: These are specific system services for the manipulation of ob-
jects. Their goal is to simplify the process of constructing applications.
• Application services: They offer a set of facilities for allowing applications access
databases, to printing services, to synchronize with other application, and so on.
• Application objects: They allow the rapid development of applications. A new
application can be formed from objects in a combined library of application
services.
2.5 Single System Image (SSI)
SSI is the illusion, created by software or hardware, that presents a collection of com-
puting resources as one, more whole resource [24]. In other words, it the property of a
21
system that hides the heterogeneous and distributed nature of the available resources
and presents them to users and applications as a single unified computing resource.
SSI makes the Cluster appear like a single machine to the user, to applications, and to
the network. SSI Cluster-based systems are mainly focused on complete transparency
of resource management, scalable performance, and system availability in supporting
user applications. SSI is supported by a middleware layer that resides between the
OS and user-level environment. Middleware consists of essentially 2 sub-layers of SW
infrastructure.
• SSI infrastructure - Glue together OSs on all nodes to offer unified access to
system resources.
• System availability infrastructure - Enable Cluster services such as check
pointing, automatic failover, recovery from failure and fault-tolerant support
among all nodes of the Cluster.
2.5.1 Benefits of SSI
There are several benefits of SSI:
• Use of system resources transparent.
• Transparent process migration and load balancing across nodes.
• Improved reliability and higher availability.
• Improved system response time and performance.
• Simplified system management.
• Reduction in the risk of operator errors.
22
• No need to be aware of the underlying system architecture to use these machines
effectively.
2.5.2 Features of SSI Clustering Systems
• Single I/O Space: Any node can access any peripheral or disk devices without
the knowledge of physical location.
• Single Process Space: Any process on any node create process with Cluster
wide process and they communicate through signal, pipes, etc, as if they are
one a single node.
• Single Global Job Management System: SSI provides single global job
management system. The manager node manages all the operations.
• Checkpointing : Some SSI systems allow checkpointing of running processes,
allowing their current state to be saved and reloaded at a later date. Check-
pointing can be seen as related to migration, as migrating a process from one
node to another can be implemented by first checkpointing the process, then
restarting it on another node. Alternatively checkpointing can be considered as
migration to disk.
• Process Migration: Many SSI systems provide process migration. Processes
may start on one node and be moved to another node, possibly for resource
balancing or administrative reasons. As processes are moved from one node to
another, other associated resources may be moved with them.
23
Figure 2.4: Functional Relationship Among Middleware SSI Modules
2.5.3 Functional Relationship among Middleware SSI Mod-ules
Every SSI has a boundary. Single system support can exist at different levels within
a system, one able to be build on another. In SSI there can be three levels of ab-
stractions. They are application and subsystem level, operating system kernel level
and hardware level. In Figure 2.4 the functional relationship among middleware SSI
module is shown. Resource Management and Scheduling is done in subsystem level.
2.5.3.1 Resource Management and scheduling (RMS)
RMS system is responsible for distributing applications among Cluster nodes. It
enables the effective and efficient utilization of the resources available. In RMS there
are two types software components. Basic architecture of RMS: client-server system
is shown in Figure 2.5.
• Resource manager: Locating and allocating computational resource, authen-
tication, process creation and migration.
24
Figure 2.5: Resource Management and scheduling (RMS)
• Resource scheduler: Queuing applications, resource location and assignment.
It instructs resource manager what to do when (policy).
There are several services which are provided by RMS:
• Process Migration.
• Checkpointing.
• Fault Tolerance.
• Minimization of Impact on Users.
• Load Balancing.
• Multiple Application Queues.
2.6 Examples of Cluster implementation
In this Section, we will discuss two existing Cluster implementation: Linux Virtual
Server (LVS), an open source project which is an advanced load balancing solution
25
for Linux systems; and Windows Compute Cluster Server 2003, a commercial Cluster
server developed by Microsoft Corporation.
2.6.1 Linux Virtual Server
In this Section, we will briefly discuss Linux Virtual Server [25]. Linux Virtual Server
(LVS) is an advanced load balancing solution for Linux systems. It is an open source
project started by Wensong Zhang in May 1998. The mission of the project was
to build a high-performance and highly available server for Linux using Clustering
technology, which provides good scalability, reliability and serviceability. The Linux
Virtual Server directs clients’ network connection requests to multiple servers that
share their workload, which can be used to build scalable and highly available Inter-
net services.
The Linux Virtual Server directs clients’ network connection requests to the different
servers according to scheduling algorithms and makes the parallel services of the
Cluster to appear as a single virtual service with a single IP address. The Linux
Virtual Server extends the TCP/IP stack of Linux kernel to support three IP load-
balancing techniques:
• NAT (Network Address Translation): Maps IP addresses from one group
to another. NAT is used when hosts in internal networks want to access the
Internet and be accessed in the Internet.
• IP tunneling: Encapsulates IP datagram within IP datagrams. This allows
datagrams destined for one IP address to be wrapped and redirected to another
IP address.
26
• Direct routing: Allows route response to the actual user machine instead of
the load balancer.
The Linux Virtual Server also provides four scheduling algorithms for selecting servers
from Cluster for new connections:
• Round robin: Directs the network connections to the different server in a
round-robin manner.
• Weighted round robin: Treats the real servers of different processing capac-
ities. A scheduling sequence will be generated according to the server weights.
Clients’ requests are directed to the different real servers based on the scheduling
sequence in a round robin manner.
• Least-connection: Directs clients’ network connection requests to the server
with the least number of established connections.
• Weighted least-connection: A performance weight can be assigned to each
real server. The servers with a higher weight value will receive a larger percent-
age of live connections at any time.
Client applications interact with the Cluster as if it were a single server. The clients
are not affected by the interaction with the Cluster and do not need modification.
The application performance scalability is achieved by adding one or more nodes to
the Cluster. Automatically detecting node or daemon failures and reconfiguring the
system appropriately achieve high availability. The Linux Virtual Server that follows
a three-tier architecture is shown in Figure 2.6. The functionality of each tier is:
• Load Balancer: The front end to the service as viewed by connecting clients.
The load balancer directs network connections from clients who access a single
27
Figure 2.6: Linux Virtual Server
IP address for a particular service, to a set of servers that actually provide the
service.
• Server Pool: It consists of a Cluster of servers that implement the actual
services, such as Web, FTP, mail, DNS, and so on.
• Back-end Storage: It provides the shared storage for the servers, so that it is
easy for servers to keep the same content and provide the same services.
The load balancer handles incoming connections using IP load balancing techniques.
The Load balancer selects servers from the server pool, maintains the state of con-
current connections and forwards packets, and all the work is performed inside the
kernel, so that the handling overhead of the load balancer is low. The load balancer
can handle much larger numbers of connections than a general server, therefore the
load balancer can schedule a large number of servers and it will not be a potential
bottleneck in the system.
28
The server nodes may be replicated for either scalability or high availability. When
the load on the system saturates the capacity of the current server nodes, more server
nodes can be added to handle the increasing workload. One of the advantages of
a Clustered system is that it can be built with hardware and software redundancy.
Detecting a node or daemon failure and then reconfiguring the system appropriately
so that its functionality can be taken over by the remaining nodes in the Cluster is
a means of providing high system availability. A Cluster-monitor-daemon can run on
the load balancer and monitor the health of server nodes. If a server node cannot be
reached by ICMP (Internet Control Message Protocol) ping or there is no response
of the service in the specified period, the monitor will remove or disable the server in
the scheduling table, so that the load balancer will not schedule new connections to
the failed one and the failure of a server node can be masked.
The back-end storage for this system is usually provided by distributed and fault
tolerant file system. Such a system also takes care of the availability and scalability
issues of file system accesses. The server nodes access the distributed file system in
a similar fashion to that of accessing a local file system. However, multiple identical
applications running on different server nodes may access a shared data concurrently.
Any conflict among applications must be reconciled so that the data remains in a
consistent state.
2.6.2 Windows Compute Cluster Server 2003
In this Section, we will briefly discuss Windows Compute Cluster Server 2003 [26].
It is an integrated platform for running, managing, and developing high performance
29
computing applications.
2.6.2.1 Compute Cluster Components
Each Windows Compute Cluster Server 2003 Cluster consists of a head node and one
or more compute nodes. The head node mediates all access to the Cluster resources
and acts as a single point for Cluster deployment, management, and job scheduling.
A Cluster can consist of only a head node.
• Head node: The head node is responsible for providing user interface and
management services to the Cluster. The user interface consists of the Com-
pute Cluster Administrator, which is a Microsoft Management Console (MMC)
snap-in, the Compute Cluster Job Manager, which is a Win32 graphic user
interface, and a Command Line Interface (CLI). Management services include
job scheduling, job and resource management, node management, and Remote
Installation Services (RIS).
• Compute node: A compute node is a computer configured as part of a high
performance Cluster to provide computational resources for the end user. Com-
pute nodes on a Windows Compute Cluster Server 2003 Cluster must have a
supported operating system installed, but nodes within the same Cluster can
have different operating systems and different hardware configurations.
2.6.2.2 Network Architecture
Network configuration consists of a head node and a scalable number of compute
nodes. The nodes can be connected as part of a larger server network, or as a private
network with the head node serving as a gateway. Figure 2.7 shows both types of
30
arrangement. The networking medium can be Ethernet or it can be a high-speed
medium such as InfiniBand (typically used only for MPI or similar communication
among the nodes).
Figure 2.7: Network Architecture
2.6.2.3 Software Architecture
The software architecture consists of a user interface layer, a scheduling layer, and an
execution layer. The interface and scheduling layers reside on the head node. The
execution layer resides primarily on the compute nodes. The execution layer as shown
in Figure 2.8 includes the Microsoft implementation of MPI, called MS MPI, which
was developed for Windows and is included in the Microsoft Compute Cluster Pack.
• Interface layer: The user interface layers consist of the Compute Cluster Job
Manager, the Compute Cluster Administrator, and Command Line Interface
(CLI). The Compute Cluster Job Manager is a WIN32 graphic user interface to
the Job Scheduler that is used for job creation and submission. The Compute
Cluster Administrator is a Microsoft Management Console (MMC) snap-in that
is used for configuration and management of the Cluster. The Command Line
Interface is a standard Windows command prompt which provides a command-
line alternative to use of the Job Manager and the Administrator.
31
Figure 2.8: Software Architecture
• Scheduling layer: The scheduling layer consists of the Job Scheduler, which is
responsible for queuing the jobs and tasks, reserving resources, and dispatching
jobs to the compute nodes.
• Execution layer: The execution layer consists of the following components
replicated on each compute node: the Node Manager Service, the MS MPI
launcher mpiexec, and the MS MPI Service. The Node Manager is a service
that runs on all compute nodes in the Cluster. The Node Manager executes jobs
on the node, sets task environmental variables, and sends a heartbeat (health
check) signal to the Job Scheduler at specified intervals (the default interval is
32
one minute). mpiexec is the MPICH2-compatible multi-threading executable
within which all MPI tasks are run. The MS MPI Service is responsible for
starting the job tasks on the various processors.
2.6.2.4 Job Execution
Steps of job execution are as follows:
1. Creating and submitting jobs:
Creating a job is the first step in Cluster computing. It is a resource request con-
taining one or more computing tasks to be run in parallel. Each task may in turn
be parallel or it may be serial. One can create a job using the Job Manager or
the CLI. To create a job means describe job priority, run time limit, number of
processors required, specific nodes requested, and whether nodes will be reserved
exclusively for the job. Then add the tasks that the job will execute. The task’s
properties also include any input, output, and error files required, as well as a list
of any other tasks on which this task depends. After defining the job and its tasks,
the next step is to submit it to the Job Scheduler. After the job is submitted, it
takes its place in the job queue with the status Queued and waits its turn to be
activated.
2. Job Scheduler:
When a job is submitted, it is placed under the management of Job Scheduler. Job
Scheduler determines the job’s place in the queue and allocates resources to the
job when the job reaches the top of the queue and as resources become available.
Jobs are ordered in the queue according to a set of rules called scheduling policies.
33
Resource allocation is based on resource sorting. When the requested resources
have been allocated, the scheduler dispatches the job tasks to the compute nodes
and takes on a management and monitoring function. The scheduler manages jobs
by enforcing certain job and task options, as well as managing job or task status
changes. It monitors jobs by reporting on the status of the job and its tasks, as
well as the health of the nodes. Job Scheduler implements the following scheduling
policies:
• Priority-based, first-come, first-served scheduling: Priority-based, first-
come, first-served (FCFS) scheduling is a combination of FCFS and priority-
based scheduling. Using priority-based FCFS scheduling, the scheduler places
a job into a higher or lower priority group depending on the job’s priority set-
ting, but always places that job at the end of the queue in that priority group
because it is the last submitted job.
• Backfilling: Backfilling maximizes node utilization by allowing a smaller
job or jobs lower in the queue to run ahead of a job waiting at the top of
the queue, as long as the job at the top is not delayed as a result. When a
job reaches the top of the queue, a sufficient number of nodes may not be
available to meet its minimum processors requirement. When this happens,
the job reserves any nodes that are immediately available and waits for the
job that is currently running to complete.
• Exclusive scheduling: By default, a job has exclusive use of the nodes
reserved by it. This can produce idle reserved processors on a node. Idle
reserved processors are processors that are not used by the job but are also
not available to other jobs. By turning off the exclusive property, the user
allows the job to share its unused processors with other jobs that have also
34
been set as nonexclusive. Therefore, non-exclusivity is a reciprocal agreement
among participating jobs, allowing each to take advantage of the other’s un-
used processors.
3. Task execution:
Job Scheduler dispatches tasks to the compute nodes in the order that they appear
in the task list. To dispatch the task, Job Scheduler passes the task to a desig-
nated node, which can be any of the compute nodes allocated to the job. Unless
dependencies have been specified, the tasks are dispatched a first-come, first-served
(FCFS) basis.
For serial tasks, the first two tasks will be dispatched to and run on the designated
node (assuming it has two processors), the next two tasks will be dispatched to
and run on a second designated node, and the sequence will repeat itself until
there are no more tasks or until all the processors in the Cluster are being used.
Any remaining tasks must wait for the next available processor and run when it
becomes available. The following Figure 2.9 shows this process. The file server
shown on the head node may not actually reside there. It can reside anywhere
in the external or internal network. An MSDE server stores the job specifications
and user log-on credentials. The task ID number, which also contains the job ID
number, allows Job Scheduler to keep track of the status of the task as part of the
job, displaying both job and task status to the user.
For parallel tasks, execution flow depends on the user application and the software
that supports it. For jobs that are run using the Microsoft Message Passing In-
35
Figure 2.9: Serial Task execution
terface Service, tasks are executed as follows. The MS MPI executable mpiexec
is started on the designated node. mpiexec, in turn, starts all the task processes
through the node-specific MS MPI Service. If more than one node is required for
the task, additional instances of MS MPI, one per node, are spawned before the
task processes themselves are started. Parallel task flow is shown in Figure 2.10.
In the Figure, P0 through P5 represent the processes that are created, each part
of a single task. This illustration shows the most common case, in which only one
process,P0, handles all the standard input and output files.
36
Figure 2.10: Parallel Task execution
2.7 Concluding Remarks
As a beginning of our work, we are trying to study the issues related of parallel com-
putation and focusing architectures, protocols and standards of Computer Clusters.
The motivation of distributed processing using Computer Cluster turns into more
advance technology known as Grid Computing which we will going to discuss in the
next Chapter.
37
Chapter 3
Grid Computing : An Introduction
Grid Computing, more specially ‘Grid Computing System’ is a virtualized distributed
environment. Grid environment provides dynamic runtime selection, sharing and ag-
gregation of geographically distributed resources based on availability, capability, per-
formance and cost of these computing resources. Fundamentally, Grid Computing is
the advanced form of distributed processing which is the combination of decentralized
architecture for managing computing resources and a layered hierarchical architecture
for providing services to the user [27].
The rest of the Chapter is organized as follows. We begin our discussion with definition
of Grid Computing in Section 3.1 and the comparing Grid with Computer Clusters
in Section 3.2. In Section 3.4 and 3.5 we consider the underlying layers of Grid
Computing in details. Resource management architecture is discussed in Section 3.6
and the protocol for resource management (GRAM) is discussed in Section 3.6.2. We
also present a Resource Monitoring Architecture for Grid environment in Section 3.7.
We Conclude our discussion in Section 3.8 introducing a new approach of distributed
processing known as Cloud Computing.
38
3.1 Grid Computing: definitions and overview
The concept of Grid was introduced in early 1990’s, where high performance com-
puters were connected by fast data communication. The motivation of that approach
was to support calculation and data-intensive scientific applications. Figure 3.1 [28]
shows the evolution of grid over time.
Figure 3.1: Evolution of Grid Computing
The basics of Grid is to co-allocation of distributed computation resources. The most
cited definition of Grid is [29]:
“A computational grid is a hardware and software infrastructure
that provides dependable, consistent, pervasive, and inexpensive
access to high-end computational capabilities.”
Again, according to IBM definition [30],
“A grid is a collection of distributed computing resources available
over a local or wide area network that appear to an end user or
application as one large virtual computing system. The vision is to
39
create virtual dynamic organizations through secure, coordinated
resource-sharing among individuals, institutions, and resources.”
A Grid Computing environments must include:
Coordinated resources: Grid environment must be facilitated with necessary in-
frastructure for co-ordination of resources based upon policies and service level
agreements.
Open standard protocols and frameworks: Open standards can provide inter-
operability and integration facilities. These standard should be applied for re-
source discovery, resource access and resource co-ordination. Open Grid Services
Infrastructure (OGSI) [31] and Open Grid Services Architecture (OGSA) [32]
was published by the Global Grid Forum (GGF) as a proposed recommendation
for this approach.
Grid Computing can be distinguished also from High Performance Computing (HPC)
and Clustered Systems in following way: while Grid focuses on resource sharing and
can result in HPC, whereas HPC does not necessarily involve sharing of resources
[33]. Grid enables the abstraction of distributed systems and resources such as pro-
cessing, network bandwidth and data storage to create a Single System Image. Such
abstraction provides continuous access to large pool of IT capabilities. Figure 3.2
and 3.3 [28] compares the Grid environment over the traditional computations. An
organization-owned computational Grid is shown in Figure 3.3 on Page 42, where a
scheduler sets policies and priorities for placing jobs in the Grid infrastructure.
40
Figure 3.2: Serving job requests in traditional environment
3.2 Grids over Cluster Computing
Computer Clusters discussed in Chapter 2 are local to the domain. The Clusters
are designed to resolve the problem of inadequate computing power. It provides
more computation power by pooling of computational resources and parallelizing the
workload. As Clusters provide dedicated functionality to local domain, they are not
suitable solution for resource sharing between users of various domains. Nodes in the
Cluster controlled centrally and Cluster manager is monitoring the state of the node
[34]. So, in brief, Cluster units only provide a subset of Grid functionality.
The big difference is that a Cluster is homogeneous while Grids are heterogeneous
[35]. The computers that are part of a Grid can run different operating systems and
have different hardware whereas the Cluster Computers all have the same hardware
and OS. A Grid can make use of spare computing power on a desktop computer while
the machines in a Cluster are dedicated to work as a single unit. Grid are inherently
41
Figure 3.3: Serving job requests in Grid environment
distributed by its nature over a LAN or WAN. The computers in the Cluster are
normally contained in a single location.
Clusters are configurable in Active-Active or Active-Passive ways. Active-Active be-
ing that each computer runs it’s own set of services (Say, one runs a SQL instance, the
other runs a web server) and they share some resources such as storage. If one of the
computers in a Cluster goes down the service fails over to the other node and almost
seamlessly starts running there. Active-Passive is similar, but only one machine runs
these services and only takes over once there is a failure. Cluster components can be
shared or dedicated. On the other hand, some Grid resources may be shared, other
may be dedicated or reserved.
42
Another difference lies in the way resources are handled. In case of Cluster, all nodes
behave like a single system view and resources are managed by centralized resource
manager. In case of Grid, every node is autonomous, for example, it has its own
resource manager and behaves like an independent entity.
3.3 An example of Grid Computing environment
Figure 3.4: Google search architecture
We consider searching world wide web in Google as an example of Grid Computing.
Figure 3.4 shows the abstract view of Google search architecture [36]. Google pro-
cesses tens of thousands of queries per second. Each of this query is first received by
one of the Web Servers, then passes it to the array of Index Servers. Index Servers are
responsible for keeping index of words and phrases found in websites. The servers are
distributed in several machines and hence the searching runs concurrently. In fraction
43
of second, index servers perform a logical AND operation and return the reference of
the websites containing query (searching phrase). The resultant references then sent
to Store Servers. Store Servers maintain compressed copies of all the pages known
to Google. These compressed copies are used to prepare page snippets and finally
presented to the end user in a readable form.
Crawler Machines synchronizing through the web and updating the Google database
of pages stored in Index and Store servers. So, the Store Servers actually contains
relatively recent and compressed copies of all the pages available in the web.
Grid Computing can facilitates the above scenario of efficient searching. As it stated
earlier the servers are distributed and searching should be parallel in order to achieve
efficiency. The infrastructure also need to scale with the growth of web as the num-
ber of pages and indexes increased. Different organizations and numerous servers are
shared with Google. Copy the content and transforming it into its local resource is al-
lowed by Google. Local resources contain keyword database of the Index Servers and
cached content in the database of the Store Servers. The resources partially shared
with end-users who send queries through their browsers. Users can then directly con-
tact with the original servers to request the full content of the web page.
Google also shares computing cycles. Google shares its computing resources, such
as storage and computing capabilities with the end-user by performing data caching,
ranking and searching of query.
44
3.4 Grid Architecture
In this Section, we will discuss Grid architecture, which identifies the basic compo-
nents of a Grid system. It also defines the purpose and functions of such components.
However, this layered Grid architecture also indicates how these components actually
interacts with one another.
Here, we present Grid architecture described in [37]. Figure 3.5 shows the Grid layers
from top to bottom.
Figure 3.5: Grid Protocol Architecture
3.4.1 Fabric Layer: Interfaces to Local Resources
Fabric layer provides the resources that can be shared in Grid environment. An exam-
ple of such resources may be computational resources, storage systems, sensors and
network systems. Grid architecture does not deal with resources like distributed file
systems, where resource implementation requires individual internal protocols [37].
The computational resources represent multiple architectures such as clusters, super-
computers, servers and ordinary PCs which run on variety of operating systems (such
45
as UNIX variants or Windows) [38].
Components of the Fabric layer implement the local and resource-specific operations
on specific resources. Such resources are physical or even logical. Logical resources
my include Software Components, Policy files, Workflow applications etc. [39]. These
resource-specific operations provides functionalities of sharing operations at higher
levels. In order to support sharing mechanisms we need to provide [34] :
• an inquiry mechanism so that the components of Fabric are allowed to discover
and monitor resources.
• an appropriate (either application dependent or unified or both) resource man-
agement functionalities to control the QoS in Grid environment.
3.4.2 Connectivity Layer: Managing Communications
Connectivity layer defines the core communication and authentication protocols neces-
sary for grid networks. Communication protocol transfers data between Fabric layer
resources. Authentication protocols, however, build on communication services for
providing cryptographically secure mechanisms to the Grid users and resources.
The communication protocol can work with any of the networking layer protocols
that support transport, routing, and naming functionalities. In computational Grid,
TCP/IP Internet protocol stack is commonly used [37].
46
3.4.3 Resource Layer: Sharing of a Single Resource
Resource layer is on the top of Connectivity layer to define the protocols along with
API and SDKs for secure negotiation, monitoring, initialization, control and payment
of sharing operations on individual resources. Resource layer uses Fabric layer inter-
faces and functions to access and control local resources. This layer entirely considers
local and individual resources and therefore, ignores global resource management is-
sues. To share single resource, we need to classify two resource layer protocols [37]:
• Information protocols: Information protocols are used to discover the infor-
mation about state and structure of the resource for example - the configuration
of resource, current load state, usage policy or costing of the resource.
• Management protocols: Management protocols in Resource layer are used
to control and access to a shared resource. The protocols specify resource re-
quirements, which includes advanced reservation and QoS and the operations
on resources. Such operations include process creation, data access etc.
3.4.4 Collective Layer : Co-ordination with multiple resources
Resource layer, described in Section 3.4.3 deals with operation and management of
single resource (for example, computational resources, storage and network systems
etc.). But the Collective layer in the Grid architecture contains protocols and services
that are not associated with any one specific resource but rather are global in nature
and handles interactions across collections of resources. This layer provides necessary
API and SDKs not associated with specific resource rather the global resources in
overall grid environment.
47
Figure 3.6: Collective and Resource layer protocols are combined in various ways toprovide application functionality
The implementation of Collective layer functions can be built on Resource layer or
other Collective layer protocols and APIs [37]. Figure 3.6 shows a Collective co-
allocation API and SDK that uses a Resource layer management protocol to control
resources. On the top of this, a co-reservation service protocol and the service itself
are defined. To implement co-allocation operations, co-allocation API is called which
provides additional functionality such as authorization, fault tolerance etc. An appli-
cation then use the co-reservation service protocols to request and perform end-to-end
reservations.
3.4.5 Application Layer : User defined Grid Applications
The top layer of the Grid consists of user applications, which are constructed by uti-
lizing the services defined at each lower layer. At each layer, we have well-defined
protocols that access some useful services for example resource management, data ac-
cess, resource discovery etc. Figure 3.7 shows the correlation between different layers
[37]. APIs are implemented by SDKs, which use Grid protocols to provide function-
48
alities to end user. Higher level SDKs can also provide functionality so that it is not
directly mapped to a specific protocol. However, it may combine protocol operations
with calls to additional APIs to implement local functionality.
Figure 3.7: Programmers view of Grid Architecture. Thin lines denotes protocolinteractions where bold lines represent a direct call
3.5 Grid Computing with Globus
Globus [40] provides a software infrastructure so that applications can distribute com-
puting resources as a single virtual machine [41]. Globus Tooklit, the core component
of the infrastructure defines basic services and capabilities required for computational
Grid. Globus is designed as a layered architecture where high-level global services are
built on the top of low-level local services. In this Section, we will discuss how Globus
toolkit protocols actually interact with Grid layers.
• Fabric Layer:
Globus toolkit is designed to use existing fabric components [37]. For example,
enquiry software is provided for discovering and state information of various
49
common resources such as computer information (i.e. OS version, hardware
configuration etc), storage systems (i.e. available spaces) etc. In the higher
level protocols (particularly at the Resource layer) implementation of Resource
management, is normally assumed to be the domain of local resource managers.
• Connectivity Layer:
Globus uses public-key based Grid Security Infrastructure (GSI) protocols [42,
43] for authentication, communication protection, and authorization. GSI ex-
tends the Transport Layer Security (TLS) protocols [44] to address the issues
of single sign-on, delegation, integration with various local security solutions.
• Resource Layer:
A Grid Resource Information Protocol (GRIP) [45] is used to define standard re-
source information protocol. The HTTP-based Grid Resource Access and Man-
agement (GRAM) [46] protocol is used for allocation of computational resources
and also for monitoring and controlling the computation of those resources. An
extended version of the FTP, GridFTP [47], is used for partial file access and
management of parallelism in the high-speed data transfers [37].
The Globus Toolkit defines client-side C and Java APIs and SDKs for these
protocols. However, Server-side SDKs can also provided for each protocol, to
provide the integration of various resources for example computational, storage,
network into the Grid [37].
• Collective Layer:
Grid Information Index Servers (GIISs) supports arbitrary views on resource
subsets, LDAP information protocol used to access resource-specific GRISs to
obtain resource state and Grid Resource Registration Protocol (GRRP) is used
50
for resource registration. Also couple of replica catalog and replica management
services are used to support the management of dataset replicas. There is an
on-line credential repository service known ‘MyProxy’ provide secure storage for
proxy credentials [48]. The Dynamically-Updated Request Online Coallocator
(DUROC) provides an SDK and API for resource co-allocation [49].
3.6 Resource Management in Grid Computing
In this Section, we will discuss a resource management architecture which is used as a
Resource layer protocol described in [46]. Block diagram of the architecture is found
in Figure 3.8.
Figure 3.8: A resource management architecture for Grid Computing environment
To communicate request for resources between components an Resource Specification
Language (RSL) is used which is described details in Section 3.6.1. With the help of
the process called specialization, Resource Brokers transfer the high level RSL speci-
51
fication into concrete specification of resources. This specification of request named
ground request is passed to a co-allocator, which is responsible for allocating and man-
aging the resources at multiple sites. A multi-request is a request which is involved
resources at multiple sites. Resource co-allocators can break such multi-request into
components and pass each element into appropriate resource manager. The infor-
mation service, working between Resource Broker and Co-allocator is responsible for
giving access to availability and capability of resources.
3.6.1 Resource Specification Language
Resource Specification Language (RSL) is combination of parameters including the
operators:
• & : conjunction of parameter specifications
• | : disjunction of parameter specifications
• + : combining two or more request into single compound request or multi-request
Resource broker, co-allocators and resource managers each define a set of parameter-name.
Resource managers generally recognize two types of parameter-name in order to com-
municate with local schedulers.
• MDS attribute name: to express constraint on resources. For example: memory>64
or network=atm etc.
• Scheduler parameters: used to communicate information related to job, i.e.
count (number of nodes required), max_time (maximum time required), executables,
environment (environment variables) etc.
52
For example the following simple specification taken from [46],
&(executable=myprog)(|(&count=5)(memory>=64))(&(count=10)(memory>=32)))
requests 5 nodes with at least 64MB memory or 10 nodes with atleast 32 MB memory.
Here, executable and count are scheduler parameters.
Again, the following is an example of multi-request:
+(&count=80)(memory>=64)(executable=my_executable)(resourcemanager=rm1)
(&(count=256)(network=atm)(executable=my_executable)(resourcemanager=rm2)
Here two requests are concatenated by + operator. This is also an example of ground
request as every component of