WORKSHOP PROCEEDINGS - Engineering...Thiago Robert C. Santos and Antonio Augusto Frohlich, LISHA,...

First International Workshop on Operating Systems, Programming Environments and Management Tools

for High-Performance Computing on Clusters(COSET-1)

June 26th, 2004Saint-Malo, France

Ville de Saint-Malo – Service Communication – Photos : Manuel CLAUZIER

Held in conjunction with 2004 ACM International Conference on Supercomputing

(ICS ’04)

WORKSHOP PROCEEDINGS

First International Workshop on Operating Systems, Programming Environments and Management Tools for

High-Performance Computing on Clusters

COSET-1 Clusters are not only the most widely used general high-performance computing platform for scientific computing but also according to recent results on the top500.org site, they have become the most dominant platform for high-performance computing today. While the cluster architecture is attractive with respect to price/performance there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind. Administrative processes need refinement both for efficiency and effectiveness when dealing with numerous cluster nodes. The goal of this one-day workshop is to bring together a diverse community of researchers and developers from industry and academia to facilitate the exchange of ideas and to discuss the difficulties and successes in this area. Furthermore, to discuss recent innovative results in the development of cluster based operating systems and programming environments as well as management tools for the administration of high-performance computing clusters.

COSET-1

Workshop co-chairs Stephen L. Scott Oak Ridge National Laboratory P. O. Box 2008, Bldg. 5600, MS-6016 Oak Ridge, TN 37831-6016 email: [email protected] Christine A. Morin IRISA/INRIA Campus universitaire de Beaulieu 35042 Rennes cedex, France email: [email protected]

Program Committee Ramamurthy Badrinath, HP, India Amnon Barak, Hebrew University, Israël Jean-Yves Berthou, EDF R&D, France Brett Bode, Ames Lab, USA Ron Brightwell, SNL, USA Emmanuel Cecchet, INRIA, France Toni Cortès, UPC, Spain Narayan Desai, ANL, USA Christian Engleman, ORNL, USA Graham Fagg, University of Tennessee, USA Paul Farrell, Kent State University, USA Andrzej Goscinski, Deakin University, Australia Liviu Iftode, Rutgers University, USA Chokchai Leangsuksun, Louisiana Tech University, USA Laurent Lefèvre, INRIA, France John Mugler, ORNL, USA Raymond Namyst, Université de Bordeaux 1, France Thomas Naughton, ORNL, USA Hong Ong, University of Portsmouth, UK Rolf Riesen, SNL, USA Michael Schoettner, University of Ulm, Germany Assaf Schuster, Technion, Israël

COSET-1 Program 9:00-9:05 Opening 9:05-10:05 Session 1: Cluster Operating System Services Session chair: Christine Morin, INRIA Parallel File System for Networks of Windows Workstations Jose Maria Perez, Jesus Carretero, Felix Garcia, Jose Daniel Garcia, Alejandro Calderon, Universidad Carlos III de Madrid, Spain An application-oriented Communication System for Clusters of Workstations Thiago Robert C. Santos and Antonio Augusto Frohlich, LISHA, Federal University of Santa Catarina (UFSC), Brazil 10:05-10:35 Session 2: Application Management Session chair: Christine Morin, INRIA A first step toward autonomous clustered J2EE applications management Slim Ben Atallah, Daniel Hagimont, Sébastien Jean and Noël de Palma, INRIA Rhône-Alpes, France 10:35-11:00 Coffee break 11:00 - 12:30 Session 3: Highly Available Systems for Clusters Session chair: Stephen Scott, ORNL Highly Configurable Operating Systems for Ultrascale Systems Arthur B. Maccabe and Patrick G. Bridges, The University of New Mexico, USA Ron Brightwell and Rolf Riesen, Sandia National Laboratories, USA Trammell Hudson, Operating Systems Research, Inc., USA Cluster Operating System Support for Parallel Autonomic Computing A. Goscinski, J. Silcock, M. Hobbs, Deakin University, Australia Type-Safe Object Exchange Between Applications and a DSM kernel R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess, University of Ulm, Germany 12:30-14:30 Lunch

14:30-16:00 Session 4: Cluster Single System Image Operating Systems Session chair: Christine Morin, INRIA SGI's Altix 3700, a 512p SSI system. Architecture and Software environment. Jean-Pierre Panziera, SGI OpenSSI speaker TBA SSI-OSCAR Geoffroy Vallée, INRIA 16:00 -16:20 Coffee break 16:20-17:20 Session 4 (continued): Cluster Single System Image Operating Systems Session chair: Geoffroy Vallée, INRIA Millipede Virtual Parallel Machine for NT/PC clusters Assaf Schuster, Technion Genesis cluster operating system Andrzej Goscinski, Deakin University 17:20-18:00 Panel: SSI: Software versus Hardware Approaches Moderator: Stephen Scott, ORNL

A Parallel File System for Networks of Windows Worstations

José María Pérez Computer Science Department. Universidad Carlos III de Madrid

Av. De la Unversidad, 30 Leganes 28911, Madrid, Spain

+34 91 624 91 04

[email protected]

Jesús Carretero Computer Science Department. Universidad Carlos III de Madrid

Av. De la Unversidad, 30 Leganes 28911, Madrid, Spain

+34 91 624 94 58

[email protected]

José Daniel García Computer Science Department. Universidad Carlos III de Madrid

Av. de la Universidad Carlos III, 22 Colmenarejo 28270, Madrid, Spain

+34 91 856 13 16

[email protected]

ABSTRACT

The usage of parallelism in file systems allows the achievement of high performance I/O in clusters and networks of workstations. Traditionally this kind of solution was only available for UNIX systems, requires the usage of special servers and the usage of special APIs, which leads to the modification, and/or recompilation of existing applications. This paper presents the first prototype of a Parallel File System, called WinPFS, for the Windows platform. It is implemented as a new Windows File System and it is integrated within the Windows kernel components, which implies that no modification or recompilation of applications is needed to take advantage of parallel I/O. WinPFS uses shared folders (through the usage of the CIFS/SMB protocol) to access remote data in parallel. The proposed prototype has been developed under the Windows XP platform, and has been tested with a cluster of Windows XP nodes and a Windows 2003 Server node.

Categories and Subject Descriptors D.4.3 [Operating Systems]: File Systems Management – distributed file systems. C.2.4 [Computer-Communications Networks]: Distributed Systems – network operating systems.

Keywords Parallel I/O, Cluster, Windows.

1. INTRODUCTION In the last years, the need of high performance data storage has grown as the capacity of disks and the applications needs have grown [1][2][3]. One approach to overpass the bottleneck that characterizes typical I/O systems is the usage of parallel I/O approach [1]. This technique allows the creation of large storage systems, by joining several storage resources, to increase the scalability and performance of the I/O system and to provide load balancing.

The usage of parallelism in file systems relies on the fact that a distributed and parallel system consists on several nodes with storage devices. Performance and bandwidth can be increased if data accesses are exploited in parallel. Parallelism in file systems is obtained by using several independent server nodes, each one supporting one or more secondary storage devices. Data are striped among those nodes and devices to allow parallel accesses to different files, and parallel accesses to the same file. Initially, this idea was used in RAID [4] (Redundant Array of Inexpensive Disks). However, when a RAID is used in a traditional file server, the I/O bandwidth is limited by the server memory bandwidth. However, if several servers are used in parallel, performance can be increased in two ways:

1. Allowing parallel access to different files by using several disks and servers.

2. Striping data using distributed partitions, allowing parallel access to the data of the same file.

However, current parallel file systems and parallel I/O libraries lack of generality and flexibility for general purpose distributed environments. Furthermore, all parallel file systems do not use standard servers,

which makes it very difficult to integrate those systems in existing networks of workstations due to the need of installing new servers that are not easy to use and that are only available for specific platforms (usually some UNIX flavour). Moreover, those systems are implemented outside the operating system, so that a new I/O API is needed to take advantage of parallel I/O with the modification of existing applications.

Most of the software related to high performance I/O is only available for UNIX environments, or has been created as UNIX middleware. The work presented in this paper tries to fulfil the lack of this kind of systems in the Windows environment, presenting a way to achieve parallel I/O in Windows platforms.

In this paper, we present a parallel file system for Windows clusters and/or networks of workstations called WinPFS. This system integrates the existing servers in an organization, using protocols like NFS, CIFS or WebDav in order to obtain Parallel I/O, without needing complex installations, providing support to existing applications, high performance and low overhead thanks to the integration with the Windows kernel.

In section 2, some works related to parallel I/O are presented. Section 3 presents the design of WinPFS. Section 4 describes some evaluations of the first WinPFS prototype. Finally, section 5 presents our conclusions and future work.

2. RELATED WORK Three different parallel I/O software architectures

can be distinguished:

• Application libraries basically consist of a set of highly specialized I/O functions. Those functions provide a powerful development environment for experts with specific knowledge of the problem to be modeled by using this solution. A representative example is MPI-IO [5], an I/O extension of the standardized message passing interface MPI.

• Parallel file systems operate independently from applications, thus allowing more flexibility and generality. Examples of parallel file systems are: Vesta [6], PIOUS [7], Galley [8], ParFiSys [9].

• Intelligent I/O systems hide the physical disk access to the application developer by providing a transparent logical I/O environment. The user

describes what she wants and the system tries to optimize the I/O requests applying optimization techniques. This approach is used in, ViPIOS[10].

The main problem with parallel I/O software architectures and parallel I/O techniques is that they often lack of generality and flexibility, because they create only tailor-made software for specific problems. On the other hand, parallel file systems are specially conceived for multiprocessors and multicomputers, and do not integrate appropriately in general purpose distributed environments as clusters of workstations.

Last years some file systems had emerged, such as PVFS [11], which can be used in Linux clusters, but they need the installation of special servers. Other solutions, such as Expand [12], can use existing standard NFS servers to accomplish parallel I/O, which implies that no new severs are needed in a cluster, but the standard linux NFS. Usually the client side is implemented as a user level library and designed with UNIX in mind.

Another important way to accomplish high performance I/O is the usage of MPI-IO [5]. Some implementations of MPI have been adapted to Windows: MPICH, MPI-PRO, and WMPI. But, usually the Windows I/O part is not optimized using parallel I/O techniques.

3. WINPFS DESIGN The main motivation for the WinPFS design is to build a parallel file system for networks of Windows workstations using standard data servers. To satisfy this goal, authors designed and implemented a parallel file system using CIFS/SMB servers. This paper describes the first prototype of WinPFS.

The goals of the proposed architecture are:

• To integrate existing storage resources using shared folders (CIFS, WebDAV, NFS, etc) rather than installing new servers. This is accomplished by using Windows Redirectors.

• To simplify setup. Only a Windows driver is needed to make use of the system.

• To be easy to use. Existing applications must work without modification and without recompilation.

• To enhance performance, scalability and capacity of the I/O system, through the usage of parallel and distributed file systems mechanism: request

splitting, balanced data allocation, load balancing, etc.

Clients

Client

Redirectors

NFSCIFSHTTP-WebDav...

IntranetDistributed partition 2

Site 1

Site 3Site 2

HTTPWebDAVCIFS Local....

....

NFS

Win32 MPI-IO

WinPFS

Distributed partition 1

Figure 1. WinPFS installed in an Intranet

To accomplish most of the former goals, the proposed design is based on a new Windows kernel component that implements the basis of the file system, isolating users from the parallel file system, and the use of protocols to connect to different network file systems.

Figure 1 shows how a client application, using any Windows interface (for instance, Win32), can access distributed partitions in a cluster or network of workstations using WinPFS. Communication with the data servers can be performed with any available protocol through kernel components, called redirectors, which redirect requests to remote servers with a specific protocol. In our first prototype, we have only considered CIFS/SMB servers and the issues related to coordinate several of them.

The situation of WinPFS into the Windows kernel can be observed in Figure 2.

Win32 POSIX DOS

Native NT API

I/O Manager

CIFS WebDav Netware NFS Local

WinPFS

Figure 2. WinPFS in the Windows I/O subsystem

A user application uses an available interface (Win32, POSIX, etc), whose calls are converted in system calls

(Windows system services). The I/O Manager, who is in charge of identifying the driver who is going to deal with each request, receives the requests in kernel mode. WinPFS register itself as a virtual remote file system that allows the driver to receive the requests.

Next sections show the remote data access and file stripping techniques used in WinPFS, the I/O request management ant the usage of WinPFS.

3.1 Remote Data Access and File Striping From the user point of view, Windows operating system provides access to remote data storage nodes through shared folders (local folders exported to remote computers). WinPFS create a new-shared folder, called \\PFS, in the client side. Therefore, the users can use parallel files through the shared folder mechanism.

From the kernel point of view, the accesses to remote data are performed based on several mechanisms as CIFS (Common Internet File System), also known as SMB protocol, UNC (Universal Naming Convention) and a special class of drivers, called redirectors, which receive I/O requests from the users and send them to the appropriate servers.

The parallel file system implemented identifies all the requests trying to access a virtual path (\\PFS) and processes them. For example, if we want to create a file in WinPFS, we must do something like: CreateFile(\\PFS\file.txt) instead of CreateFile(C:\tmp\file.txt) or CreateFile(\\server1\tmp\file.txt).

In order to achieve high performance, load balancing and higher storage capability, a file is striped through several nodes. Our file system must coordinate the access to several of those remote folders in order to achieve load balancing and data distribution.

Striping leads to the creation of one or more requests to access data based on the buffer size and the current offset. Then the requests are sent to one or several redirectors in order to access remote storage nodes (See Figure 3).

3.2 I/O Request Management The Windows NT family has a layered I/O model,

which allows the definition of several layers to process a request in the I/O subsystem. Each of those layers is a driver that can receive a request and pass it (or additional requests) to lower layers (drivers) in the I/O stack. This model allows the insertion of new

layers (drivers) in the path of an I/O request, for example for encryption or compression. WinPFS takes advantage of this mechanism in order to provide parallel I/O functionalities.

Figure 3. Serving a request to several remote

servers

To support this model the Windows I/O subsystem presents two major features: the I/O Manager and the I/O request packets (IRP).The I/O Manager is in charge of receiving requests from the user in form of NT services (system calls), creating an IRP in order to describe the requests and delivering them to the appropriate device, which has an associated driver (in our case WinPFS). All the I/O requests are delivered to drivers as I/O request packets (IRPs). That way, the I/O subsystem presents a consistent interface to all kernel-mode drivers. This interface includes typical I/O operations: open, close, read, write, etc [13].

Apart from creating the IRP, the I/O Manager must identify the device and kernel component which is going to complete a request. In the case of remote storage this work is supported by the MUP (Multi UNC Provider) that identifies the kernel component (network redirector, or WinPFS) in charge of a specific network name (Figure 4, steps 2-3). Once the driver is identified (in our case WinPFS), the IRP is passed to it (Figure 4, steps 4). Then WinPFS creates one or several subrequests to be sent to remote servers and/or to access local information (cache, cached metadata, etc).

The way in which the subrequests are created depends on the kind of request. It can be classified in the following categories:

Create: The requests (IRPs) are replicated and sent to each server in which the file is going to be distributed on.

• Read, write: The main request is split into smaller subrequests that are sent to some server. For example, if we want to read 256Kbytes and the stripping unit is 64Kbytes, four subrequests of 64Kbytes are created (one for each of four shared folders).

• Create Directory: The requests are replicated in all the shared folders, in order to make the directory tree consistent through all the shared folders. This means that if we want to create a directory: \\PFS\tmp, this directory is created in every shared folder: \\server1\share1\tmp, \\server1\share2\tmp, \\server2\share1\tmp, …

• Metadata management, control, security: This kind of request needs a different approach. Some of then do not require to split requests and/or to access the remote servers.

As an example, the basic steps to create/open a file are presented bellow (Figure 4):

0. The I/O Manager receives a request and creates an IRP. This request contains the file name in the form: \\PFS\...

1. The I/O Manager sends the IRP packet to the MUP.

2. The MUP module has to look for some redirector (network file system) that recognizes the \\PFS string as a network name. The MUP asks all the redirectors until someone answer affirmatively.

3. The MUP module indicates to the I/O Manager that the request must go to the redirector that recognized the network name. For this reason, our driver is created as a redirector that recognizes all the requests with the prefix \\PFS.

4. The I/O Manager sends the request to WinPFS.

5. The request packet received is split in several parallel subrequests, also in the form of IRPs, and are sent to redirectors (CIFS, NFS, etc). In order to create the subrequests, the driver has to know where the data of the parallel file are stored, what servers are available, and which protocols can be used for each request.

Figure 4 . WinPFS steps to serve request

6. The redirectors send the requests to the remote servers and they send/receive data stripes for each server. In the example, data is striped using a round-robin policy. However, several policies can be used to allocate parallel file data on the remote servers.

7. Once the requests are served, the driver must join the results waiting in a kernel event for the completion of all the subrequests.

Steps 0...3 are only needed when a file is created or opened. Once requests have finished, the users receive a File Handle (associated with a File Object in kernel space) that allows to access directly to WinPFS.

Other two important requests in a file system are reading and writing. Those requests come from the user indicating a buffer where the data are sent/received. That buffer must be split to send to each server a part of them. In order to optimize the file system, the implementation must avoid the copy of any buffer. The Windows kernel provides the mechanisms needed to perform those operations without copy. The buffers that come from the users are received in kernel structures called MDL (Memory Description List). WinPFS splits the request in subrequests, by creating a new MDL

called a Partial MDL, and the buffers from the original MDL are mapped in the Partial MDLs, so no copy is needed. Then this new buffer (MDL) is send to the appropriate redirector that sends/receives data to/from the remote servers.

3.3 Using WinPFS From the administration point of view, the installation of WinPFS only requires three steps:

• To install a driver in the client nodes. • To share folders in the servers, this can be

accomplished though the Explorer or Windows administrative tools.

• To indicate the shared folders to be used by WinPFS in the registry of the client nods.

From the user point of view, using the parallel file system only requires that all the paths are prefixed with \\PFS, but we plan to implement the necessary mechanisms to map the remote name (\\PFS) to common driver letters as D:.

Apart from the detail of using the naming convention \\PFS\file, nothing more is needed. WinPFS can be used with the Win32 API, POSIX, cygwin or whatever I/O API that finally uses Windows Services.

Other important features to take into account are:

• Caching: WinPFS caching is supported by the caching mechanisms performed by the redirectors. Therefore, now caching is limited to the Windows caching model, but in the future more advanced caching algorithms could be implemented. In addition, the caching mechanisms can be disabled through the Win32 API.

• Security and Authentication: Security and authentication issues are solved by the operating system. If we work in a Windows “Domain” there are no authentication problems, because the domain users can access all the resources. Of course, the access to shared folders (including WinPFS parallel partitions) can be controlled indicating what users can access the resource; this is accomplished with the Windows security model. If we want to use servers through several not trusted domains, or in workgroups, some changes must be done to our prototype to incorporate the authentication mechanisms.

• Data consistency between clients: Other important feature is the consistency between several clients accessing a parallel/distributed file. At the moment, this is solved using the default mechanism used by the CIFS redirector. The CIFS protocol use a mechanism called oplocks (opportunistic locks) that enable a protocol to maintain the consistency between clients [14].

4. Evaluation In order to measure the performance of the first prototype of WinPFS, we have made some evaluation tests. The test creates a file of 100 Mbytes that is written sequentially and then read sequentially using a static buffer size (several executions are performed with different buffer sizes). We have disabled the client cache with the option FILE_FLAG_NO_BUFFERING in the creation of the file (CreateFile). We do not need to use special features, as IOCTL, of any kind; the Windows API provides this feature and others.

With this test, we want to measure the access performance through a network to remote disks (write) and to access the server’s cache (read). We write the file with WinPFS, thus the file is striped and sent to several servers that write it to disk, and then we read it from the server, where data are maintained in caches.

We have joined two clusters of four nodes (see Figure 5). Each node is a BiProcessor Pentium III

1GHz with 1GByte of main memory, 200 GByte disks and a GigaEthernet network, with two 3Comm GigaEthernet switches and four nodes connected to each of them.

PC

PC

PC

PC

PC

PC

PC

PC

GigaEthernetSwitch

GigaEthernetSwitch

Figure 5. Evaluation Infrastructure

From the operating system point of view, a Windows Domain with one computer running Windows 2003 Server and other seven computers running Windows XP Professional has been created.

The test was executed with 8 clients running the application simultaneously and with different configurations of the I/O system. First, the evaluation was done with the simple sharing folders mechanisms available by default in Windows (CIFS). This allowed us to evaluate the performance provided by one server with different number of clients. Then, we tested WinPFS with different number of servers in parallel: PFS88 (8 servers used in parallel), PFS44 (4 servers used in parallel), and PFS84 (4 servers used in parallel and selected from a set of 8 servers).

Figure 6 shows the results obtained in the write part of the test with 8 clients (one application running in each node). As can be seen, the performance obtained by WinPFS is higher that a single server attending 8 clients. With one single CIFS server 40 Mbits/s are achieved with 64KB and 128KB application buffer. WinPFS obtains 150 Mbits/s with four servers (PFS44), 200 Mbits with four servers selected from eight (PFS84), and almost 250 Mbits/s using eight servers (PFS88). The reason is that the I/O requests are distributed between 8 servers instead of one. Therefore, in the single server case, the I/O throughput is the one provided by one disk.

Figure 7 shows the performance obtained in the read part of the test. In this case, the performance is higher with WinPFS (1200 Mbits/s). The CIFS server obtains good results (600 Mbits/second)

because all the data are in main memory, so no disk access is required.

Figure 6. Write Results for 8 clients (Write to

remote disks)

Figure 7. Read Results for 8 clients (Read from

Servers Cache)

One thing that must be remarked is that the network connecting the two clusters imposed a limit of 1Gbit/s (120 Mbytes/s) to the system. If we use one server in one subcluster and the client in the other subcluster, we never would overpass this limit. However, as can be see in Figure 7, WinPFS can overpass this limit due to the use of several servers in parallel (some of one cluster and other from the second cluster).

To clarify the results obtained, Figure 8 and Figure 9 show the speedup obtained with the PFS88 (8 servers in parallel) configuration with respect to the simple CIFS server.

As can be seen, the speedup in the read part is smaller (less than a 100% of improvement). As commented before, this is because data were served from server cache without disks accesses.

1K 2K 4K 8K 16K 32K 64K128K 256K 512K

1M

1 Client

2 Clients

4 Clients8 Clients

0

1

2

3

4

5

6

7

Speedup

Buffer Size

Figure 8. Write Speedup (CIFS vs. WinPFS 88)

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

1 Client2 Clients

4 Clients8 Clients

-0,2

0

0,2

0,4

0,6

0,8

1

Speedup

Buffer Size

Figure 9. Read Speedup (CIFS vs. WinPFS 88)

The write test takes into account the accesses to disks, and the results show that under those circumstances a speedup factor of 5 (500% of improvement) is achievable, reaching almost a 700% of improvement with buffers of big sizes. We think that the read can achieve this level of improvement if the data were flushed to disks.

As can be seen with one client, the speedup is about a factor of one in the writes and not exists for reads, or even WinPFS works worst with bigger application buffers. The last is because WinPFS strips the buffer in 64Kbytes buffers, so we are limited to the performance obtained with 64Kbytes buffers. This may be solved with a bigger stripping unit, but this may impose a limit in the system parallelism.

5. CONCLUSIONS AND FUTURE WORK In this paper, we have described the design of a parallel file system for clusters of Windows

0

200

400

600

800

1000

1200

1400

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Buffer Size (Bytes)

Ban

dw

idth

(M

bit

s/S

)

CIFS PFS88 PFS44 PFS84

0

50

100

150

200

250

300

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Buffer Size (Bytes)

Ban

dw

idth

(M

bit

s/S

)

CIFS PFS88 PFS44 PFS84

workstations. This system provides parallel I/O features that allow the integration of existing storage resources by sharing folders and using a driver (WinPFS).

Our approach proposes to complement the Windows kernel routing all the requests for the network name \\PFS to a driver that splits the requests and uses several data servers in parallel. The integration of the file system into the kernel provides higher performance that other solutions that use libraries, and it provides no differences from the user point of view, so that the user can execute its applications without rewriting or recompiling them.

WinPFS achieves high scalability, availability and performance by using several servers in parallel. WinPFS also allows us to obtain a high capacity storage node with a set of workstations. In the test, a 1.6 Terabytes system has been built using 200 GBytes disks.

In the test, the performance limits of the systems are two: disks in the write operations, and network bandwidth in the read operations.

With the usage of redirectors, a client can stripe files over CIFS, NFS and WebDAV servers independently that those servers reside in Windows or UNIX servers, NAS, or whatever other storage protocol that is supported through redirectors.

Future work is going on to dynamically add and remove storage nodes to the cluster, on data allocation and load balancing for heterogeneous distributed systems, and on parallel usage of heterogeneous resources and protocols (CIFS, NFS, WebDAV, etc) in a network of workstations, addressing the implications to performance, management and security. In addition, we will use the Active Directory Service provided by Windows to create a metadata repository, so that all clients can obtain a consistent image of the parallel files.

6. ACKNOWLEDGMENTS This work has been partially support by Microsoft Research Europe, by the Community of Madrid under the 07T/0020/2003 1 contract, and by the Spanish Ministry of Science and Technology under the TIC2003-01730 contract.

7. ADITIONAL AUTHORS Additional authors: Felix Garcia ([email protected]) and Alejandro Calderón ([email protected]).

8. REFERENCES [1] Peter M. Chen, David A Patterson. Maximizing

Performance in a Striped Disk Array. Proceedings of the 17th Annual International Symposium. On Computer Architecture, ACM SIGARCH Computer Architecture News. 1990.

[2] J. Gray. Data Management: Past Present, and Future. IEEE Computer, Vol. 29, Nº 10, 1996, pp. 38-46.

[3] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23: 187-200, 2001

[4] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Array of Inexpensive Disks (RAID). In Proceedings of ACM SIGMOD, pages 109-116, Chicago, IL, June 1988.

[5] MPI Forum. 1997. MPI-2: Extensions to the Message-Passing Interface. http://www.mpi-forum.org

[6] P. Corbett, S. Johnson, and D. Feitelson. Overview of the Vesta Parallel File System. ACM Computer Architecture News, vol. 21, no. 5, pp. 7--15, Dec. 1993.

[7] S. A. Moyer and V. S. Sunderam. PIOUS: A Scalable Parallel I/O System for Distributed Computing Environments. Proceedings of the Scalable High-Performance Computing Conference, 1994, pp. 71--78.

[8] N. Nieuwejaar and D. Kotz. The Galley Parallel File System. Proceedings of the 10th ACM International Conference on Supercomputing, May 1996.

[9] J. Carretero, F. Perez, P. de Miguel, F.Garcia, and L.Alonso, Performance Increase Mechanisms for Parallel and Distributed File Systems. Parallel Computing: Special Issue on Parallel I/O Systems. Elsevier, no. 3, pp. 525--542, Apr. 1997.

[10] Fuerle, T., O., Schikuta, E., and Wanek, H. 1999. Meta-ViPIOS: Harness distributed I/O resources with ViPIOS. Journal of Research Computing and Systems, 4(2):124-142

[11] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Takhur, PVFS: A Parallel File System for Linux Clusters.Tech. Rep. ANL/MCS-P804-0400, 2000.

[12] F. Garcia, A. Calderon, J. Carretero, J.M. Perez, J. Fernandez, The Design of the Expand Parallel

File System. International Journal of High Performance Computing Applications, 2003.

[13] Rajeev Nagar. Windows NT File System Internals. A Developer’s Guide. O’Reilly, 1997. Pp. 158

[14] SNIA (Storage Networking Industry Association). Common Internet File System (CIFS). Technical Reference. Revision: 1.0, 2002. pp. 6-10

An Application-Oriented Communication System for Clusters ofWorkstations

Thiago Robert C. Santos and Antonio Augusto Frohlich

Laboratory for Software/Hardware Integration (LISHA)Federal University of Santa Catarina (UFSC)

88049-900 Florianopolis - SC - BrazilPO Box 476

Phone: +55 48 331-7552 Fax: +55 48 331-9770{robert | guto}@lisha.ufsc.br

http://www.lisha.ufsc.br/ � {robert | guto}

Abstract

The present paper proposes an application-oriented communication sub-system to be used inSNOW, a high-performance, application-oriented parallel-programming environment for dedicatedclusters. The proposed communication sub-system is composed of a baseline architecture and afamily of lightweight network interface protocols. Each one of these protocols is built on top of thebaseline architecture and can be tailored in order to satisfy the needs of specific classes of parallel ap-plications. The family of lightweight protocols, along with the baseline architecture that supports it,consists of a customizable component in EPOS, an application-oriented, component-based operatingsystem that is the core of SNOW. The idea behind providing a set of low-level protocol implemen-tations instead of a single monolithic protocol is that parallel applications running on clusters canimprove their communication performance by using the most appropriate protocol for their needs.

Keywords: lightweight communication protocols, application-oriented operating systems, user-level communication.

1 Introduction

Clusters of commodity workstations are now commonplace in high-performance computing. In fact,commercial off-the-shelf processors and high-speed networks evolved so much in recent years that mostof the hardware features once used to characterize massively parallel processors (MPP) are now availablein clusters as well. Nonetheless, the majority of the clusters in use today rely on commodity run-timesupport systems (run-time libraries, operating systems, compilers, etc.) that have usually been designedin disregard of both parallel applications and hardware. In such systems, delivering a parallel API likeMPI is usually achieved through a series of patches or middleware layers that invariably add on overheadfor applications. Therefore, it seems logical to suppose that a run-time support system specially designedto support parallel applications on clusters of workstations could considerably improve on performanceand also on other software quality metrics (e.g. usability, correctness, adaptability).

Our supposition that ordinary run-time support systems are inadequate to support high-performancecomputing is sustained by a number of research projects in the field focusing on the implementationof message passing [8, 9] and shared memory [10, 11, 12] middlewares and on user-level communica-tion [4, 5, 6]. If ordinary operating systems could match with parallel application’s needs, deliveringadequate run-time support for the most traditional programming paradigms with minimum overhead,many of theses researches would be hard to justify outside the realm of operating systems. Indeed,

1

the way ordinary operating systems handle I/O is largely based on multitasking concepts such as do-main protection and resource sharing. This impacts the way recurring operations like system calls, CPUscheduling, and application’s data management are implemented, making little room for novel techno-logical features [2]. Not surprisingly, system designers often have to push the operating system out ofthe way in order to implement efficient schedulers and communication systems for clusters.

In addition to that, commodity operating systems usually target reconfigurability at standard con-formance and device support, failing to comply with applications’ requirements. Clusters have beenquenching the industry’s thirst for low-end supercomputers for years as HPC service providers deploycost-effective solutions based on cluster systems. There are all kinds of applications running on clusterstoday, ranging from communication-intensive distributed databases to CPU-hungry scientific applica-tions. Having the chance to customize a cluster’s run-time support system to satisfy particular applica-tions’ needs could improve the system’s overall performance. Indeed, systems such as PEACE [13] andCHOICES [14] have already confirmed this hypothesis in the 90s.

In this paper, we discuss the use of dedicated run-time support system, or, more specifically, ofdedicated communication systems, as effective alternatives to support communication-intensive parallelapplications in clusters of workstations. The research that subsided this discussion was carried out inthe scope of the SNOW project [19], which aims at developing a high-performance, application-orientedparallel-programming environment for dedicated clusters. Actually, SNOW’s run-time system comesfrom another project, namely EPOS, which takes on a repository of software components, an adaptivecomponent framework, and a set of tools to build application-oriented operating systems on demand [16].

The remainder of this paper is structured as follows. Section 2 gives an overview of EPOS. Section 3presents a redesign of EPOS communication system aimed at enhancing support for network interfaceprotocols and describes the baseline architecture that supports these protocols. Section 4 elaborates onrelated works. Conclusions are presented in Section 5, along with the directions for future work.

2 An Overview of EPOS

EPOS, the Embedded Parallel Operating System, is a highly customizable operating system developedusing state-of-the-art software engineering techniques. EPOS consists of a collection of reusable andadaptable software components and a set of tools that support parallel application developers in “plug-ging” these components into an adaptive framework in order to produce a variety of run-time systems,including complete operating systems. Being fruit of Application-Oriented System Design [15], methodthat covers the development of application-oriented operating systems from domain analysis to imple-mentation, EPOS can be customized to match the requirements of particular parallel applications. EPOS

components, or scenario-independent system abstractions as they are called, are grouped in families andkept independent of execution scenario by deploying aspect separation and other factorization techniquesduring the domain engineering process, illustrated in Figure 1. EPOS components can be adapted to bereused in a variety of execution scenarios. Usability is largely improved by hiding the details of a familyof abstraction behind an hypothetical interface, called the family’s inflated interface, and delegating theselection of proper family members to automatic configuration tools.

An application written based on inflated interfaces can be submitted to a tool that scans it searchingfor references to the interfaces, thus rendering the features of each family that are necessary to supportthe application at run-time. This tool, the analyzer, outputs a specification of requirements in theform of partial component interface declarations, including methods, types and constants that were usedby the application.

The primary specification produced by the analyzer is subsequently fed into a second tool, theconfigurator, that consults a build-up database to further refine the specification. This databaseholds information about each component in the repository, as well as dependencies and compositionrules that are used by the configurator to build a dependency tree. Additionally, each component in the

2

DomainProblem

adapter

adapter

adapter

Scenario

aspect

aspect

Frameworks

Family

inflated i

MemberMember Member

Member

aspectfeatureconfig.

AbstractionsFamilies of

Figure 1: An overview of Application-Oriented System Design.

repository is tagged with a “cost” estimation, so that the configurator will chose the “cheapest” optionwhenever two or more components satisfy a dependency. The output of the configurator consists of a setof keys that define the binding of inflated interfaces to abstractions and activate the scenario aspects andconfigurable features eventually identified as necessary to satisfy the constraints dictated by the targetapplication or by the configured execution scenario.

configurator generatoranalyzer

info

applicationprogram

frameworkinflated interfaces

system instance

aspects

componentsadapters

Figure 2: An overview of EPOS generation tools.

The last step in run-time systems generation process is accomplished by the generator. Thistool translates the keys produced by the configurator into parameters for a statically meta programedcomponent framework and triggers the compilation of a tailored system instance. Figure 2 brings anoverview of the whole procedure.

3

3 EPOS Communication System

EPOS communication system is designed around three major families of abstractions: communicator,channel, and network. The communicator family encompasses communication end-points such aslink, port, and mailbox, thus acting as the main interface between the communication system and appli-cation programs1 . The second family, channel, features communication protocols, so that applicationdata fed into the communication system via a communicator gets delivered at the destination commu-nicator accordingly. A channel implements a communication protocol that would be classified at levelfour (transport) according to the ISO/OSI reference model. The third family in EPOS communicationsystem, network, is responsible for abstracting distinct network technologies through a common inter-face2, thus keeping the communication system itself architecture-independent and allowing for flexiblecombinations of protocols and network architectures.

Commu−nicator Channel Network

Figure 3: An overview of EPOS communication system.

Previous partial implementations of EPOS communication system for the Myrinet high-speed net-work architecture confirmed the lightness of its core, delivering unprecedented bandwidth and latency toparallel applications running on SNOW [17]. Nonetheless, EPOS communication system original designmakes it hard to split the implementation of network interface protocols [1] between the processors in thehost machine and in the network adapter. Besides, it is very difficult to specify a single network interfaceprotocol that is optimal for all parallel applications, since different applications impose different trafficpatterns on the underlying network. Instead of developing a single, highly complex, all encompassingprotocol, it appears more feasible to construct an architecture that permits fine-grain selection and dy-namic configuration of precisely specified low-level lightweight protocol mechanisms. In an application-oriented environment, this set of low-level protocols can be used to customize communication accordingto applications’ needs.

EPOS design allows for the several network interface protocols that arise from the design decisions re-lated to network features to be grouped into a software component with a defined interface, a family, thatcan be easily accessed by the communication system of the OS. EPOS’ framework implements mech-anisms for fine-grain selection of modules according to applications’ needs. These same mechanismscan be used to select the low-level lightweight protocols that better satisfy the applications’ communica-tions requirements. Besides, an important step towards an efficient, application-oriented communicationsystem for clusters is to better understand the relation between the design decisions in low-level com-munication software and the performance of high-level applications. Grouping the different low-levelimplementations of communication protocols in a component coupled with the communication systemhas the additional advantage of allowing an application to experiment with different communicationsschemes, collecting metrics in order to identify the best scheme for its needs. In addition to that, tostructure communication in such a modular fashion enhances maintainability and extensibility.

3.1 Myrinet baseline architecture

The baseline communication architecture that supports the low-level lightweight protocols for the Myrinetnetworking technology must be simple and flexible enough not to hinder the design and implementation

1The component nature of EPOS enables individual elements of the communication system to be reused in isolation, evendirectly by applications. Therefore, the communicator is not the only visible interface in the communication system.

2Each member in the network family is allowed to extend this interface to account for advanced features.

4

of specific protocols. The highest bandwidth and lowest latency possible are desired since complex pro-tocol implementations will definitely affect both of them. User-level communication was the best answerfrom academia to the lack of efficient communication protocols for modern, high-performance networks.The baseline architecture described in this section follows the general concepts behind successful user-level communication systems for Myrinet. Figure 4 exhibits this architecture, highlighting the data flowduring communication as well as host and NIC memory layout.

1NIC NIC

Rx FIFO QueueTx FIFO Queue Rx DMA RequestsTx DMA Requests

Host (Epos) Host (Epos)

Non−swappablePhysical Memory

Flat address space

OS OS

2

3

Unsolicited Ring

4

Messages

Frames

Send Ring

Receive Ring

Figure 4: The Myrinet Family baseline architecture.

The NIC memory holds the six buffers that are used during communication. Send Ring and ReceiveRing are circular buffers that hold the frames before they are accessed by the Network-DMA engine, theresponsible for sending/receiving frames to/from the Myrinet network. Rx DMA Requests and Tx DMARequests are circular chains of DMA control blocks, used by the Host-DMA engine for transferringframes between host and NIC memory. Rx FIFO Queue and Tx FIFO Queue are circular FIFO queuesused by the host processor and LANai, Myrinet’s network interface processor, to signal for each other thearrival of a new frame. The size of these buffers affects the communication performance and reliabilityand the choice of their sizes is influenced by the host’s run-time system, memory space considerations,and hardware restrictions.

Much of the overhead observed in traditional protocol implementations is due to memory copiesduring communication. Some network technologies provide host-to-network and network-to-host datatransfers, but Myrinet requires that all network traffic go through NIC memory. Therefore, at least threecopies are required for each message: from host memory to NIC memory in the sending side, fromNIC to NIC and from NIC memory to host memory in the receiving side. Write-combining, the DMAtransfers start-up overhead and the fact that a DMA control block has to be written in NIC memory foreach Host-DMA transaction make write PIO more efficient than DMA for small frames. The baselinearchitecture copies data from host to NIC using programmed I/O for small frames (less than 512 bytes)and Host-NIC DMA for large frames. Since reads over the I/O bus are much slower than writes, thebaseline architecture uses DMA for all NIC-to-host transfers.

During communication, messages are split into frames of fixed size that are pushed into the com-munication pipeline. The frame size that minimizes the transmission time of the entire message in thepipelined data transfer is calculated [18] and the baseline architecture uses this value to fragment mes-sages. Besides, the maximum frame size (MTU) is dynamically configurable. For each frame, the sender

5

host processor uses write PIO to fill up an entry in the Rx DMA Requests (for large frames) or to copy(1) the frame directly to the Send Ring in NIC memory (for small frames). It then triggers a doorbell,creating a new entry in the Tx FIFO Queue and signaling for the LANai processor that a new framemust be sent. For large frames, the transmission of frames between host and NIC memory is carried outasynchronously by the Host/NIC DMA engine (1) and the frame is sent as soon as possible by LANaiafter the corresponding DMA finishes (2). Small frames are sent as soon as the doorbell is rung, since atthis point the frame is already in NIC memory. A similar operation occurs in the receiving side: when aframe arrives from the network, LANai receives it and fills up an entry in the Rx DMA Requests chain.The message is assembled asynchronously in the Unsolicited Ring circular buffer in host memory (3).The receiving side is responsible for copying the whole message from the Unsolicited Ring before it isoverwritten by other messages (4). Note that specific protocol implementations can avoid this last copyusing rendezvous-style communication, where the receiver posts a receive request and provides a bufferbefore the message is sent, a credit scheme, where the sender is requested to have credits for the receiverbefore it sends a packet, or even some other technique, achieving the optimal three copies.

The host memory layout is defined by the operating system being used. Besides, the Myrinet NICimpose some constraints on the usage of its resources that must be addressed by the OS. The most criticalone relates to the Host/NIC DMAs: the Host-DMA engine can only access contiguous pages pinnedin physical memory. Most communication systems implementations for Myrinet address this issue byletting applications pin/unpin the pages that contain its buffers on-the-fly during communication or byusing a pinned copy block. The problem with these approaches is that they add extra overhead sincepinning/unpinning memory pages requires system calls, which implies in context saving and switching,and using a pinned copy block adds extra data copies in host memory. In EPOS, where swapping can beleft out of the run-time system by mapping logical address spaces contiguously in physical memory, thisissue does not affect the overall communication.

25

Host (GNU/Linux)

Copy blockNon−swappable

Host (GNU/Linux)

Application Address Space

Messages

Frames

1NIC NIC

3

Send Ring

4Receive Ring

Figure 5: The Myrinet Family baseline architecture in a GNU/Linux host.

Figure 5 shows the memory layout and dynamic data flow of an implementation of the baseline ar-chitecture in a Myrinet GNU/Linux cluster. Issues such as address translation, kernel memory allocationand memory pinning had to be addressed in this implementation. Besides, a pinned copy block in kernel

6

memory is used to interface Host/NIC DMA transfers, which adds one extra copy for each message in thesending side. Figure 6 exhibits a performance comparison between the GNU/Linux baseline architectureand the low-level driver GM (version 1.6.5), provided by the manufacturer of Myrinet. A round trip timetest was performed in order to compare the two system’s latency.

1

10

100

1000

1 4 16 64 256 1024 4096

��

Frame size

Comparison between the baseline architecture’s and GM’s latency

GM

� � � � � � ��

�

�

�

�

�

�

Baseline

Figure 6: Comparison between the baseline architecture’s and GM’s latency (in microseconds) for dif-ferent frame sizes (in bytes).

Many Myrinet protocols assume that the Myrinet network is reliable and, for that reason, no re-transmission or time-out mechanism is needed. Indeed, the risk of a packet being lost or corrupted in aMyrinet network is so small that reliability mechanisms can be safely left out of the baseline architecture.Alternative implementations that assume unreliable network hardware and recover from lost, corrupted,and dropped frames by means of time-outs, retransmission, and hardware supported CRC checks areaddressed by specific protocol implementations since different application domains may need differenttrade off between reliability and performance.

The presented architecture may drop frames because of insufficient buffer space. The baseline ar-chitecture rests on a NIC hardware mechanism to partially solve this problem. Backpressure, Myrinet’shardware link-level flow control mechanism, is used to prevent overflow of network interface buffers,stalling the sender until the receiver is able to drain frames from the network. More sophisticated flow-control mechanisms must be provided by specific protocol implementations since specialized applica-tions may only require limited flow-control from the network, performing some kind of control on theirown.

Besides, the architecture supports only point-to-point messages. Multicast and broadcast are de-sirable since they are fundamental components of collective communication operations. Lightweightprotocols that provide these features could be easily implemented on top of point-to-point messages orusing more efficient techniques [3]. Finally, the proposed baseline architecture provides no protectionsince there is a large number of parallel applications running on dedicated environments.

7

3.2 Myrinet low-level lightweight protocols

While the baseline architecture is closely related to the underlying networking technology, low-levellightweight protocols are designed according to the communication requirements of specific classes ofparallel applications. The lightweight protocols in the Myrinet family are divided into two categories:Infrastructure and High-Performance protocols.

Infrastructure protocols provide communication services that were left out of the baseline architec-ture: transparent multicasting, QoS, connection management, protection schemes, reliable deliveryand flow control mechanisms, among others. In order to keep latencies low it would be desirable toefficiently execute the entire protocol stack, up to the transport layer, in hardware. Programmablenetwork interfaces can be used to achieve that goal. Infrastructure protocols exploit the networkprocessor to the maximum, using more elaborate Myrinet control programs in order to offload abroader range of communication tasks to the LANai. The communication performance is affecteddue to the trade-off between performance and MCP complexity but for some specific classes ofapplications this is a small price to pay for the communication services provided.

High-performance protocols deliver minimum latency and maximum bandwidth to the run-time sys-tem. They usually consist in minimal modifications in the baseline architecture that are required byapplications or in protocols that monitor the traffic patterns and dynamically modify the baselinearchitecture’s customization points in order to address dynamic changes in application require-ments.

4 Related Works

There are several communication systems implementations for Myrinet, such as AM, BIP, PM, andVMMC, to name a few. Although these communication systems share some common goals, perfor-mance being one of them, they have made very different decisions in both the communication modeland implementation, consequently offering different levels of functionality and performance. From theseveral published comparison between these implementations one can conclude that there is no singlebest low-level communication protocol, since the communication patterns of the whole run-time system(application and run-time support) influences the impact of the low-level implementation decisions of agiven communication system on applications’ performance. Besides, run-time system specifics greatlyinfluences communication system implementations’ functionality.

While the Myrinet communication systems mentioned before try to deliver a generic, all-purposesolution for low-level communication, the main goal of the presented research is customization of low-level communication software. The architecture we propose should be flexible enough to allow that abroad range of the implementation decisions behind each one of the several Myrinet communicationsystems be supported as a lightweight protocols.

Although our work has focused on Myrinet, there are some other networks for which the sameconcepts can be applied. Cluster interconnection technologies that are also implemented with a pro-grammable NIC that can execute a variety of protocol tasks include DEC’s Memory Channel, the inter-connection network in the IBM SP series and Quadrics’ QsNet.

5 Conclusions

The widespread of cluster systems brings up the necessity of improvements in the software environ-ment used in cluster computing. Cluster system software must be redesigned to better exploit clusters’hardware resources and to keep up with applications’ requirements. Parallel-programming environmentsneed to be developed with both the cluster and applications efficiency in mind.

8

In this paper we outlined the design of a communication sub-system based in low-level lightweightprotocols, along with the design decision related to this sub-system’s baseline architecture for the Myrinetnetworking technology. Experiments are being carried out to determine the best values for the architec-ture’s customization points in different traffic pattern conditions.

We intend to create an efficient, application-oriented communication system for clusters and theredesign of EPOS communication system was one more step towards that goal. We believe that it isnecessary to better understand the relation between the design decisions in low-level communicationsoftware and the performance of high-level applications. The proposed lightwight communication pro-tocols, along with the application-oriented run-time system provided by EPOS, will be used in order toevaluate how different low-level communication schemes impact on parallel applications’ performance.

References

[1] Raoul A. F. Bhoedjang, Tim Ruhl, and Henri E. Bal. User-level Network Interface Protocols. IEEEComputer, 31(11):53–60, November 1998.

[2] IEEE Task Force on Cluster Computing. Cluster Computing White Paper, Mark Baker, editor,online edition, December 2000. [http://www.dcs.port.ac.uk/˜mab/tfcc/WhitePaper].

[3] M. Gerla, P. Palnati, and S. Walton. Multicasting protocols for high-speed, wormhole-routing localarea networks. In Proceedings of the SIGCOMM, pages 184–193, 1996.

[4] Loic Prylli and Bernard Tourancheau. BIP: a New Protocol Designed for High Performance Net-working on Myrinet. In Proceedings of the International Workshop on Personal Computer basedNetworks of Workstations, Orlando, USA, April 1998.

[5] Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-Protocol Active Messages ona Cluster of SMP’s. In Proceedings of Supercomputing’97, Sao Jose, USA, November 1997.

[6] Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating SystemCoordinated High Performance Communication Library. In High-Performance Computing andNetworking, volume 1225 of Lecture Notes in Computer Science, pages 708–717. Springer, April1997.

[7] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface forParallel and Distributed Computing. In Proceedings of the 15th ACM SOSP, pages 40–53. CopperMountain, Colorado, December 1995.

[8] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation ofthe MPI message passing interface standard. Parallel Computing, 22(6):789–828, September 1996.

[9] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. InProceedings of the Supercomputing Symposium, pages 379–386, 1994.

[10] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. User-levelInterprocess Communication for Shared Memory Multiprocessors. ACM Transactions on ComputerSystems, 9(2):175–198, May 1991.

[11] Jorg Cordsen. Virtuell Gemeinsamer Speicher,. PhD thesis, Technical University of Berlin, Berlin,Germany, 1996.

[12] H. Hellwagner, W. Karl, M. Leberecht, and H. Richter. SCI-Based Local-Area Shared-MemoryMultiprocessor. In Proceedings of the International Workshop on Advanced Parallel ProcessingTechnologies - APPT’95, Beijing, China, September 1995.

9

[13] Wolfgang Schroder-Preikschat. The Logical Design of Parallel Operating Systems. Prentice-Hall,Englewood Cliffs, U.S.A., 1994.

[14] Roy H. Campbell, Nayeem Islam, and Peter Madany. Choices, Frameworks and Refinement. Com-puting Systems, 5(3):217–257, 1992.

[15] Antonio Augusto Frohlich. Application-Oriented Operating Systems. Number 17 in GMD ResearchSeries. GMD - Forschungszentrum Informationstechnik, Sankt Augustin, August 2001.

[16] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. High Performance Application-oriented Operating Systems – the EPOS Aproach. In Proceedings of the 11th Symposium on Com-puter Architecture and High Performance Computing, pages 3–9, Natal, Brazil, September 1999.

[17] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. On Component-Based Communi-cation Systems for Clusters of Workstations. In Proceedings of the First IEEE/ACM InternationalSymposium on Cluster Computing and the Grid (CCGRID 2001), pages 640–645, Brisbane, Aus-tralia, May 2001.

[18] Antonio Augusto Frohlich, Gilles Pokam Tientcheu, and Wolfgang Schroder-Preikschat. EPOSand Myrinet: Effective Communication Support for Parallel Applications Running on Clusters ofCommodity Workstations. In Proceedings of 8th International Conference on High PerformanceComputing and Networking, pages 417–426, Amsterdam, The Netherlands, May 2000.

[19] Antonio Augusto Frohlich, Philippe Olivier Alexander Navaux, Sergio Takeo Kofuji, and WolfgangSchroder-Preikschat. Snow: a parallel programming environment for clusters of workstations. InProceedings of the 7th German-Brazilian Workshop on Information Technology, Maria Farinha,Brazil, September 2000.

10

A First Step towards Autonomous Clustered J2EE Applications Management

Slim Ben Atallah1 Daniel Hagimont2 Sébastien Jean1 Noël de Palma1

1 Assistant professor

2 Senior researcher

INRIA Rhône-Alpes – Sardes project

655 avenue de l’Europe , Montbonnot Saint Martin

38334 Saint Ismier Cedex, France

Tel : 33 4 76 61 52 00, Fax : 33 4 76 61 52 52

[email protected]

ABSTRACT

A J2EE application server is composed of four tiers: a web front-

end, a servlet engine, an EJB server and a database. Clusters

allow for replication of each tier instance, thus providing an

appropriate infrastructure for high availability and scalability.

Clustered J2EE application servers are built from clusters of each

tier and provide the J2EE applications with a transparent view of

a single server. However, such applications are complex to

administrate and often lack deployment and reconfiguration tools.

This paper presents JADE, a java-based environment for clustered

J2EE applications deployment. JADE is the first attempt of

providing a global environment that allows deploying J2EE

applications on clusters. Beyond JADE, we aim to define an

infrastructure that allows managing as autonomously as possible a

wide range of clustered systems, at different levels (from

operating system to applications).

General Terms Management, Experimentation.

Keywords Clustered J2EE Applications, Deployment, Configuration.

1. INTRODUCTION J2EE-driven architectures are now a more and more convenient

way to build efficient web-based ecommerce applications.

Although this multi-tiers model, as is, suffers from a lack of

scalability, it nevertheless benefits from clustering techniques that

allow by means of replication and consistency mechanisms to

increase application bandwidth and availability.

However, J2EE applications are not really easy and comfortable

to manage. Their deployment process (installation and

configuration) is as complex as tricky, no execution monitoring

mechanism really exists and dynamic reconfiguration remains a

goal to achieve. This lack of manageability makes it very difficult

to take fully advantage of clustering capabilities, i.e.

expanding/collapsing replicas sets as needed, and so on and so

forth…

This paper presents the first results of an ongoing project that

aims to provide system administrator with a management

environment that is as automated as possible. Managing a system

means being able to deploy, monitor and dynamically reconfigure

such a system. Our first experiments target the deployment

(i.e. installation/configuration) of a clustered J2EE application.

The contribution in this field is JADE, a java-based application

deployment environment that eases administrator’s job. We show

how JADE allows deploying a real benchmark application called

RUBIS.

The outline of the rest of this paper is as follows. Section 2 recalls

clustered J2EE applications architecture and life cycle and shows

the limits of existing deployment and configuration tools. Then,

Section 3 presents JADE, a contribution to ease such application

management by providing automatic scripting based deployment

and configuration tools. Section 4 resets this work in a wider

project that consists of defining a component-based framework

for autonomous systems management. Finally, Section 5

concludes and presents future work.

2. ADMINISTRATION OF J2EE CLUSTERS: STATE-OF-THE-ART AND

CHALLENGES This introductory section recalls clustered J2EE applications

architecture and life cycle before showing the limits of associated

management tools.

2.1 Clustered J2EE Applications and their Lifecycle

J2EE application servers [1], as depicted in Figure 1, are usually

composed of four different tiers, either running on a single

machine or on up to four ones:

- A web tier, as a web sever (e.g. Apache [2]), that manages

incoming clients requests and, respectively depending if those

relate to static or dynamic content, serves them or route them to

the presentation tier using the appropriate protocol (e.g. AJP13

for Tomcat).

- A presentation tier, as a web container (e.g. Tomcat [3]),

that receives forwarded request from the web tier, interacts with

the business logic tier (using the RMI protocol) to get related

data and finally dynamically generates a web document

presenting the results to the end-user.

- A business logic tier, as an Enterprise JavaBeans server

(e.g. JoNAS [4]), that embodies application logic components

(providing them with non-functional properties) which mainly

interact with the database storing application data by sending

SQL requests by the way of the JDBC framework.

- A database tier, as a database management system (e.g.

MySQL server [5]), that manages application data.

The main motivations of clustering are scalability and fault-

tolerance. Scalability is a key issue in case of web applications

that must serve billion requests a day. Fault-tolerance does not

necessarily apply to popular sites, even if it is also required in

this case, but to applications where information delivery is

critical (as commercial web sites for example). Both scalability

and fault-tolerance are offered through replication (and

consistency management for the last). In the case of J2EE

applications, database replication provides application with

service availability when machine failures occur, as well as

efficiency by load balancing incoming requests between

replicas.

The global architecture of clustered J2EE applications is

depicted in Figure 2 and detailed below in the case of an

{Apache, Tomcat, JoNAS, MySQL} cluster.

Apache clustering is managed through HTTP load balancing

mechanisms that can involve hardware and/or software helpers.

We cite below some well-known general-purpose techniques [6]

that apply to any kind of web servers:

• Level-4 switching, where a high-cost dedicated router

can distribute up to 700000 simultaneous TCP

connections over the different servers

• RR-DNS (Round-Robin DNS), where a DNS server

periodically changes the IP address associated to the

web site hostname

• Microsoft’s Network Load Balancing or Linux Virtual

Server that use modified TCP/IP stacks allowing a set

of hosts to share a same IP addresses and

cooperatively serve requests

• TCP handoffs, where a front-end server establishes

TCP connections and lets a chosen host directly

handle the related communication.

Tomcat clustering is made by using the load balancing feature

of Apache’s mod_jk plugin. Each mod_jk can be configured in

order to balance requests on whole or of a subset of Tomcat

instances, according to a weighted round-robin policy.

No common mechanism exists to manage business logic tiers

replicas, but ad’hoc techniques have been defined. For example,

JoNAS clustering can be achieved by using a dedicated

“cluster” stub instead of the standard RMI stub in Tomcat in

order to interact with EJB. This stub can be seen as a collection

stub that manages load balancing, assuming that whatever the

JoNAS instance where a bean has been created, its reference is

bound in all JNDI registries.

Database clustering solutions often remain commercial, like

Oracle RAC (Real Application Cluster) or DB2 cluster and

require using a set of homogeneous full replicas. We can

however cite C-JDBC [7], an open source JDBC clustering

middleware that allows using heterogeneous partial replicas

providing with consistency, caching and load balancing.

J2EE applications life cycle consists in three main steps that are

detailed below: deployment, monitoring and reconfiguration.

Web tier

Database tier

HTTP

RMI

SQL

AJP13

mod_jk

plugin

end-user Presentation tier

Business logic tier

Tomcat

JDBC

Figure 1. J2EE Applications Architecture

.

Deployment At the deployment step, tiers must firstly be

installed on hosts and be configured to be correctly bound to

each other. Then, application logic and data can be initialized.

Application tiers are often delivered through installable

packages (e.g. rpms) and the configuration is statically

expressed in configuration files that statically map components

to resources.

Monitoring Once the application has been deployed on the

J2EE cluster, one needs to know both the system and the

application states to be aware of problems that may arise. Most

common issues are due either to hardware faults such as a node

or network link failure, or inappropriate resource usage when a

node or a tier of the application server becomes a bottleneck.

Reconfiguration Once a decision has been taken (e.g., extension

of a J2EE tier on new nodes to handle increased load), one must

be able to perform appropriate reconfiguration, avoiding as most

as possible to stop the associated component.

2.2 Deployment, Monitoring and Reconfiguration Challenges Currently, no integrated deployment environment exists for

clustered J2EE applications. Each tier must be installed

manually and independently. Identically, the whole assembly,

including clustering middleware, must be configured manually

mainly through static configuration files (and there also no

configuration consistency verification mechanism).

Consequently, the deployment and configuration process is a

complex task to perform.

J2EE cluster monitoring is also a weakly offered feature. It is

obviously possible to see hosts load or to use SNMP to track

failures, but this is not enough to get pertinent information about

application components.

There is no way to monitor an apache web server, and even if

JoNAS offer JMX interfaces to see what applications are

running, cluster administrator can not gather load evaluations at

application level (but only the amount of memory used by the

JVM). Finally, database servers usually do not offer monitoring

features, except in few commercial products.

In terms of reconfiguration, no dynamic mechanism is really

offered. Only Apache server enables to dynamically take into

account configuration file changes, others tiers need to be

stopped and restarted in order to apply low-level modifications.

In this context, in order to alleviate the burden of application

administrator, to take advantage of clustering and thus to be able

to optimize performance and resource consumption, there is a

crucial need for a set of tools:

- an automated deployment and configuration tool, that allows

to easily and user-friendly deploy and configure a entire J2EE

application,

- an efficient application monitoring service that automatically

gathers, filters, and notifies events that are pertinent to the

administrator,

- a framework for dynamic reconfiguration.

Research work that is directly related to this topic is provided by

the Software Dock [8.]. The Software Dock is a distributed,

agent-based framework for supporting the entire software

deployment life cycle.

One major aspect of the Software Dock research is the creation

of a standard schema for the deployment process. The current

prototype of the Software Dock system includes an evolving

software description schema definition. Abstractly, the Software

Dock provides infrastructure for housing software releases and

their semantic descriptions at release sites and provides

infrastructure to deploy or “dock” software releases at consumer

sites. Mobile agents are provided to interpret the semantic

descriptions provided by the release site in order to perform

various software deployment life cycle processes. Initial

implementations of generic agents for performing configurable

content install, update, adapt, reconfigure, and remove have

been created. However Software Dock does not deal with J2EE

deployment as well as with clustered environment.

3. JADE: J2EE APPLICATIONS DEPLOYMENT ENVIRONMENT In this section, we present JADE, a deployment environment for

clustered J2EE applications. We firstly give an overview of the

architecture and follow with the example of a benchmark

application deployment called RUBIS.

3.1 Architecture Overview JADE is a component-based infrastructure which allows the

deployment of J2EE applications on cluster environment. As

depicted in Figure 3, JADE is mainly composed of three levels

defined as follows:

Web tiers Database tiers

HTTP

Presentation tiers Business logic tiers

JDBC

mod_jk

+ LB

HTTP Load balancer

JNDI

replica

JNDI

replica

JNDI

replica

AJP13

end-user

RMI

JDBC clustering middleware

cluster

stub

Web tiers Database tiers

HTTP

Presentation tiers Business logic tiers

JDBC

mod_jk

+ LB

HTTP Load balancer

JNDI

replica

JNDI

replica

JNDI

replica

AJP13

end-user

RMI

JDBC clustering middleware

cluster

stub

Figure 2. Clustered J2EE Applications Architecture

.

Konsole level

In order to deploy software components, JADE provides a

configuration shell language. The language introduces a set of

deployment commands described as follows:

• “start daemon”: starts a JADE daemon on a node

• “create”: creates a new component manager

• “set”: sets a component property

• “install”: installs component directories

• “installApp”: installs application code and data

• “start”: starts a component

• ”stop”: stops a component.

The use of configuration commands is illustrated in the RUBIS

deployment use case in Appendix.

The Shell commands are interpreted by the command invoker

that builds deployment requests and submit them to the deployer

engine. JADE provides a GUI konsole which allows deploying

of software components of cluster nodes. As shown in Figure 3,

each started component is managed through its own GUI

konsole. The GUI konsole also allows managing existing

configuration shells.

Deployment level

It describes component repository, deployment engine and

component manager:

- The repository provides access to several software

releases (Apache, Tomcat, …) and associated

component managers. It provides a set of interfaces

for instantiating deployment engine and component

manager.

- The deployment engine is the software responsible of

performing specific tasks of the deployment process

on the cluster nodes. The deployment process is

driven by the deployment tools using interfaces

provided by the deployment engine.

- Component manager allows setting component

properties at launch time and also at run time.

Figure 3. JADE Architecture Overview.

Cluster level

The cluster level illustrates the components deployed and

started on cluster nodes. At this stage, deployed

components are able to be managed.

The JADE deployment engine is a component-based

infrastructure. It provides the interface required to deploy the

application on the required nodes. It is composed by component

factory and by components deployer on each node involved.

When a deployment shell runs a script, it begins with the

installation of component factories on required nodes and then

interacts with factories to create component deployer. The shell

can then execute the script invoking component deployer.

Component factory exposes an interface to remotely create and

destroy component managers. Components deployers are

wrappers that encapsulate legacy code and expose interface that

allows installing tiers from the repository onto the local node,

configuring the local installation, loading the application from

the repository on tiers, configuring the application and

starting/stooping tiers and the application.

The JADE command invoker submits deployment and

configuration requests to the deployment engine. Even if

currently the requests are implemented as synchronous RMI

calls to the deployment engine interface, other connectors (such

as MOM) should be easily plugged in the future.

A standard deployment script can perform the following actions:

install the tiers, configure a tier instance, load the application on

tiers, configure the application, start tiers. An example of

deployment script is given in Appendix. A standard

undeployment script should stop the application and tiers and

should uninstall all the artefacts previously installed.

3.2 RUBiS deployment scenario RUBiS [9] provides a real-world example of the needs for

improved deployment activities support. This example is used to

design a first basic deployment infrastructure. RUBiS is an

auction site prototype modelled after eBay.com that is used to

evaluate application design patterns and application servers

performance and scalability.

RuBis offers different application logic implementations. It may

take various forms, including scripting languages such as PHP

that execute as a module in a Web server such as Apache,

Microsoft Active Server Pages that are integrated with

Microsoft's IIS server, Java servlets that execute in a separate

Java virtual machine, and full application servers such as an

Enterprise Java Beans (EJB) server [22]. This study focuses on

the Java servlets implementation.

Since we take the use case of RuBis in a cluster environment,

we depict a load balancing scenario. In Appendix is presented a

configuration implying two Tomcat servers and two MySQL

servers. In this configuration, the Apache server is deployed on

a node called sci40, the tomcat servers are on nodes called sci41

and sci42, and finally the two MySQL servers are on nodes

called sci43 and sci44.

4. TOWARDS A COMPONENT-BASED

INFRASTRUCTURE FOR AUTONOMOUS

SYSTEMS MANAGEMENT The environment presented in the previous section is suitable for

J2EE application deployed but, more generally, it can be easily

derived to be applied to system management.

4.1 Overview of System Management Managing a computer system can be understood in terms of the

construction of system control loops as stated in control theory

[10]. These loops are responsible for the regulation and

optimization of the behavior of the managed system. They are

typically closed loops, in that the behavior of the managed

system is influenced both by operational inputs (provided by

clients of the system), and by control inputs (inputs provided by

the management system in reaction to observations of the

system behavior). Figure 4 depicts a general view of control

loops that can be divided into multi-tier structures including:

sensors, actuators, notification transport, analysis, decision, and

command transport subsystems.

Sensors locally observe relevant state changes and event

occurrences. These observations are then gathered and

transported by notification transport subsystems to appropriate

observers, i.e., analyzers. The analysis assesses and diagnoses

the current state of the system. The diagnosis information is then

exploited by the decision subsystem, and appropriate command

plans are built, if necessary, to bring the managed system

behavior within the required regime. Finally, command

transport subsystems orchestrate the execution of commands

required by the command plan while actuators are the

implementation of local commands.

Therefore, building an infrastructure for system management

can be understood as providing support for implementing the

lowest tiers of a system control loop, namely the

sensors/actuators and notification/command transport

subsystems. We consider that such an infrastructure should not

be sensitive as to how the loop is closed at the top (analysis and

decision tiers), be it by a human being or by a machine (in the

case of autonomous systems).

In practice, a control loop structure can merge different tiers, or

have trivial implementations for some of them (e.g., a reflex arc

to respond in a predefined way to the occurrence of an event).

Also, in complex distributed systems, multiple control loops are

bound to coexist. For instance, we need to consider horizontal

coupling whereby different control loops at the same level in a

system hierarchy cooperate to achieve correlate regulation and

optimization of overall system behavior by controlling separate

Managed system

Sensor

Analysis Decision

Actuato

Notification transport Command transport

Figure 4. Overview of a Supervision Loop.

but interacting subsystems [11]. We also need to consider

vertical coupling whereby several loops participate, at different

time granularities and system levels, to the control of a system

(e.g., multi-level scheduling).

4.2 Beyond JADE JADE is a first tool that has to be completed by other ones that

provide administration process with monitoring and

reconfiguration.

Before this, a prerequisite is a cartography service that builds a

comprehensive system model, encompassing all hardware and

software resources available in the system. Instead of relying on

a “manual” selection of resource eligible for hosting tiers, the

deployment process should dynamically map application

components on available resources by querying the cartography

service that maintains a coherent view of the system state. A

component-based model is well suited to represent a system

model. Each resource (node, software …) can be represented by

a component. Composition manifests hierarchical and

containment dependencies.

With such an infrastructure, a deployment description no more

needs to bind static resources but only needs to define the set of

required resources. The architecture description might include

an exact set of resources or just define a minimal set of

constraints to satisfy. The cartography service can then inspect

the system representation to find the needed components that

correspond to the resources needed by the application. The

deployment process itself consists in inserting the application

components into the node component that contains the required

resources. Finally, the application components are bound to the

resources via bindings to reflect the resource usage in the

cartography. The effective deployment of a component is then

performed consecutively, as the component model allows some

processing to be associated with the insertion or removal of a

sub-component.

Then, there is also a need for a monitoring service reporting the

current cluster state to take the appropriate actions. Such a

monitoring infrastructure requires sensors and a notification

transport subsystem.

Sensors can be implemented as components and component

controllers that are dynamically deployed to reify the state of a

particular resource (hardware or software). Some sensors can be

generic to interact with resources through common protocols

such as SNMP or JMX/RMI, but other probes are specific to a

resource (processor sensor). Deploying sensors optimally for a

given set of observations is an issue. Sensors monitoring

physical resources may have to be deployed where the resource

is located (e.g., to monitor resource usage) or on remote nodes

(e.g., for detecting node failures). Another direct concern about

sensors is their intrusiveness on the system. For instance, the

frequency of probing must not significantly alter the system

behavior. In the case of a J2EE cluster, we have to deal with

different legacy software for each tier. Some software, such as

web or database servers, do not provide monitoring interfaces,

in which case we have to rely on wrapping and indirect

observations using operating system or physical resource

sensors. However, J2EE containers usually provide JMX

interfaces that offer a way to instrument the application server.

Additionally, the application programmer can provide user level

sensors (e.g., in the form of JMX MBeans).

Notification transport is in charge of event and reaction

dispatching. Once the appropriate sensors are deployed, they

generate notifications to report the state of the resource they

monitor. The notifications must be collected and transported to

the observers and analyzers that have expressed interest in them.

An observer can for instance be a monitoring console that will

display the state of the system in a human readable form.

Different observers and analyzers may require different

properties from the channel used to transport the notifications.

An observer in charge of detecting a node failure may require a

reliable channel providing a given QoS, while these properties

are not required by a simple observer of the CPU load of a node.

Therefore the channels used to transport the notifications should

be configured according to the requirements of the concerned

observers and analyzers. Typically, it should be possible to

dynamically add, remove, or configure a channel between

sensors and observers/analyzers.

To this effect, we have implemented DREAM (Dynamic

REflective Asynchronous Middleware) [12], a Fractal-based

framework to build configurable and adaptable communication

subsystems, and in particular asynchronous ones. DREAM

components can be changed at runtime to accommodate new

needs such as reconfiguring communication paths, adding

reliability or ordering, inserting new filters and so on. We are

currently integrating various mechanisms and protocols in the

DREAM framework to implement scalable and adaptable

notification channels, drawing from recent results on publish-

subscribe routing and epidemic protocols.

5. CONCLUSION AND FUTURE WORK As the popularity of dynamic-content Web sites increases

rapidly, there is a need for maintainable, reliable and above all

scalable platforms to host these sites. Clustered J2EE servers is

a common solution used to provided reliability and

performances. J2EE clusters may consist of several thousands of

nodes, they are large and complex distributed system and they

are challenging to administer and to deploy. Hence is a crucial

need for tools that ease the administration and the deployment

of these distributed systems. Our ultimate goal is to provide a

reactive management system.

We propose the JADE tool which is a framework to ease J2EE

applications deployment. Jade provides automatic scripting-

based deployment and configuration tools in clustered J2EE

applications. We experienced a simple configuration scenario

based on a servlet version of an auction site (RuBiS). This

experiment provides us the necessary feedback and a basic

component to develop a reactive management system. It shows

the feasibility of the approach. JADE is a first tool that provides

with deployment facility, but it has to be completed to provide a

full administration process with monitoring and reconfiguration.

We are currently working on several open issues for the

implementation of our architecture system model and

instrumentation for resource deployment, scalability and

coordination in the presence of failures in the transport

subsystem, automating the analysis and decision processes for

our J2EE use cases. We plan to experiment JADE with other

J2EE scenarii including EJB (The EJB version of RuBis). Our

deployment service is a basic block for administration system. It

will be integrated in the future system management service.

6. References [1] S. Allamaraju et al. – Professional Java Server

Programming J2EE Edition - Wrox Press, ISBN

1-861004-65-6, 2000.

[2] http://www.apache.org

[3] http://jakarta.apache.org/tomcat/index.html

[4] http://jonas.objectweb.org/

[5] http://www.mysql.com/

[6] http://www.onjava.com/pub/a/onjava/2001/09/2

6/load.html

[7] Emmanuel Cecchet and Julie Marguerite. C-

JDBC: Scalability and High Availability of the

Database Tier in J2EE environments. In

the 4th ACM/IFIP/USENIX International

Middleware Conference (Middleware), Poster

session, Rio de Janeiro, Brazil, June 2003.

[8] R.S. Hall et Al. An architecture for Post-

Development Configuration Management in a

Wide-Area Network. In the 1997 International

Conference on Distributed Computing Systems.

[9] Emmanuel Cecchet, Anupam Chanda, Sameh

Elnikety, Julie Marguerite and Willy

Zwaenepoel. Performance Comparison of

Middleware Architectures for Generating

Dynamic Web Content. In Proceedings of the

4th ACM/USENIX International Middleware

Conference (Middleware), Rio de Janeiro,

Brazil, June 16-20, 2003

[10] K. Ogata – Modern Control Engineering, 3rd ed.

– Prentice-Hall, 1997.

[11] Y. Fu et al. – SHARP: An architecture for secure

resource peering – Proceedings of SOSP'03.

[12] Vivien Quéma, Roland Balter, Luc Bellissard,

David Féliot, André Freyssinet and Serge

Lacourte. Asynchronous, Hierarchical and

Scalable Deployment of Component-Based

Applications. In Proceedings of the 2nd

International Working Conference on

Component Deployment (CD'2004), Edinburgh,

Scotland, may 2004.

7. APPENDIX

// start the daemon (ie : the factory)

start daemon sci40

start daemon sci41

start daemon sci44

start daemon sci45

// create the managed component:

// type name host

create apache apache1 sci40

create tomcat tomcat1 sci41

create tomcat tomcat2 sci42

create mysql mysql1 sci43

create mysql mysql2 sci44

// Configure the apache part

set apache1 DIR_INSTALL /users/hagimont/apache_install

set apache1 DIR_LOCAL /tmp/hagimont_apache_local

set apache1 USER hagimont

set apache1 GROUP sardes

set apache1 SERVER_ADMIN [email protected]

set apache1 PORT 8081

set apache1 HOST_NAME sci40

//bind to tomcat1

set apache1 WORKER tomcat1 8009 sci41 100

// bind to tomcat2

set apache1 WORKER tomcat2 8009 sci42 100

set apache1 JKMOUNT servlet

// Configure the two tomcat

set tomcat1 JAVA_HOME /cluster/java/j2sdk1.4.2_01

set tomcat1 DIR_INSTALL /users/hagimont/tomcat_install

set tomcat1 DIR_LOCAL /tmp/hagimont_tomcat_local

// provides worker port

set tomcat1 WORKER tomcat1 8009 sci41 100

set tomcat1 AJP13_PORT 8009

set tomcat2 DataSource mysql2

set tomcat2 JAVA_HOME /cluster/java/j2sdk1.4.2_01

set tomcat2 DIR_INSTALL /users/hagimont/tomcat_install

set tomcat2 DIR_LOCAL /tmp/hagimont_tomcat_local

// provides worker port

set tomcat2 WORKER tomcat2 8009 sci42 100

set tomcat2 AJP13_PORT 8009

set tomcat2 DataSource mysql2

// Configure the two mysql

set mysql1 DIR_INSTALL /users/hagimont/mysql_install

set mysql1 DIR_LOCAL /tmp/hagimont_mysql_local

set mysql1 USER root

set mysql1 DIR_INSTALL_DATABASE /tmp/hagimont_database

set mysql2 DIR_INSTALL /users/hagimont/mysql_install

set mysql2 DIR_LOCAL /tmp/hagimont_mysql_local

set mysql2 USER root

set mysql2 DIR_INSTALL_DATABASE /tmp/hagimont_database

// Install the component

install tomcat1 {conf, doc, logs,webapps}

install tomcat2 {conf, doc, logs,webapps}

install apache1 {icons,bin,htdocs,cgi-bin,conf, logs}

install mysql1 {}

install mysql2 {}

// Load the application part in the middleware

installApp mysql1 /tmp/hagimont_mysql_local ""

installApp mysql2 /tmp/hagimont_mysql_local ""

installApp tomcat1 /users/hagimont/appli/tomcat rubis

installApp tomcat2 /users/hagimont/appli/tomcat rubis

installApp apache1 /users/hagimont/appli/apache Servlet_HTML

// Start all the component

start mysql1

start mysql2

start tomcat1

start tomcat2

start apache1

Highly Configurable Operating Systems for UltrascaleSystems ∗

Arthur B. Maccabe andPatrick G. Bridges

Department of ComputerScience, MSC01-1130

1 University of New MexicoAlbuquerque, NM 87131-0001

[email protected]@cs.unm.edu

Ron Brightwell andRolf Riesen

Sandia National LaboratoriesPO Box 5800; MS 1110

Albuquerque, NM 87185-1110

[email protected]@cs.sandia.gov

Trammell HudsonOperating Systems Research,

Inc.1729 Wells Drive NE

Albuquerque, NM 87112

[email protected]

ABSTRACTModern ultrascale machines have a diverse range of usagemodels, programming models, architectures, and shared ser-vices that place a wide range of demands on operating andruntime systems. Full-featured operating systems can sup-port a broad range of these requirements, but sacrifice op-timal solutions for general ones. Lightweight operating sys-tems, in contrast, can provide optimal solutions at specificdesign points, but only for a limited set of requirements.In this paper, we present preliminary numbers quantifyingthe penalty paid by general-purpose operating systems andpropose an approach to overcome the limitations of pre-vious designs. The proposed approach focuses on the im-plementation and composition of fine-grained composablemicro-services, portions of operating and runtime systemfunctionality that can be combined based on the needs ofthe hardware and sofware. We also motivate our approachby presenting concrete examples of the changing demandsplaced on operating systems and runtimes in ultrascale en-vironments.

1. INTRODUCTIONDue largely to the ASCI program within the United StatesDepartment of Energy, we have recently seen the deploy-ment of several production-level terascale computing sys-tems. These systems, for example ASCI Red, ASCI BlueMountain, and ASCI White, include a variety of hardwarearchitectures and node configurations. In addition to dif-fering hardware approaches, a range of usage models (e.g.,dedicated vs. space-shared vs. time-shared) and program-

∗This work was supported in part by Sandia National Lab-oratories. Sandia is a multiprogram laboratory operated bySandia Corporation, a Lockheed Martin Company, for theUnited States Department of Energy under contract DE-AC04-94AL85000.

ming models (e.g. message-passing vs. shared-memory vs.global shared address space) have also used for programmingthese systems.

In spite of these differences and other evolving demands,operating and runtime systems are expected to keep pace.Full-featured operating systems can support a broad rangeof these requirements, but sacrifice optimal solutions for gen-eral ones. Lightweight operating systems, in contrast, canprovide optimal solutions at specific design points, but onlyfor a limited set of requirements.

In this paper, we present an approach that overcomes thelimitations of previous approaches by providing a frameworkfor configuring operating and runtime systems tailored tothe specific needs of the application and environment. Ourapproach focuses on the implementation and compositionof micro-services, portions of operating and runtime systemfunctionality that can be composed together in a variety ofways. By choosing appropriate micro-services, runtime andoperating system functionality can be customized at buildtime or runtime to the specific needs of the hardware, systemusage model, programming model, and application.

The rest of this paper is organized as follows: section 2 de-scribes the motivation for our proposed system, includingthe hardware and software architectures of current terascalecomputing systems and the challenges faced by operatingsystems on these machines, and presents preliminary num-bers and experiences to outline the scale of this problem. Italso presents several motivating examples that are drivingour design efforts. Section 3 describes the specific challengesfaced by operating systems in ultrascale environments, andsection 4 presents our approach to addressing these chal-lenges. Section 5 describes various related operating systemwork, and section 6 concludes.

2. MOTIVATION2.1 Current and Future System DemandsModern ultrascale systems, for example the various ASCImachines and the Earth Simulator, have widely varying system-level and node-level hardware architectures. The first teras-cale system, ASCI Red, is a traditional distributed memorymassively parallel processing machine – thousands of nodes,

each with a small number of processors (2). In contrast,the ASCI Blue Mountain machine was composed of 128-processor nodes, while ASCI White employs 16-way SMPnodes. We also expect additional hardware advances such asmulti-core chips and processor-in-memory chips to be avail-able in similar systems in the near future.

In addition to hardware, the approach from a programmingmodel standpoint has varied as well. The lightweight com-pute node operating system on ASCI Red does not sup-port a shared-memory programming model on individualcompute nodes, while the other platforms support a va-riety of shared memory programming constructs, such asthreads and semaphores. This has lead to the developmentof mixed-mode applications that combine MPI and OpenMP(or pthreads) to fully utilize the capabilities of systems withlarge numbers of processors per node. Applications have alsobeen developed for these platforms that extend the bound-aries of a traditional programming model. The distributedimplementation of the Python scripting language is one suchexample [14]. Advanced programming models, such as theGlobal Address Space model, are also gaining support withinthe parallel computing community.

Even within the context of a specific programming modelsuch as MPI, applications can have wide variations in thenumber and type of system services they require and canalso have varying requirements for the environment in whichthey run. For example, the Common Component Architec-ture assists in the development of MPI applications, but itrequires dynamic library services to be available to the in-dividual processes within the parallel job. Environmentalservices, such as system-level checkpoint/restart, are alsobecoming an expected part of the standard parallel applica-tion development environment.

The usage model of these large machines has also expanded.The utility of capacity computing, largely driven by theubiquity of commodity clusters, has led to changes in theway in which large machines are partitioned and scheduled.Machines that were originally intended to run a single, largeparallel simulation are being used more frequently for pa-rameter studies that require thousands of small jobs.

2.2 Problems with Current ApproachesGeneral-purpose operating systems such as Linux provide awide range of services. These services and their associatedkernel structures enable sophisticated applications with ca-pabilities for visualization and inter-networking. This gen-erality unfortunately comes at the cost of performance forall applications that use the operating system because of theoverheads of unnecessary services.

In an initial attempt to measure this performance differ-ence, we compared the performance of the mg and cg NASB benchmarks on ASCI Red hardware [21] when running twodifferent operating systems. We use Cougar, the productizedversion of the Puma operating system [26], as the specializedoperating system, and Linux as the general-purpose operat-ing system. To make the comparison as fair to Linux aspossible, we have ported the CplantTMversion of the Por-tals high-performance messaging [1] layer to the ASCI Redhardware. Cougar already utilizes this Portals for messagetransmission.

0

200

400

600

800

1000

1200

1400

128 64 32 16 4

Mill

ions

of O

pera

tions

per

Sec

ond

Number of Processors

CougarLinux/Portals

Figure 1: CG Performance on Linux and Cougar onASCI/Red Hardware

0

1000

2000

3000

4000

5000

128 64 32 16 4

Mill

ions

of O

pera

tions

per

Sec

ond

Number of Processors

CougarLinux/Portals

Figure 2: MG Performance on Linux and Cougar onASCI/Red Hardware

Figures 1 and 2 show the performance of these benchmarkswhen running on two different operating systems. Linux out-performs Cougar on the cg benchmark with small numbersof nodes because Cougar uses older, less optimized compil-ers and libraries, but as the number of nodes used increases,application performance on Linux falls off. Similar effectshappen on the mg benchmark, though mg on Cougar outper-forms mg on Linux even on small numbers of nodes despiteusing older compilers and libraries. A variety of differentoverheads cause Linux’s performance problems on larger-scale systems, including lack of contiguous memory layoutand the associated TLB overheads and suboptimal node al-locations due to limitations with Linux job-launch on ASCIRed.

Such operating system problems have also been seen in othersystems. Researchers at Los Alamos, for example, haveshown that excess services can cause dramatic performancedegradations [17]. Similarly, researchers at Lawrence Liver-more National Laboratory have shown that operating sys-tem scheduling problems can have a large impact on appli-cation performance in large machines [13].

2.3 Motivating ExamplesThe changing nature of demands on large scale systemspresent some of the largest challenges to operating system

design in this environment. We consider changing demandsin several areas along with specific examples from each areto motivate our work.

2.3.1 Changing Usage Models.As large-scale systems age, they frequently transition fromspecialized capability-oriented usage for a handful of appli-cations to capacity usage for a wide range of applications.Operating systems for capability-oriented systems often pro-vide a restricted usage model (dedicated or space-sharedmode) and need to provide only minimal services, allowingmore operating system optimizations. Operating systemsfor capacity-oriented systems, in contrast, generally supportmuch more flexible usage models, such as timesharing, andmust provide additional services including TCP/IP inter-networking and dynamic process creation.

2.3.2 Changing Application Demands.Applications have varying demands for similar operatingsystem services depending on their needs. Correctly cus-tomizing these services can have a large impact on appli-cation performance. As a concrete example, consider fourdifferent ways for a signal to be delivered to an applicationindicating the receipt of a network packet:

• Immediate delivery using interrupts (e.g., UNIX sig-nals) for real-time or event driven applications

• Coalescing of multiple signals and waiting until someother activity (e.g., an explicit poll or quantum expi-ration) causes an entry into the kernel, thereby mini-mizing signal handling overhead.

• Extending the kernel with application-specific handlercode for performance-critical signals

• Forking a new process to handle each new signal/packet(e.g., inetd in UNIX)

2.3.3 Changing Hardware Architecture.Operating system structure can present barriers to hardwareinnovation for ultrascale systems. Operating systems mustbe customized to present novel architectural features to ap-plications and to make effective use of new hardware fea-tures themselves. Existing operating systems such as Linuxassume that each machine is similar to a standard architec-ture, the Intel x86 architecture in the case of Linux, and indoing so limit their ability to expose innovative architecturalfeatures to the application or to use such features to opti-mize operating system performance. The inability of currentoperating systems to do so presents a significant impedimentto hardware innovation.

Consider, for example, operating system support for parcel-based processor-in-memory (PIM) systems [22]. Operatingsystems for such architectures must be flexible enough toperform scheduling and resource allocation on these archi-tectures and make effective use of this hardware for its ownpurposes. We specifically consider the use of a PIM as adedicated OS file cache that makes its own prefetching, re-placement, and I/O coalescing decisions. Processes that ac-cess files would send parcels to this PIM, which could im-mediately satisfy them from a local cache, coalesce smallwrites together before sending the request on to the main

I/O system, or aggressively prefetch data based on observedaccess patterns. Doing such work in a dedicated PIM builtfor handling latency-sensitive operations would free the sys-tem’s heavyweight (e.g. vector) processors from having toperform the latency-oriented services common in operatingsystems.

2.3.4 Changing Environmental Services.Finally, consider the variety of shared environmental ser-vices that operating systems must support, such as file sys-tems and checkpointing functionality. New implementationsof these services are continually being developed, and theseimplementations require changing operating system support.As just one example, the Lustre file system [2] is currentlybeing developed to replace NFS in ultrascale systems. Lus-tre requires a specific message-passing layer from the oper-ating system (i.e., Portals), in contrast to the general net-working necessary to support NFS but in return providesmuch better performance and scalability. Similarly, check-pointing services require a means to determine operatingsystem state and network quiescence. Finally, these servicesare often implemented at user-level in lightweight operatingsystems; in these cases, the operating system must provide away to authenticate trusted shared services to applicationsand other system nodes.

3. CHALLENGESThe processing resources for ultra scale systems will likely bepartitioned based on functional needs [7]. These partitionswill most likely include: a service partition to provide gen-eral services including application launch and compilation;an I/O partition providing shared file systems, a networkpartition that provides communication with other systems,and a compute partition that provides the primary compu-tational resources for an application. In this work, we areprimarily interested the operating system used in the com-pute partition, the compute node operating system.

Like any operating system, the compute node operating sys-tem, provides a bridge between the application and the ar-chitecture used to run the application. That is, the operat-ing system presents the application with abstractions of theresources provided by the computing system. The form ofthese abstractions will depend on the nature of the physicalresources and the way in which these resources are used bythe application.

The compute node operating system will also arbitrate ac-cess to shared resources on the compute nodes, resolvingconflicts in the use of these resources as needed. The needfor this mediation will depend on the way in which the com-pute nodes are used, the system usage model. It may alsoneed to provide special services (e.g., authentication) to sup-port access to shared services (e.g., file systems or networkservices) that reside in other partitions.

Figure 3 presents a graphical interpretation of the five pri-mary factors that influence the design of the compute nodeoperating systems for ultrascale computing systems. We in-clude history in addition to the four factors identified in thepreceding paragraphs: application needs, system usage mod-els, architectural models (both system-level and node-levelarchitectures), and shared services.

OperatingSystem

Application

Architecture

System Usage

Shared Services

History

Figure 3: Factors Influencing the Design of Operat-ing Systems

3.1 HistoryEvery operating system has a history and this history mayimpact the feasibility of using the OS in new contexts. Forexample, as a Unix-like operating system Linux assumesthat all OS requests come from processes running on thelocal system. As the network has become a source of re-quests, Unix systems have adopted a daemon approach tohandle these requests. In this approach, a daemon listens forincoming requests and passes them to the operating system.In this context, inetd is a particularly interesting example.Inetd listens for connection requests. When it receives a newconnection request, inetd examines the request and, basedon the request, creates a new process to handle the request.That is, the request is passed through the operating systemto inetd which calls the operating system to create a processto handle the request. While it might make more sense tomodify Unix to handle network requests directly, this wouldrepresent a substantial overhaul of the basic Unix requestmodel.

3.2 Application NeedsApplications present challenges at two levels. First, applica-tions are developed in the context of a particular program-ming model. Programming models typically require a basicset of services. For example, in the explicit message pass-ing model, it is necessary to allow for data to be movedefficiently between local memory and the network. Second,applications themselves may require extended functionalitybeyond the minimal set needed to support the programmingmodel. For example, an application developed using a com-ponent architecture may require system services to enablethe use of dynamic libraries.

While lightweight operating systems have been shown tosupport the development of scalable applications, this ap-proach places an undue burden on the application devel-oper. Given any feature typically associated with modernoperating systems (e.g., Unix sockets), there is at least oneapplication that could benefit from having the feature read-ily available. In the lightweight operating system approach,the application developer is required to either implementthe feature or do without. In fact, this is the reason thatmany of the terascale operating systems today are based onfull-featured operating systems. The real challenge is to pro-vide features needed by a majority of applications without

adversely affecting the performance and scalability of otherapplications that do not use these features.

Advanced programming models strive to provide a high-level abstraction of the resources provided by the comput-ing system. Describing computations in terms of abstractresources enhances portability and can reduce the amountof effort needed to develop an application. While high-levelabstractions offer significant benefits, application develop-ers frequently need to bypass the implementations of theseabstractions for the small parts of the code that are timecritical. For example, while the vast majority of the code inan application may be written in a high-level language (e.g.,FORTRAN or C), it is not uncommon for application devel-opers to write a core calculation, such as a BLAS routine,in assembly language to ensure an optimal implementation.The crucial point is that the abstractions implemented tosupport advanced programming methodologies must allowapplication developers to drop through the layers of abstrac-tion as needed to ensure adequate performance. Because weare interested in supporting resource constrained applica-tions, providing variable layers of abstraction is especiallyimportant.

Finally, because the development of new programming mod-els is an ongoing activity, the operating and runtime systemmust be designed so that it is relatively easy to develophigh-performance implementations of the features needed tosupport a variety of existing programming models as well asnew models that may be developed.

3.3 System Usage ModelsThe system usage model defines the places where the princi-pal computational resources can be shared by different users.Example usage models include: dedicated systems in whichthese resources are not shared; batch dedicated system inwhich the resources are not shared while the system is be-ing used, but may be used by different users at differenttimes; space-shared systems in which parts of the system(e.g., compute nodes) are not shared, but multiple usersmay be using different parts of the system at the same time;and time-shared systems in which the resources are beingused by multiple users at the same time.

Sharing requires that the operating system take on the roleof arbiter, ensuring that all parties are given the appropri-ate degree of access to the shared resources – in terms oftime, space, and privilege. The example usage models pre-sented earlier are listed in roughly the order of operatingsystem complexity needed to arbitrate the sharing: dedi-cated systems require almost no support for arbitrating ac-cess to resources, batch dedicated systems require that theusage schedule be enforced, space sharing requires that ap-plications running on different parts of the system not beable to interfere with one another, and timesharing requiresconstant arbitration of the resources. In considering systemusage models the challenge is to provide mechanisms thatcan support a wide variety of sharing policies, while ensur-ing that these mechanisms do not have any adverse impacton performance when they are not needed.

3.4 ArchitecturesArchitectural models present challenges at two levels: thenode level and the overall systems level. An individual com-

pute node may exhibit a wide variety of architectural fea-tures, including: multiple processors, support for PIM, mul-tiple network interfaces, programmable network interfaces,access to local storage, etc. The key challenge presentedby different architectures is the need to build abstractionsof the physical resources that match the resource abstrac-tions defined by the programming model. If this is not doneefficiently, it could easily inhibit application scaling.

Beyond the need to provide abstractions of physical resources,variations in systems-level architectures may require differ-ent levels of operating system functionality on the computenodes. In most cases, specialized hardware (e.g., PIMs) willrequire specialized OS functionality. However, hardware fea-tures may simplify OS functionality. As an example, BlueGene/L supports the partitioning of the high speed com-munication network: compute nodes in different partitionscannot communicate with one another using the high speednetwork. If the partitions correspond to different applica-tions in a space shared usage model, there is no need for theOS to arbitrate access to the high speed network. As this ex-ample illustrates, the interactions between architecture andusage models may not be trivial.

3.5 Shared ServicesFinally, applications will need access to shared services, e.g.,file systems. Unlike the resource sharing that is arbitratedby the node operating system, access to the shared resources(e.g., disk drive) provided by a shared server is arbitratedby the server. In addition to arbitration, these servers mayalso require support for authentication and, using this au-thentication, provide access control to the logical resources(e.g., files) that they provide.

Here, the challenge is to provide the support required bythe shared service. In some cases, this may be negligible.In other cases, the server may require that is be able toreliably determine the source of a message. In other cases,the shared server may rely on the operating system mayneed to maintain user credentials in a secure fashion whilean application is running so that these credential can betrusted by the shared file system.

4. APPROACHIn the context of the challenges described in the previous sec-tion, a “lightweight operating system” reflects a minimal setof services that meet the requirements presented by a smallset of applications, a single usage model, a single architec-ture, and a single set of shared services. The Puma operatingsystem [27], for example, represents a lightweight with thefollowing bindings: application needs are limited MPI andaccess to a shared file system, the system usage model isspace sharing, the system architecture consists thousands ofsimple compute nodes connected by a high performance net-work, and the shared services include a parallel file systemwhich relies on Cougar to protect user identification.

Our goal is to develop a framework for building operatingand runtime systems that are tailored to the specific require-ments presented by an application, the system usage model,the system architecture, and the shared services. Our ap-proach is to build a collection of micro-services and toolsthat support the automatic construction of a lightweight op-erating system for a specific set of circumstances.

4.1 Micro-ServicesAt a minimum, each application will need micro-services formanaging the primary resources: memory, processor, com-munication, and file system. We can imagine several imple-mentations for each of these micro-services. One memory al-location service might perform simple contiguous allocation;another might map physical page frames to arbitrary loca-tions in the logical address space of a process; another mightprovide demand page replacement; yet another may providepredictive page replacement. A processor management ser-vice may simply run a single process whenever a processoris available, and another might include thread scheduling.

There may be dependencies and incompatibilities within themicro-services. As an example, a communication micro-service that assumes that logically contiguous addresses arephysically contiguous (thus reducing the size of a memorydescriptor) would depend on a memory allocation servicethat provides this type of address mapping. There will alsobe dependencies between micro-services and system usagemodels. For example, a communication service that providesdirect access to a network interface would not be compatiblewith a usage model that supports time sharing on a node.

In addition to micro-services that provide access to primaryresources, there will be higher-level services layered on topof the basic micro-services. As an example, a micro-servicemight provide garbage collected dynamic allocation, anothermight provide first fit, explicit allocation and de-allocation(malloc and free) for dynamic memory allocation. Otherexamples include an RDMA service or a two-sided messageservice layered on top of a basic communication service.

Finally, we will need “glue” services: micro-services that en-able combinations of other services. As an example, considera usage model that supports general time-sharing among theapplications on a node. Further, suppose that one of the ap-plications to be run on a node requires a memory allocatorthat supports demand page replacement and another appli-cation requires a simple contiguous memory allocator. Amemory compactor service would make it possible to runboth applications on the same node.

4.2 Signal Delivery ExampleTo illustrate how our micro-services approach can be usedto address the challenges presented by ultrascale systems,we consider the signal delivery example presented towardthe end of Section 2. Because signal delivery may not beneeded by all applications, micro-services associated withsignal delivery would be optional and, as such, would nothave any performance impact on applications that did notneed signal delivery.

For applications that do require signal delivery, we wouldneed a collection of “signal detector” micro-services that arecapable of observing the events of interest to the applica-tion (e.g., the reception of a message). These micro-serviceswould most likely run as part of the operating system ker-nel. To ensure that they are run with sufficient frequency,the signal detector micro-services may place requirementson the micro-service used to schedule the processor.

The signal detector micro-services would then be tied toone of several specialized “signal delivery” micro-services.

The specific signal delivery micro-service will depend onthe needs of the application. An immediate delivery servicewould modify the control block for the target process so thatthe signal handler for the process is run the next time theprocess is scheduled for execution. A coalescing signal de-livery service would simply record the relevant informationand make this information available to another micro-servicethat would respond to explicit polling operations in the ap-plication. A user defined signal delivery service could take auser defined action whenever an event is detected. Finally,a message delivery service could convert the signal informa-tion into to data and pass this to the micro-service that isresponsible for delivering messages to application processes.The runtime level could then include a micro-service thatwould read these messages and fork the appropriate process.

4.3 ToolsWe cannot burden application programmers with all of themicro-services that provide the runtime environment for theirapplications. Application programmers should only be con-cerned with the highest-level services that they need (e.g.,MPI) and the general goals for lower-level services. We envi-sion the need to develop a small collection of tools to analyzeapplication needs and to combine and analyze combinationsof micro-services. Figure 4 presents our current thinkingregarding the required tools.

Application Analysis

Tool

Application

Application Requirements

System Usage Model

Available Shared

Resources

Architecture

Shared Resource

Requirements

OS/Runtime Constructor

OS/Runtime

Micro-services

Figure 4: Building an Application SpecificOS/Runtime

As shown in Figure 4, the tool chain takes several inputs andproduces an application specific OS/Runtime system. If thesystem usage model is timesharing, this OS/Runtime will bemerged with the OS/Runtime needed by other applicationsthat share the the computing resources (this merging willmost likely be done on a node-by-node basis). For otherusage models, the resulting OS/Runtime will be loaded withthe application, when the application is launched.

The application analysis tool extracts the application spe-cific requirements from an application. This tool will need tobe cognizant of potential programming models and the op-

tional features of these programming models. In addition,this tool will need to match the application needs for sharedservices to the shared services that are available. The appli-cation analysis tool will produce two, intermediate outputs:the application requirements and the requirements associ-ated with the shared resources that are used by the applica-tion.

In a second step, these intermediate outputs will be com-bined with a specification of the system usage model, a spec-ification of the underlying architecture and the collection ofmicro-services to build an OS/Runtime that is tailored tothe specific needs of the application. Here, we envision atool that will take as input a set of the top-level servicesused by an application and produce a directed graph of thepermissible lower-level services for the required runtime en-vironment. Nodes of this graph will be weighted by thedegree to which the micro-service represented by the nodemeets the goals of the application developer. We plan tobase some of our work on tools for composing micro-serviceson existing tools, such as the Knit composition tool devel-oped at the University of Utah in the context of the Fluxproject [19]. Other tools will be needed to select particularservices in the context of a system usage model. These toolswill also need to ensure that the services selected meet thesharing requirements of the system.

5. RELATED WORKA number of other configurable operating systems have beendesigned including microkernel systems, library operatingsystems, extensible operating systems, and component-basedoperating systems. In addition, configurability has been de-signed into a variety of different runtime systems and systemsoftware subsystems, including middleware for distributedcomputing, network protocol stacks, and file systems.

5.1 Configurable Operating SystemsMost standard operating systems such as Linux include alimited amount of configuration that can be used to add orremove subsystems and device drivers from the kernel. How-ever, this configurability does not generally extend to coreoperating system functions, such as the scheduler or virtualmemory system. In addition, the configuration available inmany subsystems such as the network stack and the file sys-tem is coarse-grained and limited; entire networking stacksand file systems can be added or removed, but these sub-systems cannot generally be composed and configured at amuch finer granularity. In Linux, for example, the entireTCP/IP or Bluetooth stack can be optionally included inthe kernel, but more fine-grained information about exactlywhich protocols will be used cannot easily be used to cus-tomize system configuration.

Other operating systems allow more fine-grained configura-tion. Component-based operating systems such as the FluxOSKit [5], Scout [15], Think [4], eCos [18], and TinyOS [10],allow kernels to be built from from a set of composable mod-ules. Scout, for example, is built from a set of routers thatcan be composed together into custom kernels. The THINKframework is very similar to the framework we propose here.The primary differences are that we expect to build operat-ing systems that are far more tailored to the needs of specificapplications and we do not expect to do much in the way of

dynamic binding of services. eCos and TinyOS provide sim-ilar functionality in the context of embedded systems andsensor networks, respectively. The Flux OSKit provided afoundation for component-based OS development based oncode from the Linux and BSD kernels, focusing particularlyon allowing device drivers from these systems to be used indeveloping new kernels. Unlike our proposal, however, noneof these systems have concentrated on customizing systemfunctionality at the fine granularity necessary to take full ad-vantage of new hardware environments or optimize for thedifferent usage models of ultrascale systems.

Microkernel and library operating systems such as L4 [8],Exo-kernels [3], and Pebble [6], for example, allow oper-ating system semantics to be customized at compile-time,boot-time, or run-time by changing the server or library thatprovides a service, though this composability is even morecoarse-grained than the systems described above. Such flex-ibility generally comes at a price, however; these operatingsystems may have to use more system calls and up-calls toimplement a given service than a monolithic operating sys-tem, resulting in higher overheads. It also can result in a lossof cross-subsystem optimization opportunities. In contrast,our approach seeks to decompose functionality using morefine-grained structures and to preserve cross-subsystem opti-mization opportunities through tools designed explicitly forcomposing system functionality.

5.2 Configurable Runtimes and SubsystemsA variety of different systems have also been built that en-able fine-grained configuration of system services, generallyin the realm of protocol stacks and file systems. In contrastto our approach, none of these systems seek to use configu-ration pervasively across in a an entire operating system.

Coarse-grained configuration of network protocol stacks hasbeen explored in System V STREAMS [20], the x -kernel [12],and CORDS [23]. Composition in these systems is layer-based, with each component defining one protocol layer.Similar approaches have been used for building stackablefile systems [9, 28].

More fine-grained composition of protocol semantics has beenexplored in the context of Cactus [11], [24], Ensemble [25],and Rwanda [16]. Cactus’s event-based composition model,in particular, has influenced our approach to building; infact, we are using portions of the Cactus event frameworkto implement our system. To date the Cactus project has fo-cused primarily on using event-based composition in networkprotocols, not the more general operating system structuresas described in this paper.

6. CONCLUSIONSIn this paper, we have presented an argument for a frame-work for customizing an operating system and runtime en-vironment for parallel computing. Based on the results ofpreliminary experiments, we conclude that the demands ofcurrent and future ultrascale systems cannot be addressedby a general-purpose operating system if high-levels of per-formance and scalability are to be maintained and achieved.The current methods of using specialized lightweight ap-proaches and generalized heavyweight approaches will not besufficient given the challenges presented by current and fu-ture hardware platforms, programming models, usage mod-

els and application requirements. To address this problem,we presented a design for a framework that uses micro-services and supporting tools to construct an operating sys-tem and associated runtime environment for a specific setof requirements. This approach minimizes the overhead ofunneeded features, allows for carefully tailored implemen-tations of required features, and enables the constructionnew operating and runtime systems to adapt to evolvingdemands and requirements.

7. REFERENCES[1] R. Brightwell, T. Hudson, R. Riesen, and A. B.

Maccabe. The Portals 3.0 message passing interface.Technical report SAND99-2959, Sandia NationalLaboratories, December 1999.

[2] Cluster File Systems, Inc. Lustre: A Scalable,High-Performance File System, November 2002.http://www.lustre.org/docs/whitepaper.pdf.

[3] D. Engler, M. Kaashoek, and J. O’Toole. Exokernel:An operating system architecture for application-levelresource management. In Proceedings of the 15th ACMSymposium on Operating Systems Principles, pages251–266, Copper Mountain Resort, CO, 1995.

[4] J.-P. Fassino, J.-B. Stefani, J. Lawall, and G. Muller.THINK: A software framework for component-basedoperating system kernels. In Proceedings of the 2002USENIX Annual Technical Conference, June 2002.

[5] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, andO. Shivers. The Flux OSKit: A substrate for kerneland language research. In Proceedings of the 16thACM Symposium on Operating Systems Principles,pages 38–51, Saint-Malo, France, 1997.

[6] E. Gabber, C. Small, J. Bruno, J. Brustoloni, andA. Silberschatz. The Pebble component-basedoperating system. In Proceedings of the 1999 USENIXAnnual Technical Conference, pages 267–282,Monterey, CA, USA, 1999.

[7] D. S. Greenberg, R. Brightwell, L. A. Fisk, A. B.Maccabe, and R. Riesen. A system softwarearchitecture for high-end computing. In ACM, editor,SC’97: High Performance Networking and Computing:Proceedings of the 1997 ACM/IEEE SC97 Conference:November 15–21, 1997, San Jose, California, USA.,New York, NY 10036, USA and 1109 Spring Street,Suite 300, Silver Spring, MD 20910, USA, Nov. 1997.ACM Press and IEEE Computer Society Press.

[8] H. Hartig, M. Hohmuth, J. Liedtke, S. Schonberg, andJ. Wolter. The performance of µ-kernel-based systems.In Proceesings of the 16th ACM Symposium onOperating Systems Principles, 1997.

[9] J. Heidemann and G. Popek. File-system developmentwith stackable layers. ACM Transactions on ComputerSystems, 12(1):58–89, 1994.

[10] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. E. Culler,and K. S. J. Pister. System architecture directions fornetworked sensors. In Architectural Support forProgramming Languages and Operating Systems, pages93–104, 2000.

[11] M. A. Hiltunen, R. D. Schlichting, X. Han,M. Cardozo, and R. Das. Real-time dependablechannels: Customizing QoS attributes for distributedsystems. IEEE Transactions on Parallel andDistributed Systems, 10(6):600–612, 1999.

[12] N. Hutchinson and L. L. Peterson. The x-kernel: Anarchitecture for implementing network protocols.IEEE Transactions on Software Engineering,17(1):64–76, 1991.

[13] T. Jones, W. Tuel, L. Brenner, J. Fier, P. Caffrey,S. Dawson, R. Neely, R. Blackmore, B. Maskell,P. Tomlinson, and M. Roberts. Improving thescalability of parallel jobs by adding parallel awarenessto the operating system. In Proceedings of SC’03,2003.

[14] P. Miller. Parallel, distributed scripting with python.In Third Linux Clusters Institute Conference, October2002.

[15] D. Mosberger and L. L. Peterson. Making pathsexplicit in the Scout operating system. In Proceedingsof the 2nd USENIX Symposium on Operating SystemsDesign and Implementation (OSDI), pages 153–168,1996.

[16] G. Parr and K. Curran. A paradigm shift in thedistribution of multimedia. Communications of theACM, 43(6):103–109, 2000.

[17] F. Petrini, D. Kerbyson, and S. Pakin. The case of themissing supercomputer performance: Achievingoptimal performance on the 8,192 processors of ASCIQ. In Proceedings of SC’03, 2003.

[18] Redhat. eCos. http://sources.redhat.com/ecos/.

[19] A. Reid, , M. Flatt, L. Stoller, J. Lepreau, andE. Eide. Knit: Component composition for systemsoftware. In Proceesings of the 4th USENIXSymposium on Operating Systems Design andImplementation (OSDI), pages 347–660, 2000.

[20] D. M. Ritchie. A stream input-output system. AT&TBell Laboratories Technical Journal, 63(8):311–324,1984.

[21] Sandia National Laboratories. ASCI Red, 1996.http://www.sandia.gov/ASCI/TFLOP.

[22] T. L. Sterling and H. P. Zima. The gilgamesh MINDprocessor-in-memory architecture for petaflops-scalecomputing. In Internatinal Sympoium on HighPerformance Computing (ISHPC 2002), volume 2327of Lecture Notes in Computer Science, pages 1–5.Springer, 2002.

[23] F. Travostino, E. M. III, and F. Reynolds. Paths:Programming with system resources in support ofreal-time distributed applications. In Proceedings ofthe IEEE Workshop on Object-Oriented Real-TimeDependable Systems, 1996.

[24] R. van Renesse, K. P. Birman, R. Friedman,M. Hayden, and D. A. Karr. A framework for protocolcomposition in Horus. In Proceedings of the 14th ACMPrinciples of Distributed Computing Conference, pages80–89, 1995.

[25] R. van Renesse, K. P. Birman, M. Hayden,A. Vaysburd, and D. A. Karr. Building adaptivesystems using Ensemble. Software Practice andExperience, 28(9):963–979, 1998.

[26] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. vanDresser, and T. M. Stallcup. PUMA: An operatingsystem for massively parallel systems. In Proceedingsof the Twenty-Seventh Annual Hawaii InternationalConference on System Sciences, pages 56–65. IEEEComputer Society Press, 1994.

[27] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. vanDresser, and T. M. Stallcup. PUMA: An operatingsystem for massively parallel systems. ScientificProgramming, 3:275–288, 1994.

[28] E. Zadok and I. Badulescu. A stackable file systeminterface for Linux. In Proceedings of the 5th AnnualLinux Expo, pages 141–151, Raleigh, North Carolina,1999.

1

Cluster Operating System Support for Parallel Autonomic Computing

A. Goscinski School of Information Technology

Deakin University, Geelong Victoria, 3217, Australia

+61 3 5227 2088

[email protected]

J. Silcock School of Information Technology


+61 3 5227 1378

[email protected]

M. Hobbs School of Information Technology


+61 3 5227 3342

[email protected]

ABSTRACT The aim of this paper is to show a general design of autonomic elements and initial implementation of a cluster operating system that moves parallel processing on clusters to the computing mainstream using the autonomic computing vision. The significance of this solution is as follows. Autonomic Computing was identified by IBM as one of computing’s Grand Challenges. The human body was used to illustrate an Autonomic Computing system that possesses self-knowledge, self-configuration, self-optimization, self-healing, and self-protection, knowledge of its environment and user friendliness properties. One of the areas that could benefit from the comprehensive approach created by the autonomic computing vision is parallel processing on non-dedicated clusters. Many researchers and research groups have responded positively to the challenge by initiating research around one or two of the characteristics identified by IBM as the requirements for autonomic computing. We demonstrate here that it is possible to satisfy all Autonomic Computing characteristics.

Categories and Subject Descriptors D.4.7 [Operating Systems]: Organization and Design – Distributed systems.

General Terms Management, Design and Reliability.

Keywords Cluster Operating Systems, Parallel Processing, Autonomic Computing.

1. INTRODUCTION There is a strong trend in parallel computing to move to cheaper, general-purpose distributed systems, called non-dedicated clusters, that consist of commodity off-the-shelf components such as PCs connected by fast networks. Many companies, businesses and research organizations already have such "ready made parallel computers”, which are often idle and/or lightly loaded not only during nights and weekends but also during working hours.

A review by Goscinski [9] shows that none of the research performed thus far has looked at the problem of developing a technology that goes beyond high performance execution and allows clusters and grids to be built for supporting their unpredictable changes and provide services reliably to all users,

offer ease of use and ease of programming. Computer clusters, including non-dedicated clusters that allow the execution of both parallel and sequential applications concurrently, are seen as being user unfriendly, due to their complexity. Parallel processing on clusters is not broadly accessible and it is not used on daily basis – parallel processing on clusters has not yet become a part of the computing mainstream. Many activities, e.g., selection of computers, allocation of computations to computers, dealing with faults and changes caused by adding and removing computers to/from clusters, must be handled (programmed) manually by programmers. Ordinary engineers, managers, etc do not have, and should not have, specialized knowledge needed to program operating system oriented activities. The deficiencies of current research in parallel processing in particular on clusters have also been identified in [19,6,33,2,31]. A similar situation exists in the area of Distributed Shared Memory (DSM). A need for an integrated approach to building DSM system was advocated in [16]. We decided to demonstrate a possibility to address not only high performance but also ease of programming/use, reliability, and availability through proper reaction to unpredictable changes and transparency, and developed the GENESIS cluster operating system that provides a SSI and offers services that satisfy these requirements [12]. However, to the end of 2001 there was no wider response to satisfy these requirements.

A comprehensive program to re-examine the “obsession with faster, smaller, and more powerful” and “to look at the evolution of computing from a more holistic perspective” has been launched by IBM in 2001 [16,15]. Autonomic computing is seen by IBM [16] as “the development of intelligent, open systems capable of running themselves, adapting to varying circumstances in accordance with business policies and objectives, and preparing their resources to most efficiently handle the workloads we put upon them”.

As it has been stated above, we have been carrying out research in the area of building new generation non-dedicated clusters through the study into cluster operating systems supporting parallel processing. However, in order to achieve a truly effective solution we decided to synthesize and develop an operating system for clusters rather than to exploit middleware approach. Our experience with GENESIS [12], which is a predecessor of Holos, demonstrated that incorporating many services (currently provided by middleware) into a single comprehensive operating system that exploits the concept of a microkernel, has made using the system easy and improved the overall performance of application execution. We are strongly

2

convinced that the client-server and microkernel approaches leads to better a design of operating systems, which are not bloated, can be easily tailored to applications, and improve security and reliability. An identical line of thought is presented just recently in [23]. As a natural progression of our work we have decided to move toward autonomic computing on non-dedicated clusters.

The aim of this paper is to present the outcome of our work in the form of the designed services underlying autonomic non-dedicated clusters, and to show the Holos (‘whole’ in Greek) cluster operating system (the implementation of these services) that is built to offer autonomic parallel computing on non-dedicated clusters. The problem we faced was whether to present this new cluster operating system showing its architectural vision or to introduce the system from the perspective of its matching the characteristics of autonomic computing systems. We decided to use the latter because it could “say” more to the wider audience. This approach allows us also to better convey a message of a novelty and contribution of the proposed system through individual elements of the grid and clustering technologies.

This paper is organized as follows. Section 2 shows related work, and in particular demonstrates that there is no project/system that addresses all characteristics of autonomic computing. Section 3 presents the logical design of autonomic elements and their services that must be created to provide parallel autonomic computing on non-dedicated clusters. Section 4 introduces the autonomic elements, presented in the previous section, implemented or being implemented as cooperating servers of the Holos cluster operating system. Section 5 concludes the paper and shows the future work.

2. RELATED WORK IBM's Grand Challenge identifying Autonomic Computing as a priority research area has brought research carried out for many years on self-regulating computers into focus. We have long identified lack of user friendliness as a major obstacle to the widespread use of parallel processing in distributed systems [10]. In 1993 Joseph Barrera discussed a framework for the design of self-tuning systems [3]. While IBM is advocating a "holistic" approach to the design of computer systems much of the focus of researchers is upon failure recovery rather than uninterrupted, continuous, adaptable execution. The latter includes execution under varying loads as well as recovery from hardware and software failure.

A number of projects related to Autonomous Computing are mentioned by IBM in [16]. OceanStore (Berkeley University of California) [29] is a persistent data store which has been designed to provide continuous access to persistent information to an enormous number of users. The infrastructure is made up of untrusted servers, hence the data is protected using redundancy and cryptography. Any computer can join the infrastructure by subscribing to one OceanStore service provider. Data can be cached anywhere, anytime, to improve the performance of the system. Information gained and analysed by internal event monitors allow OceanStore to adapt to changes in its environments such as regional outages and denial of service attacks.

The Recovery-Oriented Computing (ROC) [30] project is a joint Berkeley/Stanford research project that is investigating novel

techniques for building highly-dependable Internet services. ROC focuses on the recovery of the system from failures rather than their avoidance.

Anthill (University of Bologna, Italy) [1] is a framework to support the design, implementation and evaluation of peer-to-peer applications. Anthill exploits the analogy between Complex Adaptive Systems (CAS) such as biological systems and the decentralized control and large-scale dynamism of P2P systems. An Anthill system consists of a dynamic network of peer nodes; societies of adaptive agents (ants) travel through this network, interacting with nodes and cooperating with other agents in order to solve complex problems. The types of P2P services constructed using Anthill show the properties of resilience, adaptation and self-organization.

Neuromation [25], Edinburgh University's information structuring project, involves the structuring of information based on human memory. The structure used would be suited for organizing information in an autonomic architecture. The structure used is simple, homogeneous and self-referential.

University of Freiburg's Multiagent Systems Project [24] revolves around the self-organized coordination of multiagent systems. This topic has some connections with Grid computing, especially economic coordination issues like in Darwin, Radar or Globus.

The Immunocomputing project [18] (International Solvay Institutes for Physics and Chemistry, Belgium) is to use the principles of information processing by proteins and immune networks in order to solve complex problems while at the same time being protected from viruses, noise, errors and intrusions.

A Grid scheduling system, developed at Monash University, called Nimrod-G [28], has been built to provide tools and services for solving coarse-grain task farming. The resource broker/Grid scheduler has the ability to lease resources at runtime depending on their capability, cost, and availability.

The Bio-inspired Approaches to Autonomous Configuration of Distributed Systems [4] at University College London has used bio-inspired approaches to autonomous configuration of distributed systems (including a bacteria inspired approach) are being explored.

While many of these systems engage in some aspects of Autonomic Computing none engage in research to develop a system which has all eight of the characteristics required. Furthermore, none of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters.

3. THE LOGICAL DESIGN OF AUTONOMIC ELEMENTS PROVIDING AUTONOMIC COMPUTING ON NON-DEDICATED CLUSTERS According to Horn [15], an autonomic computing system could be described as one that possesses at least the following characteristics: knows itself; configures and reconfigures itself under varying and unpredictable conditions; optimizes its working; performs something akin to healing; provides self-protection; knows its surrounding environment; exists in an open

3

(non-hermetic) environment; and anticipates the optimized resources needed while keeping its complexity hidden.

An autonomic computing system is a collection of autonomic elements, which can function at many levels: computing components and services, clusters within companies, and grids within entire enterprises. Each autonomic element is responsible for its own state, behavior and management, which satisfy the user objectives. These elements interact among themselves and with surrounding environments. Objectives of individual components must be consistent with the objective of the whole set of cooperating elements [19].

We proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing system supporting parallel processing on non-dedicated cluster. These elements are described in the following subsections.

3.1 Cluster knows itself To allow a system to know itself there is a need for resource discovery. This autonomic element (service) is designed to run on each computer of the cluster and:

• Identifies its components, in particular computers, and their state;

• Acquires knowledge of static parameters of the whole cluster, in particular computers, such as processor type, memory size, and available software; and

• Acquires knowledge of dynamic parameters of cluster components, e.g., data about computers’ load, available memory, communication pattern and volume.

Figure 1 shows an illustration of the outcome of the general design of this autonomic element (service). It depicts the Resource Discovery Service on each Computer of the cluster obtaining information from the various local resources such as processor loads, memory usage and communication statistics between both local and remote Computational Elements (CEs or processes).

3.2 Cluster configures and reconfigures itself In a non-dedicated cluster computers could become heavily loaded. On the other hand there are time periods when some computers of a cluster are lightly loaded or even idle. Some computers cannot be used to support parallel processing of other

users’ applications because the owners removed them from a shared pool of resources.

To allow a system to offer high availability, i.e., to configure and reconfigure itself under varying and unpredictable conditions of adding and removing computers, the system was designed to:

• Adaptively and dynamically form virtual cluster according to load and changing resources;

• Offer high availability of resources, in particular computers.

An illustration of the outcome of the general design of this autonomic element is illustrated in Figure 2. It shows how virtual clusters can change over time, with the virtual cluster expanding from t0, t1 and t2; and contracting at t3.

3.3 Cluster should optimize its working Computation elements of a newly created parallel (or sequential) application should be placed in an optimal manner on computers of a virtual cluster formed for the application. Furthermore, if a new computer is added to the virtual cluster or load of some of the computers in the cluster changes dramatically, load balancing should be employed to improve the overall execution performance. When improving performance not only computation load and available memory should be taken into consideration, but also communication costs, which in non-dedicated clusters are high. Thus, to optimize cluster’s working:

• Static allocation and load balancing is employed;

• Changing scheduling from static to dynamic, and dynamic to static is provided;

• Changing performance indices, which reflect user objectives, among computation-oriented, and communication-oriented applications should be provided;

• Computation element migration, creation and duplication is exploited;

• Dynamic setting of computation priorities of parallel applications, which reflect user objectives, is provided.

The outcome of the general design of this autonomic element is shown in Figure 3. In this example the static allocation component instantiates computational elements on selected computers, and the load balancing component migrates

Resource Discovery

Communication Pattern & Load

Local Comms. Load

Main Memory

CE1 CEn

Computational Load &

Parameters

Resource Discovery

CPU Main Memory

CE1 CEn

Computer j

RemoteComms.

Computer i

CPU

Figure 1. Resource Discovery Service Design (CE – computation element) Figure 2. Availability Service Design

(RD – resource discovery element)

RD

RD

Availability Services

Virtual Cluster (t0)

Virtual Cluster (t2) Virtual Cluster (t3)

RD

RD RD

RD

RD

Virtual Cluster (t1)

RD

Where t0< t1< t2< t3

4

computational elements between computers. The decisions of when, which and where are made by higher level services, such as a Global Scheduler.

3.4 Cluster should perform something akin to healing Despite the fact that PCs and networks are becoming more reliable, hardware and software faults can occur in non-dedicated clusters. Failures in the system currently lead to the termination of computations. Many hours or even days of work can be lost if these computations have to be restarted from scratch. Thus, the system should be able to provide something akin to healing:

• Faults and their occurrence are identified and reported;

• Checkpointing parallel applications are provided;

• Recovery from failures is employed;

• Migrating application computation elements from faulty computers to other, healthy computers that are located automatically is carried out;

• Redundant/replicated autonomic elements are provided. An illustration of the outcome of the general design of this

autonomic element is illustrated in Figure 4. (Fault detection is not shown in this figure.) Checkpoints are stored in main memories of other virtual clusters for performance and on disk for high reliability. A process is recovered after a fault by using one of the checkpoint copies on a selected computer or from disk.

3.5 Cluster should provide self-protection Computation elements of parallel and distributed applications run on a number of computers of a cluster. They communicate using messages. As such, they are subject to passive and active attacks. Thus, resources must be protected, applications/users authenticated and authorized, in particular when computation element migration is used, and communication security countermeasures must be applied. The design of an autonomic element providing self-protection includes:

• Virus detection and recovery;

• Resource protection based on access control lists or/and capabilities;

• Encryption, as a countermeasure against passive attacks;

• Authentication, as a countermeasure against active attacks.

This autonomic element is the subject of our current design and will be addressed in another report.

3.6 To allow a system to know and work with its surrounding environment There are applications that require more computation power, specialized software, unique peripheral devices etc. Many owners of clusters cannot afford such resources. On the other hand, owners of other clusters and systems would be happy to offer their services and resources to appropriate users. Thus, to allow a system to know its surrounding environment, to prevent a system from existing in a hermetic environment, to the benefit of existing unique resources and services:

• Resource discovery of other similar clusters is provided;

• Advertising services to make user’s own services available to others is in place;

• The system is able to communicate/cooperate with other systems;

• Negotiation with service providers is provided;

• Brokerage of resources and services is exploited;

• Resources should be made available/shared in a distributed/grid-like manner.

An example of a set of cooperating brokerage autonomic elements running on different clusters that illustrate some aspects of the designed autonomic element is shown in Figure 5.

3.7 A cluster should anticipate the optimized resources needed while keeping its complexity hidden Until now the single factor limiting the harnessing of the computing power of non-dedicated clusters for parallel computing has been the scarcity of software to assist non-expert programmers. This implies a need for at least the following:

• Single System Image, in particular where transparency is offered;

• A programming environment that is simple to use and, does not require the user to see distributed resources is provided;

C1..n = Computer 1..n

Virtual Cluster

C1 CE1

C2 CE2

C3

CEi Migration

Cn

Availability Services

{Static Allocation Decisions: which,

where: CE1 → C1, CE2 → C2,

……… {CEi, CEj} → Cn}

{ Dynamic Load Balancing Decisions: where, which, when:

CEi : Cn → C3 }

Global Scheduler Static

Allocation Load

Balancing

CEj

Figure 3. High Performance Service Design

CEi

Checkpointing (coordinated)

Recovery

Checkpoint for CEi

C1

Checkpoint for CEi

Checkpoint for CEi

DiskCEi after

crash recovery

C2 Cj

Ck

Figure 4. Self-Healing Service Design

5

• Message passing and DSM programming is supported transparently.

When these features are provided the complexity of a cluster is greatly reduced from the perspective of both the programmer and user. Thus, hiding the complexities of managing the resources of a non-dedicated cluster and relieving the programmer from many of the system related functions.

4. THE HOLOS AUTONOMIC ELEMENTS FOR AUTONOMIC COMPUTING CLUSTERS To demonstrate that it is possible to develop an easy to use autonomic non-dedicated cluster, we decided to implement the autonomic elements presented in Section 3 and build a new autonomic cluster operating system, called Holos. We decided to implement autonomic elements as servers. Each computer of a cluster is a multi-process system with its objectives set up by their owners and the whole cluster is a set of multi-process systems with its objectives set up by a super-user.

4.1 Holos architecture Holos is being built as an extension to the Genesis system [12]. Holos exploits the P2P paradigm and object-based approach (where each entity has a name) supported by a microkernel [8]. The general architecture is shown in Figure 6. Holos uses a three level hierarchy for naming: user names, system names, and physical locations. The system name is a data structure, which allows objects in the cluster to be identified uniquely and serves as a capability for object protection [11].

The microkernel provides services such as local inter-process communication (IPC), basic paging operations, interrupt handling and context switching. Other operating system services are provided by a set of cooperating processes. There are three groups of processes: kernel managers, system servers, and application processes. Whereas the kernel and system servers are stationary, application processes are mobile. All processes communicate using messages. Kernel managers are responsible for managing the resources of the operating system. The Process Manager, Space (Memory) Manager, and IPC Manager manage the Process Control Blocks (PCBs), memory regions, and IPC of processes, respectively. The Network Manager provides access to the

underlying network, and supports communication among remote processes. All the kernel managers support system servers.

The servers, which form a basis of an autonomic operating system for non-dedicated clusters, are as follows:

• Resource Discovery Server – collects data about computation and communication load; and supports establishment of a virtual cluster;

• Availability Server, which dynamically and adaptively forms a virtual cluster for the application;

• Global Scheduling Server – maps application processes on the computers that make up the virtual cluster for the application;

• Execution Server – coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers;

• Migration Server – coordinates moving an application process (or set of application processes) from one computer to another computer or a set of computers, respectively;

• DSM Server – hides the distributed nature of the cluster’s memory and allows programmers to write their code as though using physically shared memory;

• Checkpoint Server – coordinates creation of checkpoints for an executing application;

• Inter-Process Communication (IPC) Manager – supports remote inter-process communication and group communication within sets of application processes;

• File Server – supports both system and user level processes in accessing secondary storage, particularly the Execution Manager in the creation of processes, the Chekpoint Server in storage of checkpoint data, and the Space Manager in the provision of paging; and

Cluster 1 Cluster 2

Brokerage Service

Computational Services

Storage/Memory Services

Printer Services

Information Services

Advertisement

Exporting Services

Withdrawal Services

Import Requests

Brokerage Service

Cluster n Cluster 3

Brokerage Service

Brokerage Service

Figure 5. Grid-like Service Design

Figure 6. The Holos operating system

DSM Server

File Server

Checkpoint Server

Resource Discovery

Server

Migration Server

Space Manager

Process Manager

Network Manager

GENESIS Microkernel

Parallel Application Processes

System Servers

Kernel Managers

MP / PVM / MPI

DSM Agent

Availability Server

IPC Manager

Global Scheduler

Execution Server

Brokerage Server

6

• Brokerage Server – supports resource advertising and sharing through the services of exporting, importing and revoking.

4.2 Holos possesses the autonomic computing characteristics Sets of Holos servers that individually and in cooperation provide services that satisfy the IBM’s Autonomic Computing requirements are specified in Table 1.

The following subsections present the servers, which provide services that allow the operating system to offer autonomic operating system and support autonomic parallel computing on non-dedicated clusters. As inter-process communication in Holos is the basis of all services, and in particular is the basis of transparency, it is also presented.

4.3 Communication among parallel processes To hide distribution and make remote inter-process communication look identical to communication between local application processes, we decided to build the whole operating system services of Holos around the inter-process communication facility. To programmers of standard and parallel applications, local and remote communication is indistinguishable, which forms a basis for complete transparency.

The IPC Manager is also responsible for both local and remote address resolution for group communication. Messages that are sent to a group require the IPC Manager to resolve the destination process location and provide the mechanism for the transport of the message to the requested group members. To support programmers, the Holos group communication facility allows processes to create, join, leave and kill a group, and supports different message delivery, response and message ordering semantics [31].

4.4 Establishment of a virtual cluster for cluster self awareness The Resource Discovery Server [12,26] and the Availability Server play a key role in the establishment of virtual clusters upon a cluster. The Resource Dsicovery Server identifies idle and/or lightly loaded computers and their resources (processor model, memory size, etc.); collects both computational load and communication patterns for each process executing on a given

computer, and provides this information to the Availability Server, which uses it to establish a virtual cluster.

The virtual cluster changes dynamically in time as some computers are removed or become overloaded and cannot be used as a part of the execution environment for a given parallel application, and some computers are added or become idle/lightly loaded and can become a component of the virtual cluster. The dynamic nature of the virtual cluster creates an environment, which can address the application requirements that when executed expands or shrinks.

The current resource discovery server collects (using specially designed hooks installed in the microkernel and Process, Space and IPC Servers) static parameters such as processor type and memory size and dynamical parameters such as computation load (the number of processes in the ready and blocked state), available memory, and communication pattern and volume. We are enhancing this server using the study of the way of data collection and processing. We also concentrate our efforts on the availability server. We study the identification of events that report on the computer and software faults, adding and removal to/from the cluster by an administrator and/or user, changing computation (a completion of a process, a new process creation) and communication load (process computers communicating intensively, computers/processes completing, intensive communication), and new requests to allocate/releases computers for their application. This information is used in the development of adaptive algorithms for forming and reconfiguring virtual clusters.

4.5 Mapping parallel processes to computers for cluster self optimization Mapping parallel processes to computers of a virtual cluster is performed by the Global Scheduling Server. This process combines static allocation and dynamic load balancing components, which allow the system to provide mapping by finding the best locations for parallel processes of the application to be created remotely and to react to large fluctuations in system load. The decision to switch between the static allocation and dynamic load balancing policies is dictated by the scheduling policy, which uses the information gathered by the Resource Discovery Server.

Currently, the global scheduler is a centralized server. Our initial performance study of MPI computation-and computation-

Table 1. Servers working together to carry out services of autonomic computing

Autonomic Computing Requirement Cooperating Holos Servers - Relationships Among Autonomic Elements

To allow a system to know itself Resource Discovery Server A system must configure and reconfigure itself under varying and unpredictable conditions

Resource Discovery, Global Scheduling, Migration, Execution, and Availability Servers

A system must optimize its working Global Scheduling, Migration, and Execution Servers A system must perform something akin to healing Checkpoint, Migration, Global Scheduling and Servers A system must provide self-protection Capabilities in the form of System Names A system must know its surrounding environment Resource Discovery, and Brokerage Servers A system cannot exist in a hermetic environment Inter-process Communication Manager, and Brokerage Server A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user)

DSM and Execution Servers, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment

7

bound parallel applications on a 16-computer cluster shows that their concurrent execution with sequential applications (computation-bound, I/O bound, and in between) leads to the improved execution performance and makes the utilization of the whole cluster better. Sequential applications have demonstrated a very small slow-down, which in some cases could even be unnoticed and in other cases could be rejected [37,11]. In both cases of parallel applications the utilization of static allocation to initially place parallel processes on cluster computers and dynamic load balancing were employed.

4.6 Process creation In Holos, each computer is provided with an EXecution (EX) Server, which is responsible for local process creation [13]. A local EX Server is capable of contacting a remote EX Server to create a remote process on its behalf. Currently, the remote process creation service employs multiple process creation that concurrently creates n parallel processes on a single computer, and group process creation that is able to concurrently create processes on m selected computers. These mechanisms are of great importance for instance for master-slave based applications, where a number of identical child processes is mapped to remote computers.

When a new process is to be created, the Global Scheduler instructs the EX Server to create the process locally or on a remote computer. In both instances, i.e., group remote process creation and individual creation, a list of destination computers is provided to the by the Global Scheduler, and they are forwarded onto the EX Server on the respective destination computers. A process is created from an image stored in a file. This implies a need for employing the File Server to support this operation. To achieve high performance of the group process creation operation, a copy of the file that contains a child image is distributed to selected computers by a group communication facility.

4.7 Process duplication and migration Parallel processes of an application can also be instantiated on the selected computers of the virtual cluster by duplicating a process locally by the EX Server and, if necessary, migrating it to selected computer(s) [14].

Migrating an application process involves moving the process state, address space, communication state, and any other associated resources. This implies that a number of kernel Managers, such as Process, Space, and IPC Servers, are involved in process migration. The Migration Server only plays a coordinating role [8]. Group process migration is performed, i.e., a process can be concurrently migrated to n computers selected by the Global Scheduling Server.

4.8 Computation co-ordination for cluster self optimization When a parallel application is processed on a virtual cluster, where parallel processes are executing remotely, application semantics require an operating system to transparently maintain: input and output to/from the user, the parent/child relationship, and any communication with remote processes. As all communication in Holos is transparent, input and output to/from a user and communication with the remotely executing process is transparent.

In Holos, the parent’s origin computer manages all process “exits” and “waits” issued from the parent and its children. Furthermore, child processes in a parallel section of the program must co-ordinate their execution by waiting for both data allocation at the beginning of their execution and the completion of the slowest process in the group in order to preserve the correctness of the application, implied by a data consistency requirement. In the Holos system barriers are employed for this purpose.

4.9 Checkpointing for cluster self healing Checkpointing and fault recovery have been selected to provide fault tolerance. Holos uses coordinated checkpointing, which requires that non-deterministic events, such as processes interacting with each other, the operating system or end user, be prevented during the creation of checkpoints. However, under a microkernel-based architecture, operating system services are accessed by sending requests to operating system servers, rather than directly through system calls. This prevents non-deterministic events by stopping processes communicating with each other or with operating system servers during the creation of checkpoints. These messages are then included in the checkpoints of the sending processes to maintain the consistency of the checkpoints. Messages are dispatched to their destinations after all checkpoints are created.

To improve the performance of checkpointing, the approach that employs the main memory of another computer of a cluster, rather than a centralized disk is used. However, as in this case back-up computers can also fail, a checkpoint is stored on at least k computers and k-delivery of group communication is used to support this operation. Disk based checkpointing is also used, but the frequency of storing checkpoints on the disk is much lower.

To control the creation of checkpoints, another process of Holos, the Checkpoint Server, is employed. This process is placed on each computer and invokes the kernel managers to create a checkpoint of processes on the same computer [32]. The coordinating Checkpoint Server (where the application was originally created) directs the creation of checkpoints for a parallel application by sending requests to the remote Checkpoint Servers to perform operations that are relevant to the current stage of checkpointing. To create a checkpoint of a process, each of the kernel managers must be invoked to copy the resources under their control.

Currently, fault detection and fault recovery is the subject of our research. A basis of this research is liveness checking and process migration, which moves a selected checkpoint to the specified computer, respectively. We also develop and study the methods of recording and using information about location of checkpoints within the relevant virtual cluster.

4.10 Brokerage (toward grids) for cluster self and surroundings’ awareness Brokerage and resource discovery have been studied to build basic autonomic elements allowing Holos services and applications to be offered to both other users working with Holos and users of other systems [26].

A copy of a brokerage process runs on each computer of the cluster. Each Holos broker is such a process that preserves user

8

autonomy as in a centralized environment; and supports sharing by advertising services to make user’s own services available to other users, by allowing object to be exported to other clusters or to be withdrawn from service, and by allowing objects that have been exported by users from other clusters to be imported.

The Holos broker supports object sharing among clusters of homogeneous clusters [26] and grids [27]. This implies that resources should be made available/shared in a distributed/grid-like manner. The test version of the broker was developed based on attribute names, in order to allow users to access the objects without knowing their precise names.

4.11 Programming interface for user friendliness Holos provides transparent communication services of standard message passing (MP) and DSM as its integral components. We present in this sub-section details of these common parallel programming communication mechanisms and how they are integrated transparently into the Holos system. The logical design of the communication services and how they interface to applications using MP and DSM is shown in Figure 7. This figure also shows the hierarchical relationship between the communication services and the system and kernel services.

4.11.1 Holos message passing The standard MP service within the Holos parallel execution environment is provided by the Local IPC component of the microkernel and the IPC Manager that is supported by the Network Manager. These combine to provide a transparent local and remote MP service, which supports both the various qualities of service and group communication mechanisms. The standard MP and RPC primitives such as send and receive; and call, receive and reply, respectively are provided to programmers.

4.11.2 Holos PVM and MPI PVM and MPI have been ported to Holos as they allow exploiting advanced message passing based parallel environment [30,22]. Three modifications to PVM applications running on UNIX have been identified to improve performance: avoiding the use of XDR encoding where possible, using the direct IPC model instead of the default model, and balancing the load. The PVM communication is transparently provided by a service that is only a mapping of the standard PVM services onto the Holos communication services and benefits from additional services (for

example group process creation and process migration), which are not provided by operating systems such as Unix or Windows.

The move of MPI from UNIX to Holos has been achieved by replacing the two lower layers of MPICH with the services that Holos provides. These services were group communications, group process creation, process migration, and global scheduling, including static allocation and dynamic load balancing. Incorporating these services to MPI has shown promising results and provided a better solution for implementing parallel programming tools.

4.11.3 Distributed Shared Memory Holos DSM exploits the conventional memory sharing approach (to write shared memory code using concurrent programming skills) by using the basic concepts and mechanisms of memory management to provide DSM support [35].

One of the unique features of Holos DSM is that it is integrated into the memory management of the operating system. We decided to embody the DSM within the operating system in order to create a transparent, easy to use and program environment and achieve high execution performance of parallel applications. The options for placing the DSM system within the operating system were either to build it as a separate server or incorporate to it into one of the existing servers. The first option was rejected because of a possible conflict between two servers (Space Manager and the DSM system) both managing the same object type, i.e. memory. Synchronised access to the memory to maintain its consistency would become a serious issue. Since DSM is essentially a memory management function, the Space Manager is the server into which we decided the DSM system should be integrated. This implies that programmers are able to use the shared memory as though it were physically shared, hence, the transparency requirement is met. Furthermore, because the DSM system is in the operating system itself and is able to use the low level operating system functions the efficiency requirement can be met.

The granularity of a shared memory object is a critical issue in the design of a DSM system. As the DSM system is placed within the Space Manager and the memory unit of the Holos Space is a page, the most appropriate object of sharing for the DSM system is a page. The Holos DSM system employs release consistency model (the memory is made consistent only when a critical region is exited), which is implemented using the write-update model [34].

In Holos DSM synchronisation of processes that share memory takes the form of semaphore type synchronisation for mutual exclusion. The semaphore is owned by the Space Manager on a particular computer. Because the ownership of the semaphore is controlled by the Space Manager on each computer, gaining ownership of the semaphore is still mutually exclusive when more than one DSM process exists on the same computer. Barriers are used in GENESIS to co-ordinate executing processes.

One of the most serious problems of the current DSM systems is that they have to be initialised manually by programmers [5], [19]. Transparency of this operation is not provided. In Holos, DSM is initialised automatically and transparently. Machines are selected, process created and data distributed auromatically.

Figure 7. Easy Programming Service Design

System Services of an Operating System

Message Passing Based Communication Primitives

Kernel Services of an Operating System

Programming Environment

Message Passing or PVM / MPI

DSM Based Communication Primitives

Shared Memory

9

5. CONCLUSION In this paper, autonomic computing has been shown to be feasible and able to move parallel computing on non-dedicated clusters to the computing mainstream. The autonomic elements have been designed and implemented by respective servers or part of other system servers. All the cooperating processes that employ these mechanisms offer self and surroundings discovery, ability of reconfiguration, self-protection, self-healing, sharing and ease of programming. The Holos autonomic operating system has been built as an enhancement of the Genesis system to offer an autonomic non-dedicated cluster. This system relieves developers from programming operating system oriented activities, and provides to developers of parallel applications both message passing and DSM. In summary, the development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster.

This paper contributes to both the area of autonomic computing, in particular parallel autonomic computing on non-dedicated clusters, by harnessing many technologies developed by the authors and the area of cluster operating systems by the development of a comprehensive cluster operating system supporting parallel computing and solving the problem of building virtual clusters changing dynamically and adaptively according to load and changing resources, in particular adding, and removing computers to/from the cluster.

6. ACKNOWLEDGEMENTS We would like to thank the anonymous COSET reviewers for the valuable feedback and comments they provided. These suggestions helped us greatly to improve this paper.

7. REFERENCES [1] Anthill (University of Bologna, Italy)

http://www.cs.unibo.it/projects/anthill/, (accessed 26 May 2004)

[2] Auban, J.M.B. and Khalidi, Y.A. (1997): Operating System Support for General Purpose Single System Image Cluster. Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications. PDPTA’97, Las Vegas.

[3] Barrera, J. (1993) Self-tuning systems software. Proc. Fourth Workshop on Workstation Operating Systems. 194-197.

[4] Bio-inspired Approaches to Autonomous Configuration of Distributed Systems (University College London), http://www.btexact.com, (accessed 6 May 2003).

[5] Carter, J., Efficient Distributed Shared Memory Based on Multi-Protocol Release Consistency, Ph.D. Thesis, Rice University, 1993.

[6] Cluster, (2000): Cluster Computing White Paper, Version 2.0. M. Baker (Editor).

[7] De Paoli, D. and Goscinski, A. (1998): The RHODOS Migration Facility. Journal of Systems and Software 40:51-65.

[8] De Paoli, D. et al. (1995): The RHODOS Microkernel, Kernel Servers and Their Cooperation. Proc. First IEEE Int’l Conf. on Algorithms and Architectures for Parallel Processing, ICA3PP’95.

[9] Goscinski, A. (2000): Towards an Operating System Managing Parallelism of Computing on Clusters of Workstations. Future Generation Computer Systems: 293-314.

[10] Goscinski, A. and Haddock, A. (1994): A Naming and Trading Facility for a Distributed System. The Australian Computer Journal, No. 1.

[11] .Goscinski, A. and Wong, A (2004) The Performance of a Parallel Communication-Bound Application Executing Concurrently with Sequential Applications on a Cluster Case Study. (To be submitted to) The 2nd Intl. Symposium on Parallel and Distributed Processing and Applications (ISPA-2004). Dec. 2004, Hong Kong, China.

[12] Goscinski, A., Hobbs, M. and Silcock, J. (2002): GENESIS: An Efficient, Transparent and Easy to Use Cluster Operating System. Parallel Computing.

[13] Hobbs, M. and Goscinski, A. (1999a): A Concurrent Process Creation Service to Support SPMD Based Parallel Processing on COWs. Concurrency: Practice and Experience. 11(13).

[14] Hobbs, M. and Goscinski, A. (1999b): Remote and Concurrent Process Duplication for SPMD Based Parallel Processing on COWs. Proc. Int’l Conf. on High Performance Computing and Networking, HPCN Europe'99. Amsterdam.

[15] Horn, P. (2001): Autonomic Computing: IBM’s Perspective on the State of Information Technology.

[16] IBM (2001): IBM Corporation, http://www.research.ibm.com/autonomic/research. (Accessed 26 May 2004).

[17] Iftode L. and Singh J. P. (1997): Shared Virtual Memory: Progress and Challenges, Technical Report, TR-552-97, Department of Computer Science, Princeton University, October.

[18] Immunocomputing (International Solvay Institutes for Physics and Chemistry, Belgium) http://solvayins.ulb.ac.be/fixed/ProjImmune.html, (accessed 6 May 2003).

[19] Keleher, P. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, 1994.

[20] Kephart, J. and Chess D. (2003): The Vision of Autonomic Computing. Computer, Jan.

[21] Lottiaux, R. and MORIN, C. (2001): Containers: A Sound Basis for a True Single System Image. Proc. First IEEE/ACM Int’l Symp. on Cluster Computing and the Grid. Brisbane.

[22] Maloney, A., Goscinski, A. and Hobbs, M.: An MPI Implementation Supported by Process Migration and Load Balancing, Recent Advances in Parallel Virtual Machine and Message Passing Interface: Proc. of the 10th European PVM/MPI User's Group Meeting, pp. 414-423, Springer-Verlag.

[23] McGraw, G. and Hoglund, G. (2004) Dire Straits: The Evolution of Software Opens new vistas for Business and the Bad Guys. http://infosecuritymag.techtarget.com/ss/0,295796,sid6_iss366_art684,00.html, (accessed 26 May 2004).

10

[24] Multiagent Systems (Freiburg University) http://www.iig.uni-freiburg.de/~eymann/publications/, (accessed 26 May 2004).

[25] Neuromation (Edinburgh University) http://www.neuromation.com/, (accessed 26 May 2004).

[26] Ni, Y. and Goscinski, A. (1994): Trader Cooperation to Enable Object Sharing Among Users of Homogeneous Distributed Systems. Computer Communications. 17(3): 218-229.

[27] Ni, Y. and Goscinski, A. (1993): Resource and Service Trading in a Heterogeneous Distributed Systems. Proc. IEE Workshop on Advances in Parallel and Distributed Systems, Princeton.

[28] Nimrod-G (Monash University) http://www.gridbus.org/, (accessed 26 May 2004).

[29] OceanStore (Berkeley University of California). http://oceanstore.cs.berkeley.edu/, (accessed 26 May 2004).

[30] Recovery-Oriented Computing (Berkeley/Stanford). http://roc.cs.berkeley.edu/, (accessed 26 May 2004).

[31] Rough, J. and Goscinski, A. (1999): Comparison Between PVM on RHODOS and Unix. Proc. Fourth Int. Symp. on Parallel Architectures, Algorithms and Networks (I-SPAN’99). A. Zamoya et al. (Eds), Freemantle.

[32] Rough, J. and Goscinski, A. (2004): The Development of an Efficient Checkpointing Operating System of the GENESIS Cluster Operating System. Future Generation Computer Systems, 20(4):523-538.

[33] Shirriff, K. et al. (1997): Single-System Image: The Solaris MC Approach. Proc. Int’l Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA’97. Las Vegas.

[34] Silcock, J. and Goscinski, A. (1997). Update-Based Distributed Shared Memory Integrated into RHODOS' Memory Management. in: Proc. Third Intl. Conference on Algorithms and Architecture for Parallel Processing ICA3PP'97, Melbourne, Dec. 1997, 239-252.

[35] Silcock, J. and Goscinski, A. (1999): A Comprehensive DSM System That Provides Ease of Programming and Parallelism Management. Distributed Systems Engineering, 6: 121-128.

[36] Walker, B. (1999): Implementing a Full Single System Image UnixWare Cluster: Middleware vs. Underware. Proc. Int’l Conf. on Parallel and Distributed Processing Techniques and Applications, PDPTA’99.

[37] Wong, A. and Goscinski, A. (2004) Scheduling of a Parallel Computation-Bound and Sequential Applications Executing Concurrently on a Cluster - Case Study. (Submitted to) IEEE Intl. Conference on Cluster Computing. Sept. 2004, San Diego, California.

Type-Safe Object Exchange Between Applications and a DSM Kernel

R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess

Department of Distributed Systems, University of Ulm, 89075 Ulm, Germany

[email protected]

Phone: ++49 731 50 24238; Fax: ++49 731 50 24142

Abstract: The Plurix project implements an object-oriented Operating System (OS) for PC clusters.Communication is achieved via shared objects in a Distributed Shared Memory (DSM) - using restartabletransactions and an optimistic synchronization scheme to guarantee memory consistency. We contend thatcoupling object orientation with the DSM property allows a type-consistent system bootstrapping, quicksystem startup and simplified development of distributed applications. It also facilitates checkpointing ofthe system state. The OS (including kernel and drivers) is written in Java using our proprietary Plurix JavaCompiler (PJC) translating Java source code directly into Intel machine instructions. PJC is an integralpart of the language-based OS and tailor-made for compiling in our persistent DSM environment. In thispaper we briefly illustrate the architecture of our OS kernel which runs entirely in the DSM and theresulting opportunities for checkpointing and communication between applications and OS. We presentissues of memory management related to the DSM-kernel and to strategies to avoid false-sharing .

Keywords: Distributed Shared Memory, Object-Orientation, Reliability, Single System Image

1 IntroductionTypical cluster systems are built on top of traditional operating systems (OS) as Linux or Microsoft

Windows and data is exchanged using message passing (e.g. MPI) or remote invocation (e.g. RPC,RMI)strategies. As each node in a cluster is running its own OS with different configurations, the migration ofprocesses is difficult, as it is unknown which libraries and resources will available on the next node.Additionally, if a process is migrated to another node, the entire context including relevant parts of thekernel state must be saved and transferred. Because these OSs are not designed for cluster operation it isdifficult to migrate kernel contexts [Smile]and as a consequence cluster systems typically redirect calls ofmigrated processes back to the home node, e.g. Mosix [Mosix].

Plurix is an OS specifically tailored for cluster operation and avoids these difficulties. The DistributedShared Memory (DSM) offers an elegant solution for distributing and sharing data among loosely couplednodes [Keedy],[Li]. Applications running on top of the Plurix DSM are unaware of the physical locationof objects. A reference can either point to a local or to a remote memory block. During program executionthe OS detects a remote memory access and automatically fetches the desired memory block. Plurixextends the DSM to a distributed heap storage, providing the benefit, that not only data but also the codesegments of the programs are available on each node as they are shared in the DSM.

One of our major research goals is to simplify the development of distributed applications. Typically,DSM systems use weak consistency models to guarantee the integrity of shared data. This makes thedevelopment of applications hard, as each programmer must explicitly manage the consistency of the databy using the offered synchronization mechanism [TreadMarks]. Plurix uses a strong consistency model,called transactional consistency [Wende02] relieving the programmer from explicit consistencymanagement.

1

Single-System-Image (SSI) computing architectures have been the mainstay of high performancecomputing for many years. In a system implementing the SSI concept, each user gains a global anduniform view on available resources and programs. The system provides the same libraries and services oneach node in the cluster, which is very important for load balancing and migration of processes. Weextend the SSI concept by storing OS, kernel, and all drivers in the DSM. As a consequence we canimplement a type-safe kernel interface and at the same time simplify checkpointing and recovery.

In 1990 Fuchs introduced checkpointing and recovery for DSM systems [Fuchs90]. Numeroussubsequent papers discuss the adaptation of checkpointing strategies designed for message-passingsystems ranging from global coordinated solutions to independent checkpointing with and without logging[Morin97]. However, the more sophisticated solutions have not been evaluated in real implementationsbecause checkpointing is difficult to achieve in PC-clusters even under global coordination. If acheckpoint needs to be saved it is not sufficient to save the DSM context but also the local kernel contextneeds to be saved - which is not trivial. Plurix avoids these drawbacks by storing OS and applications inthe DSM.

The remainder of the paper is organized as follows. The design of Plurix is briefly presented in section2. We then describe the advantages of a type-safe kernel interface. In the sequel we describe the benefitsof running the kernel within the DSM. Extending the SSI provides additional advantages for thecheckpointing, which are described in section 5. Finally, we present measurements and give an outlook onfuture work.

2 Design of PlurixPlurix implements SSI properties at the operating system level, using a page-based distributed shared

memory. According to the SSI concept all programs and libraries must be available on all nodes in thecluster. Therefore Plurix uses a global address space shared by all nodes and organized asdistributed heap storage (DHS) containing both data and code. Sharing the programs in the DHSreduces redundancy concerning code segments and makes the administration of the system easier.

2.1 Java-based Kernel and Operating System Plurix is entirely written in Java and works in a fully object oriented fashion. The development of an

operating system requires access to device registers which is not possible in standard Java. For this reasonwe have developed our own Plurix Java Compiler (PJC) with language extensions to support hardwarelevel programming. The compiler directly generates Intel machine instructions and initializes runtimestructures and code segments in the heap. Traditional object-, symbol-, library- and exe-files are avoided.Each new program is compiled directly into the DHS and is thereby immediately available at each node.

Plurix is designed as a lean and high speed OS and therefore able to start quickly. The start time of theprimary node, which creates a new heap (installation of Plurix) or restarts an preexisting heap from thePageserver (see Section 5.1), is less than one second. Additional nodes, which only have to join theexisting heap, can be started in approximately 250 ms. This quick boot function of Plurix is helpful toguarantee fast node and cluster start-up time, which helps avoiding long downtimes in case of criticalerrors.

2.2 Distributed Shared Memory The transfer of the DHS-objects from one cluster node to a new one is managed within the page-based

distributed shared memory (DSM), and takes advantage of the Memory Management Unit (MMU)hardware. The MMU detects page faults, which are raised if a node requests an object on a page which isnot locally present. Each page fault results in a separate network packet which contains the address of themissing page (PageRequest). This packet is broadcast to all nodes in the cluster (Fast Ethernet LAN) andonly the current owner of the page send it to the requesting node.

An important topic in distributed systems is the consistency of shared and replicated objects. In Plurixthis is synonymous to the consistency of the DSM. Plurix offers a new consistency model, calledtransactional consistency, which is described in the following section.

2

2.3 Consistency and RestartabilityUnlike traditional systems, Plurix does not burden the programmer with the consistency of the DHS-

objects. All actions in Plurix are encapsulated in transactions. At the start of a transaction, write access topages are prohibited. If a page is written, the system creates a shadow image of it and then enables writeaccess. Additionally, the system logs the pages for which shadow images were created. At the end of atransaction (commit phase) the addresses of all modified pages are broadcasted and all partner nodes in thecluster invalidate these pages. If there is a collision with a running transaction on another node, it isaborted and eventually restarted.

In case of an abort all modified pages are discarded. Since there is a shadow image for each modifiedpage, the system can reconstruct the state of the node at the time just before the transaction has beenstarted. A token mechanism is used to ensures, that only one node is in the commitphase at a time. Thetoken is passed using a first wins strategy. To improve fairness further commit strategies will bedeveloped.

2.4 False Sharing and Backchain All page-based DSM systems suffer from the notorious false-sharing syndrome. False-sharing occurs,

if two or more nodes access separate objects which nevertheless reside on the same page. If a node writesto such an object, all other nodes are forced to abort their current transaction and restart it later. As theseobjects are not shared, such an abort is semantically unjustified and unnecessarily slows down the entirecluster. To handle this problem relocation of DSM-objects from one physical page to an other is required.When an object is relocated, all pointers to this object are adjusted. Due to the substantial network latencyin the cluster environment, it is not possible to inspect each object whether contains a pointer to therelocated object. To adjust the affected references, Plurix uses the Backchain [Traub]. This concept linkstogether all references to an object by recording the addresses of these pointers (see fig. 1). All referencesto a relocated DSM-object are found in the Backchain. To reduce invalidations of remote objects when anew Backchain entry is inserted, references on the stack are not tracked in the Backchain.

2.5 Garbage CollectionThe previously described Backchain concept can also be used to simplify a distributed garbage

collection (GC). A Mark-and-Sweep algorithm should not be used in a DHS-environment, because it iseither very difficult to implement (incremental Mark-and-Sweep) or it would stop the entire cluster whilecollecting garbage. Copying GC algorithms will unduly reduce the available address space - onlyreference counting algorithms appear feasible. The Backchain can be used as a reference counter. If anobject contains an empty Backchain, no references to this object remain. This is equivalent to a referencecounter of 0, so in this case the object is garbage and can be deleted. Because stack references are notincluded in the Backchain, the GC may only run if the stack is empty. Between two transactions thiscondition is always true, and that the GC task can be run as a regular Plurix transaction.

3

Figure 1 The Backchain Concept

DSM

object

backchain

pointer

3 A Type-Safe Interface for a DSM-KernelThe SSI concept requires, that all nodes in the cluster have the same programs installed. In a distributed

environment the easiest why to achieve this goal is to share not only data but also the code of theprograms, for this reason the Plurix extends the DSM to the DHS. In this case it is mandatory to protectthe code segments from unwanted modification either by corrupted pointers or by malicious attacks. Thiscan be achieved by using a type-safe language like Java. Language-based OS development has beensuccessfully demonstrated by the Oberon system [Wirth]. The requirement for type safety in the DSMaffects also the interface to the OS. As data in the DSM is represented by objects and these data must bepassed to the kernel, either the objects must be serialized before they are used as parameters or the kernelmust be able to accept objects.

3.1 Traditional Kernel interfacesTraditionally, distributed systems are implemented as a middleware layer on top of a local OS, such as

Linux or Mach, which are mostly written in C and therefore do not provide objects. The communicationbetween the distributed system and the local OS takes place using primitive data types or structures. If thekernel cannot handle objects as such, they are serialized (and data items are copied) before being passed tothe kernel. This kind of raw communication does not provide type checks for parameters and signatures bythe runtime environment. Hence no type-safe calls of kernel methods are possible and each kernel methodhas to check explicitly its parameters to avoid runtime errors.

3.2 Benefits of a Type-Safe Kernel InterfaceTo reduce programming complexity and to increase system performance we recommend to pass typed

objects to the kernel. This was part of the motivation to create Plurix as a stand alone OS not as amiddleware layer. Since the kernel of Plurix is written in Java and easily handles objects, a type-safecommunication between the DSM applications and the OS is natural. All Java types and objects, can behanded to the kernel methods. The programmer has no need to pay attention to the type of the passedobject because this is checked by the compiler and in some cases by the runtime environment. Further onthere is no reason to serialize objects which are used as parameters for kernel methods, so the performanceof the entire system increases.

Another benefit of using objects as parameters is that the object respectively the data included in theseobjects need not be copied. The kernel method obtains a reference and accesses the object directly. Thisincreases the system performance again.

3.3 Inter Address Space PointersIn traditional systems there are at least two different address spaces, one for the kernel and at least one

for user applications. As the kernel methods are always needed on each node the straight-forward way ofimplementing the system would be to place the kernel in the local address space. These local addresses arenot shared with other nodes, and each node in the cluster can use them in different ways. In this case aseparation between the kernel and user address space would mean to differentiate between the local- orNonDSM- and the DSM-address-space. If in such an environment objects are used as parameters, somereferences will point from the Non-DSM into the DSM address space. References which point from theNon-DSM into the DSM reduces the performances of the cluster, as they inhibit the relocation of objectsso that avoidance of false-sharing and memory fragmentation is prevented. The reason being that theBackchain entries are not longer unambiguous when an object migrates to another node and is thenrelocated from one DSM address to another. If an object is referenced by a Non-DSM-object, theBackchain leads into the local memory of the node. As addresses in the local memory are not unique, thepointer can not be adjusted, as it is not possible to detect which local memory area is specified by thisBackchain entry. The correct reference to this object can not be found and an adjustment of the memorylocation which is specified by the Backchain will lead to invalid pointers or even destroyed code segments(see figure 2).

4

As long as DSM-objects are relocatable, references from the Non-DSM into the DSM addressspace are not possible, as they could lead to dangling pointers or destroyed code segments. Tosolve this problem it would be possible to prevent relocation of DSM-Objects, which arereferenced by Non-DSM-object. As it is not possible to specify, which objects are used asparameters for kernel methods, nearly all objects in the DSM could not be relocated and so theperformance of the cluster will be impaired because false sharing and fragmentation of thememory can not be handled. Therefore direct pointers from the Non-DSM into the DSM-address-space must be avoided.

Another interesting question is how kernel methods can be called from DSM applications. Twoalternative methods are conceivable: 1. Software Interrupt: Like in most traditional systems, kernel methods may be called using

kernel- or system-calls. These are software interrupts which request a specific function fromthe kernel. If kernel-calls are used to communicate between the DSM applications and theoperating system there are no “address space spanning” pointers but the question arises how topass parameters from the DSM to the kernel, as the software interrupt itself cannot acceptparameters. One possible solution is to pass data to the kernel through a fixed address. If anobject is used as a parameter, this address would contain the pointer to the object which shouldbe used. As each kernel method requires different parameters, this object must be of a generictype and thereby each object can be passed. Each kernel method has to check the given objectif it is type compatible with the expected one as this could not be handled by the runtimeenvironment. This rises the complexity for system programmers and makes the systemvulnerable to faults by simultaneously reducing the performance and the possibilities ofparameter passing.

2. Object oriented invocation: Kernel methods are invoked in an object oriented fashion viadirect pointers to the requested kernel class. This implies that all kernel classes and theirmethods have to reside at the same addresses on each node in the cluster. This is necessary aseach application can only have one pointer to a kernel class. Should they reside at differentaddresses, these references would point to invalid addresses and the corresponding kernelmethods could not be called correctly on some nodes (see figure 3). If direct pointers are usedeach node in the cluster must run the same kernel, and such a kernel can never be changedduring runtime. Consequently all pointers in the applications, which reference kernel methodsrequire adjustment. To achieve this, all kernel methods must contain a Backchain which pointsfrom Non-DSM into the DSM and thereby the above described problems will occur.

5

Figure 2 Migration and subsequently relocation of a DSM-object

DSM

localmemory

local memory

DSM

local memory

local memory

DSM

local memory

local memory

DSM-Object referenced bya Non-DSM-object

DSM-object migrated toanother node

DSM-object was relocated toanother address

node 1 node 1 node 1node 2 node 2 node 2

pointer backchain

Both techniques give rise to an additional problem. The compiler is running in the DSM and any newprogram is automatically created in the DSM. If the new program is a device driver (which typicallyresides in kernel space) the code segments must be transfered from the DSM into the Non-DSM addressspace and this must occur simultaneously on each node.

Our implemented solution which solves all the challenges above is to remove the kernel from the localmemory address space and move it into the DSM. Further benefits of this approach are described in thefollowing section.

4 Extending the Single System Image ConceptWe elaborate the SSI concept by moving the OS and the Kernel into the DSM. The local memory is

only used for a few state variables for the network device drivers and -protocol and for the so calledSmart-Buffers, which help to bridge the gap between not restartable interrupts and transactions[Bindhammer].

4.1 Benefits of a kernel running in the DSMIf the kernel runs in the DSM parameter passing between applications and kernel is elegant and all

objects can be used as parameters. Kernel methods are called directly as described in section 3.3. Thereare no references pointing from one address space to the other. Since all device drivers now reside in theDSM even the problem of transferring newly compiled drivers from the DSM into the kernel spacevanishes. Because the code segments of the kernel methods are in the DSM redundancy is avoided.Further benefits from this concept, especially for system checkpointing are described in Chapter 5.

Some interesting questions surfaced when moving the kernel into the DSM, but before we describethese topics and our corresponding solutions we describe the memory management of Plurix and theallocation mechanism for new objects as this is important for our solution.

4.2 Distributed Heap ManagementA basic design topic of the Plurix system is the page-based DSM, raising the false sharing problem. The

allocation strategy of the memory management must try to avoid false sharing wherever possible.Furthermore collisions during the allocation of objects in the DHS must be avoided, as such a collisionwill abort other transactions and thereby serialize all allocations in the cluster. To achieve those goals,Plurix uses a two stage allocation concept consisting of allocator-objects and a central memory manager.The latter is needed, as the memory has to be portioned to the different nodes in the cluster. This divisionmust not be static, as this would reduce the maximum size of the objects.

The memory manager is used to create allocators and large objects. As the allocator must have at leastthe same size as the new object which should be created, the usage of allocators for large objects wouldlead to large allocators and thereby to a static fragmentation of the heap. The alternative for this is to limitthe size of the allocators and thereby the maximum size of the DHS-objects which is unacceptable.

Allocator-objects represent a portion of empty memory. The size of an allocator is reduced for eachallocated object. If it is exhausted, the Allocator is discarded and a new one is requested from the centralmemory manager.

6

Figure 3 Invalid reference to kernel methods

DSM

local memory

local memory

DSM

local memory

local memory

node 1 node 1node 2 node 2

4.2.1 Allocation of ObjectsIf a new object is requested, the memory management first decides if the object is created by the

corresponding allocator or by the memory manager. This decision depends on the size of the object. Eachobject which is greater than 4KB is directly allocated by the memory manager. To avoid false sharing onthese objects, their size is increased to a multiple of 4 KB (page granularity of the 32-bit Intelarchitecture). As all objects which are allocated by the memory manager have a size of a multiple of 4 KB,each object starts at a page border and consumes N pages. Therefore these objects do not co- reside withother objects on the same page.

Objects which are less than 4 KB are created by an allocator. As each node has its own allocator,collisions can only occur if a large object is allocated or if an allocator is exhausted and a new one must becreated. The measurements in section 6 show, that the size of most of the objects in Plurix are less than4KB, so large objects are rarely allocated. The collisions which occur during these allocations aretolerable most of the time.

The benefit of the two level allocation of objects is that small objects from one node are clustered in thememory. As a consequence collisions do not occur during the allocation of small objects and are rare iflarge objects are allocated. As large objects are not allocated within the allocator, its size can be limited,without limiting the maximum size of the objects. No static division of the memory is needed andtherefore no static fragmentation is created.

4.2.2 Reduction of False SharingGenerally speaking objects can be divided into two categories: Read-Only (RO) and Read-Write (RW)

objects. False-sharing on RW-objects is reduced by the mechanism described above. To further reducefalse-sharing it is reasonable to make sure that RO-objects like code segments and class descriptorswithout static variables do not co-reside with RW-objects on the same page, as this would lead tounnecessary invalidations of the RO-objects due to false-sharing. Code segments are only written duringcompilation by the compiler. If these objects would be indiscriminantly allocated, they could reside on thesame pages as the RW-objects of the node which is currently running the compiler. To avoid this, Plurixprovides additional allocators for RO-objects.

4.3 Protection of SysObjectsIf the entire system is running in the DSM, some code segments and instances of classes must be

protected against invalidation, as these objects are vital for the system. The objects which must always bepresent on a node are called SysObjects. These are nota bene all classes and instances concerning thePage-Fault-handler, the DSM protocol and the network device drivers. As these objects reside in the DSM

7

Figure 4 Allocation of objects

512 MB

4 GBAllocator Node1

Allocator Node2

Memory managerobject

Node1

size < 4kB

size >= 4kB

Node2

size >

= 4kB

size < 4kB

they might be affected by the transaction mechanism and in case of a collision on such a page, the pagewould be discarded and the system will hang, as the node is no longer able to request missing pages.

The protection of SysObjects against invalidation is easy to achieve just by defining two additionalallocators. SysObjects are either code segments or instances of SysClasses. As described above, codesegments are only written during compilation otherwise they are read only. Additionally, code segmentsshould not co-reside on the same pages as RW-objects as this would lead to false-sharing and therefore aspecial allocator is used. The compiler will create the new kernel classes in a different memory area.Afterward update messages need to be sent to all nodes in the cluster, to replace old classes and instancesby new ones. To achieve this, it is sufficient to make sure that such an allocator is only used by the currentcompilation and after that the remaining part of the last used page is consumed by a Dummy-SysObject.

RW-SysObjects are instances of SysClasses which are meaningless for all nodes except that one, thathas created the instance. For this reason RW-SysObjects are not published through the global nameservice. Therefore no other nodes can access a RW-SysObject. The only case where a RW-SysObjectcould be invalidated is as a result of false-sharing. To prevent this, each node acquires a SysRW-Allocatorduring the boot phase. All instances of SysClasses are allocated in this private allocator, so that there areonly SysObjects from one node on the same page.

These two additional allocators and the described techniques to use them are sufficient to protect allSysObjects against invalidation at run time.

4.4 Local memory for State VariablesState variables of the DSM protocol and the network device drivers must outlast the abort mechanism,

as these variables are needed to handle the abort mechanism. If they would be reset the current state of theprotocol and the network adapter would be lost. The network device driver would never be able to receivethe next packet as the receive-buffer pointer would also be reset. Also the protocol contains a sequencenumber for the messages to make sure, that no vitally important message is lost. If the state variables arereset, the protocol will receive messages from the future. In this case it would not be possible to decide ifthis number is invalid due to an abort or if the node has missed important network packets.

As the protocol is not a device driver, its current state variables can not be read from the hardwareregisters, as it is possible for normal (not network) device drivers. Hence these variables must be storedoutside the DSM address space. For device drivers and the protocol, the kernel provides special memoryareas in the local memory in which state variables are stored. To access these areas, Plurix provides“structs“, allowing to address raw memory much like the variables in an object. “structs“ are also used toaccess the memory mapped registers of devices. As Structs may not contain pointers and are notreferenced by pointers no problems with address space spanning pointers arise.

4.5 Restart of device driversIn case of an abort the state of the entire node is reset to the state just before the current transaction was

started. Devices can not be automatically reset and the device driver programmer must implement anUndo-method, which is called by the system in case of an abort. This method has to ensure that both thestate of the hardware and that of the state variables in the device driver object are reset. To make thispossible the state of all devices before the transaction must be conserved.

An example for such an Undo-method is shown for graphics controller devices. In this case the currentOn- and Off-screen memory-area on the display adapter must be reset. Since between two transactions theOn-and Off-screen contains the same data, it is sufficient to reset the Off-screen memory and afterwardcopy this value to the On-screen area. This is easy to implement as most graphics controllers containsubstantial amounts of memory for textures and vertices. A small part of this memory can be used to savethe committed state of the graphic controller. After the commit phase, the current On-Screen area is copiedinto this separate memory area and can be restored if necessary.

The serial-line controller is more difficult to handle. This controller sends all data if it receive it fromthe system. In case of an abort it is not possible to “undo” the sent data. For this problem there are twopossible solutions. Either the affected application is able to handle duplicated data or the driver has to usesmartbuffers. Data in this special buffer type are invisible to the device until the commitphase, so thedevice can only access committed and therewith value data.

8

5 Checkpointing and RecoveryState of the art PC-Clusters are built using Linux or Microsoft Windows but implementing

checkpointing and recovery in these operating systems is difficult because it is not sufficient to save theprocess context but also the local kernel context needs to be observed. The latter includes internal datastructures, open files, used sockets, pending page requests, ... which can be read only at kernel level.Resetting the kernel and process context in case of a rollback is also challenging because of the complexOS architectures. As a consequence taking a checkpoint is time consuming and checkpointing intervals arequite large, e.g. 15-120 min. for the IBM LoadLeveler.

By extending the Single System Image concept we avoid these drawbacks. Storing the kernel and itscontexts in the DSM makes it easy to save this data. Rollback in case of an error is no problem in Plurixbecause the OS and all applications are designed to be restartable anyway.

5.1 Current ImplementationA central fault-tolerant PageServer stores consistent heap images in an incremental fashion on disk.

Between two checkpoints the PageServer uses a bus snooping protocol to intercept transmitted andinvalidated memory pages to reduce the amount of data to be retrieved from the cluster at the nextcheckpoint. If a checkpoint must be saved the cluster is stopped and the PageServer collects invalidatedpages that have not been transmitted since the last checkpoint. All memory pages are written to disksynchronously. We have implemented a highly optimized disc driver that is able to write about 45 MB/s.An early performance evaluation of our PageServer can be found in section 6.

Because the kernel and its context reside in the DSM we must not save node local data. Furthermore,we have no long running processes or threads with preemptive multitasking that need to be checkpointed.Currently, we use a cooperative multitasking scheme for executing short transactions. A transaction isexecuted by a command or periodically called from the scheduler. Long running computations have todivided in sub transactions manually. In case of an error the node can perform a reboot and fetch requiredmemory pages again from the DSM from the last checkpoint.

5.2 Fault-ToleranceWe support clusters running within a single Fast Ethernet LAN and assume a fail-stop behavior of

nodes. Most DSM systems typically use a reliable multicast or broadcast facility to avoid inconsistenciescaused by lost network packets. Because of the low error probability of a LAN we are not willing toimpose the overhead by a reliable communication during normal operation. Instead we rely on a fast errordetection, fast recovery, and the quick boot option of our cluster OS.

As described in 2.x our DSM implements transactional consistency and committing transactions areserialized using a token. We introduce a logical global time (a 64-Bit value) incremented each time atransaction commits. In the latter case the new time is broadcasted to the cluster and each node updates itstime variable. A node can immediately detect if it missed a commit and ask for recovery. If the commitmessage cannot be sent with one Ethernet frame, the commit number is incremented for each commitpacket. Thus we avoid inconsistencies if a node did miss a packet of a multiple packet commit.Furthermore, any page or token requests always includes the global time value of the requesting node. Ifsuch a request contains an out-of-date commit number it is not processed but recovery is started. Thus anode that missed a commit is note able to commit a transaction because it is not granted the token.

If a single node fails temporarily it can reboot and join the DSM again. If the PageServer detectsmissing pages during the next checkpoint that were lost because of the a node failure the cluster is reset tothe last checkpoint. If a multiple nodes fail temporarily or permanently the same error detection schemeworks, too.

The network might be partitioned temporarily into two or more segments. Only one token and onePageServer is available in any of these segments. Nodes within the segments send page and tokenrequests. If either request cannot be satisfied the segment tries to recover by contacting the PageServer.Only the segment with the PageServer can recover the others have to wait until the PageServer becomesavailable again.

We plan to implement a distributed version of our PageServer two avoid a bottleneck and to replicatedata stored on the PageServers to tolerate failures of PageServers, too. We also plan to introduce a

9

asynchronous checkpointing scheme to avoid stopping the cluster during checkpointing operation.Dependency tracking will also be investigated to restart only affected nodes in case of a failure.

6 MeasurementsThe performance evaluation is carried out on three PCs interconnected by Fast Ethernet Hub. Each node

is equipped with a RLT8139 network card and an ATI Radeon graphic adapter. Only the first machine(with Celeron-CPU) is featured with hard disk (IBM 120GB, max disc write throughput without network45 MB/s) and acts as pageserver.

Table 1. Node configurationNode CPU RAM

1 Celeron 1.8 GHz 256 MB DDR RAM at 266 MHz2 Athlon XP2.2+ at 1.8 GHz 256 MB DDR RAM at 333 MHz3 Athlon XP2.0+ at 1.66 GHz 256 MB DDR RAM at 333 MHz

6.1 General System MeasurementsWe have tested the startup time of the above described cluster nodes. The results are split into the time

which the kernel needs and the time which is needed to detect and start the hardware such as HD, mouseand keyboard. The nodes have been started with and without harddisc and the time difference is about 540ms during which we have to wait for the harddisc to answer.

Table 2. Startup times (in ms)Node Startup as Master Kernel time Startup as Slave Kernel time

1 791 254 240 234

2 780 248 238 233

3 792 254 239 234

The kernel allocates 2787 objects if running as master and 518 objects if running as slave. It takesapproximately 3 microseconds to allocate an object and additional 0.5 microseconds to assign apointer to an object. To get the kernel from the DHS a slave node must request 284.

To show the correlation of changed heap size, heap spreading and time to save a checkpoint, tenmeasurements were made. Comparison of several measurements is needed for predications about speed ofhard disk, performance of implemented software and latency caused by network. In the following table foreach measurement the configuration (single station or cluster) and heap spreading is given. The pageservercreates consistent images of the complete heap containing both user data (node1 – node3) and operatingsystem. The latter is contained in “saved data”.

Table 2. Measurements

# nodes Node 1 Node 2 Node 3 Saved data Time tosave to disc

Throughput(resulting disc

write bandwidth)1 1 20 MB - - 21,4 MB 1639 ms 13,7 MB/s2 1 40 MB - - 42,5 MB 2491 ms 17,5 MB/s3 1 60 MB - - 63,0 MB 3371 ms 19,1 MB/s4 1 80 MB - - 83,4 MB 4321 ms 19,7 MB/s5 1, 2, 3 60 MB 0 MB 0 MB 63,1 MB 3422 ms 18,9 MB/s6 1, 2, 3 20 MB 20 MB 20 MB 63,1 MB 4476 ms 14,4 MB/s7 1, 2, 3 0 MB 28 MB 32 MB 63,1 MB 4971 ms 13,0 MB/s

10

8 1, 2, 3 40 MB 40 MB 40 MB 124,6 MB 8049 ms 15,8 MB/s9 1, 2, 3 48 MB 48 MB 48 MB 149,1 MB 9540 ms 16,0 MB/s10 1, 2, 3 60 MB 60 MB 60 MB 186,0 MB 11707 ms 16,3 MB/s

In comparison of measurement 1-4 we see an increase of throughput in consequence ofincreased data size. Measurements 3, 5-7 have same size of saved data, so decreased throughputdepends on network latency. Comparing measurements 6, 8-10 approves nearly constantthroughput. The slight improvement for increased data size is due to faster saving of local data.The following chart shows these three comparisons:

7 Experiences and Future Work Moving the kernel into the DHS and therewith elaborating the SSI concept made it possible to create a

type-safe kernel interface and to solve the problem of address space spanning pointers. Additionally,checkpointing is made much easier and the question in which way kernel methods should be called isanswered.

The current version of Plurix is running stable in the cluster environment, without collisions duringallocation. The usage of the allocator strategy inhibits false sharing if no applications which share objectsare running. As soon as objects are created by an application and shared with other nodes, the allocationmechanism is not able to prevent false sharing but we are working on a monitoring tool to detect falsesharing. Relocation of objects to dissolve false sharing is currently available.

Plurix uses a distributed garbage collection algorithm which is able to detect and collect garbage(including cyclic garbage) without stopping the cluster. The detection algorithm for cyclic garbage workserror free but currently there is no information which object could be cyclic garbage so each object in theDHS must be checked.

The consistency of the DHS is ensured by the page server, which uses linear segment technique to saveall changed pages. This includes data and code objects of user applications as much as the OS. In thecurrent implementation the speed of saving the complete heap is limited by network throughput and not byOS or hard disc. For this reason it is necessary to save the state of the cluster continuously which could beachieved by some minor changes, regarding the mechanism of detecting missing pages.

8 References[Mosix] Barak A. and La'adan O., The MOSIX Multicomputer Operating System for High Performance Cluster Computing ,

Journal of Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361-372, March 1998.[Wirth] N. Wirt and J. Gutknecht, „Project Oberon“, Addison-Wesley, 1992.[Traub] S. Traub, “Speicherverwaltung und Kollisionsbehandlung in transaktionsbasierten verteilten Betriebssystemen”, PhD

thesis, University of Ulm, 1996. [TreadMarks] Amza C., Cox A.L., Drwarkadas S. and Keleher P., „TreadMarks: Shared Memory Computing on Networks of

Workstations“, Proceedings of the Winter 94 Usenix Conference, pp. 115-131, January 1994.[Fuchs90] Kun-Lung Wu and W. Kent Fuchs, „Recoverable Distributed Shared Virtual Memory”, IEEE Transactions on

Computers, 39(4):460-469, April 1990.[Morin97] C. Morin, I. Puaut, “A Survey of Recoverable Distributed Shared Virtual Memory Systems”, IEEE Transactions on

Parallel and Distributed Systems, Vol. 8, No. 9, September 1997.[Keedy] J.L. Keedy, and D. A. Abramson, “Implementing a large virtual memory in a Distributed Computing System”, in: Proc.

of the 18th Annual Hawaii International Conference on System Sciences, 1985.

11

[Li] K. Li, “IVY: A Shared Virtual Memory System for Parallel Computing”, In Proceedings of the International Conference onParallel Processing, 1988.

[Wende02] M. Wende, M. Schoettner, R. Goeckelmann, T. Bindhammer, P. Schulthess, “Optimistic Synchronization andTransactional Consistency”, in: Proceedings of the 4th International Workshop on Software Distributed Shared Memory,Berlin, Germany, 2002

[Bindhammer] T. Bindhammer, R. Göckelmann, O. Marquardt, M. Schöttner, M. Wende, and P. Schulthess, “DeviceProgramming in a Transactional DSM Operating System”, in: Proceedings of the Asia-Pacific Computer Systems ArchitectureConference, Melbourne, Australia, 2002.

[Simle] http://os.inf.tu-dresden.de/SMiLE/

12

Date post:	28-Apr-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

WORKSHOP PROCEEDINGS - Engineering...Thiago Robert C. Santos and Antonio Augusto Frohlich, LISHA,...

Documents