Client-independent Checkpoint/Restart of L4Re-server ...os.inf.tu-dresden.de › papers_ps ›...

Client-independent Checkpoint/Restart ofL4Re-server-applications

Johannes Kulik

12. November 2015

Faculty of Computer ScienceInstitute of Systems Architecture

Operating Systems Group

Client-independent Checkpoint/Restart ofL4Re-server-applications

Johannes KulikMatrikelnummer: 3296089

1. Reviewer Prof. Dr. Hermann HärtigTU Dresden

2. Reviewer Dr.-Ing. Michael RoitzschTU Dresden

Supervisors M.Sc. Tobias Stumpf and Dipl.-Inf. Jan Bierbaum

12. November 2015

Johannes KulikClient-independent Checkpoint/Restart of L4Re-server-applicationsIntroducing PointStart, 12. November 2015Reviewers: Prof. Dr. Hermann Härtig and Dr.-Ing. Michael RoitzschSupervisors: M.Sc. Tobias Stumpf and Dipl.-Inf. Jan Bierbaum

TU DresdenOperating Systems GroupInstitute of Systems ArchitectureFaculty of Computer ScienceNoethnitzer Strasse 4601187 Dresden

Abstract

The Fiasco.OC/L4Re system depends on multiple user-level servers to provide system services.These servers interact with clients via IPC messages, making them independent parts, that canbe restarted independently. Therefore, they are a suitable target for using Checkpoint/Restart(C/R) to harden the system against transient and permanent hardware errors. As currentC/R systems lack in portability, this work introduces PointStart, a new C/R system forFiasco.OC/L4Re, which can transparently handle L4Re-server-applications while being client-independent. Its performance is evaluated with benchmarks and its applicability is shownon a real L4Re-server-application — the console multiplexer cons.

v

Erklärung

Ich erkläre, dass ich die vorliegende Arbeit selbständig, unter Angabe aller Zitate und nurunter Verwendung der angegebenen Literatur und Hilfsmittel angefertigt habe.

Dresden, 12. November 2015

Johannes Kulik

vii

ix

Contents

1 Introduction 1

2 Foundations 32.1 Fiasco.OC and L4Re . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Flexpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Dataspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 C/R Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Properties of C/R Systems . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Comparison 93.1 Examples for C/R for System Services . . . . . . . . . . . . . . . . . . . . . . . 93.2 Examples for C/R in Capability-Based Systems . . . . . . . . . . . . . . . . . . 103.3 Review of C/R Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Design 134.1 State and its Sources in Fiasco.OC/L4Re . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1.3 Thread State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Procedures of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3.1 Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.2 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.3 Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Security Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Special Cases/Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Implementation 235.1 Task::cap_list() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 ReLoader’s Main Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.5 Factory Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.6 Message Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.6.1 Kernel-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.6.2 User-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi

6 Evaluation 276.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.1.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2.1 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2.2 Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2.3 cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2.4 Enhancement Process . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Conclusion and Outlook 35

Bibliography 37

xii

1Introduction

Hardware errors and especially bit flips are a source for instability in a system. In the fieldof fault tolerance, different techniques were developed to cope with these errors. Whileexecuting multiple instances of the same application in parallel can be utilized to detectand correct errors [Döb+12], there are costs to pay in performance or system size. Usingdistributed systems, complex algorithms like consensus need to be used to compute correctresults. It has been proven, that 2n + 1 executions are necessary to cope with n faults andachieve correct results [Sch90], while as little as 2 executions would be needed to detecterrors. To reduce the overhead of parallel execution, other solutions might be feasible. Thesimplest would be to restart the failing application or the whole system, losing all computedstate. Introducing micro-reboots [Can+04], that only restart the failing process or thread,preserves more state if the failing application can be divided into small parts. Going a littlefurther than micro-reboots is C/R, where the current state of a system or application issaved to be restored after a restart. But while checkpointing can be used for fault tolerance,it has also other applications like process migration, reversible execution or backtracking[All94].

In a system running the Fiasco.OC microkernel with L4Re as user-level runtime-environment,most system services are implemented in user-level servers. While these servers are inde-pendent parts and not implemented in the kernel-level, they can be vital to the operationsof the system. Therefore, the stability of the system could be improved by implementinga C/R system capable of handling L4Re-server-applications. While there has already beenan approach to a C/R system on Fiasco.OC/L4Re [Vog+10], this approach included theclient of a checkpointed server into the process. This work will focus on a client-independentapproach, thereby removing the need to change and recompile clients.

In this work, I first present an overview of Fiasco.OC/L4Re and basic properties of C/Rsystems. Thereafter, I evaluate multiple C/R systems for their portability to Fiasco.OC/L4Re.Since none of these systems completely matches the properties, I designed PointStart, anew C/R system on Fiasco.OC/L4Re. After presenting its design and implementation, I willpresent an evaluation of its performance, finishing with an outlook and conclusion.

1

2Foundations

This chapter will give the reader a basic understanding of the technology this work is basedupon.

2.1 Fiasco.OC and L4Re

Fiasco.OC is a microkernel. Therefore, only basic features of the system are running in privi-leged kernel-level. These features include message passing (IPC) and scheduling. Featureslike hardware drivers and memory management are implemented in user-level using thekernel’s message passing capabilities to interact. While monolithic kernels still dominatedesktop and server environments, the microkernel concept has an advantage regardingsecurity [Tan+06], especially because of the small TCB (Trusted Computing Base). The TCBincludes all components an application has to trust.With only the bare features available in the kernel itself, other features need to be imple-mented in user-level. To ease the development of applications, an abstraction layer providinglibraries and commonly used services is usually used on top of a microkernel. For Fiasco.OCthis usually is L4Re.In the startup of a system using L4Re, Moe is used as root-Task, providing a memory allocator,a log, an object factory and a pager as services. Moe then usually starts Ned as init process.Using Ned’s Lua-based configuration file, multiple other applications can be started.

Figure 2.1: Application’s address space with multiple memory sources

3

2.1.1 Memory Management

In L4Re memory is managed in a hierarchy of pagers with each pager being an independentserver. The head of the hierarchy is called root-pager and owns all memory. For Fiasco.OCthis is Sigma0. While applications can directly interact with Sigma0, L4Re provides an ownmemory allocation service, which requests memory from Sigma0. It can be used throughMoe. This is done to be able to add an abstraction layer to memory allocation and e.g.provide copy-on-write support.While there is a hierarchy of pagers, an application is not limited to having only one parentpager. E.g. there could be a pager providing the application with code allocated by the initprocess, a memory mapped file allocated by a file-system-service and hardware memorydirectly mapped into the process’ address space (fig. 2.1). In this example, the application’spager has to keep track of which memory region is provided by which parent pager. These"pager proxies" are called region managers in L4Re. While directly using Moe’s pager serviceas region manager for an application is possible, usually every application is spawned withan own region manager by starting the application on top of l4re_kernel.

2.1.2 IPC

With the functionality in a microkernel environment being provided by different servers,client-server-interaction is an important part of the system. To achieve interaction, Fiasco.OCprovides the possibility to send IPC messages between threads. A message can be sentdirectly to a thread or through a kernel object called IPC-Gate, to which a thread has to bebound to receive the message.Receiving and sending of messages is done synchronously, which means both parties haveto rendezvous to exchange a message. This also means, a thread cannot send multiplemessages to another thread asynchronously, as there are no extra buffers in the kernel forkeeping these messages. Only a part of kernel memory mapped into user-level is availablefor message passing. This memory is called UTCB and defines the limit of a message’s size.

client server

kernel

UTCB

1: write UTCB

2a: syscall send 2b: syscall wait

4: read UTCB

3: copy UTCB

UTCB

Figure 2.2: IPC message passing between client and server

A thread trying to send a message, will call into the kernel to try to deliver the messageto the receiver (fig. 2.2). If the receiver is currently waiting for a message, the messagewill be delivered immediately. Otherwise, the sender will be added to the receiver’s waitqueue. Once the receiver tries to receive by calling into the kernel, it will pick the nextsender from its wait queue to actively deliver its message and unblock the sending thread. If

4 Chapter 2 Foundations

the former sender wanted to do a RPC (Remote Procedure Call), it will immediately blockagain, waiting for a message from its former receiver and becoming a receiver itself. This isthe case, because the client expects a reply to its request. From the server’s perspective, theclient is thus its caller.Both send and receive operations are associated with timeouts, that can be set by thesender/receiver from user-level. If the timeout is reached before a message could besent/received, the IPC operation aborts with an error. Usually, an infinite timeout is set, asclient/server depend on the operation.A message consists of 64 message registers with a size of one machine word each. The firstregister contains the protocol for the interaction. In case of region mapping this would bee.g. L4Re::Rm::Protocol. The following message registers are then interpreted using thatprotocol. Additionally, there are 64 buffer registers for receiving flexpages, which I describein the following section.

2.1.3 Flexpages

Transferring access to memory or kernel objects in Fiasco.OC is done with flexpages passedin the buffer registers of an IPC message. For memory, a flexpage describes a region in a Task(in Fiasco.OC an address space is called a Task). To request a memory mapping for a specificregion, the client will have to send an IPC message that includes a flexpage describing thisregion to a memory mapping server. Thereby, the client opens a receive window in which theserver is allowed to map memory. The server will answer by sending a flexpage describingthe memory to be mapped from its own address space, thus completing the mapping process.Using a grant instead of a map operation in the flexpage, the memory is not only mappedto the client, but also unmapped from the server’s address space. Usually, the client alsoreceives all rights to the memory, that the server held. References to kernel objects can betransfered with flexpages as well.

2.1.4 Dataspaces

For managing memory in L4Re there is an abstraction to manual handling of flexpages,which is called a Dataspace. These Dataspaces, usually provided by Moe, describe a partof allocated memory and can be attached to a region manager. The region manager willthen call these dataspaces to map memory on page faults. The basic technology behind thisis an IPC-Gate bound to the memory provider, who will then handle the memory mappingrequests. The region manager thus only holds a reference to a kernel object for each regionof the Task’s address space.

2.1.5 Capabilities

References to kernel objects are called Capabilities in Fiasco.OC. There are currently eighttypes of kernel objects: Task, Factory, Thread, IPC-Gate, IRQ, Scheduler, Vlog and Semaphore.They are created using a Factory object which is provided through the L4Re environment.But since kernel objects reside in kernel-level, user-level code cannot access these objects

2.1 Fiasco.OC and L4Re 5

directly. To interact with a kernel object, user-level code has to make an IPC call into thekernel providing a capability. The type of interaction is dependent on the type of objectreferenced by the capability. An IPC-Gate for example is a generic way to pass messagesbetween different threads, while a call providing a Task capability can only use the Taskprotocol. In order to enforce restrictions on the communication with kernel objects, delete,write and special rights are available. Rights to read are not necessary because a capabilityhas to be mapped to the Task of a thread to interact with it at all. Read rights are thusrepresented by the mapping into a Task. The other rights are stored in a capability table.This table also contains the references to the kernel objects. For security reasons user-levelcode can only address its capabilities by their indices in the capability table, much like filedescriptors in other systems. This also ensures anonymity between IPC partners.

2.2 C/R Systems

The purpose of a C/R system is to provide the possibility to restart an application, while notlosing all progress. Therefore, usually in intervals, the application’s state is saved, which iscalled checkpointing. C/R is used for fault tolerance, but also for migrating processes. Itcan also be helpful in debugging long-running applications, as the full state before a fault isavailable in a checkpoint.To be able to evaluate the possibility of porting existing C/R systems, the properties dis-tinguishing C/R systems need to be understood. The following list of properties based on[MG09] is not complete, as it contains only the properties I identified as most important forclient-independent C/R of L4Re-server-applications.

2.2.1 Properties of C/R Systems

Correctness It is essential for a C/R system to consider all resources an applicationmight be using in the system. If only one resource is not checkpointed, the outcome of anapplication’s computation might completely change after a restart. This is a hard problemespecially for C/R systems, that are only situated in either user- or kernel-level. The followingtwo examples [Bro+06] illustrate the problem.

If an application checkpointed by a user-level C/R system uses rand(), the seed for gener-ating pseudo random numbers has to be checkpointed. While this imposes no problem onkernel-level C/R systems there are usually access restrictions for user-level C/R systems thatprohibit setting the correct seed value.

Kernel-level checkpointing on the other hand can have problems correctly restoring anapplication having user-level references for kernel-level data. Looking at e.g. MPI, one canfind virtual process IDs that map to physical process IDs and locations. Since a user-levelC/R system would not be able to set the physical process IDs, only the virtual IDs would becheckpointed. In kernel-level the physical IDs would also be checkpointed, but the mappingwould not be correct any more if one system changes its location during recovery.


These examples show that neither kernel- nor user-level is better for achieving correctness incheckpointing, especially if the properties of transparency should be held.

Transparency An ideal C/R system should not add a burden on the user. Thus a check-pointed program should not be aware of the fact, that it is getting checkpointed. This alsomeans, that the developer of a program does not have to change it for working with a C/Rsystem. Since transparent C/R systems have to be generic to cope with lots of different typesof applications, not every optimization can be used. The developer of an application hasmore information about the lifetime of data or the possibility to recompute data from asmaller checkpointed dataset. Thus explicit checkpointing by the application can be moreefficient.

Autonomy Going in another direction compared to transparency, is autonomy, whichmeans that the user does not have to interact with the checkpointing services at runtime.This can be achieved by automation of fault detection and restarts of the application.

Portability What kernel-level C/R systems generally lack is portability, since they areimplemented in the system’s core itself. Thus it is hard to impossible to write C/R systems inkernel-level in a manner, that the code can be used on different operating systems. User-levelC/R systems however can — at least in parts — be implemented mostly independent fromthe underlying OS. But even then there is some information that can only be checkpointedwith special help of the kernel in some circumstances, because the information necessaryto achieve correctness is located in the kernel and only available through system-specificinterfaces. Another problem are operating-system-specific properties. Capabilities forexample are not available in all systems or can only be compared to the simpler file-descriptors. In contrast to file-descriptors, capabilities are vital for every application andcannot be ignored.

Efficiency With portability being hard to achieve, one may not want to sacrifice efficiencyfor it, since efficiency is one of the most important properties of a C/R system. Thiscomes from the fact that checkpoints are created very often and every added overhead onthe applications runtime will let a user consider the risk of not running the applicationcheckpointed or using a different service which is faster. Although most effort is directedtowards making the checkpointing process faster and nearly ignoring the recovery processas an event, that occurs seldom, there are applications which greatly benefit from a fastrecovery process. A system service providing essential services to other applications is onesuch example. Here the downtime has to be kept as short as possible to let the system workas expected.

Use of disks With efficiency in mind, the choice of storage technology for checkpointingdata might be considered. A common approach is to write all data to disk, because ofthe protection against power failures. But writing to disk imposes a major overhead onthe checkpointing process [Zhe+04]. To bypass this overhead diskless checkpointing wasintroduced and mainly used in clusters with hosts holding the state of others in memory.This does account for host failure, but not for a power outage of the whole data center. Theadvantages that can be gained by diskless checkpointing even on single systems though, are

2.2 C/R Systems 7

greatly reduced checkpoint time and the possibility to avoid complex problems where statecannot be easily restored but only protected from deletion.

Message logging Another property, that a C/R might need, is message logging. Especiallywhen dealing with external client applications, the internal state of a checkpointed servershould be retained. Therefore, logging communication from the clients is important.Three types of message logging can be distinguished. First of all, messages might be ignoredby the C/R system. Then no messages are logged. Depending on the application, this can befine, e.g. in computation-intensive workloads. The two other types are inter-application andintra-application messages. The former ones are sent inside the checkpointed applicationand the latter ones between the checkpointed application and other applications.If messages are logged, either pessimistic, optimistic or casual logging can be used. Whilepessimistic logging has the highest overhead, it also has the simplest recovery procedure,as messages are stored to stable storage before the execution of any associated event. Incontrast to that, with optimistic logging messages are stored to volatile storage asynchronousto the execution of an associated event. This achieves higher performance at the cost of morecomplex or impossible recovery, because of lost messages. Combining optimistic logging withsharing of logs between processes, casual logging combines the performance of optimisticlogging with a higher survivability. The sharing of messages is done by piggy-backing onnormal messages, which increases the message size and adds complexity to the recoverymechanism.


3Comparison

Before presenting the design devised for this work, I will present related examples of C/Rsystems for system services and capability-based systems in general. I will then reviewcurrently available C/R systems. In doing so, I will examine if any of these systems is suitablefor porting to Fiasco.OC. Since the system services of L4Re usually do not use the file system,libraries like Metamori [Jey04] or libfcp [Wan+97] whose purpose is to make checkpointingof files possible are not discussed in detail. Instead, the focus is on the checkpointing ofallocated memory, thread state and possible features for checkpointing capabilities, as theseconstitute the state of a L4Re server.

3.1 Examples for C/R for System Services

As already said in the introduction, this work focuses on checkpointing system servers foundin L4Re for the Fiasco.OC kernel. Since there are other microkernel-based systems thisproblem has already been solved in different approaches, some of which I will present heretogether with solutions for servers in general.

MINIX 3 One implementation for restarting servers in a microkernel environment can befound in the MINIX 3 OS [Tan+06]. Here a so-called reincarnation server monitors devicedrivers, which are implemented as servers in user-level, and other servers. If the monitoredserver does not respond to IPC calls or terminates, the reincarnation server restarts the failedcomponent. For mostly stateless drivers this is no problem at all. If keeping state is necessary,an additional service can be used which is called data store. This service has to be usedmanually by the server and thus might not save all state necessary for full recovery. That iswhy in this system mainly stateless servers are supported.

CuriOS In the CuriOS [Dav] operating system the client specific session state of a server ismanaged by the kernel. A server is only able to access this so-called server state region ifthe client invoked the server. The rest of the time the memory is detached from the server’saddress space. For the client it is not accessible at all. The reason for this isolation is thata corrupted server can only corrupt the client state of the current client while the otherclients’ state is save. Additionally the client state can be sustained through the server’s restartbecause the state is not coupled with the server. Thus the restart is also transparent to theclient.

ASSURE For a very general application of C/R in servers, ASSURE [Sid+09] was devel-oped. Here unknown errors are bypassed by exception handling and C/R mechanisms. Ifan error occurs, the nearest so-called rescue point is used on a replica of the application

9

to re-run the event that caused the error. Rescue points are found offline by fuzzy searchand are fault handlers in the application itself. It is claimed that thus no source code of theapplication is needed for successful recovery from unknown errors.

3.2 Examples for C/R in Capability-Based Systems

With Fiasco.OC being a microkernel, IPC is the main method of communication as mostservices are implemented in different servers. Since calling of a service is done by invoking acapability, capabilities are used all over the system. This makes for a special case in check-pointing. But since Fiasco.OC is not the only system using capabilities or similar constructsfor communication, there already exist solutions some of which I will present below.

KeyKOS / EROS In the KeyKOS operating system [Lan92], which also uses capabilities,the problem is solved by using the coarsest granularity: the whole system state is regularlycheckpointed. The same is true for EROS [Sha+99] with the slight difference, that consis-tency checks are done before every checkpoint is written to ensure the correctness of thecheckpoint’s data.

Mach An equivalent to capabilities (ports) can be found in the Mach microkernel, whichalso supports Unix applications. For implementing task migration in this system though, theports abstraction of the microkernel was used. With that it was possible to create a taskmigration system in user-level with slight changes to the underlying kernel [Mil+99]. Thisis not an explicit C/R system but for task migration a checkpoint of the task’s current statehas to be taken to migrate it. Thus, task migration and C/R share common concepts andmainly differ in the goal: migration instead of fault tolerance.

L4ReAnimator There also exists an earlier attempt to implement checkpointing on Fi-asco.OC. The project was called L4ReAnimator [Vog+10]. It was a semitransparent check-pointing library that depended on a capability fault handler on the client side. This faulthandler was used to detect and handle stale capability entries of restarted servers. For con-venience it was expected that the implementor of a server would provide an implementationof such a fault handler for the client to use.

3.3 Review of C/R Systems

A C/R system has to hold the following properties: correctness, transparency, autonomy,portability and it should support diskless checkpointing. The reasoning for choosing correct-ness is that it is a property that always needs to hold — at least for applications with a definedset of properties. Otherwise, using a C/R system could change the applications results, whichis not supposed to happen. Checkpointing system services implies that autonomy is a vitalproperty, because the system might not be able to carry on without that service and waitingfor a user’s interaction is too costly in terms of time spent. Transparency is needed because

10 Chapter 3 Comparison

it reduces the work to be done, as all system services currently available or developed in thefuture can run on top of this C/R system without being changed. If an available C/R systemcan comply with these properties, then it should also be easily portable and since Fiasco.OChas special needs regarding its capabilities, these needs should be easily incorporable. Itshould also not be complicated to extend the C/R system for diskless checkpointing as forthis work everything is supposed to be saved in memory.

Since there are more C/R systems than can be discussed in this work, I will present a smallsubset in detail. For other C/R systems, I will only present a name and a small description ofwhy they are not suitable. First of all there is Condor [Lit+97] which is too big a projectand not transparent. CRAK [ZN01] is user initiated and thus does not provide autonomy.Neither autonomy nor multi-threaded applications are supported by CHPOX [Sud+07].With FineFRC [Chr+06] being mainly aimed at clusters and relying on shared memorybetween these, it does not seem suitable for a single host’s system services. FTOP [Bad+02]can checkpoint communication channels and implements message routing for restarteddistributed applications, but it depends on the process being integrated with PVM (ParallelVirtual Machine) [Gei94] and is not transparent.

libckpt

Although libckpt [Pla+94] is not transparent, as it requires the replacement of an applica-tion’s main() function, and can only checkpoint single-threaded applications, it deserves amore detailed discussion because of its portability and the fact, that it was one of the firstC/R systems for Unix. While it does require the linking and recompilation of an application,it also includes user directed checkpointing via memory exclusion. Here a user can tellthe C/R system about memory regions which need not be checkpointed. The portability oflibckpt comes from the fact that it does not use many system-specific features. For findingthe stack and heap memory regions, it uses pointer arithmetics instead of system-specificinformation sources. The CPU state is saved with a combination of setjmp()/longjmp() callswhich are a feature of the C standard library. The function saves all the state on the stack atthe beginning of a checkpoint and the latter function returns to normal processing after acheckpoint has been taken.

BLCR

A fully transparent C/R system is BLCR (Berkeley Lab Checkpoint/Restart) [Due05]. In thissystem checkpointing is done as a kernel module on Linux systems. As mentioned in thelast chapter, systems depending on kernel-level alone cannot always be correct. BLCR thusprovides a set of callbacks for extending the checkpointing process from user-level. Forthe collection of thread state it uses a slightly modified version of vmadump which wasoriginally developed for BPROC [Hen99]. Since the collection is done in kernel-level, kernelstructures are accessed directly to gather CPU, FPU and memory segment information. Otherinformation that is saved by accessing kernel structures include the tree of processes startedby the application and a list of open file descriptors. With BLCR heavily depending on Linuxkernel structures, only small parts of its code could be reused on Fiasco.OC.

3.3 Review of C/R Systems 11

CRIU

With CRIU (Checkpoint/Restart In Userspace) [tea] being developed for Linux only, it heavilyrelies on OS-specific features. Memory regions, file descriptors and the dependency treeof a process are found by parsing information provided by the Linux kernel through the/proc file system. For all other functions the ptrace interface is called. Usually, this interfaceis instrumented by debuggers and allows a process to control other processes. Therefore,it can be used from user-level to retrieve and set CPU/FPU state and pause processes. Asptrace is a system call, kernel support is needed. Thus, CRIU can be called a hybrid betweenuser- and kernel-level, storing information in user-level which is provided and retrieved onthe kernel-level. While not being portable at all because of the lack of a similarly powerfulinterface on Fiasco.OC, its hybrid design can still serve as a blueprint for other approaches.

DMTCP

Another approach to implement a user-level C/R system on Linux is the MTCP[Rie+06]-based DMTCP [Ans+09]. It adds functionality for checkpointing distributed applicationsto MTCP, which includes state of sockets. While it is to some extend portable because ofusing POSIX signals to stop the processes for checkpointing and using setjmp/longjmp pairas described in the libckpt section, it also uses Linux specific features to find memory regions.A list of started processes is compiled by overwriting the fork() library function and savingthe information received from the application and the kernel. Since neither fork() nor othersystem calls proxied by DMTCP are available in Fiasco.OC, only parts of this approach couldbe reused.

ZapC

An approach that is different than the ones discussed so far is ZapC [Laa+05]. Here parts ofthe OS are virtualized so processes can be encapsulated in a special container (pod, PrOcssDomain) that decouples the process from the host’s OS. This provides a fully transparent wayof checkpointing an application. But as ZapC is built on Zap [Osm+02] which is a successorto the aforementioned CRAK it is implemented as a kernel module. While the project doesnot seem to be open source, a similar design could be implemented on Fiasco.OC vCPUs[Lac+10], thus making ZapC a viable approach.

This overview of other approaches shows, that no discussed C/R system could be directlyported, because of either the lack of corresponding APIs, too tight integration into the OS,unavailable source code or the fact that the C/R system cannot fulfill all properties. Still partof the designs of these C/R systems can serve as an inspiration for implementing C/R onFiasco.OC. I will discuss the design decisions for this work in the next chapter.

12 Chapter 3 Comparison

4Design

In this chapter I will present the design of PointStart, the C/R system I developed forFiasco.OC/L4Re. Beginning with possibilites to gather an application’s state in Fiasco.OC,I then present the different components of PointStart followed by a detailed rundown ofstartup, checkpointing and restarting. In the end different security considerations andrestrictions of PointStart will be discussed. Details of the implementation will be discussedin the following chapter.

4.1 State and its Sources in Fiasco.OC/L4Re

As mentioned before, the purpose of checkpointing is saving all the state an application needsfor continuing its task as if it was never restarted. This means, that for a specific operatingsystem every source of application state has to be identified. Otherwise, the outcome of anapplication can be different, which violates the property of correctness (see section 2.2.1).In Fiasco.OC state can be divided into three categories: held capabilities, allocated memoryand state of each thread. These will be discussed below in more detail.

4.1.1 Capabilities

First a small recap (for more see section 2.1.5): capabilities in Fiasco.OC are a meansof controlling kernel objects and used for communication between different threads orapplications. Every task has a capability table that holds references to created kernel objectsassociated with the rights this task has with regards to this capability. On a capability createdin the task itself this usually includes the right to delete the corresponding kernel object.The difference between the reference in the capability table and the kernel object itself iscrucial for understanding the workings of the checkpointing design.

Since storing the checkpoints in memory is part of this work, the main problem for check-pointing capabilities is restoring a task’s capability table while keeping the kernel objectsalive. As mentioned before, kernel objects can only be deleted if a task holds the deleterights for this capability. Therefore, PointStart remaps the capabilities to the task without thedelete rights in the checkpointing phase. The task can then only remove the reference fromits capability table while the kernel object is kept alive as long as there is still a referenceto it in at least one capability table. To use this, all capabilities held by the checkpointedapplication are transfered to another Task to keep a reference.

With prohibiting deletion of an object solved, there still remains the question of how toenumerate all mapped capabilities. A Task’s capability table has space for 220 entries.

13

Walking this table from user-level would need 220 IPC calls into the kernel to check everyentry for validity (Task::cap_valid()), thus causing a lot of overhead. Walking this table inkernel-level and transferring only the result into user-level can reduce the overhead, becausethe checks can be done in bulk. With the user-level caller allocating memory before the calland supplying the address to the kernel, the space limit of the UTCB does not apply. Thuswhen allocating enough memory to hold one byte per capability, only one IPC call has to bemade.Another way of gathering information on which capabilities are mapped into a Task, isintercepting the creation or reception of capabilities. This can e.g. be done with the vCPUfeature available in Fiasco.OC or to some extent by providing an own factory implementationas proxy, as a factory is needed to create kernel objects. This has the advantage to work inuser-level alone, but adds overhead by doubling the number of occurring IPC calls with theadded level of indirection. Using an own factory will also fall short of recognizing capabilitiesdirectly mapped from/granted to another Task.Because of the lesser overhead and the simplicity of the approach, I chose to provide the listof capabilities from kernel-level naming it Task::cap_list(). This list will also contain the typeof a capability, because threads need to be identified as they are handled specially. More onthis topic will be discussed in the Procedures of Operation (section 4.3).

4.1.2 Memory

For an application to continue its processing in the state it was checkpointed, all memoryattached to it has to be reset back to the state of the checkpoint. This includes all memoryallocated by the application — be it heap, stack or explicitly allocated memory regions. As Ihave presented in chapter 3, a common approach to identifying the memory regions neededfor a checkpoint is through information published by the kernel e.g. via the /proc pseudofile system in Linux. With Fiasco.OC being a microkernel and the memory managementbeing done in user-level (see section 2.1), trying to collect information about an application’sattached memory in kernel-level may not yield all information. In Fiasco.OC the onlyinformation about memory in the kernel are the page tables. But page tables only representthe current state of memory pages mapped into a Task. Pages, that have not been accessed,yet, or have been unmapped because of limited system memory will not be found and thusnot checkpointed. However, the user-level pager associated with a thread holds informationfor every region attached to the address space. Therefore, PointStart checkpoints memorycompletely in user-level.

4.1.3 Thread State

Different from capabilities and memory, thread state is comprised of multiple parts. Each ofthese parts is necessary for a thread to continue its work right where it was at checkpointingtime. This means a copy of CPU and FPU registers, scheduler state and IPC state have to besaved for each thread of which an application might have multiple. Scheduler state includesthe timeout a thread might be waiting for, the consumed CPU time of a thread, the UTCBand the current scheduling flags e.g. if the thread is ready to run or waiting for an exceptionor IPC. The information necessary to restore IPC state includes the list of senders waiting forthe thread to become ready to receive and the current IPC partner of a thread. The partner

14 Chapter 4 Design

determines, which thread is currently communicating with the receiver or — if explicitly set— which thread a receiver will accept messages from in the future. Additionally a receivingthread might have a caller set for replying to another thread’s request. A sending threadholds a reference to the list of senders of the receiver, it is waiting in.

There are multiple possibilities to gather thread state. The CPU registers could be col-lected in user-level by triggering an artificial exception in a thread through a kernel call(Thread::ex_regs()). This will force the thread into its exception handler, which will receivean IPC message containing all CPU registers and inherit the FPU from the sending thread.As triggering an artificial exception does not wake up threads blocked in IPC, these threadswould not call the exception handler and thus not be checkpointed. Using this also fails toprovide scheduler and IPC state.Another way of gathering CPU and FPU registers — as mentioned before in section 3.3(libckpt) — is to use custom assembler code. This method is hardly portable and does provideneither scheduler nor IPC state.Gathering parts of the IPC state could be done by providing own versions of Fiasco.OC’sIPC interface in user-level to be used by the application. That way for every call into thekernel or from the kernel information could be saved. An implementation for this wouldbe the vCPU feature available in Fiasco.OC. While CPU registers and FPU registers couldbe collected with vCPUs, not all IPC state and scheduler state can be gathered. Keepingtrack of a thread’s timeout would also have to be implemented in user-level and a thread’slist of senders would not include partners external to the vCPU. This is due to the fact thatno immediate event is caused by a sender going into the wait queue of a receiver and onlyevents can be intercepted by a vCPU.Therefore, I have decided to collect a thread’s state inside kernel-level, where all the informa-tion is directly available. CPU registers are saved when a thread enters the kernel and FPUregisters can be saved by functions already needed for scheduling. For saving and restoringIPC and scheduler state I have used direct access to a thread capability’s private data. In theend, all of a thread’s state is saved into memory provided by user-level with one kernel call(Thread::get_state()).

ApplicationSnapshot

Server

Logger &

Factory Proxy

ReLoader

Moe

Kernel

Snapshot

Server

Snapshot Snapshot

Figure 4.1: Components in the system: ReLoader, Snapshot Server, message logger/factoryproxy

4.1 State and its Sources in Fiasco.OC/L4Re 15

4.2 Components

PointStart is divided into three main components (fig. 4.1) with different responsibilities.With the code for storing checkpoints split off from the code handling the application("ReLoader"), the storage server ("Snapshot Server") has a smaller code base and is thusless prone to errors. Using this and starting multiple instances of Snapshot Server a morerobust storage can be achieved. The instances could be used as mirrors of each other, butbecause of the overhead in needed memory and time to make two copies, I decided to use around robin approach. The result is, that if one Snapshot Server crashes, the second to lastcheckpoint is still available.

With the ReLoader handling the application, it is responsible for startup, checkpointing andrestart of the application. After startup, it wakes up periodically creating a new checkpoint.Additionally, checking for faults and normal termination belongs to its responsibilities. Fordetecting normal termination, L4Re provides a Parent protocol signaling termination to aregistered object via IPC message. On starting the application, ReLoader registers itself asthe Parent object of the application and terminates the corresponding Snapshot Serverson receiving a Parent signal. With fault detection being a subject on its own, PointStartcurrently relies on Fiasco.OC to provide information about a fault by registering as exceptionhandler of every application thread. An exception handler will be called by Fiasco.OC viaIPC if unhandled page faults or CPU traps occur. When receiving such an exception IPC,ReLoader will perform a restart of the application.

The third component combines a factory proxy with a message logger. Spawning two threadsin a single Task, one is responsible for responding to client requests while the other only logsmessages. This is done to keep the overhead of message logging small even with parallelrequests to the factory proxy. Furthermore, the message logger receives messages with theoriginal IPC-Gate’s identifier. Therefore, it would be possible, that an identifier created bythe application could be the same as the one used by the factory proxy, if both would receivemessages with the same UTCB.The factory proxy is responsible for creating kernel objects on the application’s behalf.Instead of directly requesting kernel objects, the client will send a request to the factoryproxy, which will reply with mapping a newly created kernel object. Proxying object creationrequests, PointStart can set the message logger on every IPC-Gate with a newly introducedkernel call (Ipc_gate::logger()). Setting a message logger is needed, because for forwardingmessages from the application itself, the application would have to be changed, violatingtransparency (section 2.2.1). The set logger will receive all requests made through thatIPC-Gate and will save them with the time of reception.In addition to logging messages, messages have to be replayed after a restart. This is doneby another service running in the same thread as the factory proxy, thus sharing memorywith the message logger. This service can replay messages for specific gates or all gatesstarting from a given time. It is also responsible for blocking and unblocking all loggedIPC-Gates via the Ipc_gate::logger() kernel call. This is needed for prohibiting clients otherthan the replaying server to send any messages. Otherwise, the order of reception would notnecessarily be the same as before the restart.

16 Chapter 4 Design

4.3 Procedures of Operation

In the process of running an application, ReLoader has the three main phases, which arereached from the main loop — Startup, Checkpoint and Restart. I will present each of thesephases in more detail below.

4.3.1 Startup

Before running the given application itself, ReLoader starts two Snapshot Servers for storingthe checkpoints. Additionally, ReLoader optionally starts the factory proxy and messagelogger. It then loads the application binary and prepares the environment of the application.This also includes providing an IPC-Gate connected to the factory proxy as standard factoryof the application. After giving the application a short time to fully start up, the newlyadded Task::cap_list() call is used on the application’s task to find all threads. For each ofthese threads, ReLoader exchanges the exception handler with itself, to be able to receiveexception IPCs. As more threads can be spawned throughout the execution of the application,setting the exception handler is repeated in every checkpoint phase, which I will describe inthe following section.

4.3.2 Checkpoint

In the periodically reached checkpointing phase, the ReLoader will take a new checkpoint ofthe application. For doing this, it is first necessary to choose a Snapshot Server. As mentionedbefore, this is done round robin. On this Snapshot Server a new Snapshot is created forstoring the checkpoint data. This might result in the Snapshot Server deleting an oldercheckpoint, if the threshold of parallel Snapshots is reached. Currently, only one Snapshotper Snapshot Server is supported.

For taking a consistent checkpoint, the threads of the application have to be stopped. Tofind these threads, ReLoader uses Task::cap_list() to retrieve a list of capabilities. This list isthen filtered for threads, ignoring the capabilities of the region manager, the Parent objectand the exception handler. These are known to the ReLoader, because it originally mappedthem into the Task. Since the threads will be stopped in the next step, not ignoring thesecapabilities would result in a lockup of the ReLoader, as Parent object and exception handlerare the ReLoader itself. Also, the region manager is needed for retrieving memory regions.Having this list of threads, each thread’s exception handler is set to ReLoader, because newthreads could have been created by the application after the startup phase. Then each threadis forced into the exception handler by an artificial exception. With the ReLoader being theexception handler, no IPC exception message will be sent until after the checkpointing isdone, as the ReLoader only retrieves IPC messages from the kernel in its main loop.

Since it is possible for an application to run between getting the list of capabilities andstopping all threads, ReLoader retrieves the list of capabilities again. The updated list isthen used to save the capabilities to the created Snapshot. As mentioned before, the delete

4.3 Procedures of Operation 17

right has to be revoked, which is done by granting the capability to the Snapshot and thenmapping the capability into the application’s Task again with reduced rights.

For checkpointing the memory of the application, ReLoader retrieves the list of regionsattached to the region manager of the application. The Dataspaces belonging to each region,the address the region was mapped to, the size of the region and the offset into the Dataspacethe region is mapped from are transfered to the Snapshot Server. The Snapshot Server thencopies the memory of the Dataspace into a new Dataspace and stores this Dataspace togetherwith the other information for later retrieval. There is no need to attach the new Dataspaceto a region manager, as the memory does not need to be mapped in any address space,because it will not be accessed directly. The allocated memory will also be available as longas the Dataspace is not destroyed, making any mapping into an address space unnecessary.

For the last part of state retrieval, ReLoader filters the capability list for threads again. Exceptfor the aforementioned region manager, Parent object and exception handler, the foundthreads’ state is retrieved with the added Thread::get_state(), supplying a memory addressof a previously allocated and locally attached Dataspace. After the call, the Dataspace istransfered to the Snapshot Server, which creates a copy of the Dataspace and stores the newDataspace together with the also transfered capability id of the thread for later retrieval.

Finally, ReLoader leaves the checkpointing phase returning to its main loop. The threadswaiting to send a message to ReLoader as their exception handler can then proceed, asReLoader handles IPC message retrieval in the main loop. Thus the application can continueits computation where it left off at the beginning of the checkpointing phase.

4.3.3 Restart

As mentioned before, ReLoader inserts itself as the exception handler of the handled appli-cation’s threads and currently identifies faults by receiving exception IPC messages fromthe kernel. If any non-artificial exception is received, ReLoader will trigger a restart of theapplication. The restart begins with stopping the threads of the application, as alreadyexplained for the checkpointing phase (section 4.3.2) with the same exceptions applying. Ifany thread would continue its operation, the restored memory might get corrupted.

ReLoader then continues the restart with detaching all memory from the region manager ofthe application and unmapping the memory belonging to these regions from the application’sTask. The only exception here are the regions reserved for the UTCB, because it is handledin the thread state, and the Kernel Info Page (KIP), which contains read-only informationfrom the kernel. Unmapping the memory makes sure, that a page fault will occur on nextusage, which will then be served from the restored Dataspace. This might seem to be aproblem, because the region manager of the application is still to be used and not stopped,but the region manager has its own region manager originally created by ReLoader. Thisother region manager will make sure, that the pages necessary for running the application’sregion manager will be mapped in again. After clearing the address space, ReLoader willalso clear the capability table by unmapping every capability found in the already retrievedcapability list.

18 Chapter 4 Design

With only the empty shell of a Task remaining, the old state can be restored. Initially, thelatest Snapshot ID has to be identified by querying the last used Snapshot Server for itsSnapshot. ReLoader retrieves the list of capabilities mapped in the Snapshot and maps theminto the application’s Task, restoring the capability table to the checkpointing state. Forrestoring the memory state, ReLoader retrieves the list of Dataspaces stored in the Snapshotand the region information for each Dataspace. When queried for the Dataspace itself, theSnapshot Server returns a new Dataspace with the content copied from the stored one.Returning a copy ensures, that the Snapshot is not modified, as it might be needed foranother restart. Using the retrieved information, ReLoader then attaches the Dataspacesto the region manager of the application. Filtering the list of capabilities mapped in theSnapshot for threads, the thread state is retrieved. As before, the Snapshot Server returnsonly a copy of the stored Dataspace to the query of ReLoader. Attaching the retrievedDataspace to get a memory address, ReLoader then uses this address as a parameter toThread::set_state() to set the thread state in the kernel.

The Thread::set_state() call will send the thread into the exception handler with an artificialexception, after it set the state. Thus, only after all the state is set and ReLoader returns toits main loop to handle IPC messages, the application will continue from the checkpointedstate.

However, ReLoader has to block all IPC-Gates known to the factory proxy before returning tothe main loop. This ensures, that messages can be replayed before the application continuesits computation. If messages are blocked, ReLoader will call the message logging serverright after replying to the exception IPC in the main loop. The message logging serverwill replay the known messages received since the checkpoint was taken. The time of theused checkpoint is provided to the message logging server by ReLoader. Since setting theapplication’s state is not completed until all messages are replayed, the IPC-Gates are notunblocked individually. Instead, ReLoader will tell the message logging server to unblock allIPC-Gates after the replay finished and the application can continue receiving messages.

4.4 Security Considerations

With PointStart requiring full control over the state of an application, security needs to beconsidered for the handling of capabilities and features added to the kernel.

With capabilities being a means of controlling the rights and access to kernel objects, noelevation of privileges should be possible. This property is held by the fact, that the originalcapabilities are kept alive by granting them to the Snapshot Server. Removing the deleterights with granting, there is no way the application could destroy the capability. Thus thereis no need to create a new capability with the same properties. This prevents further securityproblems manifesting when managing kernel memory in user-level as done in L4/strawberry[Hae03] and is possible, because the checkpointing information is not stored on disk, but inRAM. Furthermore, storing the information in RAM makes sure, that it cannot be alteredby other processes and no extra security precautions like HMACs are necessary to proof theintegrity of the saved data.

4.4 Security Considerations 19

One of the calls added to the kernel, is Task::cap_list(). The functionality of listing allcapabilities and their type, could be achieved in user-level, but at much greater costs. Fromsecurity point though, it is only interesting if new information is leaked. This is not thecase, as Task::cap_valid() could be used to test for a valid capability for each capability slot.Invoking each found capability with all the different protocols understood by kernel objects,would also yield information about the type of an object. This is due to the fact, that kernelobjects only understand one protocol. Invoking a capability with the wrong protocol raisesan error (EBadproto), thus identifying the capability type is eventually possible and no newinformation is leaked. The problem with optimizing this process by adding functionality tothe kernel though is, that more code runs with elevated privileges, thus adding to the codethat has to be trusted.

Another feature added to the kernel is getting and setting of thread state withThread::get_state() and Thread::set_state(). While also adding code with elevated privi-leges and therefore code that has to be trusted, Thread::get_state() is a read-only functionand can consequently also leak information. With Thread::set_state() however, it is possibleto set parameters of the thread and therefore the kernel object, that could not be set before.Taking a closer look at the data returned by Thread::get_state(), especially pointers to kernelobjects are leaked, but also the kernel stack. Other data includes the CPU and FPU registersand the UTCB, but these can also be accessed from user-level. With Thread::set_state() beingthe reverse function, it allows setting the kernel stack. Therefore, a malicious applicationmight be able to call kernel functions and alter its state to gain more privileges. Being ableto set pointers to kernel objects for the sender list and IPC partners might even crash thesystem, as I found no way to validate the memory pointed to. Therefore, these functionsreduce the security of a system until these problems are fixed. A possible fix could be signingthe data returned by Thread::get_state() in the kernel and only allowing data with a validsignature to be used by Thread::set_state(). This leaves only the problem of stale pointers inthe kernel data.

The last feature added to the kernel is used for message logging. To support this, the callIpc_gate::logger() and code in Ipc_gate::invoke() was added in the kernel. Since the loggercan only be set if the calling thread can also bind a thread to the IPC-Gate, the security isnot weakened. With Ipc_gate::invoke() being a simple setter like Ipc_gate::bind_thread(), itcan also not be misused to gain additional privileges.

Because of PointStart being the loader of the started application, it has full control over itfrom the beginning. The additional kernel functions only add to this control in a small way,since PointStart creates the Task and main thread of the application and can map arbitrarycapabilities and memory to and from the application. With control over the thread, it canalso set the instruction pointer and thus run any code in the context of the application. As aresult, the application has to trust PointStart.

4.5 Special Cases/Restrictions

Even though aiming at being as including as possible, there are some special cases PointStartcannot solve. Additionally, some restrictions apply to applications checkpointed by PointStart.

20 Chapter 4 Design

One special case is present in a L4Re server called Mag. It can be used to securely multiplexthe graphics and input hardware of a system among multiple applications and multiplecomplete windowing environments. The problem with Mag is, that it uses the kernelsreference counting to detect if a window is still supposed to be showing. Removing itselffrom the reference counter on creation of the object, the object will be automatically deletedby the kernel if it no longer resides in any other capability table. Using a garbage collector,Mag detects missing kernel objects and deletes the corresponding windows. Since PointStarthas to hold a reference to every capability of an application for checkpointing purposes, anapplication using Mag and running checkpointed by PointStart will experience a delay inthe removal of windows. This delay depends on how long it takes for the last checkpointholding the capability to be deleted by PointStart. A possible solution would be introducinga threshold for object references in Mag. If the reference counter drops below the threshold,Mag could delete the window while the capability can still be alive in the Snapshot.

Another special case is the root-task Moe. While in principle being possible to checkpoint,Moe provides the basic features of the L4Re environment used by PointStart. Therefore,PointStart would have to implement most of Moe’s features for itself, making it essentially acopy of Moe. With that much functionality added, PointStart itself would be a candidate forcheckpointing. As a result, checkpointing Moe is not supported by PointStart.

As initially said, some restrictions apply to applications checkpointed by PointStart. Theserestrictions include, that the application may not use an own exception handler. Thisis currently not possible, because ReLoader has to handle the exception IPC message todetect faults and for letting the application threads continue after being stopped. It couldbe possible to implement a chained exception IPC message passing from ReLoader tothe original exception handler, but this has not been done. Another restriction is, thatapplications spawning a lot of new threads through all the runtime of the application cannotbe supported, because all threads have to be stopped to create a consistent checkpoint. Sinceall of these restrictions are not common in L4Re server applications, they are usually noproblem.

A restriction with more impact might apply to shared memory. Since PointStart will detachall memory at restart and replace it with a copy of the data found at checkpointing, noregion will be shared after a restart. Having no flag for shared memory in the regionmanager, PointStart is unable to recognize shared memory. But even if shared memoryregions could be recognized, there is no good way to handle them, because sharing thememory with a partner also means sharing state. If the partner is not restarted togetherwith the application, either the application’s state or the partner’s state will get changed bya restart. But checkpointing and restarting multiple applications is not yet supported byPointStart, which is why PointStart should not be used with applications sharing memory.

Event though PointStart supports message logging, the logging mechanism is still in an earlystage. It is suitable for logging messages, containing simple data like text or numbers, but notmemory or capability mappings. To support this, the messages would have to be parsed andacted upon. It is thus not suitable for factory servers like Moe. In its current implementation,it also only supports send operations, but extending it to support calls would be possible.

4.5 Special Cases/Restrictions 21

5Implementation

While the discussion of the design presented in chapter 4 describes the general functionality,there are some parts, where details of the implementation are not obvious. I will presentthese implementation details in this chapter.

5.1 Task::cap_list()

As already explained in chapter 4, I added a new function to the Task kernel object, namedcap_list(). It accepts a user-level address as parameter, to which it writes the type of eachmapped capability in the Task. With each capability using one byte of space, 1 MB of memoryis needed for the full list of 220 capabilities. Theoretically, there is room for improvement, asthere are currently only eight types of kernel objects available, which would need only fourbits to encode, so in the end only 512 KB would be needed. Additionally, simple run-lengthcompression could reduce the memory needs especially for mainly empty capability tablesor recurring object types. Since the compressed size cannot be known before the call,supporting multiple calls by passing a start parameter would be needed.

For getting the type of a kernel object, which is only present as the more general Kobject_Ifacepointer in the capability table, in earlier versions of Fiasco.OC Kobject_Iface::kobj_type()could be used. This method used to return a string with the kernel object type. In newerversions, this method is not available anymore. Instead, cxx::dyn_typeid() can be used on aKobject_Iface, which will return a structure containing the name. Using strcmp on this name,the object type is found.

One challenge with walking the capability table of a Task is the lookup of the capability’skernel-object-pointer. The memory access of the lookup()-function slows down the process.Thus, it is more efficient to check if a capability is mapped in the entry, before making thelookup. This is done by calling Obj_space::v_lookup() on the capability index, as Obj_space isone of the bases of a Task object.

For handling the list returned by the kernel in user-level, the application needs to include aheader file, defining the object types. This also includes two functions for walking the list.get_next_cap() will walk the list from a given starting point and return the next non-emptycapability index. If a type is also given as parameter, the function will return the nextcapability index of that type. The other utility-function get_next_free_cap(), returns the nextempty capability index. This can be used for directly mapping a capability to another Task.But with L4Re keeping track of free capability slots in user-level, this usage is only advisablefor the slots below 0x400, which are not tracked by L4Re::Util::cap_alloc.

23

5.2 Startup

In the startup phase, PointStart acts as an application loader. Usually, when starting a serverapplication, an IPC-gate is created by the default loader Ned. This IPC-gate is passed tothe server and the clients, which need to communicate with the server. With PointStartbeing a loader of its own, this functionality is not needed for starting the Snapshot Servers.Relying only on the functionality of L4Re’s libloader, PointStart would theoretically be ableto checkpoint Ned, as none of Ned’s functionality is needed by PointStart.

Only by being the loader of the application, PointStart is able to explicitly save Dataspacesattached to the application at load-time. These Dataspaces are then later needed while check-pointing the application, because the application’s region manager cannot return a Dataspacefor them, but rather points to its parent region manager for retrieval. Saving the Dataspacesin the beginning thus saves some IPC calls and complexity in the implementation.

5.3 Checkpointing

The importance of obtaining a Dataspace from the application’s region manager instead ofe.g. directly mapping the memory and using memcpy is being able to call Dataspace.copy_in().With this call, PointStart is automatically able to use Moe’s copy-on-write mechanics, as Moeis the provider of the Dataspaces. This means, memory which is not changed, does not haveto be copied, reducing the overhead introduced by PointStart.

Saving the capabilities of an application also adds to the overhead. Seeing that and thecomplexity of garbage collection with multiple Snapshot Servers, I decided to reduce both byre-using an older checkpoint’s Task for the new checkpoint. This means, that when creatinga new checkpoint of which a Snapshot Server currently only holds one, only memory andthread state are cleared while the Task is kept. ReLoader than compares the capabilitiescurrently set in the checkpoint’s Task to the current capabilities in the application’s Task andremoves only unequal capabilities and maps new ones over. With many applications onlychanging a small number of capabilities between two checkpoints, this reduces the overheadfor saving checkpoints.But the same reasoning already laid out for introducing Task::cap_list() also applies here:the capability space is too big to make a kernel call for every possible capability. UsingTask::cap_list() to filter for used capabilities would be a possibility, but I decided to introduceanother kernel function (Task::cap_list_equal()) for the same reasons as presented forTask::cap_list(). Task::cap_list_equal() takes a user-level memory address and two Taskcapabilities as arguments and returns a list with one bit for each capability index. The bitwill be set to 1 if both Tasks have a capability mapped in this slot and the capabilities are thesame. Otherwise the bit will be 0. This means, 100 KB of memory have to be allocated touse this function. An integration into Task::cap_list() is not generally possible, because theTask called would need the Task to compare the capabilities with in its capability table. Thisis not the case for the Snapshot Server’s checkpoint, as it only contains the capabilities of theapplication. Thus a third Task has to be called for comparison of both.

24 Chapter 5 Implementation

5.4 ReLoader’s Main Loop

Initially, ReLoader created a second thread only for handling the exception messages ofthe application. This was done, to separate the reception of IPC messages from the coderesponsible for regular checkpointing and restarts. With the separation, two shared memoryvariables were used between the exception handler and ReLoader to signal stopping ofthreads and received non-artificial exceptions. To ensure multi-core compatibility, changingone of the variables also meant flushing the cache for its address.When I started to implement handling of Parent object signals, ReLoader either had to createanother thread for listening to IPC-messages or handle them in the main thread. My decisionwas to restructure ReLoader’s main loop to use polling. This removed the need for shared-memory variables and for other threads. Instead, ReLoader now checks for IPC-messageswith a timeout of zero. Thereby, only threads already waiting in the list of senders candeliver their messages. These messages are checked for being either of the Parent protocolor an exception message and handled appropriately. After receiving IPC-messages, ReLoaderchecks if the time interval between checkpoints is reached. Then, either the application ischeckpointed or ReLoader sleeps for a short time and starts the main loop again.

5.5 Factory Proxy

The factory proxy implemented by PointStart is based upon the libkproxy library included inL4Re. This library provides a scheduler proxy and a factory proxy. The factory proxy wasout-of-date and had a bug already mentioned on the mailing list [L4h]. After fixing this and asimilar bug, I added handlers for missing kernel objects to the library. I refactored the librarya little bit, so already implemented handlers could be reused in other implementations ofthe library’s Factor_svr. Utilizing this feature, I only needed to implement the handler forIPC-Gates in PointStart’s factory proxy. This handler saves all created IPC-Gates into memoryshared with the message logger. Therefore, I chose to use two threads in a single Task for asimple implementation of shared memory using global variables.

5.6 Message Logging

Logging messages needs two components: a receiver, that logs the messages and a possibilityto forward incoming messages to the receiver. I implemented the former in user-level andthe latter in kernel-level and present details of both below.

5.6.1 Kernel-Level

For supporting message logging with PointStart, the kernel needed to be extended addingtwo features: binding an IPC-Gate to another IPC-Gate which acts as a message logger andblocking messages to an IPC-Gate for replaying messages. Both can be done from user-levelthrough the added kernel call Ipc_gate::logger(). It accepts an IPC-Gate as first and thelogging state as second parameter. For binding the IPC-Gate a pointer to the Ipc_gate object is

5.4 ReLoader’s Main Loop 25

saved to the invoked IPC-Gate object. The state is also saved to the IPC-Gate object and canbe one of NoFrom, Send and Block with Send being the default on newly created IPC-Gates.NoFrom is a special state used by the IPC-Gate bound as logger to signal, that the senderof the message should not be changed to the IPC-Gate’s ID. This is needed by the messagelogger in user-level to recognize the original IPC-Gate the message was sent to.If the state is set to Block, only messages sent by one thread are allowed. This thread isbound to the IPC-Gate, that got bound as a logger to the IPC-Gate the message is sentthrough. This is needed for replaying messages. All other threads are blocked with a timeout,checking every 10ms if the gate’s status has changed to Send again. The sending is done byfirst invoking the logger’s IPC-Gate with the IPC operation only set to send, as no reply fromthe message logger is needed. Thereafter, the thread bound to the IPC-Gate is invoked withthe original IPC operation.

5.6.2 User-Level

The user-level implementation consists of two parts: the actual logger and the server usedfor replaying messages. As the logger is kept simple, it only listens for any IPC message. Onreceiving a message, it copies the message tag and its whole UTCB for later replay. Thisprocess could be optimized for space, by reading the message tag and only copying thenumber of words the UTCB should contain from the UTCB. As memory was not an issue inlater experiments, this optimization was not implemented, yet. When replaying messageswith the second part of the user-level implementation, the previously saved UTCB is copiedinto the thread’s UTCB for every received message and sent through the IPC-Gate using thealso saved message tag.

26 Chapter 5 Implementation

6Evaluation

With PointStart having to run additional code for checkpointing and restart, it is importantto measure the introduced overhead. Especially for L4Re servers used throughout thesystem, a low overhead in checkpointing but also in restart is necessary to keep the systemresponsive. The following sections will outline the test setup and present the results of thetests. Afterwards, problems found while running the tests will be discussed.

6.1 Setup

The machine for running the tests was equipped with 8 GB of RAM of which 3018 MB couldbe used by Fiasco.OC because of 32 bit mode, an Intel(R) Core(TM) i3-4150T CPU @3.00GHz and a PCI-e card adding a serial port to the system. Connecting to the serial portwith another PC, the output of the benchmarks was logged and the measured timings wereprocessed with awk to calculate the average run-time and the corrected sample standarddeviation. Since utilizing multiple CPU cores in Fiasco.OC can only be done by explicitlyassigning a thread to a CPU, all running processes shared the same core.

6.1.1 Benchmarks

The benchmarks used for evaluating PointStart can be divided into two groups: syntheticbenchmarks to measure and compare the overhead and a real L4Re-server as a specific usecase. For the measurements of the overhead, I decided to use the MiBench benchmark suite[Gut+01] of which multiple benchmarks were already ported to L4Re: bitcount, qsort, susan,lame and dijkstra. The paper presenting MiBench, displays the instruction distribution forevery benchmark. bitcount is using mostly integer operations and few memory operationswhile comparing different techniques for counting bits. The instruction distribution is similarto qsort, which uses the qsort function implemented in the C library. As input for thisbenchmark I used a list of 160086 German words [Wor], shuffled and concatenated fourtimes. susan as ported to L4Re consists of three chained benchmarks, which implementcorner detection, edge detection and smoothing of an image. Because of it processingan image, many memory load operations are executed. The example susan_small.pgmwas used as input for this benchmark. With lame a normal application for encoding mp3files is also in the benchmark suite. For encoding mp3 files, the instructions are equallydistributed between integer, floating point, memory load and memory store operations. Thelast one — dijkstra — implements Dijkstra’s Algorithm for finding the shortest path betweenrandomly chosen nodes. This uses mainly integer and memory load operations while usingan example input file (dijkstra_in.dat). The parameters and achieved runtime can befound in table 6.1.

27

benchmark parameters avg. run-timebitcount 30,000,000 iterations 9.7s

qsort 600,000 german words, 10 runs 30.9ssusan MAX_CORNERS = 15,000, 10,000 runs 29.8slame 46 MB private wav recording (49 min) 8.9s

dijkstra NUM_NODES = 12,500 21.3sRaytrace -m128 -pX -a256 with X in 1,2,4,6,8 21.7s

Table 6.1: parameters and run-time of used benchmarks

Since all benchmarks presented until now only use a single thread, I additionally choseRaytrace from the SPLASH-2 benchmark suite of parallel applications [Gut+01] to testPointStart’s behaviour on applications with multiple threads. This benchmark mainly usesinteger operations and some memory reads, no memory barriers and a lower number oflocks. I decided to use this one, because the runtime could be easily increased, while otherbenchmarks from SPLASH-2 produced segfaults with increased problem complexity — evenwhen run without PointStart. The file car.env included in the benchmark was used as inputand one, two, four, six and eight threads were requested with command line parameters(table 6.1) to see the impact of a higher number of threads.

While benchmarks can give a good view at the overhead of checkpointing, they are no L4Reserver-applications. That is why I chose to also run a test with cons, an output and inputmultiplexing server, to show the applicability for L4Re servers. For this, I start two clients,which print an individual message and the number of prints already done to their outputevery second. The output in this case is cons, which runs checkpointed. This setup is usedfor testing checkpointing over a longer period of time and thereby finding any memoryleaks. Additionally, the restarts of this combination are more important than the one withthe benchmarks, because this work is about L4Re servers.

For measuring the overhead of message logging, I also chose to use cons as a server, becauseof its ease of use. Connecting a simple client to cons, I was able to generate messagesusing printf. Measuring the run-time of the simple client with and without messagelogging, I calculated the overhead. While printing messages as fast as possible is an unusualcase of using cons, the overhead can be measured without interference. To further reduceinterference, I only used the message logging component of PointStart for this benchmark.

6.1.2 Procedure

Running the benchmarks, I employed the scripting capabilities of L4Re’s default init processNed. Its configuration files are actually written in Lua, which provides the possibility to use afor-loop running every benchmark 50 times. Therefore, getting 50 samples for calculatingthe average run-time and the standard deviation for running without checkpointing andwith 2s, 1s and 500ms intervals between checkpoints can be achieved easily.

For benchmarking restarts, the overall run-time of the application is not as important asthe time it takes for the application to be responding again. This is also the time, ReLoader

28 Chapter 6 Evaluation

Figure 6.1: benchmark overhead of using 2s, 1s and 500ms ckpt. intervals

needs between recognizing a fault and finishing the restart. Therefore, I changed ReLoaderto trigger restarts at certain intervals and measured the time these restarts took, startingdirectly at trigger time and ending when ReLoader is done restarting — right before theapplication’s exception IPC message is handled. This was done for all benchmarks and thesetups with cons.

Additionally, I used the setups with cons for evaluating message logging. While checkpointing,restarting and logging messages for cons, I checked the buffers held by cons for the twoclients manually for missing messages after each restart. With only one client printing tocons, I measured the run-time with cons showing the messages on screen/serial port and alsowithout showing the messages. In the second case, cons logs the messages into its buffer,which should be much faster, as output especially through a serial port is a bottle neck.Therefore, I ran the second case with 1 million printed messages and the first case with only1000 printed messages.

The overhead is calculated by setting the average run-time of an application as 100% andsetting the average run-time with C/R or message logging in relation to that.

6.2 Results

The following section will first cover the results of the synthetic benchmarks for checkpoint-ing and then restarts. Afterwards, I will evaluate the results of the setups with cons forcheckpointing, restart and message logging, because with cons being non-synthetic, there ismore to consider than just the run-time overhead.

6.2 Results 29

dijkstra Raytrace susan qsort lame bitcount626 MB 285 MB 187 MB 86 MB 80 MB 0.6 MB

Table 6.2: maximal used address space by regions attached to region manager

6.2.1 Checkpoint

After presenting the procedures and parameters for running the benchmarks, I will nowpresent the results obtained by that. Setting the average run-time without checkpointingas 100%, fig. 6.1 shows the relative run-time of running the benchmarks with differentcheckpointing intervals. As expected, the run-time increases with checkpointing enabledand smaller intervals between checkpoints add more overhead. It is not clearly conceivable,if the increase is linear or rather exponential, but checkpointing at 2s adds less than 15%to the average total run-time for every benchmark. For bitcount the average total run-timeeven decreases slightly with 2s intervals, but it is not clear, why. What one can clearly see,though, is that the interval of checkpointing should be selected according to the needs of theapplication. As seen for the 500ms interval, the overhead varies between 18% for bitcountand 121% for qsort and seems highly dependent on the application.

qsort

What fig. 6.1 also shows, is a different behaviour for qsort compared to the other benchmarkswith over 121% overhead for the 500ms checkpointing interval. What one can also see, isthat qsort is the only benchmark with a big standard deviation. The high average run-timeand standard deviation seems not to be related to the amount of memory allocated by thebenchmark. Table 6.2 shows the maximal amount of address space allocated by regionsattached to the region manager at checkpoint time for each benchmark. There, one can see,that qsort is on the lower end of the table.

What is different between qsort and the other benchmarks, is the memory access. While otherbenchmarks either do not use much memory at all (bitcount) or have mainly read-heavyworkloads (susan) with only an output file being written (lame), qsort sorts the strings in-place. This means, instead of sequentially writing to memory, adding more used pages, qsortfills the memory once and then replaces certain parts, writing all pages multiple times.

This is a bad case for the copy-on-write mechanism automatically used by PointStart withthe Dataspaces provided by Moe. It means, that a lot of memory has to be actually copiedto be checkpointed and that more memory is used overall. Therefore, many pages have toactually be freed after deleting a Snapshot, which led to pauses in the execution of up to6s caused by the higher prioritized Moe. Going into the kernel debugger in some of thesepauses showed Moe running code from List_alloc::merge(). This code gets called afterfreeing memory pages. While PointStart is influenced by Moe, the depths of its behaviourare not part of this work.


1 2 3 4 5 6 7 8# of threads

100

120

140

160

180

200

Runtim

e %

without2s1s500ms

Figure 6.2: overhead when checkpointing with different number of threads

Raytrace with multiple threads

While not being specially notable with only one thread in fig. 6.1, Raytrace also shows thebehaviour of PointStart on applications with multiple threads in fig. 6.2. There, the averagerun-time of running Raytrace with 2s, 1s and 500ms checkpointing interval is shown relativeto running Raytrace without checkpointing. This is done for one, two, four, six and eightthreads for each interval. As one can see, the run-time overhead increases only minimallywith more threads for 2s and 1s checkpointing intervals. The slightly bigger increase with a500ms interval is linear.

This shows, that PointStart scales with the number of threads, adding only linear overheadin run-time.

6.2.2 Restart

Since the run-time of an application, which was restarted in the middle of computation,is not comparable to an application without restart, only the restart times were measured.As a comparison, I also measured the checkpoint times. Both are shown in fig. 6.3 forsingle-threaded benchmarks and in fig. 6.4 for Raytrace running with one, two, four, six andeight threads.

As one can see in fig. 6.3, the restart time depends on the application. While a correlationcan be seen between memory consumption and restart time with bitcount allocating the leastamount of memory and having the smallest restart time, this does not hold true for everyapplication. For example, susan allocates less memory than Raytrace, but Raytrace needs lesstime for a restart.

6.2 Results 31

Figure 6.3: checkpoint and restart time in ms

Figure 6.3 also shows two spikes in restarting lame and checkpointing qsort. The latter alsoshows a big standard deviation of over 800ms, while all other results had considerably lowerstandard deviation. The cause of qsort’s different behaviour and high standard deviation hasalready been discussed in section 6.2.1.While other applications showed a standard deviation in checkpointing as high as 50% oftheir average checkpointing time (dijkstra, Raytrace), this can be explained with the usageof memory over time. dijkstra for example first reads in a file, creating a graph of nodes inmemory. Therefore using more memory over time, its checkpoint time rises from around18ms in the first checkpoints to over 60ms in the last ones. Another point could be Moeforcing the application to pause while freeing memory, as also seen in qsort’s benchmarkresults. In contrast to the checkpoint time, the restart time shows no high standard deviation.This could be the case, because the application is restarted once per run at roughly the sametime, while checkpoint times are collected from multiple checkpoints per run.

The other spike in fig. 6.3, seen in lame’s restart time, is not related to the output lamegenerates. Re-running lame with deactivated output showed the same results in restarttime. Further analyzing the problem, I recognized up to 3574 regions attached to the regionmanager at restart time. As each region gets unmapped, I changed ReLoader’s code tounmap in batch instead of making a single call for every attached region. This resulted inchanges of about 100ms, but the difference between checkpoint time and restart time wasstill very big. Inspecting lame, I found the source of this behaviour: lame uses L4Re’s tmpfspackage for writing the output file. For every 4 KB block of the output file written to thein-memory file-system created by tmpfs, a new region is attached. Since the implementationof tmpfs is not part of this work, I stopped further investigations.

Running Raytrace with multiple threads, the results depicted in fig. 6.4 show a linear increasein restart time with a rising number of threads. This is to be expected, as more thread stateneeds to be reinstated. This figure also shows the higher standard deviation for checkpointingtimes already mentioned above.


Figure 6.4: checkpoint and restart time with multiple threads in ms

6.2.3 cons

The overhead of checkpointing for cons is hard to measure, as cons has no finite run-time.This comes from the fact, that cons is not a synthetic benchmark. As described in the setup,two setups with cons were used: with two clients printing regularly and with one clientprinting as fast as possible.

The results measured with cons having two clients attached, include the average checkpointand restart time. While needing around 2 MB of memory, a checkpoint of cons took onaverage 10.6ms and a restart took on average 14.7ms. This should be hardly noticeable inreal applications. But the evaluation also showed, that a long-running application can runon top of PointStart, as this setup with cons was kept running for 45 minutes until stoppedmanually. Taking a checkpoint every 2s and restarting every 11s, this means a total of 1350checkpoints and 245 restarts took place without a fault.

While cons continued to work after restarts, state was lost. This is due to the fact, that conskeeps a buffer of printed messages for every client. On restart, this buffer is reset to the lastcheckpoint’s state, thus losing messages between the last checkpoint and the restart. Thisproblem could be approached for this specific setup by reducing the checkpoint interval to500ms. But this approach is limited if the clients reduce their printing interval. Therefore,I reran the setup also using message logging. In doing this, cons’ messages buffers wererefilled on restart, therefore not losing this state.

Because the setup with two clients only loses at most two messages per client on restart,there was no overhead measurable in adding message logging. But the second setup withonly one client was especially chosen to provide a possibility for measuring overhead. Theoverhead when printing all messages to the serial port without message logging comparedto with message logging was only 0.2%. The cause of this is, that the overhead in printingthe messages to the serial port is much higher than the one introduced by message logging.

6.2 Results 33

When not printing the messages to the serial port, message logging introduced the entireoverhead of 84.4%. This not being 100% although the message has to be sent twice, is agreat achievement.

6.2.4 Enhancement Process

The results presented in this chapter are the result of an enhancement process, that tookplace over the time writing this work. While running these experiments showed some flawsin the design and implementation, I worked out solutions which I will present in short belowtogether with the problems.

Running a benchmark multiple times as described in section 6.1.2, I discovered memory leaks.Memory was not freed, because the capabilities were mapped into the Snapshot on everycheckpoint. This removed the delete rights from the capability table, thus not deleting thecapability and the memory at program termination. To fix this, I added Task::cap_list_equal()and mapped only new capabilities. This also improved the performance after the firstcheckpoint. Taking dijkstra as an example, the first checkpoint took on average 18ms whilethe second one for that Snapshot-Server only took on average 16ms. The reason is, that asmaller number of capabilities, that has to be mapped into the Snapshot from the secondcheckpoint onwards.

With tmpfs in the lame benchmark using many capabilities, some user-level capability spaceproblems were found. But attaching multiple regions for one Dataspace also showed aproblem in the design of PointStart. Originally, PointStart made a copy for the Dataspace ofevery region ignoring the fact, that a Dataspace can be attached in parts as multiple regions.This lead to a huge overhead in memory, which could be removed by only making one copyand using a different storage method in the Snapshot-Server. As described in section 6.2.2,analyzing the benchmarks of lame also improved the restart time for applications using manyregions.

Running cons, which was the only benchmark receiving messages from other clients, showeda flaw in PointStart’s design, when applied to L4Re-server-applications instead of syntheticbenchmarks: losing state because of lost messages. Incorporating message logging, thisproblem could be overcome.


7Conclusion and Outlook

In this work, I identified and presented properties most important for client-independent C/Rof L4Re-server-applications. Using these properties, current C/R systems were examined. Butthe analyzed C/R systems all either fulfilled not all properties or could not be directly ported,because of tight integration into the OS or lack of corresponding APIs. Therefore, I designedand presented a new C/R system that incorporated message logging — PointStart.

With L4ReAnimator [Vog+10] there already existed a previous approach to C/R on Fi-asco.OC/L4Re, but my approach differs in being client-independent. This independencewas achieved by keeping created capabilities alive while restarting, instead of recreatingthem. Security problems with managing kernel memory in user-level could thereby beavoided for capabilities. Implementing PointStart was achieved with small changes to theunderlying kernel, which were needed for correctness and performance reasons as discussedin section 4.1. The transparent design of PointStart also allows unmodified servers to be run,as long as PointStart is used as a loader. But there are restrictions to a server-application,which were discussed in section 4.5.

Utilizing parts of the MiBench and SPLASH-2 benchmark suites, I was able to measure theoverhead introduced by PointStart. The results differed from application to applicationbut showed the applicability of PointStart. Especially the presented test with the consolemultiplexing server cons has shown the suitability of PointStart for L4Re-server-applications.This experiment also showed the importance of message logging for a C/R system.

With the benchmarks and experiments showing practical usage, there are still possibilities toenhance PointStart. The current implementation only supports one checkpoint per Snapshot-Server. Implementing incremental checkpointing, a higher number of checkpoints couldbe stored. Thereby, a fault being unidentified for a longer period of time could be copedwith while keeping the memory overhead at a minimum. The addition of signing of theoutput of Thread::get_state() would be a good approach to increase the security of PointStart.Additionally, the storing of logged messages could be optimized by parsing the message tagand handling of flexpages in messages would need to be implemented to support a greaterrange of servers. While some experiments could be done with compressing the memory ofcheckpoints, this will interfere with Moe’s copy-on-write and might cause more overhead interms of time. But it might be worth evaluating.

35

Bibliography

[All94] Colin Allison. „Wanted: an application aware checkpointing service“. In: Proceed-ings of the 6th workshop on ACM SIGOPS European workshop: Matching operatingsystems to application needs. ACM. 1994, pp. 178–183 (cit. on p. 1).

[Ans+09] Jason Ansel, Kapil Aryay, and Gene Coopermany. „DMTCP: Transparent check-pointing for cluster computations and the desktop“. In: IEEE International Sym-posium on Parallel & Distributed Processing. IEEE. 2009, pp. 1–12 (cit. on p. 12).

[Bad+02] R Badrinath, Rakesh Gupta, and Nisheeth Shrivastava. „FTOP: A Library forFault Tolerance in a Cluster.“ In: IASTED Conference on Parallel and DistributedComputing and Systems. Citeseer. 2002, pp. 515–520 (cit. on p. 11).

[Bro+06] Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and PaulStodghill. „Recent advances in checkpoint/recovery systems“. In: IEEE Interna-tional Symposium on Parallel & Distributed Processing. IEEE. 2006, 8–pp (cit. onp. 6).

[Can+04] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and ArmandoFox. „Microreboot-A Technique for Cheap Recovery.“ In: Symposium on OperatingSystems Design and Implementation. Vol. 4. 2004, pp. 31–44 (cit. on p. 1).

[Chr+06] Rosalia Christodoulopoulou, Kaloian Manassiev, Angelos Bilas, and CristianaAmza. „Fast and transparent recovery for continuous availability of cluster-basedservers“. In: Proceedings of the eleventh ACM SIGPLAN symposium on Principlesand practice of parallel programming. ACM. 2006, pp. 221–229 (cit. on p. 11).

[Dav] Francis M David. „Transparent Recovery from Operating System Errors“. In:Proceeding of DSN 2007 Edinburgh (), pp. 25–28 (cit. on p. 9).

[Due05] Jason Duell. „The design and implementation of berkeley lab’s linux check-point/restart“. In: Lawrence Berkeley National Laboratory (2005) (cit. on p. 11).

[Döb+12] Björn Döbel, Hermann Härtig, and Michael Engel. „Operating system supportfor redundant multithreading“. In: Proceedings of the tenth ACM internationalconference on Embedded software. ACM. 2012, pp. 83–92 (cit. on p. 1).

[Gei94] Al Geist. PVM: Parallel virtual machine: a users’ guide and tutorial for networkedparallel computing. MIT press, 1994 (cit. on p. 11).

[Gut+01] Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, et al. „MiBench: A free,commercially representative embedded benchmark suite“. In: IEEE InternationalWorkshop on Workload Characterization. IEEE. 2001, pp. 3–14 (cit. on pp. 27,28).

37

[Hae03] Andreas Haeberlen. „Managing kernel memory resources from user level“. PhDthesis. Master’s thesis, University of Karlsruhe, 2003 (cit. on p. 19).

[Hen99] Erik Hendriks. „BPROC: A distributed PID space for Beowulf clusters“. In: Pro-ceeding of Linux Expo. Vol. 99. 1999 (cit. on p. 11).

[Jey04] Ashwin Raju Jeyakumar. „Metamori: A library for incremental file checkpointing“.PhD thesis. Citeseer, 2004 (cit. on p. 9).

[L4h] L4kproxy::Factory_svr. 2013. URL: http://os.inf.tu-dresden.de/pipermail/l4-hackers/2013/006027.html (visited on Nov. 10, 2015) (cit. on p. 25).

[Laa+05] Oren Laadan, Dan Phung, and Jason Nieh. „Transparent checkpoint-restart ofdistributed applications on commodity clusters“. In: IEEE International on ClusterComputing. IEEE. 2005, pp. 1–13 (cit. on p. 12).

[Lac+10] Adam Lackorzynski, Alexander Warg, and Michael Peter. „Generic virtualizationwith virtual processors“. In: Twelfth Real-Time Linux Workshop. 2010 (cit. onp. 12).

[Lan92] Charles R Landau. „The checkpoint mechanism in KeyKOS“. In: Proceedings ofthe Second International Workshop on Object Orientation in Operating Systems.IEEE. 1992, pp. 86–91 (cit. on p. 10).

[Lit+97] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpointand migration of UNIX processes in the Condor distributed processing system.Computer Sciences Department, University of Wisconsin, 1997 (cit. on p. 11).

[MG09] Andrew Maloney and Andrzej Goscinski. „A survey and review of the currentstate of rollback-recovery for cluster systems“. In: Concurrency and Computation:Practice and Experience 21.12 (2009), pp. 1632–1666 (cit. on p. 6).

[Mil+99] Dejan S Milojicic, Wolfgang Zint, Andreas Dangel, and Peter Giese. „Task Mi-gration on the top of the Mach Microkernel“. In: Mobility. ACM Press/Addison-Wesley Publishing Co. 1999, pp. 134–151 (cit. on p. 10).

[Osm+02] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. „The design andimplementation of Zap: A system for migrating computing environments“. In:ACM Special Interest Group on Operating Systems Operating Systems Review 36.SI(2002), pp. 361–376 (cit. on p. 12).

[Pla+94] James S Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparentcheckpointing under unix. Computer Science Department, 1994 (cit. on p. 11).

[Rie+06] Michael Rieker, Jason Ansel, and Gene Cooperman. „Transparent User-LevelCheckpointing for the Native Posix Thread Library for Linux.“ In: Proceedings ofthe International Conference on Parallel and Distributed Processing Techniques andApplications. Vol. 6. 2006, pp. 492–498 (cit. on p. 12).

[Sch90] Fred B Schneider. „Implementing fault-tolerant services using the state machineapproach: A tutorial“. In: ACM Computing Surveys (CSUR) 22.4 (1990), pp. 299–319 (cit. on p. 1).

[Sha+99] Jonathan S Shapiro, Jonathan M Smith, and David J Farber. EROS: a fast capa-bility system. Vol. 33. 5. ACM, 1999 (cit. on p. 10).

[Sid+09] Stelios Sidiroglou, Oren Laadan, Carlos Perez, et al. „Assure: automatic softwareself-healing using rescue points“. In: ACM SIGARCH Computer Architecture News37.1 (2009), pp. 37–48 (cit. on p. 9).

38 Bibliography

http://os.inf.tu-dresden.de/pipermail/l4-hackers/2013/006027.html

http://os.inf.tu-dresden.de/pipermail/l4-hackers/2013/006027.html

[Sud+07] Oleksandr O Sudakov, Ievgenii S Meshcheriakov, and Yuriy V Boyko. „CHPOX:transparent checkpointing system for Linux clusters“. In: 4th IEEE Workshop onIntelligent Data Acquisition and Advanced Computing Systems: Technology andApplications. 2007 (cit. on p. 11).

[Tan+06] Andrew S Tanenbaum, Jorrit N Herder, and Herbert Bos. „Can we make operat-ing systems reliable and secure?“ In: Computer 39.5 (2006), pp. 44–51 (cit. onpp. 3, 9).

[tea] CRIU team. Checkpoint/Restart In Userspace. URL: https://web.archive.org/web/20150503040058/http://criu.org/Main_Page (visited on June 7, 2015)(cit. on p. 12).

[Vog+10] Dirk Vogt, Björn Döbel, and Adam Lackorzynski. „Stay strong, stay safe: Enhanc-ing reliability of a secure operating system“. In: Proceedings of the Workshop onIsolation and Integration for Dependable Systems. 2010 (cit. on pp. 1, 10, 35).

[Wan+97] Yi-Min Wang, Pi-Yu Chung, Yennun Huang, and Elmootazbellah N Elnozahy. „In-tegrating checkpointing with transaction processing“. In: Twenty-Seventh AnnualInternational Symposium on Fault-Tolerant Computing. IEEE. 1997, pp. 304–308(cit. on p. 9).

[Woo+95] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, andAnoop Gupta. „The SPLASH-2 programs: Characterization and methodologi-cal considerations“. In: ACM Special Interest Group on Computer ArchitectureComputer Architecture News. Vol. 23. 2. ACM. 1995, pp. 24–36.

[Wor] words.german.Z. 1992. URL: https://web.archive.org/web/20151108144621/http://ftp.icm.edu.pl/packages/wordlists/german/ (visited on Sept. 8,2015) (cit. on p. 27).

[Zhe+04] Gengbin Zheng, Lixia Shi, and Laxmikant V Kalé. „FTC-Charm++: an in-memorycheckpoint-based fault tolerant runtime for Charm++ and MPI“. In: Interna-tional Conference on Cluster Computing. IEEE. 2004, pp. 93–103 (cit. on p. 7).

[ZN01] Hua Zhong and Jason Nieh. CRAK: Linux checkpoint/restart as a kernel module.Tech. rep. Citeseer, 2001 (cit. on p. 11).

Bibliography 39

https://web.archive.org/web/20150503040058/http://criu.org/Main_Page

https://web.archive.org/web/20150503040058/http://criu.org/Main_Page

https://web.archive.org/web/20151108144621/http://ftp.icm.edu.pl/packages/wordlists/german/

https://web.archive.org/web/20151108144621/http://ftp.icm.edu.pl/packages/wordlists/german/

Date post:	23-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Client-independent Checkpoint/Restart of L4Re-server ...os.inf.tu-dresden.de › papers_ps ›...

Documents