V MANAGEMENT IN THE 32 64 M OS X L · VIRTUAL MEMORY MANAGEMENT IN THE MOVE FROM 32 TO 64 BIT WITH...

VIRTUAL MEMORY MANAGEMENT IN THE

MOVE FROM 32 TO 64 BIT WITH REFERENCE

TO MAC OS X AND LINUX

BACHELOR-ARBEIT

KARL RITSON

15 July 2005

Technische Universitat MunchenFakultat fur Informatik

Bachelor-Arbeit

Aufgabensteller[in]: Univ.-Prof. Dr. Peter Paul SpiesBetreuer[in[nen]]: Dipl.-Inform. Martin Uhl

Abgabedatum: 15.07.2005

Ich versichere, dass ich diese Bachelor-Arbeit selbstandig verfasst und nur dieangegebenen Quellen und Hilfsmittel verwendet habe.

Datum Unterschrift

Abstract

The demand for 64-bit computing brings with it a whole new set of issues andchallenges and provides the opportunity to reexamine fundamental operating systemstructure — specifically to change the way operating systems use address space. Thisbachelor thesis attempts to tackle the problems associated with the administration ofone large, flat address space versus private address spaces and examine what sort of

support will be necessary for a single address space.

Contents

1 The Past 41.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 History of the Mac OS X Kernel - Mach 3.0 . . . . . . . . . . . . . . 41.3 History of Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 History of Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . 6

2 The Present 82.1 Virtual Memory Today . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Translation Lookaside Buffer . . . . . . . . . . . . . . . . . 172.1.4 Page Replacement Algorithms . . . . . . . . . . . . . . . . . 20

2.2 Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.1 Protection Domains . . . . . . . . . . . . . . . . . . . . . . . 222.2.2 Access Control and The Capability System . . . . . . . . . . 23

2.3 Mac OS X Software Architecture . . . . . . . . . . . . . . . . . . . . 242.3.1 OS X Virtual Memory Architecture . . . . . . . . . . . . . . 25

2.4 Linux Kernel Software Architecture . . . . . . . . . . . . . . . . . . 332.4.1 The Kernel Layer . . . . . . . . . . . . . . . . . . . . . . . . 342.4.2 Linux Virtual Memory Architecture . . . . . . . . . . . . . . 402.4.3 Page Fault Handler . . . . . . . . . . . . . . . . . . . . . . . 432.4.4 Page Replacement . . . . . . . . . . . . . . . . . . . . . . . 46

2.5 Linux Vs Mach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.6 Chip Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.6.1 The PowerPC processor . . . . . . . . . . . . . . . . . . . . 482.6.2 Page Address Translation . . . . . . . . . . . . . . . . . . . . 502.6.3 i386 Processor . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.7 Chip architecture - from 32 to 64 bit on Intel and AMD Chips . . . . 542.7.1 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.7.2 AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.8 Personal Spaces vs Single Spaces . . . . . . . . . . . . . . . . . . . . 552.8.1 The Single Address Space Approach . . . . . . . . . . . . . . 562.8.2 Capabilites . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3 The Future 613.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . 613.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

1

CONTENTS 2

Bibliography 63

List of Figures

2.1 Mapping from a virtual to a physical address . . . . . . . . . . . . . . 102.2 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 An example 64-bit page table entry . . . . . . . . . . . . . . . . . . . 122.4 A 32-bit page table entry . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Linear Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Multilevel Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Hashed Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.8 Clustered Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . 162.9 An overview of the five layers of the Mac OS X architecture . . . . . 242.10 IPC stucture in Mach . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 The major subsystems of the Linux operating system . . . . . . . . . 332.12 Linux Conceptual Decompositional Overview . . . . . . . . . . . . . 342.13 Buddy1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.14 Buddy2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.15 Buddy3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.16 Memory Manager Structure . . . . . . . . . . . . . . . . . . . . . . . 392.17 MMU Conceptual Block Diagram - 64 Bit Implementation . . . . . . 492.18 Page Address Translation Overview - PowerPC 64 Bit . . . . . . . . . 512.19 Page Address Translation Overview - PowerPC 32 Bit . . . . . . . . . 512.20 386 Linear Address . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.21 386 Physical Address Computation . . . . . . . . . . . . . . . . . . . 522.22 386 Page Table Entry . . . . . . . . . . . . . . . . . . . . . . . . . . 532.23 Itanium®address translation . . . . . . . . . . . . . . . . . . . . . . 552.24 Three choices for structuring cooperation between application compo-

nents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.25 A Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.26 A Guarded Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3

Chapter 1

The Past

1.1 IntroductionThis paper will look at virtual memory management in two key operating system ker-nels: the XNU kernel from Darwin (Macintosh OS X) and the Linux kernel, the ideabeing that they constitute two fundamentally different kernel architectures - the microand the monolithic kernels. I begin with a brief history of the two and conclude thisfirst chapter with a delve into the history of virtual memory itself.

1.2 History of the Mac OS X Kernel - Mach 3.0In 1985, after having his responsibilities at Apple ”removed ” Steve Jobs went on tofound the NeXT Computer inc, who created an operating system called NEXTSTEPwhich used a port of CMU Mach 2.0 (with a 4.3BSD environment) as its kernel [KER00].NeXT partnered with Sun Microsystems to jointly release specifications for OpenStep,an open platform comprised of several API’s and frameworks that anybody could useto create their own implementation of an object-oriented operating system, runningon any underlying core operating system. NeXT was taken over by Apple in 1997and the operating system kernel, Mach, was used as a foundation of OS X. Openstepwas not the only Mach-based operating system to influence the development of MacOS X. Openstep was combined with BSD 4.4 to produce a system called Rhapsodya POSIX compatible, UNIX-based operating system, made by Apple Computers torun on Intel/Cyrix/AMD Pentium and Motorola/IBM PowerPC. A developer release ofRhapsody was forked to produce the first version of a further operating system, Darwin0.1, in 1999. Darwin is in itself an operating system but can be viewed as a collectionof technologies forming the central part of Mac OS X (Darwin 7.0.x consists of over250 packages). Such application environments as Cocoa, Carbon and Aqua are notpart of Darwin. The Mac OS X kernel, called XNU, contains code based on the Mach3.0 microkernel. As Mach plays such a large role in the evolution of OS X it would beinstructive to briefly discuss the development of the Mach kernel.

In 1975, a group of researchers at the University of Rochester in New York be-gan development of an ”intelligent” gateway system called RIG (Rochester IntelligentGateway) to provide local and remote access to certain computing facilities. It’s oper-ating system named Aleph, had a kernel which was structured around an inter-processcommunication facility whereby processes could send messages to each other by spec-

4

CHAPTER 1. THE PAST 5

ifying the destination as a process and port number. A port was a message queue inthe kernel and a process could have any number of ports defined within it to receivemessages on. A process X could ”shadow” another process Y (X receives a copy ofevery message sent to Y), or X could ”interpose” Y (X intercepts all messages sent to,or originating from Y). However, a 2KB limit on the size of a message (due to limitedaddress space) and the resulting IPC inefficiency, no protection for the ports (port num-bers were global, and any process could create and use them - thus, any process couldsend a message to any other process), no way to notify the failure of a process to an-other process that depended on it (without explicit registration of such dependencies),etc. meant that RIG was killed a few years later [KER00].

One of the people who worked on RIG was Richard Rashid. In 1979, he moved toCarnegie Mellon University, where he worked on Accent, a network operating systemkernel. Like RIG, Accent also used IPC as the basic system structuring tool. However,Accent addressed RIG’s shortcomings: ports now had ”capabilities”, and copy-on-write memory mapping was used to facilitate large message transfers. Messages couldbe sent to processes on another machine through an intermediary process. Accent hadflexible and powerful virtual memory management, which was integrated with IPC andfile storage. The sequel to Accent was Mach.

Early versions of Mach had monolithic kernels (meaning drivers, system services,etc get executed in kernel space) using much of BSD’s code and version 3.0 was thefirst microkernel in the sense that BSD ran as a user space task. It started life as ”Mul-tiprocessor Universal Communication Kernel” 1 at the Carnegie Mellon University in1985 and although research on Mach at Carnegie Mellon ended in 1994, the OpenSoftware Foundation’s Cambridge and Grenoble Research Institutes continue to refineMach’s structure. Mach was designed from the ground up as a multitasking, multi-threaded operating system with very powerful communications facilities. It excels inmultiprocessor and distributed environments [SCO97]. The vitual memory subsystemwas designed to handle sparse address spaces whereby regions of memory could beallocated from anywhere within the address space. This is in contrast to contiguousvirtual memory space where heap and stack grow towards one another.

Mac OS X nowadays is a hybrid between microkernel and monolithic kernel archi-tectures. The Mach microkernel manages five fundamental abstractions - tasks, threads,ports, messages and memory objects. In the following paper, I will focus on the waythe macintosh architecture, via Mach, allocates and manages virtual memory.

1.3 History of LinuxLinux is a Unix-like multitasking, multiuser 32 and 64 bit operating system for a varietyof hardware platforms and licensed under an open source license [WIR03]. AlthoughLinux is neither pure SYSV 2 nor pure BSD, it is an (increasingly) POSIX-compliantUnix clone and it inherits it’s multitasking and multiuser capabilities from it’s Unixlineage. Which means it shares a number of technical features with Unix 3. The Linuxkernel was originally developed for the Intel 80386 by Linus Torvalds with the idea oflearning the capabilities of an 80386 processor for task switching. It is traditionally a

1one of Richard Rashid’s collegues, the Italian Darlo Giuse mispronounced the unfortunate acronym”MUCK” as ”MACH” and the name stuck

2UNIX System V. There are essentially two ”flavours” of UNIX AT&T’s System V and BSD.3A thorough graphical history of the timeline of the history of a lot of Unix and Unix migrants can be

found at http://www.levenez.com/unix/


monolithic kernel, meaning, amongst other things, that procedures from one processcan be called from anywhere within the kernel by another.

Originally named Freax , it was originally hosted using the Minix operating sys-tem. The first release of the Linux-Kernel (V0.01) to the public was in September 1991containing just over 10,000 lines of code (the latest release has just under 6,000,000).The size of the tar.gz-archive was 71KB. This version only ran on Intel 80386 proces-sors on a PC architecture. The only supported file system was the MinixFS and therewas no network support at all. It could handle multiprocessing, had a few drivers andthere was a basic virtual memory sub-system. In a now famous posting by Linus on 25Aug 1991, a date which can be considered to be the “birth” of Linux, said,

it uses a MMU (Memory Management Unit), for both paging (not to diskyet) and segmentation. It’s the segmentation that makes it REALLY 386dependent (every task has a 64Mb segment for code & data - max 64 tasksin 4Gb. Anybody who needs more than 64Mb/task - tough cookies)

It quickly attracted the attention of other programmers and with the release of ver-sion 0.11 in Dec 1991 Linux became self hosting (i.e. you could compile Linux 0.11under Linux 0.11) and it wasn’t until version 0.96, released in April 1992 that Linuxcould host the XWindow system. Version 1.0 of the Linux kernel was released inMarch, 1994. Since then, the kernel has gone through many development cycles, eachculminating in a stable version. Each development cycle has taken a year or three, andhas involved redesigning and rewriting large parts of the kernel to deal with changesin hardware (for example, new ways to connect peripherals, such as USB) and to meetincreased speed requirements as people apply Linux to larger and larger systems (orsmaller and smaller ones: embedded Linux is becoming a hot topic). Most of the codevolume in Linux is device drivers. The core functionality, which implements multi-tasking and multiuser functionality, is small in comparison.

A good implementation of multitasking requires, among other things, proper mem-ory management. The operating system must use the memory protection support in theprocessor to protect running programs from each other. Otherwise a buggy program(that is, almost any program) may corrupt the memory area of another program, orthe operating system itself, causing weird behavior or a total system crash. The LinuxMemory Management is the subject of part of this paper.

1.4 History of Virtual MemoryAn essential component for understanding memory usage is the division between main(or primary) and secondary memory. I will use the terms “main memory”, “memory”or “RAM” to refer to main (or core) memory and “disk”, “secondary memory” or“backing store”to refer to secondary memory. I forego an explanation of these types ofmemory in my assumption that the reader has a basic grounding in computing addingonly that primary memory is the only memory addressed by the CPU.

Early computers used a two-level storage system consisting of main memory, whichwas expensive, and secondary memory, which was comparatively cheap. But mainmemory was thin on the ground and this lack of space had serious implications. Whathappens if a programme “grows” while it is being executed? Eventually it will run outof main memory to use. More to the point, what happens with multiple programmesrunning? Programmers used to have to rewrite their programmes to fit into the differentamounts of memory on different machines. People began to look towards some way


to avoid this. The idea being, that due to the combined size of programme, data andstack exceeding the physical amount of memory available for them, the slow secondarystorage is used to store those parts of the programme not currently in use [TAN87].

An early solution, circa 1950 - 1960, was to divide programmes up into blocksor pages, the movement operations being called overlays [DEN96]. The first overlaywould run until it was finished and it would then call the second one. The overlays werekept on disk and were swapped in and out of memory as required. The job of splittingup the programmes into overlays was the task of the programmer. This splitting took along time and people looked elsewhere for a solution to automate the process.

A revolutionary new concept was introduced by the Atlas team at the University ofManchester who by 1959 had produced the first prototype of a virtual memory systemcalled “one level storage system”, credited to Fotheringham in 1961. It was radical inthat for the first time, there was a distinction between “address” and “memory loca-tion”. In order to do this, a special automatic set of hardware and software was neededto

• translate each address generated by the processor to its current memory location

• move a missing page of data into main memory (demand-paging, more on thislater)

• move the least useful pages out of memory

Before virtual memory could be regarded as a stable entity, many models, exper-iments, and theories had to be developed to overcome the numerous problems withvirtual memory, like “thrashing” - extended periods of paging reducing performance.By the mid 70’s the virtual memory idea had been perfected enough to use.

In 1965, hardware designer Maurice Wilkes [WIL65] proposed what he called“slave memory”, a small, high-speed store, included in the processor, to hold a smallnumber of recently used blocks of code. Wilkes argued that by eliminating many datatransfers between the processor and main memory, the system would run to within afew per cent of the full processor speed, costing only a few per cent of main memory[DEN96]. The term “cache memory” replaces “slave memory” and paved the way forthe modern day Translation Lookaside Buffer (TLB).

Chapter 2

The Present

Before we examine what the new era of 64-bits can and cannot achieve, we need to lookat what the situation is nowadays in the 32-bit arena. This section of the paper describesinter-process communication, virtual memory and private address spaces in the 32-bitarena, comparing and contrasting the Mach 3.0 and Linux 2.4.x kernels in order toconsider what changes in protection of and access to operating system primitives suchas ’the process’ could overcome the trade-off between performance and security in thejump to 64-bit architectures and single address spaces. The reader who has a thoroughknowledge of 32-bit, private address space addressing mechanisms might find this partof my thesis repeating what (s)he already knows. In the following and for the rest ofthis paper, I refer to “32-bit systems”, “32-bit machines”, “32-bit architectures” and“32-bit domains” to mean the same thing, namely computers with 32-bit wide addressregisters. Ditto “64-bit”. I also refer to “Memory Manager”, “Memory ManagementUnit” and “Virtual Memory Manager” whereby I mean that chip on the CPU card thattranslates virtual addresses into physical addresses. “PPC” means PowerPC and it’s theCPU on a Macintosh.

2.1 Virtual Memory TodayVirtual memory has been a feature of the Macintosh operating system as of System 7.0released in May 1991, nine years before Mac OS X. The two VM systems for OS9and OS X had several differences but the two main ones are: OS9 simulated a largerfixed address space by storing the entire address space on disc and the VM system wasswitchable i.e. you could turn it off and on. Under OS X it isn’t possible to switch VMoff and the entire address space isn’t stored on disk - memory is paged in as needed.Linux has had VM from its inception (Sept 1991). It does more than just make yourcomputer’s memory go further. The memory management subsystem provides:

Large Address Spaces The operating system makes the system appear as if it hasa larger amount of memory than it actually has. The virtual memory can bemany times larger than the physical memory in the system. In 32-bit systems wehave 232 bytes of virtual address space, or 4 gigabytes. The hard disk storageis sometimes called the “swap space” or “backing store” because of its use asstorage for data being swapping in and out of memory. Unlike most UNIX-based operating systems, Mac OS X does not use a preallocated swap partition

8

CHAPTER 2. THE PRESENT 9

for virtual memory. Instead, it uses all of the available space on the machine’sboot partition [APP00].

Protection Each process 1 in the system has its own virtual address space. These vir-tual address spaces are completely separate from each other and so a processrunning one application cannot affect another. Also, the hardware virtual mem-ory mechanisms allow areas of memory to be protected against writing. Thisprotects code and data from being overwritten by rogue applications.

Resource virtualisation On a system with virtual memory, a process does not haveto concern itself with the details of how much physical memory is available orwhich physical memory locations are already in use by some other process. Inother words, virtual memory takes a limited physical resource (physical memory)and turns it into an infinite, or at least an abundant, resource (virtual memory).

Fault Isolation Processes with their own virtual address spaces cannot overwrite eachother’s memory. This greatly reduces the risk of a failure in one process trigger-ing a failure in another process. That is, when a process crashes, the problem isgenerally limited to that process alone and does not cause the entire machine togo down

Shared Virtual Memory Although virtual memory allows processes to have separateaddress spaces, there are times when you need processes to share memory. Dy-namic libraries are a common example of executing code shared between severalprocesses. Shared memory can also be used as an Inter Process Communication(IPC) mechanism, with two or more processes exchanging information via mem-ory common to all of them. Linux supports the Unix System V shared memoryIPC.

For 32-bit applications, this virtual address space is 4 GB. For 64-bit applications,the theoretical maximum is 264 = 18,446,744,073,709,551,616 bytes or approximately18 exabytes if calculating using 1000 bits = 1 MB otherwise it’s 16 exabytes if using1024 bits in one MB.

2.1.1 PagingWhen a process attempts to access memory, it does not know that the operating systemwill be interfering with the address it generates. As far as it is concerned, it just sees4GB and wants the contents of address XYZ to continue with its calculations. How-ever, the operating system does interfere with the addresses mainly due to there notbeing 4GB worth of RAM in the machine (there are other reasons as outlined above).Hence the addresses generated are virtual addresses (i.e. not physical) and togetherthey form what is called the virtual address space. Somehow these addresses haveto be interpreted and converted into physical addresses. This is done by the MemoryManagement Unit which sits between the processor and the memory bus.

The memory manager divides up the virtual (or logical) address space into equallysized chunks called pages which are all the same size, usually some integer power of2 and usually 4K. On Intel x86 and PPC systems the page size is 4K but the PDP-11 uses 8K pages and the VAX page size is 512 bytes. Each page in virtual memory

1by “process” I understand a programme in execution ([TAN87]) consisting of the executable programme,the programme’s data and stack, the current values of its programme counter, stack pointer, other registersand all the other information needed to run the programme.


has a corresponding page frame in physical memory these being the same size as thepages. The page frames get given a page frame number and there is some mechanismto keep check of whether the page frame has been allocated or not, often by way of apresent/absent bit (sometimes known as a valid bit etc) in the page table (see below).

In this paged model, a virtual address is composed of two parts; an offset and avirtual page number. If the page size is 4K, on a big endian processor like the PPC2, bits 31-20 would contain the offset and 19-0 the virtual frame number. On a littleendian processor like the x86, bits 0-11 of the virtual address contain the offset and bits12 and above are the virtual page frame number. The virtual page number is used asan index into the page table which gives us the corresponding page frame which, whenconcatenated with the offset produces the physical address.

Figure 2.1 explains the situation:

Figure 2.1: Mapping from a virtual to a physical address

If the translation from a logical page address to a physical address fails, a pagefault occurs and a page-fault handler is invoked to stop executing the current code andattend to the fault. This handler finds a free page of physical memory, transfers the datafrom the disk to the RAM, and updates the page table. If no free pages are available,the handler has to release an existing page, writing any modified data first to the frameon the disk. This is known as paging. Figure 2.2 3 explains the situation:

2.1.2 Page TablesThe purpose of the page table is to translate virtual pages into physical pages in RAM.The location of the page table in memory is stored in hardware (normally a register

2Actually, the PPC is “bytesexual” or “bi-endian” meaning it is an architecture that can be configuredeither way

3graphic taken from [HAR98] and modified


Figure 2.2: Paging

pointing to the address of the start of the page table for that process). A page table storestranslation, protection, attribute and status information for virtual addresses [HUC93,CHA88] so that each process has its own protected 4GB address space.

Page tables were essentially designed for 32-bit address spaces. So how well dothey perform in the 64-bit arena with sparse address spaces? In this section I examineseveral page table designs and compare the benefits for 32 and 64 bit machines.

TLB miss overhead correlates directly with page table performance. [SZM00]. Anideal page table would facilitate a fast TLB miss handler, use little virtual or physicalmemory and flexibly support operating systems in page table modifications.

General

On a 32-bit machine with 32-bit addresses and 4KB pages with 20 bits of page address,there are 220page table entries. So if each page table entry is 32 bits wide, the page tableis 220 * 32 = 4MB. And that’s just for one process! So if several processes are runningthey quickly eat up memory. The situation on a 64-bit machine is much worse as themaximum table size is 16 petabytes. So some consideration as to how to construct pagetables is called for.

For all page table designs 64-bit address mapping information will require eightbytes [TAL95]. A typical page table entry will have 1 valid bit, a 28-bit physical pagenumber, 12 bits of software and hardware attributes and “PAD”-bits for future use.[TAL95] see figure 2.3.

A 32-bit page table entry on the other hand is only 4 bytes and might look some-thing like figure 2.4:


Figure 2.3: An example 64-bit page table entry

Figure 2.4: A 32-bit page table entry


Linear Page Tables or Linear Virtual Arrays

Linear page tables store all page table entries for a process in a single array which isindexed by the virtual page number (see figure 2.5) 4. Such page tables are very large,hence stored in the virtual address space, only partially populated and only backed upby physical memory when required. So a linear page table can efficiently support theconventional address-space split by not loading the centre of the array.[ELP99].

Figure 2.5: Linear Page Table

With linear tables, the MMU splits a virtual address into page number and pageoffset components. The page number is used to index into an array of page tableentries. The actual physical address is the concatenation of the page frame numberin the page table entry and the page offset of the virtual address. Page table entriesconsist of the physical page frame number for the corresponding virtual page, a flagindicating whether the entry is currently valid, a flag indicating whether the page maybe written and a bit indicating whether the page has been referenced.

TLB misses are handled by looking up the relevant entry in the table and loadingthe page. This lookup requires a simple calculation based on the virtual address and onememory reference. In theory, it’s quick and painless but remember we are accessing anarray in virtual memory which means that a TLB entry has to be present for the pagetable itself. If a nested TLB miss occurs whilst servicing a TLB miss, then either thenested miss is resolved elsewhere in the page table or is eventually resolved at the rootwhich is on physical memory. So you can view a linear page table as a multilevel pagetable that is searched from the leaf nodes to the root rather than the other way round[ELP99]. Hence with 64-bits, we would need several more levels and several TLBentries would be needed per page and this sort of nesting is expensive. The benefitsfor 64 bits are: it’s a straightforward mechanism, it’s a practical design if a proper datastructure is used to map the table itself and part of the TLB can be used for mappings tothe table hence the tree is rarely traversed. The disadvantage, though, is the high spaceoverhead.

Multilevel or Hierarchical Page Tables

The problem with simple arrays as page tables is that they are very wasteful. Typically,the virtual address space on a 32-bit machine starts the programme code of a process ataddress 0, then starts the heap at the first page boundary at the end of the programmeand the stack always starts at the 4GB mark and works its way down. So there’s a big

4Graphic taken from [TAL95]


gap between the heap and the stack and thus most of the page table entries are empty.So the idea with multilevel page tables is to keep only the sections of the page tablethat are needed in memory.

Figure 2.6: Multilevel Page Table

The virtual address is divided into several indices into different page tables, -I1,I2,I3,I4 etc - plus an offset (see figure 2.6) 5. There is a single top-level page ta-ble for the entire virtual address space and each entry in the top-level page table storesan address of a page table in the next level; the 2nd-level page table and so on creatinga tree structure. Each index accesses one level of the tree. This system avoids keepingall the page tables in memory in 2 ways. Firstly if the 1st-level page entry indicatesan invalid address, then the corresponding 2nd-level page tables are not even allocated.Secondly you normally end up allocating just three 2nd-level page tables, one for eachof the code, the heap and the stack. The second level tree-nodes associated with theunused centre region of the address space are not allocated until needed. During a TLBmiss, the chain of tables is traversed and a 100% resolve has to occur.

Different platforms use different numbers of levels. The architecture-independentparts of Linux kernel source code is written as though there were always three levelsof the page table structure.

On the x86 processors, the multilevel page tables are actually only two levels deep,which is sufficient for addressing a 4GB address space with 4KB pages. The codethat traverses the ”middle levels” of page tables does nothing on the x86 architecture- it gets preprocessed and compiled down to essentially nothing via platform-specific#ifdefs. This allows other code to be written as though all machines had three-levelpage tables. [WIL99]

However, multilevel page tables in the 64-bit domain involves either increasing thenode size to translate more bits at each level and/or you increase the number of levels(typically between 4 and 7). The former increases memory consumption and the latterincreases the number of memory references required for each page table look-up. Inaddition, multilevel page tables are impractical due to the huge overhead on a TLB miss[TAL95]. All in all, multilevel page tables are inefficient for large sparse address spacesbecause they inefficiently support the translation of large address spaces in either the



space or time domain [ELP99]. Linux uses multilevel page tables on all platforms.[SZM00]

Inverted Page Tables

As mentinoed above, a page table can be up to 4MB in size on a 32 bit machine.The situation on a 64-bit machine is much worse as the maximum table size is 16petabytes. One advantage of inverted page tables is their ability to compactly storesparse address spaces [ELP99]. In fact, page table size is approximately proportionalto physical memory size, not virtual address space size. Another plus is that virtualsparsity presents no problem. The main disadvantages are you can’t share memorybecause each physical address can only have one translation associated with it [ELP99]and the lookup time is typically very high because the whole table has to be linearlysearched.

Whereas a normal page table is indexed by a virtual page number and stores aphysical frame number per virtual page, an inverted page table is indexed by a physicalframe number and stores a virtual page number per physical frame. There is one PTEper physical page frame. Each entry is numbered from 0 - n from bottom to top inthe table. Unlike an ordinary or multilevel page table, there is only one inverted pagetable per system — all processes use the same page table. Each entry contains twofields, the identification of the process that owns the virtual page contained in that pageframe and the virtual page number of the page contained in that page frame. Whena process requests memory, the page table is searched, usually with the aid of a hashtable [CHA88]. The search looks for an entry that matches both the identification ofthe requesting process and the virtual page being requested. If that entry is found thenthe index of the entry where it was found becomes the physical page frame for thephysical address. If no matching entry is found then the virtual page is not in memoryand a page fault occurs.

The advantages and disadvantages of inverted page tables in the 32-bit and the 64-bit domains are the same. [ELP99]

Hashed Tables

Hashed page tables are similar to inverted page tables but whereas the inverted pagetable contains one entry per physical frame, the hashed page table contains a hash-table of virtual to physical translations (see figure 2.7) 6. The advantage of hashedpage tables is the same as inverted page tables (namely they are smaller than standardtables) but on top of that, hashed tables have a quicker lookup than inverted tables.

The virtual page number and the physical frame number are in the hash table. Thevirtual address consists of a virtual page number and an offset. The virtual page numberis used as an entry point into the page table which contains a chain of elements hashingto the same location. Virtual page numbers are compared in this chain searching for amatch. If a match is found, the corresponding physical frame number is combined withthe offset to form the physical address.

For 64-bit address spaces, the hashed page table only has to be altered in one way:extend the size of the page table entries to accommodate the the larger virtual pagenumbers. TLB misses have a fixed overhead, so hashed page tables are well-suited forsparse address spaces [TAL95] however, the overhead is high.



Figure 2.7: Hashed Page Table

Clustered page tables

Clustered page tables are hashed page tables that store mapping information for severalconsecutive pages (e.g. sixteen) with a single tag and a next pointer [TAL95] i.e. theyare similar to hashed page tables with the difference that each entry stores mappinginformation of a block of consecutive page table entries. The number of pages perpage block is called the subblocking factor [TAL95]. Virtual address translation issimilar to a hashed PT. The virtual address is split into three: the virtual block number,the block offset and the page offset. The virtual block number acts as an entry pointinto the clustered page table and the block offset is used to index and select a page tableentry which in turn is combined with the remaining offset to form the physical address.

Subblocking or clustering is effective when the address space of a programme con-sists of small clusters of contiguous pages, thus saving on the space overhead. How-ever, this efficiency is significantly reduced for address spaces consisting of isolatedvirtual pages. [ELP99] because potentially more memory could be required due to theextreme sparsity of the address space. (see figure 2.8) 7

Figure 2.8: Clustered Page Table

Guarded Page Tables or Path Compression

Invented by Jochen Liedtke, guarded page tables are tree-structured combining the ad-vantages of hashed page tables and multilevel page tables. Guarded page tables attempt



to avoid the problems associated with multilevel page tables and sparsity, namely thathuge amounts of page table entries are required for non-mapped pages. If the second,third, fourth etc. table levels of a virtual to physical address mapping are so sparcelypopulated that there is only one path through them, we could omit these essentially“useless” 8 tables by checking to see whether the address bits after the first level bitrepresent this path. In other words, if we augment each page table entry with a bitstring of length g (the length of g can vary from entry to entry) we can check to seewhether g is, in fact, the prefix of the remaining virtual address. If it is we skip theassociated translation steps and the translation process continues with the remainingpostfix or terminates

2.1.3 Translation Lookaside BufferIf the CPU had to look up a page table entry every time it had to translate an address,performance would suffer. All modern computers designed for virtual memory incor-porate a special hardware cache called a Translation Lookaside Buffer (TLB), whichcaches page table entries so that the CPU usually doesn’t have to probe the page tableto find a page table entry to translate [WIL99]. This hardware device is based on theobservation that most processes exhibit locality of reference, meaning at any phase inits execution, a process makes lots of references to only a few pages, so only a fractionof the pages are read heavily; the rest are hardly touched [TAN87]. What is needed isa fast efficient cache. Let’s look briefly at the kind of memory a cache can be.

Cache Mapping and associativity

The TLB is an n-way set associate memory cache. There are different ways to mapstorage in a cache with the main memory it serves. There are three different ways thatthis mapping can generally be done. Take as an example a system with 512 KB ofcache and 64 MB of main memory

1. Direct Mapped Cache: The simplest way to allocate the cache to the systemmemory is to determine how many cache lines there are (16,384 in our example)and just chop the system memory into the same number of chunks. Then eachchunk gets the use of one cache line. This is called direct mapping. So if wehave 64 MB of main memory addresses, each cache line would be shared by4,096 memory addresses (64 M divided by 16 K).

2. Fully Associative Cache: Instead of hard-allocating cache lines to particularmemory locations, it is possible to design the cache so that any line can storethe contents of any memory location. This is called fully associative mapping.

3. N-Way Set Associative Cache: “N” here is a number, typically 2, 4, 8 etc. Thisis a compromise between the direct mapped and fully associative designs. In thiscase the cache is broken into sets where each set contains “N” cache lines, let’ssay 4. Then, each memory address is assigned a set, and can be cached in any oneof those 4 locations within the set that it is assigned to. In other words, withineach set the cache is associative, and hence the name. This design means thatthere are “N” possible places that a given memory location may be in the cache.The tradeoff is that there are “N” times as many memory locations competing for

8i.e. they are null-pointers; the address bits are not used to index any table


the same “N” lines in the set. Let’s suppose in our example that we are using a4-way set associative cache. So instead of a single block of 16,384 lines, we have4,096 sets with 4 lines in each. Each of these sets is shared by 16,384 memoryaddresses (64 M divided by 4 K) instead of 4,096 addresses as in the case of thedirect mapped cache. So there is more to share (4 lines instead of 1) but moreaddresses sharing it (16,384 instead of 4,096).

Conceptually, the direct mapped and fully associative caches are just special casesof the N-way set associative cache. You can set “N” to 1 to make a “1-way” set asso-ciative cache. If you do this, then there is only one line per set, which is the same asa direct mapped cache because each memory address is back to pointing to only onepossible cache location. On the other hand, suppose you make “N” really large; say,you set “N” to be equal to the number of lines in the cache (16,384 in our example),then you only have one set, containing all of the cache lines, and every memory loca-tion points to that huge set. This means that any memory address can be in any line,and you are back to a fully associative cache. [KOZ01]

The following text is taken from [LAY05]

Comparison of Cache Mapping TechniquesThere is a critical tradeoff in cache performance that has led to the creationof the various cache mapping techniques described in the previous section.In order for the cache to have good performance you want to maximizeboth of the following:

Hit Ratio: You want to increase as much as possible the likelihood of thecache containing the memory addresses that the processor wants. Other-wise, you lose much of the benefit of caching because there will be toomany misses.

Search Speed: You want to be able to determine as quickly as possible ifyou have scored a hit in the cache. Otherwise, you lose a small amount oftime on every access, hit or miss, while you search the cache.

Now let’s look at the three cache types and see how they fare:

Direct Mapped Cache: The direct mapped cache is the simplest form ofcache and the easiest to check for a hit. Since there is only one possibleplace that any memory location can be cached, there is nothing to search;the line either contains the memory information we are looking for, or itdoesn’t.

Unfortunately, the direct mapped cache also has the worst performance,because again there is only one place that any address can be stored. Let’slook again at our 512 KB cache and 64 MB of system memory. As yourecall this cache has 16,384 lines and so each one is shared by 4,096 mem-ory addresses. In the absolute worst case, imagine that the processor needs2 different addresses (call them X and Y) that both map to the same cacheline, in alternating sequence (X, Y, X, Y). This could happen in a smallloop if you were unlucky. The processor will load X from memory andstore it in cache. Then it will look in the cache for Y, but Y uses the samecache line as X, so it won’t be there. So Y is loaded from memory, andstored in the cache for future use. But then the processor requests X, andlooks in the cache only to find Y. This conflict repeats over and over. The


net result is that the hit ratio here is 0%. This is a worst case scenario, butin general the performance is worst for this type of mapping.

Fully Associative Cache: The fully associative cache has the best hit ra-tio because any line in the cache can hold any address that needs to becached. This means the problem seen in the direct mapped cache disap-pears because there is no dedicated single line that an address must use.

However, this cache suffers from problems involving searching the cache.If a given address can be stored in any of 16,384 lines, how do you knowwhere it is? Even with specialized hardware to do the searching, a per-formance penalty is incurred. And this penalty occurs for all accesses tomemory, whether a cache hit occurs or not, because it is part of searchingthe cache to determine a hit. In addition, more logic must be added to de-termine which of the various lines to use when a new entry must be added(usually some form of least recently used algorithm is employed to decidewhich cache line to use next). All this overhead adds cost, complexity andexecution time.

N-Way Set Associative Cache: The set associative cache is a good com-promise between the direct mapped and set associative caches. Let’s con-sider the 4-way set associative cache. Here, each address can be cachedin any of 4 places. This means that in the example described where weaccessed alternately two addresses that map to the same cache line, theywould now map to the same cache set instead. This set has 4 lines in it, soone could hold X and another could hold Y. This raises the hit ratio from0% to near 100%! Again an extreme example, of course. As for search-ing, since the set only has 4 lines to examine this is not very complicated todeal with, although it does have to do this small search, and it also requiresadditional circuitry to decide which cache line to use when saving a freshread from memory. Again, some form of least recently used algorithm istypically used. In a nutshell, the hit ratio gets better as N increases but thesearch speed gets worse.

After that brief aside, back to the TLB.....

If the CPU looks in its TLB to find the right page table entry and finds what its lookingfor, we face a TLB hit and the CPU reuses it without actually traversing the page table.Occasionally, the TLB doesn’t hold the PTE it needs and we face a TLB miss, so theCPU loads the needed entry from the page table and caches that into the TLB.

Note that a TLB does not cache normal data—it only caches address translationinformation from the page table. A page table entry is very small and the TLB onlycaches a relatively small number of them (depending on the CPU, usually somewherefrom 32 and 1024 of them). This means that TLB misses are a couple of orders ofmagnitude more common than than page faults—any time you touch a page you haven’ttouched fairly recently, you’re likely to miss the TLB cache. This isn’t usually a bigdeal because TLB misses are many orders of magnitude cheaper than page faults—youonly need to fetch a PTE from main memory, not fetch a page from disk.

A TLB is very fast on a hit and is able to translate addresses in a fraction of aninstruction cycle. This translation can generally be overlapped with other parts of in-struction setup, so the PTE hardware gives you virtual memory support at essentiallyzero time cost.


2.1.4 Page Replacement AlgorithmsWhen a page fault occurs, and the memory is full, the operating system software hasto decide which page to move out of memory in order to make room for the one that’sgoing to be paged in. If the page chosen has been modified whilst in memory, it’smodified or “dirty” bit will have been set to 1 and it has to be re-written to the disk.If however, it just contains programme code, then it doesn’t need to be written back todisk - the new page can just write over it. Performance can vary considerably dependingon the way in which this decision is made. If, for example, a heavily used page isselected to be paged out, then there’s a good chance it will be selected again prettysoon to be paged back in again. So preferably, an infrequently used page should beselected and ideally, the page which will not be used for the longest period of time 9

[SIL88, TAN87]. I use the phrases “to swap out” and “to page out” interchangeablythroughout this section to mean the same thing.

FIFO

Probably the simplest way to select a “victim” page is to throw out the oldest pagewhich is exactly what this algorithmic method does. It can be implemented using alist: new pages are inserted at the head of the list and the page at the tail is swappedout [WIE00]. It’s performance is not good, however, as the page it swaps out could besome initialisation code not used for a long time or it could very well contain a variablein constant use. 10

Least Recently Used

One variant is to say we swap out pages depending not on when they were broughtinto memory (like FIFO see section 2.1.4) but when a page is likely to be used. If weassume the past as an approximation of the future, we would tend to want to replace thepage that has not been used the longest. This is the principle behind the Least RecentlyUsed algorithm. We count the number of references which have been made to eachpage. The page with the smallest count is replaced.

LRU is effective but it’s expensive as the pages currently in memory have to beheld in a liked list which needs updating with every memory reference. It is usuallyimplemented with special hardware involving either counters or a list called the LRUstack or the paging stack (data structure) [HAM00] the details of which are beyond thescope of this paper 11. It also suffers when each page is used heavily when a process isbeing initialised. [SIL88].

9In most textbooks, this ideal is given a name - the Optimal Page Replacement Algorithm. It has alsobeen noted in the same textbooks that it is difficult [SIL88] if not impossible [TAN87] to implement as itrequires a certain amount of soothsaying, namely future knowledge of which page tables will be accessedand when. It’s use is academic and it is used in page replacement algorithm comparisons. Hence this paperwill not be dealing with it.

10There’s not much else to say about this algorithm other than to mention Belady’s anomaly discoveredin 1969 whereby it was discovered that, contrary to intuition, if you allocate 4 page frames to a programmewhose pages are being swapped out via FIFO, more page faults occur than if you allocate 3. The intuitionbeing the more memory a programme has, the fewer page faults occur. [TAN87, SIL88]

11The LRU implemented with a stack cannot suffer from Belady’s anomaly as it can be shown that the setof pages in memory for n frames is always a subset of the set containing n+1


Second Chance or Clock

True LRU page replacement is rarely used as the hardware required proves just finan-cially unviable. However, some systems provide a reference bit which is set wheneverthe page is referenced (i.e. a read or a write anywhere in the page). These bits can befound either in the page table entry or in a separate register. The idea behind secondchance replacement is the FIFO algorithm with the addition of an inspection of thisreference bit. If the reference bit is 0 the page is replaced. If it is 1 then it gets set to 0,the arrival time record at the present time, it is moved to the end of the list and we moveon to the next page in the list. If a page is used often enough, it’s reference bit will keepgetting set to 1 and it will stay in memory. As can be imagined, if all the pages havebeen referenced we find ourselves clearing all reference bits and we are back with ourFIFO algorithm, in essence [TAN87, SIL88]. It’s known as second chance because itgives a page a second chance to stay in memory — one more sweep cycle.

The clock algorithm is the second chance algorithm but implemented with a circularlist. The pointer in the clock’s “hand” and whenever a page fault occurs the page beingpointed to is inspected. If it’s a 0 the page is evicted and the new page is inserted in itsplace whereas if it’s a 1 the bit is cleared and “hand” is moved on to the next page in thelist. The basic idea of the clock algorithm is that a slow incremental sweep repeatedlycycles through the all of the cached (in-RAM) pages, noticing whether each page hasbeen touched (and perhaps dirtied) since the last time it was examined. If a page’sreference bit is set, the clock algorithm doesn’t consider it for eviction at this cycle andcontinues its sweep, looking for a better candidate for eviction. Before continuing itssweep, however, it resets the reference bit in the page table entry. [GOR04] Resettingthe reference bit ensures that the next time the page is reached in the cyclic sweep, itwill indicate whether the page was touched since this time. Visiting all of the pagescyclically ensures that a page is only considered for eviction if it hasn’t been touchedfor at least a whole cycle. Linux, as we shall see, (see section 2.4.4 uses a simplesecond-chance (one-bit clock) algorithm, sort of, but with several elaborations andcomplications.

Second chance/clock is a step up from FIFO but is considered inefficient due to itscontinually moving pages around in the list.

Aging Algorithms

The reference bit tells me at any one time, only which pages have been referenced andwhich have not; we know nothing of the order of referencing. If I were to introducemore bits, say 8, that keep an eye on when a page was referenced, I would be introduc-ing an idea called aging. Essentially it’s this: I view my 8 bits like a bookshelf — I addbits at one end (the high order bit end) and remove bits from the other. Every so often,say 100ms, I interrupt the process and transfer control to the kernel. The operatingsystem then copies the reference bit into the high-order bit of my byte, shifting all theother bits right one. Hence I have a concise referencing history of page for the lasteight time-periods. Viewing these bytes as unsigned integers [SIL88] the page with thesmallest number gets pages out.

There are lots of algorithms that can be used for page replacement. The ones I havelooked at here are, in their purest form, demand paging algorithms. That means pagesare fetched only when the CPU demands their contents. So when you start up yourcomputer, the memory is empty and the CPU will demand lots of pages, all of whichcause page faults. Eventually, the process has loaded most of the pages it requires


and runs with comparatively few page faults. This locality of reference (see section2.1.3) allows us to define this fraction of pages that a process is using during these“quiet” times as it’s working set [DEN72]. If there isn’t enough memory to hold allthe pages in a processes’ working set the process exhibits thrashing (see section 1.4).Moreover, in timesharing systems, when a process is removed to disk to let anotherprocess have the processor, at some point it will need to be moved back in again. It ispreferable at this point to preload the working set before allowing the process to run.This idea of Denning’s [DEN72] is called the Working Set Model and involves theoperating system keeping tabs on which pages are in the working set for each process.An implementation of the working set model involves the use of the aging algorithmdiscussed above (section 2.1.4). All the pages with a 1 in the n highest order bits isconsidered to be in the working set. All others are precluded.

Information about the working set can be used to improve our paging algorithms.Take the clock algorithm, for example. Normally, the page presently being pointed tois evicted if it’s reference bit is 0. If we check first to see if it is part of the working setand keep it if it is we will generally improve the performance of the clock algorithm.This is known as the wsclock algorithm.

2.2 ProtectionWith the separation of addressing and protection via the use of segmentation and vir-tual addressing how does the system keeps tabs on which processes can do what? Inother words: what forms of access control can an operating system offer for memorymanagement? One form of access control is seen in CPU instructions that may onlybe executed in supervisor mode, which usually amounts to within the kernel. The divi-sion of virtual memory into kernel and user parts is also a form of access control. TheAccess Matrix model is basis for several access control mechanisms. In this model,you have Objects: resources like hardware devices, data files, etc. that need accesscontrol, that is, must be accessed in a protected fashion; Subjects: active entities likeuser processes that access objects and Rights: operations such as enable, disable, read,write, execute, etc. on objects are represented by access rights. We will examine twogeneralized forms of protection systems: the capability system and the access controllist system.

2.2.1 Protection DomainsTo be able to say which processes may or may not interact with other processes, andas a way to partially allow processes to interact with others e.g. to read but not writefile F, it is helpful to visualise a matrix with the rows i as subjects and the columns j asobjects and an element M[i,j] contains a set of rights. We talk of protection domainsas being an (subject, rights) tuple where each pair specifies a subject and some subsetof operations that may be performed by it. Rights are here operations that may beperformed on an object. [TAN87]

At any instant a process runs in some protection domain i.e. it has a set of ob-jects associated with it and for each of these there are some rights it has. Processescan switch from domain to domain during execution and the rules governing this aredependant on the operating system. The classic division is between the USER domainand the KERNEL domain. From within the kernel, a process could potentially accessall the physical pages in memory and all protected resources. A process in the user


domain cannot. Another example is typically the unix set of rights READ, WRITE andEXECUTE found next to all running processes when you type ls -l at the commandline in a terminal shell.

2.2.2 Access Control and The Capability SystemTwo popular implementation approaches of the access matrix are Access Lists andCapability Lists. An access list enumerates who may access a particular object. Thus,you need to know the identities of all subjects. The Unix file permission scheme isan example. Note that in the context of Unix file permissions, ”others” is a wildcard(catch-all) for subjects whose identity is not known (they may not even exist yet).Access lists are more effective when it’s easy to put subjects in groups. [SIL88, TAN87,KER00]. So when on operation M on an object Oj is attempted in domain Di we searchthe list of Oj looking for an entry < Dj , Rk > with M ∈ Rk

A capability list enumerates each object along with the operations allowed on it.This is a ticket-based scheme, where the possession of a capability (an unforgeableticket) by a subject allows access to the object. In this scheme, it is extremely criticalto protect the capability list, as capabilities must not propagate without proper authen-tication. Note that capabilities are finer grained than access list entries, and often canreplace the Unix all-or-nothing super-user model. For example, rather than making theping command setuid root, it is more secure to have a capability representing rawaccess to the network, and providing only that to ping. A capability can be thoughtof as a protected name for an object in memory [LIN76]. Each capability containsseparate read and write permissions, so that some processes may receive capabilitiesthat permit reading and writing some segment, while others receive capabilities per-mitting only reading from that same segment. [SCH74]. This capability system wasfirst suggested by Dennis and Van Horn [DEN66] as a kind of secure pointer to meetthe need for resource protection as multiprogrammed computer systems came of age.To execute operation M on an object Oj , the process passes the capability of object Oj

as a parameter. Simply possessing the capability means access is allowedA capability list is associated with a domain but is never accessible to any processes

within it. The protection provided by capability-based systems relies on the fact thatthe capabilities are kept in a protected memory object maintained by the operatingsystem and never allowed to migrate anywhere where a user process could access themdirectly and modify them. [SIL88]. Three methods are used to protect them:

1. Tagged architecture - a capability is marked with a ’tag’ bit.

2. keep the capability lists in the kernel and only let processes refer to them by theirplace in the list.

3. keep the capability list in user space but encrypt it with a key

If we compare capability lists to segment tables we notice that both consist of point-ers and rights but the difference is; entries in a segment table point to in-core structures(segments) but entries in a capability list point to secondary storage. So it would bepossible to combine them to use capabilities for both in-core and secondary storageobjects. That is, associate a unique capability with any object, regardless of where it isin system; the same one is used by the file manager and the memory manager.

Another interesting note: with capabilities, you do not need privileged hardwarestates. Just take an object-oriented approach; to do something, you need an object


(thing to be manipulated) and an operation. The object is of a certain type. Example:consider scheduling and dispatching (should only be done by the kernel). So, “schedul-ing” and “dispatching” are operations defined for process objects, and undefined for allother objects. Just make sure that no user gets capabilities for the objects which are ofthe types on which privileged operations work.

Why are capabilities not used more often? They are too expensive and such systemsexecute more slowly. 12

2.3 Mac OS X Software ArchitectureMac OS X includes preemptive multitasking, protected memory, dual-processor sup-port, multithreading and, compared with OS9.x, an advanced virtual memory manage-ment [SYD02]. Figure 2.913 depicts the Macintosh Op Sys graphically.

Figure 2.9: An overview of the five layers of the Mac OS X architecture

As you move from an upper layer to a lower layer in the model pictured in fig-ure 2.9, you move from more directly accessible code to less directly accessible code.So should someone click a button in the user interface (Aqua) invokes a function in theCarbon API which in turn accesses QuickDraw code [SYD02]. The code in the lowestlayer, the Kernel environment, produces Darwin, the Core OS layer of Mac OS X. Itcontains Mach, device drivers and low-level commands of the BSD environment. Thisis the heart of Mac OS X. Most of the technologies in this layer are also part of Darwin[APP00, SYD02].

The Core Services layer implements low-level features that can be used by mosttypes of software. This layer includes features such as collection management, data

12Theodore Linden [LIN76], p426, claims that calls to the system software needed to use a capability cantake a millisecond or more

13graphic taken from [SYD02]


formatting, memory management, string utilities, process management, file systemmanagement, stream-based I/O, and low-level network communication. Most of thesefeatures are included in the Core Foundation framework.

The Application Services layer implements services that are mainly graphics-related.This layer is crucial to all application environments except BSD because BSD is an en-vironment used to write UNIX programmes that run in a terminal window, hence itdoesn’t need access to graphics. This layer also includes features such as HTML ren-dering, disc recording, address book management, font management, speech synthesisand recognition. Quartz is the primary technology used for 2D rendering as well as forwindow management. OpenGL is an implementation of the industry-standard API forrendering 3D graphics. QuickTime is Apple’s technology for displaying video, audio,virtual reality and other multimedia-related information. Core Audio, (not mentionedin figure 2.9 but present in this layer), is Apple’s technology for managing audio soft-ware and hardware.

The Application Environments layer embodies the basic technologies for buildingapplications. Classic runs legacy programmes from older OS’s. The Carbon envi-ronment (more procedural) is an enhanced subset of the Toolbox API, an older pro-gramming environment, meaning that Apple threw out all the functions in Toolbox thatcouldn’t be adapted to work in OS X and rewrote some of the code for other functions.The Cocoa environment creates only OS X applications (more object-oriented). TheJava and the X11 environments are also found here.

The user interface layer is an abstraction that identifies methodologies for creatingapplications. Each methodology manifests itself as a set of guidelines, recommendedtechnologies, or a combination of the two. For example, Aqua is a set of guidelines thatdescribe the appearance of applications, whereas accessibility represents both a tech-nology and a set of guidelines to support assistive technology devices such as screenreaders [APP00].

2.3.1 OS X Virtual Memory ArchitectureOverview

Although Apple released a machine with an Memory Management Unit (MMU) in1987, the Mac operating systems didn’t make use of it until System 7 in 1991. Whenthey did make use of it, however, virtual memory features such as protected addressspaces, memory mapped files, page locking, shared memory, etc. were not present[KER00]. Nowadays, Mach offers a powerful and portable memory managementsystem with sophisticated mechanisms for dealing with these features. It featuresan object-oriented design approach, sophisticated memory management and message-based IPC [HOH96].

Virtual memory is handled differently in Mac OS X than it was in earlier versionsof the Mac OS. In earlier versions of the Mac OS, each program had assigned to it anamount of RAM the program could use. Users could turn Virtual Memory on and off,and also say how much main memory should be optimally allocated to each applicationand how much maximally. In Mac OS X, virtual memory is “on” all the time.

In Mac OS X, all processes are given their own private address space, which isallocated dynamically to the application on an as-needed basis. At the time of writing,the xnu kernel is 32 bit but its VM system is 64 bit aware, so the kernel can addressmore than 4GB of ram.


One of the major goals of the design of the virtual memory management systemin Mach was to provide a clean seperation between the machine independent and themachine dependant portions of the virtual memory system in stark contrast to the orig-inal 3BSD virtual memory system, which was specific to the VAX [THO05]. Mach’smemory management is split into 3 parts:

• the machine dependant pmap module 14 which manages the MMU and runsin the kernel. The pmap module is responsible for catching all page faults, per-forming any TLB operations needed and setting the MMU registers so it relieson the MMU Architecture and is therefore machine specific.

• machine independent kernel code concerned with managing address maps andreplacing pages.

• a memory manager which provides the semantics associated with the act ofreferencing memory regions which have mapped data spaces backed by it. Itprovides (or denies) access to pages of data that have been referenced by usertasks [HOH96]. It is a server task that responds to specific messages from thekernel in order to handle memory management functions for the kernel [BRI01].

This is a default memory manager that runs as part of the kernel but users cansupply their own memory managers (running outside the kernel) for special situations.Communication between the kernel and the memory manager is well defined and doc-umented (see [BRI01] amongst others) which makes it possible for users to write theirown memory managers. On the one hand, this has the potential of making the kernelsmaller by moving a large section of the code out into user space but it also has the po-tential of creating race conditions as two active entities would be involved in handlingmemory [TAN87].

Throughout this chapter I will use the terms “process” and “task” interchangeably.Task is Mach-speak for what is essentially a Unix process.

Inter Process Communication

The Mach kernel provides message-oriented, capability-based interprocess commu-nication. The interprocess communication (IPC) primitives efficiently support manydifferent styles of interaction, including remote procedure calls (RPC), object-orienteddistributed programming, streaming of data, and sending very large amounts of data[BRI01]. The IPC implementation makes use of the VM system to efficiently transferlarge amounts of data. The message body can contain the address of a region in thesender’s address space which should be transferred as part of the message. When atask receives a message containing an out-of-line region of data, the data appears in anunused portion of the receiver’s address space. This transmission of out-of-line data isoptimized so that sender and receiver share the physical pages of data copy-on-write,and no actual data copy occurs unless the pages are written. Regions of memory up tothe size of a full address space may be sent in this manner.

Messages The message primitives let tasks send and receive messages. Messages aresent via unidirectional communication channels, so-called ports. Messages sent to aport are delivered reliably (messages may not be lost) i.e. they are asynchronous and

14a complete breakdown of the code in the pmap module can be found here: http://www.daemon-systems.org/man/pmap.9.html


hence are received in the order in which they were sent as they are stored in queues ifthey can’t be delivered. Available message types are pure data (integers, floats, strings,etc), port rights, memory regions.

Ports Ports hold a queue of messages (although the messages themselves are actuallystored in a different kernel data structure - the port queue). Tasks operate on a portto send and receive messages by exercising capabilities for the port. Multiple taskscan hold send capabilities, or rights, for a port. Tasks can also hold send-once rights,which grant the ability to send a single message. Only one task can hold the receivecapability, or receive right, for a port. If a thread from one process sends a messageto another thread from a different process, the second thread has to use a different portto answer. Port rights can be transferred between tasks via messages. The sender of amessage can specify in the message body that the message contains a port right. If amessage contains a receive right for a port, then the receive right is removed from thesender of the message and the right is transferred to the receiver of the message. Whilethe receive right is in transit, tasks holding send rights can still send messages to theport, and they are queued until a task acquires the receive right and uses it to receivethe messages. A port has a single receiver and (potentially) multiple senders. When aport is created, 64 bytes of kernel space are allocated and maintained until the port isdestroyed when, for example, all the processes using it have exited.

Port Queues The port queue is a protected data structure, only accessible via the ker-nel’s exported message primitives.

Port Rights Port rights are a secure, location-independent way of naming ports. Theability to send messages to ports and to receive messages is stored in port rights. Portrights are kernel-protected capabilities [HOH96] and can only be manipulated by tasksvia port names; there is no way for a malicious user task to guess a port name andsend a message to a port to which it shouldn’t have access. Port rights do not carry anylocation information. When a receive right for a port moves from task to task, and evenbetween tasks on different machines, the send rights for the port remain unchanged andcontinue to function. There are three kinds of port rights:

1. receive right - a receive right allows the task to receive messages from this port.Only one receive right exists for a port.

2. send right -A send right allows the task to send messages to this port. The numberof send rights per port is not restricted.

3. send-once rights - A send-once right allows the task to send one message to thisport. The send-once right is destroyed after the message is sent. [HOH96]

Port Sets A port set is a common access point for receiving messages from ports.A receive operation retrieves a message from one of the queues of the member portsunless all of these queues are empty. A port cannot be a member of more than oneport set. The port set abstraction allows a single thread to wait for a message from anyof several ports. Tasks manipulate port sets with a capability, or port-set right, whichis taken from the same space as the port capabilities. The port-set right may not betransferred in a message. A port set holds receive rights, and a receive operation on aport set blocks waiting for a message sent to any of the constituent ports. If a port isa member of a port set, the holder of the receive right can’t receive directly from theport. [BRI01, HOH96]


Figure 2.10: IPC stucture in Mach


Regions and Memory Objects

Just as it was one of the initial goals of Mach to clearly separate the machine indepen-dant code from the machine dependant, so it was also a goal to fully support sparseaddress spaces. However, even with a multilevel page table sparse usage of memoryis expensive. Mach introduces a way to tell which addresses are in use and which notby allocating and deallocating portions of the logical address space in regions. This ispossible only because the address space is initially sparse. [APP00] Any given regionis backed by a memory object [LOE92]. A memory manager task keeps a list of allo-cated regions, eliminating the need for a linear table [SCO97]. The memory managerprovides the policy that governs the relationship between the image of a set of pageswhilst cached in memory (the physical memory contents of a memory region) and theimage of that set of pages when not so cached (the abstract memory object). The al-location function vm allocate 15 allows specification of a base address plus a size,in which case the indicated region is allocated, or it can specify just a size, in whichcase the system finds a convenient slot. A virtual address is only valid if it falls in anallocated region [TAN87]

Another high-level Mach abstraction is the memory object. The task that imple-ments this object (that responds to messages sent to the port that names the object) isthe memory manager. [LOE92] Memory objects can be a single page, a bunch of pagesor it can be a file (through file mapping primitives, abstracted by BSD) or a tree andall these objects can be mapped into the virtual address space of tasks. Meaning thata file is paged like a process would be. When a task opens a file, it’s mapped to taskaddress space as a memory object. Data isn’t transferred immediately, file blocks arepaged in on demand - into space already set aside for them and shared files are mappedto shared memory. [SCO97] Receiving a message also causes the creation of a mem-ory object as the message is mapped into the receiving thread’s address space whichis much less taxing than the typical UNIX implementation, which involves physicallycopying the message from the sender, into the kernel, and from there to the receiver.Any memory object can be inserted into the virtual memory space as a region and canbe named by either the port specified to the kernel when mapping a memory objectinto a client task’s virtual address space or abstract memory object ports (the port thememory manager presents to the kernel when initializing a memory object). [HOH96].In addition, a task can change the protections of the objects in its address space and canshare those objects with other tasks.

In a way, the kernel could be seen as using the main memory as a cache for a varietyof abstract memory objects and virtual memory just a collection of virtual objects eachwith a particular owner and protections. [TAN87, HOH96, APP00, SCO97]

A further object call is vm map which maps a region of virtual memory at thespecified address, for which data is to be supplied by the given memory object, startingat the given offset within that object. [BRI01]

Pages

Mach allows the task to specify protection values for its virtual memory pages. Protec-tion values can be any combination of read, write, and execute. These attributes can beused to control access to memory shared between tasks. Mach also provides very fine-grained control over memory inheritance. A task can assign an inheritance attribute for

15see The GNU Mach Reference Manual for Version 1.2 of the GNU Mach microkernel, section 5.1 forexact details of this and all virtual memory interface functions


each allocated page in its address space. Pages marked copy are mapped into the childtask’s address space. However, they are not physically copied to another location untilthe child task attempts to write to that address. This is initially much faster and usesless memory than physically copying all of the parent task’s allocated memory intothe child task’s address space. The share attribute allows true sharing between parentand child tasks - each can read and write the page, but the programmer is responsiblefor avoiding race conditions. If the page is marked absent, the child does not inheritthat page from the parent. That address is left unallocated in the child’s address space.[SCO97]

Pages in Mach have a current and a maximum protection code, which are combi-nations of read, write and execute permissions. The page replacement algorithm usedis Modified FIFO or Clock/2nd Chance (see section 2.1.4. The pageout daemon main-tains 3 lists:

• Free list: page frames on this list can be used to satisfy page faults. Pages areclean.

• Active list: frames on this list hold pages that are in active use (FIFO list)

• Inactive list: hold pages that may or may not be in active use

The pageout daemon runs when the size of the free list drops below some minimumvalue. It examines the inactive list in FIFO order. If the reference bit is on, it moves theframe to the end of the list end and turns off the reference bit. If R bit is off, it cleansthe page, adds it to the free list and invalidates the page in its page table. When the freelist is full, it stops. Then it moves pages from the active list to the tail of the inactivelist, turning off the reference bit.

Memory Maps

Each Mach task has its own 2 level memory map. [APP00, YOU87] In Mach, thismemory map takes the form of an ordered doubly linked list. Each of these objectscontains a list of pages and shadow references to other objects. The vm map entrystructure contains task-specific information about an individual mapping along with areference to the backing object. It connects objects with maps.

A map entry may be a normal object or a submap. A submap is a collection of map-pings that is part of a larger map .Submaps are often used to group mappings togetherfor the purpose of sharing them among multiple Mach tasks, but they may be used formany purposes. What makes a submap particularly powerful is that when several taskshave mapped a submap into their address space, they can see each other’s changes, notonly to the contents of the objects in the map, but to the objects themselves. This meansthat as additional objects are added to or deleted from the submap, they appear in ordisappear from the address spaces of all tasks that share that submap. [APP00]

A map has a protection field indicating what rights the task currently has forthe object and a max protection field contains the maximum access that the currenttask can obtain for the object.

It would be a security hole if a task could increase its own permissions on a memoryobject arbitrarily, however. In order to preserve a reasonable security model, the taskthat owns a memory object must be able to limit the rights granted to a subordinate task.For this reason, a task is not allowed to increase its protection beyond the permissionsgranted in max protection.


Finally, the use pmap field indicates whether a submap’s low-level mappingsshould be shared among all tasks into which the submap is mapped. If the mappings arenot shared, then the structure of the map is shared among all tasks, but the actual con-tents of the pages are not. For example, shared libraries are handled with two submaps.The read-only shared code section has use pmap set to true. The read-write (non-shared)section has use pmap set to false, forcing a clean copy of the library’s DATAsegment to be mapped in from disk for each new task. [APP00]

Shared Memory

Portions of a tasks’s space may be shared through inheritance or external memorymanagement. A task is a fairly expensive entity. It exists to be a collection of resources.All of the threads in a task share everything whereas for a task to share memory withanother task requires explicit action to share the port. A task can allocate and de-allocate ranges of memory in it’s space and change protections on them. It can alsospecify inheritance properties for the ranges. A new task is created by specifying anexisting task as a base from which to construct the address space for the new task.The inheritance attribute of each range of the memory of the existing task determineswhether the new task has that range defined and whether that range is virtually copiedor shared with the existing task. [LOE92]. Inheritance attributes may be specified asshared, copy or none. Pages specified as shared are shared for read and write. [CHA94]

Most copy operations within Mach are achieved by through copy-on-write optimi-sations (inheritance value is copy) which itself is accomplished via protected sharing ofthe memory range via the function vm inherit share. Two tasks share the mem-ory to be copied as read-only until one of them wishes to write to the range. Thatportion is then copied. This lazy evaluation optimised kernel performance. If a page ismarked none, the child’s page is left unallocated.

When a region is duplicated due to a copy-on-write operation, the kernel mustkeep an eye on which of the two regions have been modified and which have not.Mach creates objects with the sole purpose of holding modified pages that originallybelonged to another object. [CHA94] A shadow object is created. A shadow objectis initially empty, without a pager, and contains a reference to the shadowed object.When the contents of a page are modified, the page is copied from the parent objectinto the shadow object and then modified. When reading data from a page, if that pageexists in the shadow object, the page listed in the shadow object is used. If the shadowobject has no copy of that page, the original object is consulted. A series of shadowobjects pointing to shadow objects or original objects is known as a shadow chain.Shadow chains can become arbitrarily long if an object is heavily reused in a copy-on-write fashion. However, since fork is frequently followed by exec, which replacesall of the material being shadowed, long chains are rare. Further, Mach automaticallygarbage collects shadow objects, removing any intermediate shadow objects whosepages are no longer referenced by any (non-defunct)shadow object. It is even possiblefor the original object to be released if it no longer contains pages that are relevant tothe chain. [APP00]

The Mac OS X VM system provides an abstraction known as a named entry. Anamed entry is nothing more than a handle to a shared object or a submap. Sharedmemory support in Mac OS X is achieved by sharing objects between the memorymaps of various tasks. Shared memory objects must be created from existing VMobjects by calling vm allocate to allocate memory in your address space and thencalling mach make memory entry 64 to get a handle to the underlying VM object.


The handle returned by mach make memory entry 64 can be passed to vm mapto map that object into a given task’s address space. The handle can also be passed viaIPC or other means to other tasks so that they can map it into their address spaces. Thisprovides the ability to share objects with tasks that are not in your direct lineage, andalso allows you to share additional memory with tasks in your direct lineage after thosetasks are created.

The other form of named entry, the submap, is used to group a set of mappings.The most common use of a submap is to share mappings among multiple Mach tasks.What makes a submap particularly useful is that when several tasks have mapped asubmap into their address space, they can see each other’s changes to both the data andthe structure of the map. This means that one task can map or unmap a VM objectin another task’s address space simply by mapping or unmapping that object in thesubmap.

Working Set Detection Subsystem

To improve performance, Mach has a subsystem known as the working set detectionsubsystem (see Section 2.1.4). This subsystem is called on a VM fault; it keeps aprofile of the fault behavior of each task from the time of its inception. In addition, justbefore a page request, the fault code asks this subsystem which adjacent pages shouldbe brought in and then makes a single large request to the pager. Since files on disktend to have fairly good locality and since address space locality is largely preserved inthe backing store this provides a substantial performance boost. Also, since it is basedupon the application’s previous behavior, it tends to pull in pages that would probablyhave otherwise been needed later. This occurs for all pagers. The working set codeworks well once it is established. However, without help, its performance would bethe baseline performance until a profile for a given application has been developed. Toovercome this, the first time that an application is launched in a given user context, theinitial working set required to start the application is captured and stored in a file. Fromthen on, when the application is started, that file is used to seed the working set. Theseworking set files are only accessible by the super-user (and the kernel). [APP00]

Capabilities in Mach

As mentioned in section 2.2.2, one way to protect capabilities is to put them in thekernel. This is exactly what the designers of Mach have done with the way the kernelremembers which processes have access to which ports. It keeps a table of which portsthe processes have access to in the kernel and processes refer to ports by their positionin the table. The table is a capability list. For each process, one c-list. When a threadasks the kernel to create a port for it, the kernel does so and adds a capability to theprocesses capability list that the thread belongs to. It return a 32-bit integer to thethread with which it has access to the capability; the capability identifier or name. Allthreads in the process can access the capability.

Each capability contains a pointer to a port as well as a rights field containingone of the following three rights: RECEIVE, SEND and SEND-ONCE. The first ofthese, RECEIVE, means that the process with this right can read messages from theport. At any one instant only one process can read from a port hence each port hasonly one receiver. The SEND right can be held by numerous processes simultaneously.SEND-ONCE is like SEND but after the send is over, the kernel destroys the capability.Capability lists are destroyed once their process exists or is killed even if that c-list has


receive rights for some port. These ports are also destroyed as they are now useless.Should any capabilities exist with send rights to such a port, the kernel marks them asdead and returns an appropriate error code should any process attempt to send to them.

If a process is allocated the same capability several times, the kernel allocates onlyone capability but counts how many times the process has received it. An existingprocess decrements the count by one. [TAN87] See also section 2.8.2 of this paper.

Summary

Mach is based on the concepts of processes, threads, ports, memory objects and message-based IPC. A Mach process is an address space and it’s accompanying threads. Threadsoccur in processes, if you will, with each process having a port to which it can writeto have kernel calls carried out, eliminating the need for direct system calls [TAN87].Memory objects are the means by which tasks take control over memory management.

Mach is based on an object-oriented design approach [HOH96] with communica-tion channels as object references [LOE92] and is meant as a foundation for operatingsystems built on it. It is the goal of the Mach project designers to move more and morefunctionality out of the kernel until everything is done by user mode tasks communi-cating via the kernel.

2.4 Linux Kernel Software ArchitectureTrying to describe the most up-to-date Linux kernel code is a Herculean task becauseLinux is a moving target. It is continuously being modified and updated. What followsare basic aspects of the design of the Linux virtual memory system, circa version 2.4.At the time of writing, the latest stable version of the Linux kernel is 2.6.10. Linuxitself is only the kernel of an operating system. The Linux kernel is useless by itself; itparticipates as one layer in the overall system. Figure 2.11 shows a decomposition ofthe system.

User Applications

O/S ServicesLinux Kernel

Hardware controllers

Figure 2.11: The major subsystems of the Linux operating system

Each subsystem layer can only communicate with the subsystem layers that areimmediately adjacent to it. In addition, the dependencies between subsystems are fromthe top down: layers pictured near the top depend on lower layers, but subsystemsnearer the bottom do not depend on higher layers. The Linux operating system iscomposed of four major subsystems [BOW98]:

User Applications the set of applications in use on a particular Linux system will bedifferent depending on what the computer system is used for, but typical exam-ples include a word-processing application and a web-browser


O/S Services these are services that are typically considered part of the operating sys-tem (a windowing system, command shell, etc.); also, the programming interfaceto the kernel (compiler tool and library) is included in this subsystem.

Linux Kernel the kernel abstracts and mediates access to the hardware resources, in-cluding the CPU

Hardware Controllers this subsystem is comprised of all the possible physical de-vices in a Linux installation; for example, the CPU, memory hardware, harddisks, and network hardware are all members of this subsystem

2.4.1 The Kernel LayerWithin the kernel layer, Linux is composed of five major subsystems: the processscheduler, the memory manager, the virtual file system, the network interface, andthe inter-process communication subsystem. Figure 2.12 16 shows the relationshipsbetween the subsystems:

Figure 2.12: Linux Conceptual Decompositional Overview

The process scheduler is the most central subsystem; all other subsystems dependon the process scheduler since all subsystems need to suspend and resume processes.

Of the other dependencies:The process-scheduler subsystem uses the memory manager to adjust the hard-

ware memory map for a specific process when that process is resumed.The inter-process communication subsystem depends on the memory manager

to support a shared-memory communication mechanism. This mechanism allows two16taken from [BOW98+]


processes to access an area of common memory in addition to their usual private mem-ory.

The memory manager uses the virtual file system to support swapping; this isthe only reason that the memory manager depends on the process scheduler. When aprocess accesses memory that is currently swapped out, the memory manager makesa request to the file system to fetch the memory from persistent storage, and suspendsthe process. [BOW98]

Memory Manager

The memory manager provides the capabilities already discussed to its clients: largeaddress space, protection, memory mapping, fair access to physical memory and sharedmemory.

The memory manager subsystem is composed of three modules:

1. The architecture specific module presents a virtual interface to the memory man-agement hardware

2. The architecture independent manager performs all of the per-process mappingand virtual memory swapping. This module is responsible for determining whichmemory pages will be evicted when there is a page fault – there is no separatepolicy module since it is not expected that this policy will need to change.

3. A system call interface is provided to provide restricted access to user processes.This interface allows user processes to allocate and free storage, and also toperform memory mapped file I/O.

The memory manager stores a per-process mapping of physical addresses to virtualaddresses. This mapping is stored as a reference in the process scheduler’s task listdata structure. In addition to this mapping, additional details in the data block tell thememory manager how to fetch and store pages. For example, executable code canuse the executable image as a backing store, but dynamically allocated data must bebacked to the system paging file. Finally, the memory manager stores permissions andaccounting information in this data structure to ensure system security.

The memory manager controls the memory hardware, and receives a notificationfrom the hardware when a page fault occurs — this means that there is bi-directionaldata and control flow between the memory manager modules and the memory man-ager hardware. Also, the memory manager uses the file system to support swappingand memory mapped I/O. This requirement means that the memory manager needs tomake procedure calls to the file system to store and fetch memory pages from persis-tent storage. Because the file system requests cannot be completed immediately, thememory manager needs to suspend a process until the memory is swapped back in;this requirement causes the memory manager to make procedure calls into the processscheduler. Also, since the memory mapping for each process is stored in the processscheduler’s data structures, there is a bi-directional data flow between the memorymanager and the process scheduler. User processes can set up new memory mappingswithin the process address space, and can register themselves for notification of pagefaults within the newly mapped areas. This introduces a control flow from the memorymanager, through the system call interface module, to the user processes. There is nodata flow from user processes in the traditional sense, but user processes can retrieve


some information from the memory manager using select system calls in the systemcall interface module. [BOW98]

The memory manager provides two interfaces to its functionality: a system-callinterface that is used by user processes and an interface that is used by other kernelsubsystems to accomplish their tasks. 17

System Call Interface

• malloc() / free() - allocate or free a region of memory for the process’s use

• mmap() / munmap() / msync() / mremap() - map files into virtual memory re-gions

• mprotect - change the protection on a region of virtual memory

• mlock() / mlockall() / munlock() / munlockall() - super-user routines to preventmemory being swapped

• swapon() / swapoff() - super-user routines to add and remove swap files for thesystem

Intra-Kernel Interface

• kmalloc() / kfree() - allocate and free memory for use by the kernel’s data struc-tures

• verify area() - verify that a region of user memory is mapped with required per-missions

• get free page() / free page() - allocate and free physical memory pages

In addition to the above interfaces, the memory manager makes all of its data struc-tures and most of its routines available within the kernel. Many kernel modules inter-face with the memory manager through access to the data structures and implementa-tion details of the subsystem.

Since Linux supports several hardware platforms, there is a platform-specific partof the memory manager that abstracts the details of all hardware platforms into onecommon interface. All access to the hardware memory manager is through this abstractinterface.

The memory manager uses the hardware memory manager to map virtual addressesto physical addresses. In addition, the memory manager uses a daemon (kswapd) forpaging. Linux uses the term ’daemon’ to refer to kernel threads; a daemon is scheduledby the process scheduler in the same way that user processes are but daemons candirectly access kernel data structures.

The kswapd daemon periodically checks to see if there are any physical memorypages that haven’t been referenced recently and swaps them out. The memory managersubsystem takes special care to minimize the amount of disk activity that is requiredby attempting to avoid writing pages to disk if they could be retrieved another way.

The hardware memory manager detects page faults. It notifies the Linux kernel ofthis page fault and it is up to the memory manager subsystem to resolve the fault. If thememory manager detects an invalid memory access, it notifies the user process with asignal; if the process doesn’t handle this signal, it is terminated.

17a java-based, interactive breakdown of the Linux kernel memory manager architecture can be foundhere: http://plg.uwaterloo.ca/ itbowman/pbs/mm.html


The following data structures are architecturally relevant:

vm area - the memory manager stores a data structure with each process that recordswhat regions of virtual memory are mapped to which physical pages. This data struc-ture also stores a set of function pointers that allow it to perform actions on a particularregion of the process’s virtual memory. For example, the executable code region of theprocess does not need to be swapped to the system paging file since it can use the exe-cutable file as a backing store. When regions of a process’s virtual memory are mapped(for example when an executable is loaded), a vm area struct is set up for eachcontiguous region in the virtual address space. Since speed is critical when looking upa vm area struct for a page fault, the structures are stored in an AVL tree.

mem map - the memory manager maintains a data structure for each page of physicalmemory on a system. This data structure contains flags that indicate the status of thepage (for example, whether it is currently in use). All page data structures are availablein a vector (mem map), which is initialized at kernel boot time. As page status changes,the attributes in this data structure are updated.

free area - the free area vector is used to store unallocated physical memorypages; pages are removed from the free area when allocated, and returned whenfreed. The Buddy system [KNU73, TAN87] is used when allocating pages from thefree area.

Short aside: The Buddy SystemAs processes are loaded and killed, so they leave behind free holes in memory.

These holes, when kept on lists sorted by size can be allocated fast but deallocated onlyslowly due to the fact that all the lists have to be searched to find the segments’ neigh-bours. The buddy system, which is credited to Knowlton (1965) and Knuth (1973)[TAN87], merges holes when a process is swapped out in a way that utilises binaryblocks. The memory manager keeps lists of free blocks of all the sizes of each powerof 2 up to the size of memory so a list containing 1 byte, another for blocks of 2 byteand one for 4 bytes, 8 bytes, 16 bytes etc. Initially all of memory is free and with a1MB memory the 1MB list has a single entry containing a single 1MB hole; all theother lists are empty. As a process gets swapped in, it’s gets an area of memory only inone of these integer power of two blocks. As it gets swapped out, merges of ”buddy”blocks are made. So a 90KB process gets a 128KB chunk because 64 is too small.However, there are no blocks of any size at initialisation other than the entire memorysize - in this example 1 MB. So memory is split, in our case into 512 KB blocks calledBuddies [TAN87]

Figure 2.13: Buddy1


one at address 0 and one at 512 the first of which is split again into two 256 blocks.The first of these is split into two 128 blocks and the first of these is allocated to our90KB process labelled A in the figures.

Figure 2.14: Buddy2

If a 42 KB process were then to be swapped in, we check our list of 64 KB blocksand see there aren’t any. so we check to see if there are any 128 blocks free and there’sone so we split that into two 64KB blocks and allocate one of them to the 42 KBprocess labelled B in the figures

Figure 2.15: Buddy3

When B is killed, it’s 64 bit block is merged with it’s buddy (providing the buddy isstill free at this point) to create a 128 KB block and if that is still free when A is killed,the two 128 blocks would be merged.

The advantage of the buddy system over algorithms that allow the memory to besplit arbitrarily is that when a block of 2k is freed the memory manager only has tosearch the list of 2k holes.

END OF ASIDE

The memory manager subsystem is composed of several source code modules;these can be decomposed by areas of responsibility into the following groups (shownin Figure 2.16):

• System Call Interface - this group of modules is responsible for presenting theservices of the memory manager to user processes through a well-defined inter-face.


• Memory-Mapped Files (mmap) - this group of modules is responsible for sup-ported memory-mapped file I/O.

• Swapfile Access (swap) - this group of modules controls memory swapping.These modules initiate page-in and page-out operations.

• Core Memory Manager (core) - these modules are responsible for the core mem-ory manager functionality that is used by other kernel subsystems.

• Architecture Specific Modules - these modules provide a common interface toall supported hardware platforms. These modules execute commands to changethe hardware MMU’s virtual memory map, and provide a common means ofnotifying the rest of the memory-manager subsystem when a page fault occurs.

[BOW98+]

Figure 2.16: Memory Manager Structure


2.4.2 Linux Virtual Memory ArchitectureOverview

There are basically two sections of memory in the Linux memory architecture: ker-nel and user space. The Linux kernel sees all of physical memory mapped startingatpage offset, which is typically 3GB on x86.

In contrast, normal processes see memory via page tables that give them their ownvirtual address spaces and the virtual address of a logical page has little to do with thephysical address it resides in — it’s wherever the kernel decided to put it.

On x86 processors, the top 1 GB of the address space is reserved for the kerneland user processes cannot see that part of the address space at all — no pages aremapped there in user-process page tables. On entry to the kernel, the addressing modeis changed and the kernel can see physical RAM there. The kernel can also see theuser space of the process its executing on behalf of, through the page table of the userprocess it’s executing on behalf of. Most multi-purpose processors support the notionof a physical address mode as well as a virtual address mode. Physical addressingmode requires no page tables and the processor does not attempt to perform any addresstranslations in this mode. The Linux kernel is linked to run in physical address space.

Memory in Linux is split logically into zones. Physical memory is dealt with via theBuddy allocator (so called because it uses the buddy algorithm) and in-kernel memoryis allocated by the kernel by the slab allocator. It is dynamically allocated and usesthe ”slab allocation” algorithm. Slabs are built on top of physical pages allocated bythe Buddy allocator and the memory thereby allocated is contiguous. [KUM02] Theslab allocator provides memory in 128KB chunks. Should memory chunks of greatersize be needed, this is done when the page tables are set up at boot time. This is stillphysically contiguous. If, however, memory chucks greater than 128 KB are neededand they don’t have to be contiguous in memory, then vmalloc allocates virtual con-tiguous memory that might happen to be coincidentally physically contiguous but italso might not be. The main kernel memory allocator is kmalloc. Kmalloc itselfknows nothing about virtual memory. In Linux Kernel version 2.2, kmalloc used thebinary buddy system and only manages blocks of memory whose are a power-of-twomultiple of the page size. It seems the old buddy allocator had problems with memoryfragmentation and had no simple way to distinguish between various kinds of memory.For this reason, the notion of zones was introduced in v2.4 (see section 2.4.2.

IPC

Linux supports a number of (IPC) mechanisms. Signals and pipes are two of thembut Linux also supports the System V IPC mechanisms like queues, semaphores andshared memory. [MOH04, RUS99] Signals are used to signal asynchronous events toone or more processes like an error condition such as the process attempting to accessa non-existent location in its virtual memory. They can only be processed when theprocess is in user mode. If a signal has been sent to a process that is in kernel mode, itis dealt with immediately on returning to user mode. With the exception of the killand stop signals, a process can choose just how it wants to handle the various signals.Processes can block the signals and, if they do not block them, they can either chooseto handle them themselves or allow the kernel to handle them. If the kernel handlesthe signals, it will do the default actions required for this signal. Not every process inthe system can send signals to every other process, the kernel can and super users can.


Normal processes can only send signals to processes with the same user id and groupid.

Signals are not presented to the process immediately they are generated, they mustwait until the process is running again. Every time a process exits from a system callits signal and blocked fields are checked and, if there are any unblocked signals, theycan now be delivered. This might seem a very unreliable method but every process inthe system is making system calls, for example to write a character to the terminal, allof the time. [RUS99]

Pipes are unidirectional byte streams which connect the standard output from oneprocess into the standard input of another process [RUS99]. Neither process is awareof this redirection and behaves just as it would normally. It is the shell which sets upthese temporary pipes between the processes. As the writing process writes to the pipe,bytes are copied into the shared data page and when the reading process reads from thepipe, bytes are copied from the shared data page.

Shared Memory

As mentioned above (section 2.4.2)one of the SYSV IPC mechanisms is that of sharedmemory. Shared memory allows one or more processes to communicate via memorythat appears in all of their virtual address spaces via references in their respective pagetables. A process may access shared memory by passing a unique reference identifierto the kernel via a system call. Access to shared memory objects is checked usingaccess permissions, much like accesses to files are checked. The access rights to theshared memory object is set by the creator of the object via system calls, If the creatorhas enough access rights it may also lock the shared memory into physical memory.The object’s reference identifier is used by each mechanism as an index into a tableof resources. Once the memory is shared, there are no checks on how the processesare using it. They must rely on other mechanisms, for example semaphores, to syn-chronize access to the memory. The process can choose where in its virtual addressspace the shared memory goes or it can let Linux choose a free area large enough.Each newly created shared memory area is represented by a shmid ds data structure.As we have seen, the first time that a process accesses one of the pages of the sharedvirtual memory, a page fault will occur. When Linux fixes up that page fault it finds thevm area struct data structure describing it. This contains pointers to handler rou-tines for this type of shared virtual memory. The shared memory page fault handlingcode looks in the list of page table entries for this shmid ds to see if one exists for thispage of the shared virtual memory. If it does not exist, it will allocate a physical pageand create a page table entry for it. As well as going into the current process’s pagetables, this entry is saved in the shmid ds. This means that when the next processthat attempts to access this memory gets a page fault, the shared memory fault handlingcode will use this newly created physical page for that process too [RUS99].

When processes no longer wish to share the virtual memory, they detach fromit. So long as other processes are still using the memory the detach only affects thecurrent process. Its vm area struct is removed from the shmid ds data structureand deallocated. The current process’s page tables are updated to invalidate the areaof virtual memory that it used to share. When the last process sharing the memorydetaches from it, the pages of the shared memory current in physical memory are freed,as is the shmid ds data structure for this shared memory.


Zones

Regardless of what machine architecture you are running Linux on, the kernel needs tobe able to describe memory. Some memory is geographically near the processor, likelevel 1 cache, and some is distanced, like a bank of memory near device cards suit-able for DMA. These different banks of memory are called nodes [GOR04] and eachnode is logically split into various zones. [KUM02, GOR04] The split is different ondifferent machines and depends on various factors including how much memory yourmachine has. Different ranges of physical pages may have different properties, for thekernel’s purposes. For example, Direct Memory Access, which allows peripherals toread and write data directly to RAM without the CPU’s intervention, may only workfor physical addresses less than 16MB [KNA01, GOR04]. Some systems have morephysical RAM than can be mapped between page offset and 4GB; those physi-cal pages are not directly accessible to the kernel and so must be treated differently.The zone allocator (see section 2.4.2) handles such differences by dividing memoryinto a number of zones and treating each zone as a unit for allocation purposes. Anyparticular allocation request utilizes a list of zones from which the allocation may beattempted, in most-preferred to least-preferred order. For example, a request for a userpage should be filled first from the ZONE NORMAL (16MB - 896MB); if that fails,from the ZONE HIGHMEM (896 MB - End); and if that fails, from the ZONE DMA(First 16MB of memory). Thus, the zonelist for such allocations consists of the nor-mal, HIGHMEM, and DMA zones, in that order. On the other hand, a request for aDMA page may only be fulfilled from the DMA zone, so the zonelist for such requestscontains only the DMA zone. [KNA01, GOR04]. Pages are mapped into the differentzones via a global mem map array.

Zones keep track of page use statistics, free area information and zone locks. Zonesalso have watermarks letting the system know when the available memory is low andit should wake up the page daemon kswapd to start freeing pages and how hard itshould work to do that. [GOR04]

The Zone Allocator

The Zone allocator was introduced in version 2.4. The kernel page tables map as muchof physical memory as possible into the address range starting at page offset. Thephysical pages occupied by the kernel code and data are reserved and will never beallocated to any other purpose; all other physical pages are allocated to process virtualmemory, the buffer cache, additional kernel virtual memory, and so forth, as necessary.In order to do this, we must have a way of keeping track of which physical pages are inuse, by whom. The zone allocator carves up the physical address space into a numberof zones and allocates certain types of memory objects preferentially from appropriatezones. Thus, user memory (that is, memory to be mapped into a process address space)is allocated preferentially from the “cacheable” zone and only from the DMA or slowzones if no cached pages are available; requests for DMA are filled exclusively fromthe “DMA” zone; and certain other requests (eg pagetables) are filled preferentiallyfrom the “slow” zone.

The zone allocator still uses the buddy system internally. The free area lists andbuddy bitmaps are maintained on a per-zone basis rather than globally.

In 2.2 and earlier kernels, the physical allocator did have a means of distinguishingDMA memory from other memory, but it involved maintaining separate global freelistsfor DMA and normal RAM. The zone allocator is a refinement of that system.


Pages and Page Table Management

The kernel memory map uses 4MB pages in Pentium architecture [KUM02]. Pages inLinux are 4KB in size [MOH04, KUM02].

Linux layers the machine independent/dependent layer in an unusual manner incomparison to other operating systems [CRA99]. Mach, as mentioned above, has thepmap objects that manage the underlying physical pages. Linux instead maintains theconcept of a three-level 18 page table in the architecture-independent code even if theunderlying architecture does not support it. [GOR04]

1. A top-level node is called the Page Global Directory (PGD) because it acts asthe index to all the pages belonging to the process.

2. A middle-level node is called a Page Middle Directory (PMD)

3. A bottom level is called a ”page table,” because it holds actual pte’s describingparticular (virtual) pages.

Each Page Table accessed contains the page frame number of the next level of PageTable. A virtual address can be broken into a number of fields; each field providingan offset into a particular Page Table. To translate a virtual address into a physicalone, the processor must take the contents of each level field, convert it into an offsetinto the physical page containing the Page Table and read the page frame number ofthe next level of Page Table. This is repeated three times until the page frame numberof the physical page containing the virtual address is found. Now the final field inthe virtual address, the byte offset, is used to find the data inside the page. Morespecifically, each active entry in the PGD table points to a page frame containing anarray of PMD entries of type pmd t, which in turn points to page frames containingPage Table Entries (PTE) of type pte t, which finally point to page frames containingthe actual user data.

Each platform that Linux runs on must provide translation macros that allow thekernel to traverse the page tables for a particular process. This way, the kernel does notneed to know the format of the page table entries or how they are arranged. [RUS99]

2.4.3 Page Fault HandlerIn this section I will look at some code from the Linux kernel with respect to justhow it handles page faults. I will look specifically at the two architecture-independentfunctions handle mm fault() and handle pte fault() from (mm/memory.c)which represent the top-level pair of functions for the architecture-independent pagefault handler. [GOR04]

The Function handle mm fault()The function handle mm fault() allocates the Page Middle Directory and PageTable Entry necessary for this new Page Table Entry that is about to be allocated. Ittakes the necessary locks to protect the page tables before calling handle pte fault()to fault in the page itself.

18in version 2.6.11 there is a four-level page table with a new level called ”PUD”. Seehttp://lwn.net/Articles/117749/


handle mm fault()

1364 int handle mm fault(struct mm struct *mm, struct vm area struct

*vma,

1365 unsigned long address,int write access)

1366 {1367 pgd t *pgd;

1368 pmd t *pmd;

1369

1370 current → state = TASK RUNNING;

1371 pgd = pgd offset(mm,address);

1372

1373 /*1374 *We need the page table lock to synchronize with kswapd

1375 *and the SMP-safe atomic PTE updates.

1376 */

1377 spin lock(&mm → page table lock);

1378 pmd = pmd alloc(mm,pgd,address);

1379

1380 if (pmd){1381 pte t *pte =pte alloc(mm,pmd,address);

1382 if (pte)

1383 return handle pte fault(mm,vma,address, write access,pte);

1384 }1385 spin unlock (&mm → page table lock);

1386 return -1;

1387 }

Commentary

1364 The parameters of the function are the following:

• mm is the mm struct for the faulting process

• vma is the vm area struct managing the region the fault occurred in

• address is the faulting address

• write access is 1 if the fault is a write fault

1370 Sets the current state of the process.1371 Gets the page global directory entry from the top-level page table.1377 Locks the mm struct because the page tables will change.1378 pmd alloc()will allocate a pmd t if one does not already exist.1380 If the pmd has been successfully allocated, then...1381 Allocates a page table entry for this address if one does not already exist.1382-1383 Handles the page fault with handle pte fault() and returns the statuscode.1385 Failure path and unlocks the mm struct1386 Returns -1,which will be interpreted as an out of memory condition.This is cor-rect because this line is only reached if a page middle directory or page table entrycould not be allocated.


The Function handle pte fault()

The function handle pte fault() decides what type of fault this is and whichfunction should handle it. do no page() is called if this is the first time a page isto be allocated. do swap page() handles the case where the page was swapped outto disk with the exception of pages swapped out from tmpfs. do wp page() breakscopy-on-write pages. If none of them are appropriate, the PTE entry is simply updated.If it was written to, it is marked dirty, and it is marked accessed to show it is a youngpage.

handle pte fault()

1331 static inline int handle pte fault(struct mm struct *mm,

1332 struct vm area struct *vma,unsigned long address,

1333 int write access,pte t *pte)

1334 {1335 pte t entry;

1336

1337 entry =*pte;

1338 if (!pte present(entry)){1339 /*1340 *If it truly wasn’t present,we know that kswapd

1341 *and the PTE updates will not touch it later. So

1342 *drop the lock.

1343 */

1344 if (pte none(entry))

1345 return do no page(mm,vma,address, write access,pte);

1346 return do swap page(mm,vma,address,pte,entry, write access);

1347 }1348

1349 if (write access){1350 if (!pte write(entry))

1351 return do wp page(mm,vma,address,pte,entry);

1352

1353 entry =pte mkdirty(entry);

1354 }1355 entry =pte mkyoung(entry);

1356 establish pte(vma,address,pte,entry);

1357 spin unlock(&mm → page table lock);

1358 return 1;

1359 }

1331 The parameters of the function are the same as those for handle mm fault()except that the page table entry (PTE) for the fault is included.1337 Records the PTE.1338 Handles the case where the PTE is not present.1344 If the PTE has never been filled, this handles the allocation of the PTE withdo no page()


1346 If the page has been swapped out to backing storage,this handles it with do swap page()1349-1354 Handles the case where the page is been written to.1350-1351 If the PTE is marked write-only, it is a copy-on-write page, so handle itwith do wp page()1353 Otherwise, this just simply marks the page as dirty.1355 Marks the page as accessed1356 establish pte() copies the PTE and then updates the TLB and MMU cache.This does not copy in a new PTE, but some architectures require the TLB and MMUupdate1357 Unlocks the mm struct and returns that a minor fault occurred.

2.4.4 Page ReplacementLinux uses a Least Recently Used (LRU) (see section 2.1.4) page aging technique tofairly choose pages which might be removed from the system. The aging principle (seesection 2.1.4) Linux uses to fairly share memory resources is that each physical pageof data carries with it a measure of its frequency of use; its age. The more frequentlya page is referenced, the less likely it is to be discarded by the kernel when memorybecomes scarce. [KNA01] Memory in Linux is unified, that is all the physical memoryis on the same free list [RIE01] and can be allocated to any one of a number of cacheson demand.

• The slab cache: this is the kernel’s dynamically allocated heap storage. Thismemory is unswappable, but once all objects within one (usually page-sized)area are unused, that area can be reclaimed.

• The page cache: this cache is used to cache file data for both mmap() and read()and is indexed by (inode, index) pairs. No dirty data exists in this cache; when-ever a program writes to a page, the dirty data is copied to the buffer cache, fromwhere the data is written back to disk.

• The buffer cache: this cache is indexed by (block device, block number) tuplesand is used to cache raw disk devices, inodes, directories and other filesystemmetadata. It is also used to perform disk IO on behalf of the page cache and theother caches. For disk reads the pagecache bypasses this cache and for networkfilesystems it isn’t used at all.

• The inode cache: this cache resides in the slab cache and contains informationabout cached files in the system. Linux 2.2 cannot shrink this cache, but becauseof its limited size it does need to reclaim individual entries.

• The dentry cache: this cache contains directory and name information in a filesystem-independent way and is used to lookup files and directories. This cache is dy-namically grown and shrunk on demand.

• shared memory: the memory pool containing the shared memory segments ismanaged pretty much like the page cache, but has its own infrastructure for doingthings.

• Process mapped virtual memory: this memory is administrated in the processpage tables. Processes can have page cache or shared memory segments mapped,in which case those pages are managed in both the page tables and the datastructures used for respectively the page cache or the shared memory code.


The drawback of using caches, hardware or otherwise, is that in order to save ef-fort Linux must use more time and space maintaining these caches and, if the cachesbecome corrupted, the system will crash.

The page replacement of Linux 2.2 works as follows. When free memory dropsbelow a certain threshold, the pageout daemon (kswapd) is woken up. Kswapd is aninfinite loop, which incrementally scans all the normal VM pages subject to paging,then starts over. Kswapd generally does its clock sweeping in increments, and sleepsin between increments so that normal processes may run. The pageout daemon shouldusually be able to keep enough free memory, but if it isn’t, user programs will endup calling the pageout code itself. [RIE01]. It starts by trying to free slabs for thekernel memory pool. Then it calls several funcions in a loop until enough memory hasbeen cleared. First, it calls a clock 19 algorithm loop, which loops over all physicalpages, to see whether they have been touched (and perhaps dirtied) lately, clearingreferenced bits, queuing old dirty pages for I/O and freeing old clean pages. Then afunction scans shared memory segments, swapping out those pages that haven’t beenreferenced recently and which aren’t mapped into any process. Then a function scansthe virtual memory of all processes in the system, unmapping pages which haven’t beenreferenced recently, starting swapout IO and placing those pages in the page cache.[RIE01] The idea is that functions get called with a certain priority argument and if notenough memory is freed, they get called again with a higher priority. This means thatif one memory pool is heavily used is won’t give up it’s resources lightly and one ofthe other memory pools will be forced to donate its memory.

The clock algorithm proceeds in increments, usually sweeping a small fraction ofthe in-memory pages at a time, and keeps a record of its current position betweenincrements of sweeping. This allows it to resume its sweeping from that page at thenext increment. [GOR04]

kswapd isn’t the only replacement algorithm, however — there are actually sev-eral interacting replacement algorithms in the Linux memory management system.

• Pages that are part of (System V IPC) shared memory pages are swept by adifferent clock algorithm, implemented by shm swap()

• Page frames managed by the file buffering system are managed differently, forseveral reasons. (For example, file blocks may be smaller than the VM page size,and filesystem metadata are flushed to disk more often than normal blocks.)

Linux 2.4 introduces page aging in combination with multiple page lists avoidingthe problem that referenced clean pages were freed before old dirty pages [RIE01].Page aging works as follows: for each physical page we keep a counter (called age inLinux, or act count in FreeBSD) that indicates how desirable it is to keep this pagein memory. When scanning through memory for pages to evict, we increase the pageage (adding a constant) whenever we find that the page was accessed and we decreasethe page age (subtracting a constant) whenever we find that the page wasn’t accessed.When the page age reaches zero, the page is a candidate for eviction.

Linux uses the basic flat model, where the operating system and application pro-grams have access to a continuous, unsegmented address space. All segment registerspoints to the same segment descriptor. Each segment has full access to the whole mem-ory space. Virtual memory (and memory protection) is implemented through paging.

19Clock algorithms are commonly used because they provide a passible approximation of LRU replace-ment and are cheap to implement [GOR04]


Each process has its own page directory, but shares the kernel space which is locatedin the fourth gigabyte.

2.5 Linux Vs MachThe Linux kernel and the Mach kernel differ in many ways. Mach doesn’t provide afile system, for example [THO94].

To try and get round the traditional short-comings of being monolithic, Linux nowhas the ability to load and unload modules dynamically into the kernel. This improvesthe modularity of the source code and the runtime memory footprint of the kernel, butstill retains the defining characteristics of the monolithic kernel: all operating systemcodes are still running in the kernel mode. [CHI00]

Both Mach and Linux were both designed for private address spaces but Linuxexecutes drivers and file systems in kernel space whereas Mach does it in user space.Linux now has the ability to load and unload modules dynamically into the kernelimproving the modularity of the source code and the runtime memory footprint ofthe kernel, but still retains the defining characteristics of the monolithic kernel: alloperating system codes are still running in the kernel mode. [CHI00]

2.6 Chip Architecture

2.6.1 The PowerPC processorThe Memory Management Unit (MMU) in a PowerPC processor utilises two maintypes of accesses, generally speaking: instruction accesses and data accesses gener-ated by load and store instructions. It has been developed for 32 and 64 bits with 32and 64 bit registers with a 64 bit effective address. It uses hashed page tables and di-vides the memory up into 256 MB segments with some 64-bit implementations havingSegment Lookaside Buffer (SLB) hardware implementation for quicker segment allo-cation. Address translation occurs via segment descriptors and page tables whereby theformer translates the logical address to an interim virtual address and the latter trans-lates the virtual address to a physical address. The segment descriptors reside in on-chip segment registers in 32-bit implementations and as segment table entries (STE’s)in memory in 64-bit implementations. The MMU of a 64-bit PPC processor providesan interim virtual address (80 or 64 bits) and hashed page tables in the generation ofphysical addresses ≤≤≤ 64 bits in length. In contrast, the MMU of a 32-bit PowerPCprocessor is similar except that it provides a 52-bit interim virtual address and physicaladdresses that are ≤≤≤32 bits in length. The model of the MMU allows for the conceptof a virtual address that is not only bigger than the maximum physical memory but ofa virtual address space that is bigger than the effective address space. 64-bit proces-sors can support either an 80 bit or a 64 bit virtual address range (the “interim” virtualaddress mentioned above. I assume the 80 bit and the 52 bit cases for this thesis. Inaddition to the 64 and 32 bit memory management models, the PowerPC defines a 32bit mode of operation for 64-bit implementations. The 64 bit address is calculated asusual and then the high-order bits are treated as zero. This occurs for both instructionand data access. Figure 2.17 20 shows the conceptual organisation of the MMU in a 64-bit implementation. The instruction addresses shown in the figure are generated by the

20graphic from [IBM03]


processor for sequential instruction fetches and addresses that correspond to a changeof program flow. Memory addresses are generated by load and store instructions andby cache instructions. [IBM03]

Figure 2.17: MMU Conceptual Block Diagram - 64 Bit Implementation

As shown in figure 2.17, after an address is generated, the higher-order bits of theeffective address are translated into physical address bits (beginning with ’P’) and thelower-order address bits are untranslated and therefore effective for both effective andphysical addresses. The MMU then passes the resulting 64-bit physical address to thememory subsystem.


2.6.2 Page Address TranslationPage address translation can be broken down as follows:

• A 64-bit effective access address is converted into a 80-bit (or 64-bit) virtualaddress.

• This virtual address is then used to locate the PTE in the hashed page table inmemory.

• The physical page number is then extracted from the PTE and used in the forma-tion of the physical address of the access.

and here I will look at the 64-bit and the 32-bit versions.

64-bit Page Address Translation Overview

Figure 2.18 shows an overview of the translation of an effective address to a physicaladdress for 64-bit implementations as follows:

• Bits 0–35 of the effective address comprise the effective segment ID used to se-lect a segment descriptor, from which the virtual segment ID (VSID)is extracted.

• Bits 36–51 of the effective address correspond to the page number within thesegment;these are concatenated with the VSID from the segment descriptor toform the virtual page number (VPN).The VPN is used to search for the PTE ineither an on-chip TLB or the page table.The PTE then provides the physical pagenumber (RPN).Note that bits 36–40 form the abbreviated page index (API)whichis used to compare with page table entries during hashing.

• Bits 52–63 of the effective address are the byte offset within the page; these areconcatenated with the RPN field of a PTE to form the physical address used toaccess memory.

The translation of effective addresses to physical addresses for 32-bit implementa-tions is shown in figure 2.19 and differs from the 64-bit implementation in that 32-bitimplementations index into an array of 16 on-chip segment registers instead of segmenttables in memory to locate the segment descriptor and the address ranges are obviouslydifferent.

Thus, the address translation is as follows:

• Bits 0–3 of the effective address comprise the segment register number usedto select a segment descriptor,from which the virtual segment ID (VSID)is ex-tracted.

• Bits 4 –19 of the effective address correspond to the page number within thesegment; these are concatenated with the VSID from the segment descriptor toform the virtual page number (VPN). The VPN is used to search for the PTE ineither an on-chip TLB or the page table. The PTE then provides the physicalpage number (RPN).

• Bits 20 –31 of the effective address are the byte offset within the page; these areconcatenated with the RPN field of a PTE to form the physical address used toaccess memory.


Figure 2.18: Page Address Translation Overview - PowerPC 64 Bit

Figure 2.19: Page Address Translation Overview - PowerPC 32 Bit


The PowerPC also offers an alternative translation from logical to physical thatbypasses the TLB/hash-table paging mechanism. When a logical address is referenced,the processor begins the page lookup and, in parallel, begins an operation called blockaddress translation(BAT). Block address translation depends on eight BAT registers:four data and four instruction. The BAT registers associate virtual blocks of 128Kor more with physical segments. Block Address Translation takes precedence overSegmented Address Translation; i.e., if a mapping for a storage location is present inboth a BAT entry and a Page Table Entry, the Block Address Translation will be used.[PPC96]

2.6.3 i386 ProcessorThe 80386 transforms logical addresses into physical address in two steps:

• Segment translation, in which a logical address (consisting of a segment selectorand segment offset) are converted to a linear address.

• Page translation, in which a linear address is converted to a physical addressby the paging unit. This step is optional, at the discretion of systems-softwaredesigners.

Paging on the 386

There are two levels of indirection in address translation by the paging unit. A pagedirectory contains pointers to 1024 page tables. Each page table contains pointers to1024 pages. A register containing the physical base address of the page directory isloaded on each task switch. A 32-bit Linear address is divided up as shown in figure2.20:

Figure 2.20: 386 Linear Address

The physical address is then computed (in hardware) as shown in figure 2.21:

Figure 2.21: 386 Physical Address Computation


Page tables are aligned so the lower 12 bits are used to store useful informationabout the page table (page) pointed to by the entry. Figure 2.22 shows the format forpage directory and page table entries.

Figure 2.22: 386 Page Table Entry

• D - 1 means page is dirty (undefined for page directory entry).

• R/W - 0 means read only for user.

• U/S - 1 means user page.

• P - 1 means page is present in memory.

• A - 1 means page has been accessed (set to 0 by aging).

• OS - bits can be used for LRU etc, and are defined by the OS.

At each stage of the address translation access permissions are verified and pagesnot present in memory and protection violations result in page faults. The fault handlerthen either brings in a new page or does whatever needs to be done.

Segments in the 80386

Segment registers are used in address translation to generate a linear address froma logical (virtual) address: linear address = segment base + logicaladdress. The linear address is then translated into a physical address by the paginghardware (see above). Each segment in the system is described by an 8 byte segmentdescriptor which contains all necessary information (base, limit, type, privilege).

The segments are:

Regular segments code and data segments

System segments task state segments and local descriptor tables

System segments are task specific. There is a Task State Segment associated witheach task in the system. It contains all the information necessary to restart the task.The Local Descriptor Tables contain regular segment descriptors that are private to atask. In Linux there is one LDT per task.

To keep track of all these segments, the 386 uses a Global Descriptor Table that issetup in memory by the system. The GDT contains a segment descriptor for each taskstate segment, each local descriptor table and also regular segments. The Linux GDTcontains just two normal segment entries: the kernel code segment descriptor and thekernel data/stack segment descriptor. The rest of the GDT is filled with TSS and LDTsystem descriptors.


The kernel segments have base 0xc0000000 which is where the kernel lives inthe linear view. Before a segment can be used, the contents of the descriptor for thatsegment must be loaded into the segment register. The 386 has a complex set of criteriaregarding access to segments so you can’t simply load a descriptor into a segmentregister. The programmer loads one of these registers with a 16-bit value called aselector. The selector uniquely identifies a segment descriptor in one of the tables.Access is validated and the corresponding descriptor loaded by the hardware.

2.7 Chip architecture - from 32 to 64 bit on Intel andAMD Chips

There are two main 64 bit chip architectures available on the market today, one fromIntel (Itanium) and another from AMD (Opteron). Of course, there are others likeSPARC from Sun but I will look at just these two in this paper. The Itanium 64 chipuses a new instruction set called EPIC and is not directly backwards compatible withthe x86 instruction set (although it’s cousin Intel’s Xeon EM64T is), whereas the AMDchip is backwards compatible (the author is not aware whether AMD’s athlon chip is).

2.7.1 IntelThe Itanium can be viewed as the descendant of the x386. So how does it differ fromit’s predecessor? Well, first of all the Itanium can address the full 18 Exabytes avail-able diretly. The address space in the 64 bit chip is divided into 8 equally sized sectionscalled regions. Each region is tagged with a unique region identifier (RID) so the TLBcan hold translations from many different address spaces concurrently and need not beflushed on address space switches. These regions provide the basic virtual memory ar-chitecture to support multiple address space operating systems. [INT02]. In addition,every translation in the TLB contains a protection key that is matched against a set ofprotection key registers. These registers provide the basic virtual memory architectureto support single address space operating systems.[INT02] There are no Single AddressSpace Operating System (SASOS) or Multiple Address Space Operating System (MA-SOS) modes in the Itanium architecture. The processor behaves the same regardlessof the address space model used by the OS. The difference between a SASOS and aMASOS is on of policy for the chip; specifically how the RID’s and protection keys aremanaged by the OS and whether different processes are allowed to share the same RIDfor their code and data. Multiple, unrelated processes in a SASOS may share an RIDfor their private pages and it is up to the OS to use protection keys and the protectionkey registers to enforce protection. Figure 2.23 demonstrates address translation in theMMU of the Itanium chip. 21

2.7.2 AMDThe typical x86 processor prior to the Opteron accessed memory via a memory con-troller located on a separate chip called the North Bridge, which also served as theconnection to the Level 2 cache, the AGP slot, and some of the PCI devices. So, alot of traffic was traveling over the connection between the processor and the NorthBridge, a connection called the Front Side Bus. That connection became a bottleneck,

21graphic taken from [CHA02]


Figure 2.23: Itanium®address translation

slowing down access to memory. In multi-processor systems, the processors shared theFront Side Bus, so the bottleneck became even more severe.

Memory management is built into the AMD 64-bit processors so data in memorydoes not have to travel to the processor via the North Bridge across the Front Side Bus.It enables the processor to directly address memory.

2.8 Personal Spaces vs Single SpacesWith the private or personal address spaces idea originally supported by Unix [LIN95]each process has its own address space containing (possibly shared) code and privatedata. A major advantage of this approach is that it provides automatic hardware sup-ported protection between processes. In a single address space operating system allprocesses run within a single, global virtual address space. So instantly, taken fromthe point of view of the modern multiple space model, protection becomes an issue.How are we to stop malicious processes unwantedly accessing other areas on mem-ory? Protection would potentially be provided not through conventional address spaceboundaries but through some other mechanism that dictate which pages of the globaladdress space a process can reference.

Up to now we’ve seen that microkernels are a set of independent services that com-municate between them using microkernel primitives. The microkernel is executedin supervisor mode, while the rest of the tasks (servers) that are part of the OS areexecuted in its own address space, protected from each other. [CRE04]. Monolithickernels are compiled as a single big program, where all the tasks (threads, servers) are


executed in the same memory space and in the highest processor privilege level. Thiskind of operating system is easier to implement and the kernel is faster because com-munication inside the kernel has less overhead (it can be done with simple functioncalls).

Memory protection can only be implemented if the microprocessor (or the memorymanager unit) provides some specific facilities to do it. It is not possible to implementmemory protection without the help of the hardware. Memory protection mechanismshave been implemented jointly with the support of virtual memory. Protection infor-mation is stored in the same data structure used to translate memory address. Eachaddress space has a single protection domain, shared by all threads that run within theprocess. A thread can only have a different protection domain if it runs in a differentaddress space. Sharing is only possible where a single physical memory page can bemapped into two or more physical address spaces. This has significant disadvantageswhen used for protected sharing. The interpretation of the pointers depends on theaddressing context and any transfer of control between protected modules requires anexpensive context switch [WIT02]. Sharing would be simplified if addresses could betransmitted between processes and used to access the shared data. [LEV84]

2.8.1 The Single Address Space ApproachThe major advantages of private address spaces, namely

1. they increase the amount of address space available to all programmes

2. they provide hard memory protection boundaries

3. they permit easy cleanup when a programme exits

are causes of obstacles to efficient cooperation between application modules. Theuse of separate process address spaces, combined with multitasking has lead to a shar-ing paradigm based on a number of applications executing in separate processes andcommunicating via pipes, messages or other structures. [LIN95] In particular, pointershave no meaning beyond the boundary or lifetime of the process that created them,therefore pointer-based information is not easily shared, stored or transmitted. The pri-mary cooperation mechanisms rely on copying data between private virtual memories,typically converting it to and from a neutral intermediate representation. [CHA94]

So given that swapping between kernel space and user space, like Mach does, isexpensive and that inter kernel communication, like Linux does, is potentially a se-curity risk, what structures do programmers have at their disposal when it comes toprotecting shared memory? Chase [CHA94] suggests that programmers essentiallyhave two choices: put application components in separate, independent processes thatexchange data through pipes, files or messages and other information that maps easilyto a byte stream ( 2.24a), thereby sacrificing performance, or place all components inone process, sacrificing protection ( 2.24b).

Even in operating systems like Opal [CHA94] which allow programmes to co-operate via shared, pointer-rich data structures, 2.24(c) using overlapping protectiondomains, passing memory access rights on-the-fly from domain to domain, it is stillthe system rather than the applications coordinating address translations. And more tothe point, access right are being allocated statically.


Figure 2.24: Three choices for structuring cooperation between application compo-nents

2.8.2 CapabilitesOne suggestion to combat these limitations has been an addressing mechanism called“Cabability based addressing”. Capability systems support an object based approachto computing. Conceptually, a capability is a key that gives the processor permissionto access an entity or object [LEV84] . A capability is implemented as a data structurethat contains two items of information: a unique object identifier and access rightsshown in figure 2.25

Figure 2.25: A Capability

The identifier names an object in the computer system. An object can be a file, anarray, message port etc. The access rights define the operations that can be performedon that object for example, read-only access to a memory segment or send and receiveaccess to a message port. Each process has access to a list of capabilities which iden-tify all the objects that process is permitted to access. To specify an object, the userprovides the index of a capability in the list thus PUT(file capability “My Record”)identifies “My Record” as the file and file capability checks to see whether PUT can


be performed on it. A programme cannot access an object unless it’s capability listcontains a suitably privileged capability for the object [LEV84]. Hence, to maintainsystem integrity, a system must prevent programmes from directly modifying bits ina capability. This is usually done by allowing only the operating system to modifythe c-list. Processes obtain new capabilities by executing operating system operationswhereby the operating system stores a fresh capability for an object in the process’capability list. A capability system also provides other capability operations to:

• Move capabilites to different places in the c-list

• Delete a capability

• Restrict the right in a capability, producing a less privileged version

• Pass a capability as a parameter to a process

• Transmit a capability to another user in the system

So a programme controls the movement of capabilites and can share capabilitiesand therefore objects with other programmes. It’s possible to have a tree of c-lists, thefirst one containing capabilities of secondary lists and so on.

For the jump into 64-bit, single address space computing where we have processesplaced spars ely, memory sharing could be speeded up considerably by a system wherebya process dictates which rights another process has to itself dynamically and not tyinga section of memory to a particular protection domain.

Consider the conventional multiprogramming systems we have examined in thispaper: Darwin and Linux. A programme executes within a single process and is dividedinto a collection of pages. When a programme is run, the operating system creates apage table available to the programme. The table is a list of descriptors that containphysical information about each page including access rights about who can access thepage. If two processes wish to share a page, the page descriptors must be in the samelocation in both page tables. Any dynamic sharing of pages requires operating systemintervention to load pages.

In a capability system each process has a c-list that defines which pages it canaccess. Instead of the page table descriptors available to conventional system hard-ware, the capability addressing system has a set of capability registers. [LEV84] Theprogramme can execute hardware instructions to transfer capabilities between the ca-pability list and the capability hardware registers. So the registers define a subset of thepotentially accessible pages that can be physically addressed by the hardware. The ca-pability system provides a dynamically changing address space whiche changes when-ever the programme changes one of the capability registers. Capabilities are contextindependent i.e. the page addressed by a capability is independent of the process usingthe capability. A process can share a page with another by duplicating or sending acapability from its list to the capability list of a partner’s.

Protection

A phenomenon that can be observed in the physical, biological and social sciencesis that an organising principle common to many complex systems is the creation ofa structured hierarchy by decomposing a larger system into many subsystems which


interact minimally with each other and which themselves are iteratively decompos-able. [SIM69] However, when large software systems are broken down, it is importantto limit the interaction of processes in a way that is not dependant on all the otherprocesses functioning properly otherwise the whole system crashes. Capabilities stopmalfunctions from spreading beyond the subsystem where they occurred [LIN76]. Theso-called ”principle of least privilege” means that, essentially, the less damage you cando. the better i.e. if you can’t access it, you can’t break it and small protection domainsaccomplish this. Of course, a procedure may malfunction and the system my not knowbut providing all the others behave in predictable ways the error is easier to find. Andcapabilities provide this small protection domain

In the present day approach to private spaces, a programme executes in a process’address space and every procedure accessed by that programme has access to thatprocess’ address space including pages and files. Each procedure executes within anidentical protection environment. With this protection model, there is no easy wayto limit the access rights of specific subprogrammes executed on behalf of a process.While access rights can be increased or decreased, any such change is relatively per-manent [LIN76].

In a capability system a procedure can only affect objects for which capabilityregisters have been loaded. So different procedures called by one programme can haveaccess to different pages. A procedure can thus protect its objects from maliciousaccess by the processes that call them just as a programme can protect its objects fromaccess by called procedures [LEV84]. In effect, each procedure has it’s own addressspace, it’s own protection domain. Permission is given by passing a capability for anobject as a parameter when a procedure is called. So every procedure is protected by aprivate capability list. Hence we no longer have processes executing within their ownprotection domains but instead a single process is allowed to execute in many different,small protection domains.

Linden [LIN76] maintains that capabilities can also be used to to address any andall objects in the system; not only memory and I/O devices but also software-createdvirtual objects. What he suggests, is that instead of making the system software supportprimitive object types, the kernel code could be modified (and reduced) by providinga mechanism for creating new object types. The kernel itself is freed of the neces-sity to support different object types and becomes an independent entity. It could beconsidered an application in memory. Application programmers could, he says, defineprotection mechanisms tailored to their applications by creating new object types. He,and others 22, call objects of a type not supported by the system extended type objects.He adds that extended type objects are a convenient way of implementing certain secu-rity policies which can be enforced without depending on the discretion of other usersof the system, ([LIN76], p 439) by creating a new extended type with the operationsof that type programmed to enforce whatever access controls are to be maintained e.g.a check on the user of an object can be programmed into the operations for it. Henceusers can be given very limited access rights that allow them to perform only prepro-grammed operations on objects.

It is the opinion of the author that special registers for capabilities would not beneeded if there were some way of storing permission and segmentation informationwithin a pointer object. This guarded pointer [CAR94] could reside in a general pur-pose register or in memory. And due to the fact that memory can be accessed directly

22Linden names Gray, Jones, Wulf and Ferrie. Their literature was not used, however, in the research forthis paper and hence they cannot be found in the bibliography. For more information the reader is referred tosee [LIN76], p 434


using a pointer, higher performance may be achieved than with traditional implementa-tions of capabilities because table lookups to translate capabilities to virtual addressesare not required. Figure 2.26 23 demonstrates the structure of a guarded pointer.

Figure 2.26: A Guarded Pointer

A guarded pointer identifies a byte in the virtual address space, the segment con-taining that byte, and the set of operations permitted on the segment. The permissionfield determines what operations may be performed using the pointer, and the segmentlength field separates the address into a fixed segment field and a variable offset fieldby specifying the base-2 logarithm of the length of the segment containing the address.Hence a guarded pointer specifies an address, the operations that can be performed us-ing that address and the page containing the address. No capability tables are required.All processes can share the same address space safely and the need for costly contextswitching is eliminated 24.

23taken from [CAR94]24Carter talks about enter pointers allowing entry into a protected subsystem only at specific places. The

details of Carter’s suggestions are irrelevant for this paper. Interested readers should see [CAR94] p.4

Chapter 3

The Future

3.1 IntroductionIn this, the final chapter of my paper, I suggest that some form of object-oriented mem-ory management along with dynamic allocation of linkage at runtime could optimisedata sharing.

3.2 Conclusion and OutlookVirtual memory is an important feature of operating systems. For applications currentlysharing a single machine, virtual memory provides them with their own protected ad-dress space, which enhances security and reliability. Virtual memory also provides forapplication transparent relocation of programs within physical memory and other formsof data storage such as disk storage. The 64 bit environment is particularly challengingfor implementing efficient virtual memory.

64-bit computing essentially breaks the 4GB barrier. In 32 bit systems, any mem-ory beyond the 4GB space must be accessed through Physical Address Extensions, apaging scheme which isn’t as efficient as a flat memory space. Plus, the page table itselfmust be located within the first 4GB of address space, which places a cap on its scala-bility. For example, 32-bit Linux and most versions of Windows can only use physicaladdress extensions to offer 16GB of address space; beyond that, the page table growstoo fat to accommodate. It is a huge turning point in memory management because itallows us the margin to experiment with ingrained concepts of ‘the computer as oneentity’ [UHL05] and envision sequential object calling as taking over from where theconcept of “process” left off. Capabilities, together with the advent of 64-bit machines,provide an efficient means of security in single address spaces for a machine that needsto change protection domains frequently. A capability-based system allows fast mul-tithreading among threads from different protection domains and allows for dynamictyping and linkage at runtime. In this paper, I have tried to suggest that by abandoningthe basic 32-bit operating system building block of the private address space, we alsomust rethink our idea of disjointed processes. Protection can be achieved by inheri-tance rather than by segmentation. Memory objects are the new segregators.

61

CHAPTER 3. THE FUTURE 62

3.3 Future WorkCommunication in the single address space is eased because there are no more pro-tected, disjoint address spaces and the need for context switching between kernel anduser spaces is reduced. The author believes that some mixture of Linden’s extendedtypes and Carter’s guarded pointers would increase efficiency if we consider the con-cept of a process to have been replaced by small capability enforced private protectiondomain in the 64-bit model. So if capabilities are to be the objects in a system, someform of scheme would be necessary to keep track of interdependencies. It may be pos-sible to “roll-back” a sequence of object interactions if an error has occurred and tocontinue on without the programme crashing. Just how this could occur could be theobject of any future work.

Another area of future work in the 64 bit arena that I haven’t examined in any greatdepth in this thesis, would be the page table model selection particularly in the micro-kernel. The page table not only affects the area of TLB-refill performance and kernelmemory consumption, but also has a significant effect on task creation and destructionoverheads

Bibliography

[AMD04] “AMD Opteron ™Processor Datasheet”, Advanced Micro Devices Inc.,Publication No 23932, February 2004. Available from www.amd.com

[APP00] The Apple Developer Connection Documentation available fromhttp://developer.apple.com/

[ARD66] “Program and Addressing Structure in a Time-Sharing Environment”, B. W.Arden, B. A. Galler, T. C. O’Brien, F. H. Westervelt, Computer Center, The Uni-versity of Michigan, Ann Arbor, Michigan, Journal of the ACM (JACM) Volume13, Issue 1, January 1966

[BAR90] “MACH Kernel Interface Manual”, Robert V. Baron et al.,Carnegie-Mellon University, 1990. Available online from: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/doc/publications.html

[BOW98] “Conceptual Architecture of the Linux Kernel”, Ivan Bowman, Jan 1998.Available at: http://plg.uwaterloo.ca/˜itbowman/CS746G/a1/

[BOW98+] “Concrete Architecture of the Linux Kernel” by Ivan Bow-man, Saheem Siddiqi and Meyer C.Tanuan, Department of Com-puter Science, University of Waterloo, Canada, 1998. Available fromhttp://plg.uwaterloo.ca/˜itbowman/CS746G/a2/

[BRI01] “External Memory Management” Chapter 6 of The GNU Mach ReferenceManual for Version 1.2 of the GNU Mach microkernel, Markus Brinkmann,2001 Available online from: http://www.mirror5.com/software/hurd/gnumach-doc/mach 6.html

[CAR94] “Hardware Support for Fast Capability-based Addressing”, NicholasP. Carter, Stephan W. Keckler, WIlliam J. Dally, Artificial IntelligenceLaboratory, Massachusetts Institute of Technology, Proceedings of the6th International Conference on Architechtural Support for ProgrammingLanguages and Operating Systems (ASPLOS VI), 1994. Available fromhttp://www.cs.utexas.edu/users/skeckler/pubs/asplos94.pdf

[CHA88] “801 Storage: Architecture and Programming”, A. Chang and M. MergenACM Transmissions on Computer Systems, Vol.6 No.1, February 1988, pp 32-33

[CHA94] “Sharing and Protection in a Single Address Space Operating System”, Jef-frey S. Chase, Henry M. Levy, Michael J. Feeley, Edward D. Lazowska, 1994.Available from http://citeseer.ist.psu.edu/chase94sharing.html

63

BIBLIOGRAPHY 64

[CHA02] “Exploring the Itanium MMU”, Matt Chapman, UNi-versity of New South Wales, 2002. Available online at:http://www.gelato.unsw.edu.au/2002workshop/gelato.pdf

[CHI00] “CSCI555 Reading Report 4 Kernel Architecture Comparison”, Yuan-ChunChiu. Available from http://www-scf.usc.edu/˜yuanchuc/cs555 reading4.doc

[COS86] “Chapter 5: Memory Management”, Lance Costanzo, Intel 80386 Program-mer’s Reference Manual, 1986 from http://www.logix.cz/michal/doc/i386/chp05-00.htm

[CRA99] “The UVM virtual memory system”, Charles D.Cranor and GurudattaM.Parulkar. In Proceedings of the 1999 USENIX Annual Technical Conference(USENIX-99),pages 117 –130,Berkeley,CA,1999. USENIX Association.

[CRE04] “White Paper: Memory Protection in a Threaded System”, AlfonsCrespo, 2004, ©2003 by OCERA Consortium. Available online from:http://www.ocera.org/archive/upvlc/public/reports/memory-protection/memory-protection.pdf

[DEN66] “Programming Semantics for Multiprogrammed Computations”, J.B. Den-nis and E.C. van Horn, Communications of the ACM, vol 9, 1966

[DEN72] “Properties of the Working-Set Model”, P.J. Denning and S.C. Schwartz,Communications of the ACM Vol. 15 No.3, pp191-198, 1972

[DEN96] “Before Memory was Virtual”, Peter J. Denning, George Mason University,October 1996. Available from http://cne.gmu.edu/pjd/PUBS/bvm.pdf

[DIL] “Design Elements of the FreeBSD VM System”,Matthew Dillon, Daemon News, 2003. Available from:http://www.daemonnews.org/200001/freebsd vm.html

[ELP99] “Virtual Memory Management In A 64-Bit Microkernel”, Kevin John El-phinstone, University of New South Wales Computing Department

[GAR94] “An Introduction to Software Architecture” David Gar-lan and Mary Shaw, School of Computer Science, CarnegieMellon University, January 1994. Available at: http://www-2.cs.cmu.edu/afs/cs/project/able/www/paper abstracts/intro softarch.html

[GOL00] “Moving the default Memory Manager out of the Mach Kernel”, David B.Golub and Richard P. Draves, Carnegie Mellon University. Available online from:http://citeseer.ist.psu.edu/golub91moving.html

[GOR04] “Understanding the Linux Virtual Memory Manager” Mel Gorman, BrucePerens’ Open source series, Prentice Hall Professional Technical Reference,2004, ISBN 0-13-145348-3

[HAM00] “Page Replacement Algorithms”, Professor Howard J. Hamil-ton, lecture notes No. 7 Memory Management, Department ofComputer Science University of Regina. Available online from:http://www2.cs.uregina.ca/˜hamilton/courses/330/notes/memory/page replacement.html

BIBLIOGRAPHY 65

[HAR98] “Terminal Server Architecture” chapter 3 from Windows NTTerminal Server and Citrix MetaFrame By Ted Harwood, Pear-son Education, 1998 ISBN: 1562059440. Available online at:http://www.microsoft.com/technet/archive/winntas/maintain/termserv.mspx

[HOH96] “Großer Beleg: Steps Towards Porting a Unix Single Server to the L3 Mi-crokernel” Chapter 2: State of the Art, Michael Hohmuth, Sven Rudolph, TUDresden, Dept. of Computer Science, OS Group, 1996. Available online from:http://os.inf.tu-dresden.de/˜hohmuth/prj/lites-on-l3/beleg/

[HUC93] “Architectural Support for Translation Table Management in Large AddressSpace Machines” Jerry Huck and Jim Hays, Proc of the 20th Annual InternationalSymposium on Computer Architecture, pp.39-50, May 1993

[IBM03] “PowerPC Microprocessor Family: Programming Environments Manual for64 and 32-Bit Microprocessors, Version 2.0”, IBM: International Business Ma-chines Corporations, 2003

[INT02] “Intel®Itanium®Architecture Software Developer’s Manual, Volume 2: Sys-tem Architecture, Revision 2.1”, Document No: 245318-004, Oct 2002. Availablefrom www.intel.com

[KER00] http://www.kernelthread.com/mac/osx/arch xnu.html

[KER01] “The Linux Kernel Archives” from http://www.kernel.org/

[KNA01] “Linux Memory Management”, Joseph Knapka, AndreaRusso, Alan Cudmore, Rik van Riel, 2001. Available from:http://home.earthlink.net/˜jknapka/linux-mm/vmoutline.html

[KNO01] “Introduction to Unix”, Dr William J. Knottenbelt, Imperial College Lon-don, 2001 available from http://www.doc.ic.ac.uk/˜wjk/UnixIntro/Lecture1.html

[KNU73] “The Art of Computer Programming, Volume 1: Fundamental Algorithms”,D.E. Knuth, Addison-Wesley, 1973

[KOL92] “Archistechtural Support for Single Address Space Operating Systems”, EricJ. Koldinger, Jeffrey S. Chase, Susan J. Eggers, Department of Computer Sci-ence and Engineering, University of Washington, Technical Report 92-03-10 July1992. Available from: http://citeseer.ist.psu.edu/chase92architectural.html

[KOZ01] “Cache Mapping and Associativity” by Charles M. Kozierok available fromhttp://www.pcguide.com/ref/mbsys/cache/funcMapping-c.html

[KUM02] “MMDOC. Linux Memory Management Documentation”, S. Mohan Ku-mar, 2002. Available from http://mmdoc.sourceforge.net

[LAY05] “Cache Mapping and Associativity”, Lay Networks, 2000 - 2005. Availablefrom http://www.laynetworks.com/Cache%20Mapping%20and%20Associativity.htm

[LEM00] “2-Level Mapping”, Bill Lemley, Tuncay Basar and Kyung Kim, The Coreof Information Technology, George Mason University Hyper Learning Center.Available from: http://cne.gmu.edu/itcore/virtualmemory/vmideas.html

BIBLIOGRAPHY 66

[LEV84] “Capability Based Computer Systems”, Henry M. Levy, Digital EquipmentCorporation, 1984, ISBN 0-932376-22-3

[LIN76] “Operating System Structures to Support Security and Reliable Software”,Theodore A. Linden, Institute for Computer Sciences and Technology, NationalBureau of Standards, Computing Surveys, Vol 8 No. 4, December 1976

[LIE94] “Page Table Structures for Fine-Grain Virtual Memory”, Jochen Liedtke, Ger-man National Research Centre for Computer Science (GMD), Technical ReportNo. 872, Oct 1994

[LIN95] “Grand Unified Theory of Address Spaces”, Anders Lindstrom andJohn Rosenberg from the Department of Computer Science, Universityof Sydney Australia and Alan Dearle from the Department of ComputerScience, University of Sterling, 1995. Available from http://www.dcs.st-and.ac.uk/research/publications/download/LRD95.pdf

[LOE92] “Mach 3 Kernel Principles”, Keith Loepere, Open Software Foundationand Carnegie Mellon University, OSF Publishing, 1992. Available online from:http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/doc/osf.html

[MOH04] “The Linux Knowledge Base and Tutorial”, James Mohr, 2005. Availablefrom: http://sourceforge.net/projects/linkbat

[PEN95] “The PowerPC Architecture™: 64-bit Power with 32-Bit Compatibility”,C. Ray Peng, Thomas A. Peterseon and Ron Clark, Proceedings of the40th IEEE Computer Society International Conference, 1995. Available from:http://portal.acm.org/citation.cfm?id=527213.793553

[PPC96] “PowerPC Processor Binding to IEEE 1275-1994 Standardfor Boot (Initialization, Configuration) Firmware” Revision: 2.1(Approved Version), November 6, 1996. Available online from:http://playground.sun.com/1275/bindings/ppc/release/ppc-2 1.html#HDR6

[RAS87] “Machine-Independant Virtual Memory Management for Paged Uniproces-sor and Multiprocessor Architectures”, Richard Rashid, Avadis Tevanian,Michael Young, David Golub et al., Dept. of Computer Science, CarnegieMellon University, 2nd Symposium On Architectural Support for Program-ming Languages and Operating Systems, ACM October 1987. Available from:www.cs.berkeley.edu/˜brewer/cs262/mach-vm.ps

[RIE98] “Linux-MM Documentation”, Rik van Riel, July 1998 Available from:http://linux-mm.org/docs/

[RIE01] “Page replacement in Linux 2.4 memory management”, Rik van Riel, 2001.Available from: http://www.surriel.com/lectures/linux24-vm.html

[RUS99] “The Linux Kernel”, Chapter 3: Memory Management, David A. Rusling,1996-1999. Available from http://www.tldp.org/LDP/tlk/tlk-title.html

[SCH74] “The Protection of Information in Computer Systems”Jerome H. Saltzer,Senior Member, IEEE, and Michael D. Schroeder, Member, IEEE, Fourth

BIBLIOGRAPHY 67

ACM Symposium on Operating System Principles, October 1973. Re-vised version in Communications of the ACM 17 7 (July 1974). Avail-able from University of Virginia, Department of Computer Science:http://www.cs.virginia.edu/˜evans/cs551/saltzer/

[SCO97] “Meet Mach”, James Scott, 1997. Available fromhttp://www.stepwise.com/articles/technical/meetmach.html

[SIL88] “Operating System Concepts”, Chapter 7, Abraham Silberschatz and JamesL. Peterson, Addison-Wesley Publishing, 1998, ISBN 0-201-18760-4

[SIM69] “The Sciences of the Artificial”, Herbert A. Simon, M.I.T. Press, 1969. ISBN9-780262-691918

[SMA99] “EELRU: Simple And Effective Adaptive Page Replace-ment”, Yannis Smaragdakis, Scott Kaplan and Paul Wilson, Dept.of Computer Science, University of Texas, 1999. Available from:http://www.cs.amherst.edu/˜sfkaplan/papers/index.html

[SYD02] “Mac®OS X Programming”, Dan Parks Sydow, New Riders, 2002 ISBN0-7357-1168-2

[SZM00] “Kernel Implementation: Page Table Structures”, Cristan Szmadja,University of New South Wales, Computing Department available fromhttp://www.cse.unsw.edu.au/˜cs9242/02/lectures/08vm.pdf

[TAN86] “Using Sparse Capabilities in a Distributed Operating System”, Andrew S.Tanenbaum, Dept. of Mathematics and Computer Science, Vrije Universiteit,Sape J. Mullender, Centre for Mathematics and Computer Science, Robbert vanRenesse, Dept. of Mathematics and Computer Science, Vrije Universiteit 1986.Available from: http://citeseer.ist.psu.edu/tanenbaum86using.html

[TAN87] “Operating Systems: Design and Implementation” A. Tanenbaum, Prentice-Hall International Editions, 1987, ISBN 0-13-637331-3

[TAL95] “A New Page Table for 64-bit Address Spaces”, M.Talluri, M.D. Hill, Y.A.Kalidi, Computer Science Department, University of Wisconsin in 15th Sympo-sium on Operating System Principles, 1995

[THO94] “Distributions of Mach”, Mary R. Thompson, Oct 1994. Available at:http://www-2.cs.cmu.edu/afs/cs/project/mach/public/FAQ/distribution.info

[THO05] “PMAP(9) Manual Page” from NetBSD Kernel Developer’s Man-ual, Jason R. Thorpe, 2005. Avialable online from http://www.daemon-systems.org/man/pmap.9.html

[TOP96] “A History of MTS – 30 Years of Computing Service”,Susan Topol, Information Technology Digest, May 13, 1996(Vol. 5, No. 5), The University of Michigan. Available fromhttp://www.itd.umich.edu/ doc/Digest/0596/feat02.html

[UHL05] “On Operating System Basis Building Blocks”, Martin Uhl, Institut fur In-formatik, Technische Universitat Munchen, The 2005 International Conferenceon Computer Design (CDES 05), 2005.

BIBLIOGRAPHY 68

[WIE00] “Simulation of Page Replacement Algorithms”, Felix Wiemann, Facharbeit,Date unknown. Available from http://www.ososo.de/pra-sim/pra-sim.pdf

[WIL65] “Slave Memories and Dynamic Storage Allocation”, M.V. Wilkes, Trans.IEEE vol EC-14 page 270, 1965

[WIL99] “The GNU/LINUX 2.2 Virtual Memory System”, Paul Wilson, Uni-versity of Austin Texas, Computer Sciences division. Available fromhttp://home.earthlink.net/˜jknapka/linux-mm/vm paulwilson.html

[WIR03] “Linux the Big Picture” by Lars Wirzenius, first appearing in an article inPC Update April 28 2003. Available online at http://liw.iki.fi/liw/texts/linux-the-big-picture.html

[WIT02] “Mondrian Memory Protection”, Emmett Witchel, Josh Cates, and KrsteAsanovic, MIT Laboratory for Computer Science, 2002. Available online from:http://www.cag.lcs.mit.edu/scale/papers/mmp-asplos2002.pdf

[YOU87] ‘The Duality of Memory and Communication in the Implementation ofa Multiprocessor Operating System”‘, Young et al., Stanford University, Proc11th Symposium on Operating System Principles, 1987. Avilable online from:http://www.stanford.edu/˜emrek/quals/summaries/The Duality of Memory and-Communication in the Implementation of a Multiprocessor Operating System-

Young-1987.txt

Date post:	25-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

V MANAGEMENT IN THE 32 64 M OS X L · VIRTUAL MEMORY MANAGEMENT IN THE MOVE FROM 32 TO 64 BIT WITH...

Documents