Data Storage - University of Cretehy460/pdf/002.pdf · Data Storage One of the important ... hold a...

Chapter 2

Data Storage

One of the important ways that database systems are distinguished from othersystems is the ability of a DBMS to deal with very large amounts of dataefficiently. In this chapter and the next we shall learn the basic techniques formanaging data within the computer. The study can be divided into two parts:

1. How does a computer system store and manage very large volumes ofdata?

2. What representations and data structures best support efficient manipu-lations of this data?

We cover the first issue in this chapter, while the second is the topic of Chapters3 through 5.

This chapter begins with the technology used for physically storing massiveamounts of data. We shall study the devices used to store information, espe-cially rotating disks. We introduce the "memory hierarchy," and see how theefficiency of algorithms involving very large amounts of data depends on thepattern of data movement between main memory and secondary storage (typi-cally disks) or even "tertiary storage" (robotic devices for storing and accessinglarge numbers of optical disks or tape cartridges). A particular algorithm —two-phase, multiway merge sort — is used as an important example of an al-gorithm that uses the memory hierarchy effectively.

We also consider, in Section 2.4, a number of techniques for lowering the timeit takes to read or write data from disk. The last two sections discuss methodsfor improving the reliability of disks. Problems addressed include intermittentread- or write-errors, and "disk crashes," where data becomes permanentlyunreadable.

21

22 CHAPTER 2. DATA STORAGE

2.1 The Memory Hierarchy

A typical computer system has several different components in which data maybe stored. These components have data capacities ranging over at least sevenorders of magnitude and also have access speeds ranging over seven or moreorders of magnitude. The cost per byte of these components also varies, butmore slowly, with perhaps three orders of magnitude between the cheapest andmost expensive forms of storage. Not surprisingly, the devices with smallestcapacity also offer the fastest access speed and have the highest cost per byte.A schematic of the memory hierarchy is shown in Fig. 2.1.

Figure 2.1: The memory hierarchy

2.1.1 Cache

At the lowest level of the hierarchy is a cache. The cache is an integrated cir-cuit ("chip"), or part of the processor's chip, that is capable of holding dataor machine instructions. The data (including instructions) in the cache are acopy of certain locations of main memory, the next higher level of the memoryhierarchy. Sometimes, the values in the cache are changed, but the correspond-ing change to the main memory is delayed. Nevertheless, each value in thecache at any one time corresponds to one place in main memory. The unit oftransfer between cache and main memory is typically a small number of bytes.

2.1. THE MEMORY HIERARCHY 23

We may therefore think of the cache as holding individual machine instructions,integers, floating-point numbers or short character strings.

Often, a machine's cache is divided into two levels. On-board cache is foundon the same chip as the microprocessor itself, and additional level-2 cache isfound on another chip.

When the machine executes instructions, it looks for both the instructionsand the data used by those instructions in the cache. If it doesn't find themthere, it goes to main-memory and copies the instructions or data into thecache. Since the cache can hold only a limited amount of data, it is usuallynecessary to move something out of the cache in order to accommodate thenew data. If what is moved out of cache has not changed since it was copied tocache, then nothing more needs to be done. However, if the data being expelledfrom the cache has been modified, then the new value must be copied into itsproper location in main memory.

When data in the cache is modified, a simple computer with a single proces-sor has no need to update the corresponding location in main memory. How-ever, in a multiprocessor system that allows several processors to access thesame main memory and keep their own private caches, it is often necessary forcache updates to write through, that is, to change the corresponding place inmain memory immediately.

Typical caches at the end of the millenium have capacities up to a megabyte.Data can be read or written between the cache and processor at the speed of theprocessor instructions, commonly 10 nanoseconds (10~8 seconds) or less. Onthe other hand, moving an instruction or data item between cache and mainmemory takes much longer, perhaps 100 nanoseconds (10~7 seconds).

2.1.2 Main Memory

In the center of the action is the computer's main memory. We may think ofeverything that happens in the computer — instruction executions and datamanipulations — as working on information that is resident in main memory(although in practice, it is normal for what is used to migrate to the cache, aswe discussed in Section 2.1.1).

In 1999, typical machines are configured with around 100 megabytes (108

bytes) of main memory. However, machines with much larger main memories,10 gigabytes or more (1010 bytes) can be found.

Main memories are random access, meaning that one can obtain any bytein the same time.1 Typical times to access data from main memories are in the10-100 nanosecond range (10~8 to 10~7 seconds).

Although some modern parallel computers have a main memory shared by many proces-sors in a way that makes the access time of certain parts of memory different, by perhaps afactor of 3, for different processors.


Computer Quantities are Powers of 2

It is conventional to talk of sizes or capacities of computer componentsas if they were powers of 10: megabytes, gigabytes, and so on. In reality,since it is most efficient to design components such as memory chips tohold a number of bits that is a power of 2, all these numbers are reallyshorthands for nearby powers of 2. Since 210 — 1024 is very close to athousand, we often maintain the fiction that 210 = 1000, and talk about210 with the prefix "kilo," 220 as "mega," 230 as "giga," 240 as "tera,"and 250 as "peta," even though these prefixes in scientific parlance referto 103, 106, 109, and 1012, respectively. The discrepancy grows as we talkof larger numbers. A "gigabyte" is really 1.074 x 109 bytes.

We use the standard abbreviations for these numbers: K, M, G, T,and P for kilo, mega, giga, tera, and peta, respectively. Thus, 16G bytesis sixteen gigabytes, or strictly speaking 234 bytes. Since we sometimeswant to talk about numbers that are the conventional powers of 10, weshall reserve for these the traditional numbers, without the prefixes "kilo,""mega," and so on. For example, "one million bytes" is 1,000,000 bytes,while "one megabyte" is 1,048,576 bytes.

2.1.3 Virtual Memory

When we write programs, the data we use — variables of the program, filesread, and so on — occupies a virtual memory address space. Instructions ofthe program likewise occupy an address space of their own. Many machinesuse a 32-bit address space; that is, there are 232, or about 4 billion, differentaddresses. Since each byte needs its own address, we can think of a typicalvirtual memory as 4 gigabytes.

Since a virtual memory space is much bigger than the usual main memory,most of the content of a fully occupied virtual memory is actually stored onthe disk. We discuss the typical operation of a disk in Section 2.2, but for themoment we only need to be aware that the disk is divided logically into blocks.The block size on common disks is in the range 4K to 56K bytes, i.e., 4 to 56kilobytes. Virtual memory is moved between disk and main memory in entireblocks, which are usually called pages in main memory. The machine hardwareand the operating system allow pages of virtual memory to be brought intoany part of the main memory and to have each byte of that block referred toproperly by its virtual memory address. We shall not discuss the mechanismsfor doing so in this book.

The path in Fig. 2.1 involving virtual memory represents the treatmentof conventional programs and applications. It does not represent the typicalway data in a database is managed. However, there is increasing interest inmain-memory database systems, which do indeed manage their data through


Moore's Law

Gordon Moore observed many years ago that integrated circuits were im-proving in many ways, following an exponential curve that doubles aboutevery 18 months. Some of these parameters that follow "Moore's law" are:

1. The speed of processors, i.e., the number of instructions executedper second and the ratio of the speed to cost of a processor.

2. The cost of main memory per bit and the number of bits that canbe put on one chip.

3. The cost of disk per bit and the number of bytes that a disk canhold.

On the other hand, there are some other important parameters thatdo not follow Moore's law; they grow slowly if at all. Among these slowlygrowing parameters are the speed of accessing data in main memory, or thespeed at which disks rotate. Because they grow slowly, "latency" becomesprogressively larger. That is, the time to move data between levels of thememory hierarchy appears to take progressively longer compared with thetime to compute. Thus, in future years, we expect that main memory willappear much further away from the processor than cache, and data on diskwill appear even further away from the processor. Indeed, these effects ofapparent "distance" are already quite severe in 1999.

virtual memory, relying on the operating system to bring needed data into mainmemory through the paging mechanism. Main-memory database systems, likemost applications, are most useful when the data is small enough to remainin main memory without being swapped out by the operating system. If amachine has a 32-bit address space, then main-memory database systems areappropriate for applications that need to keep no more than 4 gigabytes of datain memory at once (or less if the machine's actual main memory is smaller than232 bytes). That amount of space is sufficient for many applications, but notfor large, ambitious applications of DBMS's.

Thus, large-scale database systems will manage their data directly on thedisk. These systems are limited in size only by the amount of data that canbe stored on all the disks and other storage devices available to the computersystem. We shall introduce this mode of operation next.

2.1.4 Secondary StorageEssentially every computer has some sort of secondary storage, which is a formof storage that is both significantly slower and significantly more capacious than


main memory, yet is essentially random-access, with relatively small differencesamong the times required to access different data items (these differences arediscussed in Section 2.2). Modern computer systems use some form of disk assecondary memory. Usually this disk is magnetic, although sometimes opticalor magneto-optical disks are used. The latter types are cheaper, but may notsupport writing of data on the disk easily or at all; thus they tend to be usedonly for archival data that doesn't change.

We observe from Fig. 2.1 that the disk is considered the support for bothvirtual memory and a file system. That is, while some disk blocks will be usedto hold pages of an application program's virtual memory, other disk blocks areused to hold (parts of) files. Files are moved between disk and main memoryin blocks, under the control of the operating system or the database system.Moving a block from disk to main memory is a disk read; moving the blockfrom main memory to the disk is a disk write. We shall refer to either as adisk I/O. Certain parts of main memory are used to buffer files, that is, to holdblock-sized pieces of these files.

For example, when you open a file for reading, the operating system mightreserve a 4K block of main memory as a buffer for this file, assuming disk blocksare 4K bytes. Initially, the first block of the file is copied into the buffer. Whenthe application program has consumed those 4K bytes of the file, the next blockof the file is brought into the buffer, replacing the old contents. This process,illustrated in Fig. 2.2, continues until either the entire file is read or the file isclosed.

Figure 2.2: A file and its main-memory buffer

A database management system will manage disk blocks itself, rather thanrelying on the operating system's file manager to move blocks between mainand secondary memory. However, the issues in management are essentially thesame whether we are looking at a file system or a DBMS. It takes roughly 10-30milliseconds (.01 to .03 seconds) to read or write a block on disk. In that time,a typical machine can execute perhaps one million instructions. As a result, itis common for the time to read or write a disk block to dominate the time ittakes to do whatever must be done with the contents of the block. Therefore itis vital that, whenever possible, a disk block containing data we need to accessshould already be in a main-memory buffer. Then, we do not have to pay thecost of a disk I/O. We shall return to this problem in Sections 2.3 and 2.4,where we see some examples of how to deal with the high cost of moving data


between levels in the memory hierarchy.In 1999, single disk units have capacities in the range from 1 to over 10

gigabytes. Moreover, machines can use several disk units, so secondary-storagecapacity of 100 gigabytes for a single machine is realistic. Thus, secondarymemory is on the order of 105 times slower but at least 100 times more capaciousthan typical main memory. Secondary memory is also significantly cheaper thanmain memory. In 1999, prices for magnetic disk units range from 5 to 10 centsper megabyte, while the cost of main memory is 1 to 2 dollars per megabyte.

2.1.5 Tertiary Storage

As capacious as a collection of disk units can be, there are databases muchlarger than what can be stored on the disk(s) of a single machine, or even ofa substantial collection of machines. For example, chains of retail stores retainterabytes of data about their sales. Data gathered from satellite images oftenmeasures in the terabytes, and satellites will soon return petabytes (1015 bytes)of information per year.

To serve such needs, tertiary storage devices have been developed to holddata volumes measured in terabytes. Tertiary storage is characterized by sig-nificantly higher read/write times than secondary storage, but also by muchlarger capacities and smaller cost per byte than is available from magneticdisks. While main memory offers uniform access time for any datum, and diskoffers an access time that does not differ by more than a small factor for access-ing any datum, tertiary storage devices generally offer access times that varywidely, depending on how close to a read/write point the datum is. Here arethe principal kinds of tertiary storage devices:

1. Ad-hoc Tape Storage. The simplest — and in past years the only —approach to tertiary storage is to put data on tape reels or cassettes andto store the cassettes in racks. When some information from the tertiarystore is wanted, a human operator locates and mounts the tape on a tapereader. The information is located by winding the tape to the correctposition, and the information is copied from tape to secondary storageor to main memory. To write into tertiary storage, the correct tape andpoint on the tape is located, and the copy proceeds from disk to tape.

2. Optical-Disk Juke Boxes. A "juke box" consists of racks of CD-ROM's(CD = "compact disk"; ROM = "read-only memory." These are opticaldisks of the type used commonly to distribute software). Bits on an opticaldisk are represented by small areas of black or white, so bits can be readby shining a laser on the spot and seeing whether the light is reflected. Arobotic arm that is part of the jukebox can quickly extract any one CD-ROM and move it to a CD reader. The CD can then have its contents, orpart thereof, read into secondary memory. It is not normally possible towrite onto CD's without using special equipment. Low-cost CD writers


are available, and it is likely that read/write tertiary storage based onoptical disks will soon be economical.

3. Tape Silos A "silo" is a room-sized device that holds racks of tapes. Thetapes arc accessed by robotic arms that can bring them to one of severaltape readers. The silo is thus an automated version of the earlier ad-hoc storage of tapes. Since it uses computer control of inventory andautomates the tape-retrieval process, it is at least an order of magnitudefaster than human-powered systems.

The capacity of a tape cassette in 1999 is as high as 50 gigabytes. Tapesilos can therefore hold many terabytes. CD's have a standard of about 2/3 ofa gigabyte, with a next-generation standard of about 2.5 gigabytes becomingprevalent. CD-ROM jukeboxes in the multiterabyte range are also available.

The time taken to access data from a tertiary storage device ranges froma few seconds to a few minutes. A robotic arm in a jukebox or silo can findthe desired CD-ROM or cassette in several seconds, while human operatorsprobably require minutes to locate and retrieve tapes. Once loaded in thereader, any part of the CD can be accessed in a fraction of a second, while itcan take many additional seconds to move the correct portion of a tape underthe read-head of the tape reader.

In summary, tertiary storage access can be about 1000 times slower thansecondary-memory access (milliseconds versus seconds). However, single tert-iary-storage units can be 1000 times more capacious than secondary storagedevices (gigabytes versus terabytes). Figure 2.3 shows, on a log-log scale, therelationship between access times and capacities for the four levels of memory hi-erarchy that we have studied. We include "zip" and "floppy" disks ("diskettes"),which are common storage devices, although not typical of secondary storageused for database systems. The horizontal axis measures seconds in exponentsof 10; e.g., —3 means 10~3 seconds, or one millisecond. The vertical axis mea-sures bytes, also in exponents of 10; e.g., 8 means 100 megabytes.

2.1.6 Volatile and Nonvolatile StorageAn additional distinction among storage devices is whether they are volatile ornonvolatile. A volatile device "forgets" what is stored in it when the power goesoff. A nonvolatile device, on the other hand, is expected to keep its contentsintact even for long periods when the device is turned off or there is a powerfailure. The question of volatility is important, because one of the characteristiccapabilities of a DBMS is the ability to retain its data even in the presence ofpower failures.

Magnetic materials will hold their magnetism in the absence of power, sodevices such as magnetic disks and tapes are nonvolatile. Likewise, opticaldevices such as CD's hold the black or white dots with which they are imprinted,even in the absence of power. Indeed, for many of these devices it is impossible

Figure 2.3: Access time versus capacity for various levels of the memory hier-archy

to change what is written on their surface by any means. Thus, essentially allsecondary and tertiary storage devices are nonvolatile.

On the other hand, main memory is generally volatile. It happens that amemory chip can be designed with simpler circuits if the value of the bit isallowed to degrade over the course of a minute or so; the simplicity lowers thecost per bit of the chip. What actually happens is that the electric charge thatrepresents a bit drains slowly out of the region devoted to that bit. As a result,a so-called dynamic random-access memory, or DRAM, chip needs to have itsentire contents read and rewritten periodically. If the power is off, then thisrefresh does not occur, and the chip will quickly lose what is stored.

A database system that runs on a machine with volatile main memory mustback up every change on disk before the change can be considered part of thedatabase, or else we risk losing information in a power failure. As a consequence,query and database modifications must involve a large number of disk writes,some of which could be avoided if we didn't have the obligation to preserve allinformation at all times. An alternative is to use a form of main memory that isnot volatile. New types of memory chips, called flash memory, are nonvolatileand are becoming economical. An alternative is to build a so-called RAM diskfrom conventional memory chips by providing a battery backup to the mainpower supply.

2.1.7 Exercises for Section 2.1

Exercise 2.1.1: Suppose that in 1999 the typical computer has a processorthat runs at 500 megahertz, has a disk of 10 gigabytes, and a main memory


of 100 megabytes. Assume that Moore's law (these factors double every 18months) continues to hold into the indefinite future.

* a) When will terabyte disks be common?

b) When will gigabyte main memories be common?

c) When will terahertz processors be common?

d) What will be a typical configuration (processor, disk, memory) in the year2008?

! Exercise 2.1.2 : Commander Data, the android frorn the 24th century on Star-Trek: The Next Generation (but you knew that, didn't you?) once proudlyannounced that his processor runs at "12 teraops." While an operation anda cycle may not be the same, let us suppose they are, and that Moore's lawcontinues to hold for the next 300 years. If so, what would Data's true processorspeed be?

2.2 DisksThe use of secondary storage is one of the important characteristics of databasemanagement systems, and secondary storage is almost exclusively based onmagnetic disks. Thus, to motivate many of the ideas used in DBMS implemen-tation, we must examine the operation of disks in detail.

2.2.1 Mechanics of DisksThe two principal moving pieces of a disk drive are shown in Fig. 2.4; theyare a disk assembly and a head assembly. The disk assembly consists of oneor more circular platters that rotate around a central spindle. The upper andlower surfaces of the platters are covered with a thjn layer of magnetic material,on which bits are stored. A 0 is represented by orienting the magnetism of asmall area in one direction and a 1 by orienting the magnetism in the oppositedirection. A common diameter for disk platters is 3.5 inches, although diskswith diameters from an inch to several feet have been built.

The locations where bits are stored are organized into tracks, which areconcentric circles on a single platter. Tracks occupy most of a surface, exceptfor the region closest to the spindle, as can be seen in the top view of Fig. 2.5.A track consists of many points, each of which represents a single bit by thedirection of its magnetism.

Tracks are organized into sectors, which are segments of the circle separatedby gaps that are not magnetized in either direction.2 The sector is an indivisible

2We show each track with the same number of sectors in Fig. 2.5. However, as we shalldiscuss in Example 2 1, the number of sectors per track may vary, with the outer trackshaving more sectors than inner tracks.

Figure 2.4: A typical disk

unit, as far as reading and writing the disk is concerned. It is also indivisibleas far as errors are concerned. Should a portion of the magnetic layer becorrupted in some way, so that it cannot store information, then the entiresector containing this portion cannot be used. Gaps often represent about 10%of the total track and are used to help identify the beginnings of sectors. Theblocks, which, as we mentioned in Section 2.1.3, are logical units of data thatare transferred between disk and main memory, consist of one or more sectors.

Figure 2.5: Top view of a disk surface

The second movable piece shown in Fig. 2.4, the head assembly, holds thedisk heads. There is one head for each surface, riding extremely close to the


Sectors Versus Blocks

Remember that a "sector" is a physical unit of the disk, while a "block" isa logical unit, a creation of whatever software system — operating systemor DBMS, for example — is using the disk. As we mentioned, it is typicaltoday for blocks to be at least as large as sectors and to consist of one ormore sectors. However, there is no reason why a block cannot be a fractionof a sector, with several blocks packed into one sector. In fact, some oldersystems did use this strategy.

surface, but never touching it (or else a "head crash" occurs and the disk isdestroyed, along with everything stored thereon). A head reads the magnetismpassing under it, and can also alter the magnetism to write information on thedisk. The heads are each attached to an arm, and the arms for all the surfacesmove in and out together, being part of the rigid head assembly.

2.2.2 The Disk ControllerOne or more disk drives are controlled by a disk controller, which is a smallprocessor capable of:

1. Controlling the mechanical actuator that moves the head assembly, toposition the heads at a particular radius. At this radius, one track fromeach surface will be under the head for that surface and will therefore bereadable and writable. The tracks that are under the heads at the sametime are said to form a cylinder.

2. Selecting a surface from which to read or write, and selecting a sectorfrom the track on that surface that is under the head. The controller isalso responsible for knowing when the rotating spindle has reached thepoint where the desired sector is beginning to move under the head.

3. Transferring the bits read from the desired sector to the computer's mainmemory or transferring the bits to be written from main memory to theintended sector.

Figure 2.6 shows a simple, single-processor computer. The processor com-municates via a data bus with the main memory and the disk controller. Adisk controller can control several disks; we show three disks in this computer.

2.2.3 Disk Storage CharacteristicsDisk technology is in flux, as the space needed to store a bit shrinks rapidly. In1999, some of the typical measures associated with disks are:

Figure 2.6: Schematic of a simple computer system

• Rotation Speed of the Disk Assembly. 5400 RPM, i.e., one rotation every11 milliseconds, is common, although higher and lower speeds are found.

• Number of Platters per Unit. A typical disk drive has about five plattersand therefore ten surfaces. However, the common diskette ("floppy" disk)and "zip" disk have a single platter with two surfaces, and disk drives withup to 30 surfaces are found. "Single-sided floppies," with a single surfaceon one platter, are old-fashioned but may still be found.

• Number of Tracks per Surface. A surface may have as many as 10,000tracks, although diskettes have a much smaller number; see Example 2.2.

• Number of Bytes per Track. Common disk drives have 105 or more bytesper track, although diskettes' tracks hold less. As mentioned, tracks aredivided into sectors. Figure 2.5 shows 12 sectors per track, but in factas many as 500 sectors per track are found in modern disks. Sectors, inturn, hold perhaps 512 to 4096 bytes each.

Example 2.1: The Megatron 747 disk has the following characteristics, whichare typical of a medium-size, vintage-1999 disk drive.

• There are four platters providing eight surfaces.

• There are 213, or 8192 tracks per surface.

• There are (on average) 28 = 256 sectors per track.

• There are 29 = 512 bytes per sector.


The capacity of the disk is the product of 8 surfaces, times 8192 tracks,times 256 sectors, times 512 bytes, or 233 bytes. The Megatron 747 is thus an8 gigabyte disk. A single track holds 256 x 512 bytes, or 128K bytes. If blocksarc 212, or 4096 bytes, then one block uses 8 sectors, and there are 256/8 = 32blocks on a track.

The Megatron 747 has surfaces of 3.5-inch diameter. The tracks occupy theouter inch of the surfaces, and the inner 0.75 inch is unoccupied. The densityof bits in the radial direction is thus 8192 per inch, because that is the numberof tracks.

The density of bits around the tracks is far greater. Let us suppose at firstthat each track has the average number of sectors, 256. Suppose that the gapsoccupy 10% of the tracks, so the 128K bytes per track (or 1M bits) occupy90% of the track. The length of the outermost track is 3.5?r or about 11 inches.Ninety percent of this distance, or about 9.9 inches, holds a megabit. Hencethe density of bits in the occupied portion of the track is about 100,000 bitsper inch.

On the other hand, the innermost track has a diameter of only 1.5 inchesand would store the same one megabit in 0.9 x 1.5 x TT or about 4.2 inches. Thebit density of the inner tracks is thus around a 250,000 bits per inch.

Since the densities of inner and outer tracks would vary too much if thenumber of sectors and bits were kept uniform, the Megatron 747, like othermodern disk drives, stores more sectors on the outer tracks than on inner tracks.For example, we could store 256 sectors per track on the middle third, but only192 sectors on the inner third and 320 sectors on the outer third of the tracks.If we did, then the density would range from 114,000 bits to 182,000 bits perinch, at the outermost and innermost tracks, respectively, d

Example 2.2: At the small end of the range of disks is the standard 3.5-inchdiskette. It has two surfaces with 40 tracks each, for a total of 80 tracks. Thecapacity of this disk, formatted in either the MAC or PC formats, is about 1.5megabytes of data, or 150,000 bits (18,750 bytes) per track. About one quarterof the available space is taken up by gaps and other disk overhead in eitherformat. D

2.2.4 Disk Access Characteristics

Our study of database management systems requires us to understand not onlythe way data is stored on disks but the way it is manipulated. Since all com-putation takes place in main memory or cache, the only issue as far as the diskis concerned is how to move blocks of data between disk and main memory. Aswe mentioned in Section 2.2.2, blocks (or the consecutive sectors that comprisethe blocks) are read or written when:

a) The heads are positioned at the cylinder containing the track on whichthe block is located, and

2.2. DISKS 35

b) The sectors containing the block move under the disk head as the entiredisk assembly rotates.

The time taken between the moment at which the command to read a blockis issued and the time that the contents of the block appear in main memory iscalled the latency of the disk. It can be broken into the following components:

1. The time taken by the processor and disk controller to process the request,usually a fraction of a millisecond, which we shall neglect. We shall alsoneglect time due to contention for the disk controller (some other processmight be reading or writing the disk at the same time) and other delaysdue to contention, such as for the bus.

2. The time to position the head assembly at the proper cylinder. This time,called seek time, can be 0 if the heads happen already to be at the propercylinder. If not, then the heads require some minimum time to startmoving and stop again, plus additional time that is roughly proportionalto the distance traveled. Typical minimum times, the time to start, moveby one track, and stop, are a few milliseconds, while maximum times totravel across all tracks are in the 10 to 40 millisecond range. Figure 2.7suggests how seek time varies with distance. It shows seek time beginningat some value x for a distance of one cylinder and suggests that themaximum seek time is in the range 3x to 20a;. The average seek timeis often used as a way to characterize the speed of the disk. We discusshow to calculate this average in Example 2.3.

Cylinders traveled

Figure 2.7: Seek time varies with distance traveled

3. The time for the disk to rotate so the first of the sectors containing theblock reaches the head. This delay is called rotational latency. A typicaldisk rotates completely about once every 10 milliseconds. On the average,the desired sector will be about half way around the circle when the


heads arrive at its cylindei, so the average rotational latency is around 5milliseconds. Figure 2.8 illustrates the problem of rotational latency.

Figure 2.8: The cause of rotational latency

4. The transfer time, during which the sectors of the block, and any gapsbetween them, rotate past the head. Since a typical disk has about 100,000bytes per track and rotates once in approximately 10 milliseconds, we canread from disk at about 10 megabytes per second. The transfer time fora 4096 byte block is less than half a millisecond.

Example 2.3 : Let us examine the time it takes to read a 4096-byte block fromthe Megatron 747 disk. First, we need to know some timing properties of thedisk:

• The disk rotates at 3840 rpm; i.e., it makes one rotation in l/64th of asecond.

• To move the head assembly between cylinders takes one millisecond tostart and stop, plus one additional millisecond for every 500 cylinderstraveled. Thus, the heads move one track in 1.002 milliseconds and movefrom the innermost to the outermost track, a distance of 8191 tracks, inabout 17.4 milliseconds.

Let us calculate the minimum, maximum, and average times to read that4096-byte block. The minimum time, since we are neglecting overhead andcontention due to use of the controller, is just the transfer time. That is, theblock might be on a track over which the head is positioned already, and thefirst sector of the block might be about to pass under the head.

Since there are 512 bytes per sector on the Megatron 747 (see Example 2.1for the physical specifications of the disk), the block occupies eight sectors. The

2.2. DISKS 37

Trends in Disk-Controller Architecture

As the cost of digital hardware drops precipitously, disk controllers are be-ginning to look more like computers of their own, with general-purpose pro-cessors and substantial random-access memory. Among the many thingsthat might be done with such additional hardware, disk controllers arebeginning to read and store in their local memory entire tracks of a disk,even if only one block from that track is requested. This capability greatlyreduces the average access time for blocks, as long as we need all or mostof the blocks on a single track. Section 2.4.1 discusses some of the appli-cations of full-track or full-cylinder reads and writes.

heads must therefore pass over eight sectors and the seven gaps between them.Recall that the gaps represent 10% of the circle and sectors the remaining 90%.There are 256 gaps and 256 sectors around the circle. Since the gaps togethercover 36 degrees of arc and sectors the remaining 324 degrees, the total degreesof arc covered by seven gaps and 8 sectors is:

36 x ̂ + 324 x ̂ = 1L109256 256

degrees. The transfer time is thus (11.109/360)/64 seconds; that is, we divideby 360 to get the fraction of a rotation needed, and then divide by 64 becausethe Megatron 747 rotates 64 times a second. This transfer time, and thus theminimum latency, is about 0.5 milliseconds.

Now, let us look at the maximum possible time to read the block. In theworst case, the heads are positioned at the innermost cylinder, and the blockwe want to read is on the outermost cylinder (or vice versa). Thus, the firstthing the controller must do is move the heads. As we observed above, thetime it takes to move the Megatron 747 heads across all cylinders is about 17.4milliseconds. This quantity is the seek time for the read.

The worst thing that can happen when the heads arrive at the correct cylin-der is that the beginning of the desired block has just passed under the head.Assuming we must read the block starting at the beginning, we have to waitessentially a full rotation, or 15.6 milliseconds (i.e., l/64th of a second), forthe beginning of the block to reach the head again. Once that happens, wehave only to wait an amount equal to the transfer time, 0.5 milliseconds, toread the entire block. Thus, the worst-case latency is 17.4 + 15.6 + 0.5 = 33.5milliseconds.

Last let us compute the average time to read a block. Two of the componentsof the latency are easy to compute: the transfer time is always 0.5 milliseconds,and the average rotational latency is the time to rotate the disk half way around,or 7.8 milliseconds. We might suppose that the average seek time is just thetime to move across half the tracks. However, that is not quite right, since


typically, the heads are initially somewhere near the middle and therefore willhave to move less than half the distance, on average, to the desired cylinder.

A more detailed estimate of the average number of tracks the head mustmove is obtained as follows. Assume the heads are initially at any of the 8192cylinders with equal probability. If at cylinder 1 or cylinder 8192, then theaverage number of tracks to move is (1 + 2 + • • • + 8191)/8192, or about 4096tracks. If at cylinder 4096, in the middle, then the head is about equally likelyto move in as out, and either way, it will move on average about a quarterof the tracks, or 2048 tracks. A bit of calculation shows that as the initialhead position varies from cylinder 1 to cylinder 4096, the average distance thehead needs to move decreases quadratically from 4096 to 2048. Likewise, asthe initial position varies from 4096 up to 8192, the average distance to travelincreases quadratically back up to 4096, as suggested in Fig. 2.9.

Starting track

Figure 2.9: Average travel distance as a function of initial head position

If we integrate the quantity in Fig. 2.9 over all initial positions, we findthat the average distance traveled is one third of the way across the disk, or2730 cylinders. That is, the average seek time will be one millisecond, plusthe time to travel 2730 cylinders, or 1 + 2730/500 = 6.5 milliseconds.3 Ourestimate of the average latency is thus 6.5 + 7.8 + 0.5 = 14.8 milliseconds; thethree terms represent average seek time, average rotational latency, and transfertime, respectively. O

2.2.5 Writing BlocksThe process of writing a block is, in its simplest form, quite analogous to readinga block. The disk heads are positioned at the proper cylinder, we wait for the

3 Note that this calculation ignores the possibility that we do not have to move the headat all, but that case occurs only once in 8192 times assuming random block requests. On theother hand, random block requests is not necessarily a good assumption, as we shall see inSection 2.4.

2.2. DISKS 39

proper sector(s) to rotate under the head, but, instead of reading the dataunder the head we use the head to write new data. The minimum, maximumand average times to write would thus be exactly the same as for reading.

A complication occurs if we want to verify that the block was written cor-rectly. If so, then we have to wait for an additional rotation and read eachsector back to check that what was intended to be written is actually storedthere. A simple way to verify correct writing by using checksums is discussedin Section 2.5.2.

2.2.6 Modifying BlocksIt is not possible to modify a block on disk directly. Rather, even if we wish tomodify only a few bytes (e.g., a component of one of several tuples stored onthe block), we must do the following:

1. Read the block into main memory.

2. Make whatever changes to the block are desired in the main-memory copyof the block.

3. Write the new contents of the block back onto the disk.

4. If appropriate, verify that the write was done correctly.

The total time for this block modification is thus the sum of time it takesto read, the time to perform the update in main memory (which is usuallynegligible compared to the time to read or write to disk), the time to write,and, if verification is performed, another rotation time of the disk.4

2.2.7 Exercises for Section 2.2Exercise 2.2.1: The Megatron 777 disk has the following characteristics:

1. There are ten surfaces, with 10,000 tracks each.

2. Tracks hold an average of 1000 sectors of 512 bytes each.

3. 20% of each track is used for gaps.

4. The disk rotates at 10.000 rpm.

5. The time it takes the head to move n tracks is 1 + O.OOln milliseconds.4We might wonder whether the time to write the block we just read is the same as the

time to perform a "random" write of a block. If the heads stay where they are, then we knowwe have to wait a full rotation to write, but the seek time is zero. However, since the diskcontroller does not know when the application will finish writing the new value of the block,the heads may well have moved to another track to perform some other disk I/O before therequest to write the new value of the block is made


Answer the following questions about the Megatron 777.

* a) What is the capacity of the disk?

b) If all tracks hold the same number of sectors, what is the density of bitsin the sectors of a track?

* c) What is the maximum seek time?

* d) What is the maximum rotational latency?

e) If a block is 16,384 bytes (i.e., 32 sectors), what is the transfer time of ablock?

! f) What is the average seek time?

g) What is the average rotational latency?

! Exercise 2.2.2: Suppose the Megatron 747 disk head is at track 1024, i.e.,1/8 of the way across the tracks. Suppose that the next request is for a blockon a random track. Calculate the average time to read this block.

*!! Exercise 2.2.3: At the end of Example 2.3 we computed the average distancethat the head travels moving from one randomly chosen track to another ran-domly chosen track, and found that this distance is 1/3 of the tracks. Suppose,however, that the number of sectors per track were inversely proportional tothe length (or radius) of the track, so the bit density is the same for all tracks.Suppose also that we need to move the head from a random sector to anotherrandom sector. Since the sectors tend to congregate at the outside of the disk,we might expect that the average head move would be less than 1/3 of the wayacross the tracks. Assuming, as in the Megatron 747, that tracks occupy radiifrom 0.75 inches to 1.75 inches, calculate the average number of tracks the headtravels when moving between two random sectors.

!! Exercise 2.2.4: At the end of Example 2.1 we suggested that the maximumdensity of tracks could be reduced if we divided the tracks into three regions,with different numbers of sectors in each region. If the divisions between thethree regions could be placed at any radius, and the number of sectors in eachregion could vary, subject only to the constraint that the total number of byteson the 8192 tracks of one surface be 1 gigabyte, what choice for the five pa-rameters (radii of the two divisions between regions and the numbers of sectorsper track in each of the three regions) minimizes the maximum density of anytrack?

2.3 Using Secondary Storage EffectivelyIn most studies of algorithms, one assumes that the data is in main memory,and access to any item of data takes as much time as any other. This model

2.3. USING SECONDARY STORAGE EFFECTIVELY 41

of computation is often called the "RAM model" or random-access model ofcomputation. However, when implementing a DBMS, one must assume thatthe data does not fit into main memory. One must therefore take into accountthe use of secondary, and perhaps even tertiary storage in designing efficientalgorithms. The best algorithms for processing very large amounts of data thusoften differ from the best main-memory algorithms for the same problem.

In this section, we shall consider primarily the interaction between mainand secondary memory. In particular, there is a great advantage in designingalgorithms that limit the number of disk accesses, even if the actions taken bythe algorithm on data in main memory are not what we might consider the bestuse of the main memory. A similar principle applies at each level of the memoryhierarchy. Even a main-memory algorithm can sometimes be improved if weremember the size of the cache and design our algorithm so that data movedto cache tends to be used many times. Likewise, an algorithm using tertiarystorage needs to take into account the volume of data moved between tertiaryand secondary memory, and it is wise to minimize this quantity even at theexpense of more work at the lower levels of the hierarchy.

2.3.1 The I/O Model of ComputationLet us imagine a simple computer running a DBMS and trying to serve a numberof users who are accessing the database in various ways: queries and databasemodifications. For the moment, assume our computer has one processor, onedisk controller, and one disk. The database itself is much too large to fit inmain memory. Key parts of the database may be buffered in main memory, butgenerally, each piece of the database that one of the users accesses will have tobe retrieved initially from disk.

We shall assume that the disk is a Megatron 747, with 4K-byte blocks andthe timing characteristics determined in Example 2.3. In particular, the averagetime to read or write a block is about 15 milliseconds. Since there are manyusers, and each user issues disk-I/0 requests frequently, the disk controller willoften have a queue of requests, which we initially assume it satisfies on a first-come-first-served basis. A consequence of this strategy is that each requestfor a given user will appear random (i.e., the disk head will be in a randomposition before the request), even if this user is reading blocks belonging to asingle relation, and that relation is stored on a single cylinder of the disk. Laterin this section we shall discuss how to improve the performance of the systemin various ways. However, the following rule, which defines the I/O model ofcomputation, continues to hold:

Dominance of I/O cost: If a block needs to be moved betweendisk and main memory, then the time taken to perform the reador write is much larger than the time likely to be used manip-ulating that data in main memory. Thus, the number of blockaccesses (reads and writes) is a good approximation to the timeneeded by the algorithm and should be minimized.


Example 2.4 : Suppose our database has a relation R and a query asks for thetuple of R that has a certain key value k. As we shall see, it is quite desirablethat an index on R be created and used to identify the disk block on which thetuple with key value k appears. However it is generally unimportant whetherthe index tells us where on the block this tuple appears.

The reason is that it will take on the order of 15 milliseconds to read this 4K-byte block. In 15 milliseconds, a modern microprocessor can execute millionsof instructions. However, searching for the key value k once the block is inmain memory will only take thousands of instructions, even if the dumbestpossible linear search is used. The additional time to perform the search inmain memory will therefore be less than 1% of the block access time and canbe neglected safely. D

2.3.2 Sorting Data in Secondary StorageAs an extended example of how algorithms need to change under the I/O modelof computation cost, let us consider sorting when the data is much larger thanmain memory. To begin, we shall introduce a particular sorting problem andgive some details of the machine on which the sorting occurs.

Example 2.5: Let us assume that we have a large relation R consisting of10,000,000 tuples. Each tuple is represented by a record with several fields, oneof which is the sort key field, or just "key field" if there is no confusion withother kinds of keys. The goal of a sorting algorithm is to order the records byincreasing value of their sort keys.

A sort key may or may not be a "key" in the usual SQL sense of a primarykey, where records are guaranteed to have unique values in their primary key.If duplicate values of the sort key are permitted, then any order of recordswith equal sort keys is acceptable. For simplicity, we shall assume sort keys areunique. Also for simplicity, we assume records are of fixed length, namely 100bytes per record. Thus, the entire relation occupies a gigabyte.

The machine on which the sorting occurs has one Megatron 747 disk and 50megabytes of main memory available for buffering blocks of the relation. Theactual main memory is 64M bytes, but the rest of main-memory is used by thesystem.

We assume disk blocks are 4096 bytes. We may thus pack 40 100-byte tuplesor records to the block, with 96 bytes left over that may be used for certainbookkeeping functions or left unused. The relation thus occupies 250,000 blocks.The number of blocks that can fit in 50M bytes of memory (which, recall, isreally 50 x 220 bytes), is 50 x 220/212, or 12,800 blocks. D

If the data fits in main memory, there are a number of well-known algo-rithms that work well;5 variants of "Quicksort" are generally considered the

5See D. E. Knuth, The Art of Computer Programming, Vol. 3. Sorting and Searching,2nd Edition, Addison-Wesley, Reading MA, 1998


fastest. Moieover, we would use a strategy where we sort only the key fieldswith attached pointers to the full iccoids. Only when the keys and their point-ers were in sorted order, would we use the pointeis to bring every record to itsproper position.

Unfortunately, these ideas do not work very well when secondary memoryis needed to hold the data. The preferred approaches to sorting, when the datais mostly in secondary memory, involve moving each block between main andsecondary memory only a small number of times, in a regular pattern. Often,these algorithms operate in a small number of passes; in one pass every record isread into main memory once and written out to disk once. In the next section,we shall consider one such algorithm.

2.3.3 Merge-SortThe reader may be familiar with a sorting algorithm called Merge-Sort thatworks by merging sorted lists into larger sorted lists. To merge sorted lists, werepeatedly compare the smallest remaining keys of each list, move the recordwith the smaller key to the output, and repeat, until one list is exhausted. Atthat time, the output, in the order selected, followed by what remains of thenonexhausted list is the complete set of records, in sorted order.

Example 2.6 : Suppose we have two sorted lists of four records each. To makematters simpler, we shall represent records by their keys and no other data,and we assume keys are integers. One of the sorted lists is (1,3,4,9) and theother is (2,5, 7,8). In Fig. 2.10 we see the stages of the merge process.

Figure 2.10: Merging two sorted lists to make one sorted list

At the first step, the head elements of the two lists, 1 and 2, are compared.Since 1 < 2, the 1 is removed from the first list and becomes the first elementof the output. At step (2), the heads of the remaining lists, now 3 arid 2,are compared; 2 wins and is moved to the output. The merge continues untilstep (7), when the second list is exhausted. At that point, the remainder of the


first list, which happens to bo only one element, is appended to the output andthe merge is done. Note that the output is in soited order, as must be the case,because at each step we chose the smallest of the remaining elements. D

The time to merge in main memoiy is linear in the sum of the lengths of thelists. The reason is that, because the given lists are sorted, only the heads ofthe two lists are ever candidates for being the smallest unselected element, andwe can compare them in a constant amount of time. The classic merge-sortalgorithm sorts recursively, using Iog2 n phases if there are n elements to besorted. It can be described as follows:

BASIS: If there is a list of one element to be sorted, do nothing, because thelist is already sorted.

INDUCTION: If there is a list of more than one element to be sorted, thendivide the list arbitrarily into two lists that are either of the same length, or asclose as possible if the original list is of odd length. Recursively sort the twosublists. Then merge the resulting sorted lists into one sorted list.

The analysis of this algorithm is well known and not too important here. BrieflyT(n), the time to sort n elements, is some constant times n (to split the list andmerge the resulting sorted lists) plus the time to sort two lists of size n/2. Thatis, T(ri) = 2T(n/2) + an for some constant a. The solution to this recurrenceequation is T(ri) = O(nlogn), that is, proportional to nlogn.

2.3.4 Two-Phase, Multiway Merge-SortWe shall use a variant of Merge-Sort, called Two-Phase, Multiway Merge-Sort,to sort the relation of Example 2.5 on the machine described in that example.It is the preferred sorting algorithm in many database applications. Briefly, thisalgorithm consists of:

• Phase 1: Sort main-memory-sized pieces of the data, so every record ispart of a sorted list that just fits in the available main memory. Theremay thus be any number of these sorted subhsts, which we merge in thenext phase.

• Phase 2: Merge all the sorted sublists into a single sorted list.

Our first observation is that with data on secondary storage, we do not wantto start with a basis to the recursion that is one record or a few records. Thereason is that Merge-Sort is not as fast as some other algorithms when therecords to be sorted fit in main memory. Thus, we shall begin the recursionby taking an entire main memory full of recoids, and sorting them using anappropriate main-memory sorting algorithm such as Quicksort. We repeat thisprocess for as many times as necessary:

1. Fill all available main memory with blocks from the original relation tobe sorted.


2. Sort the records that are in main memory.

3. Write the sorted records from main memory onto new blocks of secondarymemory, forming one sorted sublist.

At the end of this first phase, all the records of the original relation will havebeen read once into main memory, and become part of a main-memory-sizesorted sublist that has been written onto disk.

Example 2.7: Consider the relation described in Example 2.5. We determinedthat 12,800 of the 250,000 blocks will fill main memory. We thus fill memory20 times, sort the records in main memory, and write the sorted sublists out todisk. The last of the 20 sublists is shorter than the rest; it occupies only 6,800blocks, while the other 19 sublists occupy 12,800 blocks.

How long does this phase take? We read each of the 250,000 blocks once,and we write 250,000 new blocks. Thus, there are half a million disk I/0's. Wehave assumed, for the moment, that blocks are stored at random on the disk, anassumption that, as we shall see in Section 2.4, can be improved upon greatly.However, on our randomness assumption, each block read or write takes about15 milliseconds. Thus, the I/O time for the first phase is 7500 seconds, or 125minutes. It is not hard to reason that at a processor speed of tens of millionsof instructions per second, the 10,000,000 records can be formed into 20 sortedsublists in far less than the I/O time. We thus estimate the total time for phaseone as 125 minutes, d

Now, let us consider how we complete the sort by merging the sorted sublists.We could merge them in pairs, as in the classical Merge-Sort, but that wouldinvolve reading all data in and out of memory 2 Iog2 n times if there were nsorted sublists. For instance, the 20 sorted sublists of Example 2.7 would be readin and out of secondary storage once to merge into 10 lists; another completereading and writing would reduce them to 5 sorted lists, a read/write of 4 ofthe five lists would reduce them to 3, and so on.

A better approach is to read the first block of each sorted sublist into amain-memory buffer. For some huge relations, there would be too many sortedsublists from phase one to read even one block per list into main memory,a problem we shall deal with in Section 2.3.5. But for data such as that ofExample 2.5, there are relatively few lists, 20 in that example, and a blockfrom each list fits easily in main memory.

We also use a buffer for an output block that will contain as many of thefirst elements in the complete sorted list as it can hold. Initially, the outputblock is empty. The arrangement of buffers is suggested by Fig. 2.11. We mergethe sorted sublists into one sorted list with all the records as follows.

1. Find the smallest key among the first remaining elements of all the lists.Since this comparison is done in main memory, a linear search is suffi-cient, taking a number of machine instructions proportional to the num-ber of sublists. However, if we wish, there is a method based on "priority

Figure 2.11: Main-memory organization for multiway merging

queues"6 that takes time proportional to the logarithm of the number ofsublists to find the smallest element.

2. Move the smallest element to the first available position of the outputblock.

3. If the output block is full, write it to disk and reinitialize the same bufferin main memory to hold the next output block.

4. If the block from which the smallest element was just taken is now ex-hausted of records, read the next block from the same sorted sublist intothe same buffer that was used for the block just exhausted. If no blocksremain, then leave its buffer empty and do not consider elements fromthat list in any further competition for smallest remaining elements.

In the second phase, unlike the first phase, the blocks are read in an unpre-dictable order, since we cannot tell when an input block will become exhausted.However, notice that every block holding records from one of the sorted lists isread from disk exactly once. Thus, the total number of block reads is 250,000in the second phase, just as for the first. Likewise, each record is placed once in

5See Aho, A. V. and J. D. Ullman Foundations of Computer Science, Computer SciencePress, 1992.


How Big Should Blocks Be?

We have assumed a 4K byte block in our analysis of algorithms using theMegatron 747 disk. However, there are arguments that a larger block sizewould be advantageous. Recall from Example 2.3 that it takes about half amillisecond for transfer time of a 4K block and 14 milliseconds for averageseek time and rotational latency. If we doubled the size of blocks, we wouldhalve the number of disk I/0's for an algorithm like the Multiway Merge-Sort described here. On the other hand, the only change in the time toaccess a block would be that the transfer time increases to 1 millisecond.We would thus approximately halve the time the sort takes.

If we doubled the block size again, to 16K, the transfer time wouldrise only to 2 milliseconds, and for a block size of 64K it would be 8milliseconds. At that point, the average block access time would be 22milliseconds, but we would need only 62,500 block accesses for a speedupin sorting by a factor of 10.

There are reasons to keep the block size fairly small. First, we cannoteffectively use blocks that cover several tracks. Second, small relationswould occupy only a fraction of a block, and thus there could be muchwasted space on the disk. There are also certain data structures for sec-ondary storage organization that prefer to divide data among many blocksand therefore work less well when the block size is too large. In fact, weshall see in Section 2.3.5 that the larger the blocks are, the fewer the num-ber of records we can sort by the two-phase, multiway method describedhere. Nevertheless, as machines get faster and disks more capacious, thereis a tendency for block sizes to grow.

an output block, and each of these blocks is written to disk. Thus, the numberof block writes in the second phase is also 250,000. As the amount of second-phase computation in main memory can again be neglected compared to theI/O cost, we conclude that the second phase takes another 125 minutes, or 250minutes for the entire sort.

2.3.5 Extension of Multiway Merging to Larger RelationsThe Two-Phase, Multiway Merge-Sort described above can be used to sort somevery large sets of records. To sec how large, let us suppose that:

1. The block size is B bytes.

2. The main memory available for buffering blocks is M bytes.

3. Records take R bytes.


The number of buffers available in main memory is thus M/B. On thesecond phase, all but one of these buffers may be devoted to one of the sortedsublists; the remaining buffer is for the output block. Thus, the number ofsorted sublists that may be created in phase one is (M/B) - 1. This quantityis also the number of times we may fill main memory with records to be sorted.Each time we fill main memory, we sort M/R records. Thus, the total numberof records we can sort is (M/R) ((M/B) — l), or approximately M2/RB records.

Example 2.8: If we use the parameters outlined in Example 2.5, then M =50,000,000, B = 4096, and R = 100. We can thus sort up to M^/RB = 6.1billion records, occupying six tenths of a terabyte.

Note that relations this size will not fit on a Megatron 747 disk, or on anyreasonably small number of them. We would probably need to use a tertiarystorage device to store the records, and we would have to move records from ter-tiary storage to a disk or disks, using a strategy like Multiway Merge-Sort, butwith tertiary storage and secondary storage playing the roles we have ascribedto secondary storage and main memory, respectively. D

If we need to sort more records, we can add a third pass. Use the Two-Phase, Multiway Merge-Sort to sort groups of M2/RB records, turning theminto sorted sublists. Then, in a third phase, we would merge up to (M/B) — 1of these lists in a final multiway merge.

The third phase lets us sort approximately M3/RB2 records occupyingM3 /B2 blocks. For the parameters of Example 2.5, this amount is about 75trillion records occupying 7500 petabytes. Such an amount is unheard of today.Since even the 0.61 terabyte limit for the Two-Phase, Multiway Merge-Sort isunlikely to be carried out in secondary storage, we suggest that the two-phaseversion of Multiway Merge-Sort is likely to be enough for all practical purposes.

2.3.6 Exercises for Section 2.3Exercise 2.3.1: Using Two-Phase, Multiway Merge-Sort, how long would ittake to sort the relation of Example 2.5 if the Megatron 747 disk were replacedby the Megatron 777 disk described in Exercise 2.2.1, and all other character-istics of the machine and data remained the same?

Exercise 2.3.2 : Suppose we use Two-Phase, Multiway Merge-Sort on the ma-chine and relation R of Example 2.5, with certain modifications. Tell how manydisk I/0's are needed for the sort if the relation R and/or machine character-istics are changed as follows:

* a) The number of tuples in R is doubled (and everything else remains thesame).

b) The length of tuples is doubled to 200 bytes (and everything else remainsas in Example 2.5).

2.4. IMPROVING THE ACCESS TIME OF SECONDARY STORAGE 49

* c) The size of blocks is doubled, to 8192 bytes (again, as throughout thisexercise, all other parameteis are unchanged).

d) The size of available main memory is doubled to 100 megabytes.

! Exercise 2.3.3 : Suppose the relation R of Example 2.5 grows to have as manytuples as can be sorted using Two-Phase, Multiway Merge-Sort on the machinedescribed in that example. Also assume that the disk grows to accomodate R,but all other characteristics of the disk, machine, and relation R remain thesame. How long would it take to sort R'>

* Exercise 2.3.4: Let us again consider the relation R of Example 2.5, butassume that it is stored sorted by the sort key (which is in fact a "key" in theusual sense, and uniquely identifies records). Also, assume that -R is stored ina sequence of blocks whose locations are known, so that for any i it is possibleto locate and retrieve the ith block of R using one disk I/O. Given a key valueK, we can find the tuple with that key value by using a standard binary searchtechnique. What is the maximum number of disk I/0's needed to find the tuplewith key Kl

\\ Exercise 2.3.5 : Suppose we have the same situation as in Exercise 2.3.4, butwe are given 10 key values to find. What is the maximum number of disk I/O'sneeded to find all 10 tuples?

* Exercise 2.3.6: Suppose we have a relation whose n tuples each require Rbytes, and we have a machine whose main memory M and disk-block size Bare just sufficient to sort the n tuples using Two-Phase, Multiway Merge-Sort.How would the maximum n change if we made one of the following alterationsto the parameters?

a) Double B.

b) Double R.

c) Double M.

\ Exercise 2.3.7: Repeat Exercise 2.3.6 if it is just possible to perform the sortusing Three-Phase, Multiway Merge-Sort.

*! Exercise 2.3.8: As a function of parameters R, M, and B (as in Exer-cise 2.3.6) and the integer k, how many records can be sorted using a fc-phase,Multiway Merge-Sort?

2.4 Improving the Access Time of SecondaryStorage

The analysis of Section 2.3.4 assumed that data was stored on a single disk andthat blocks were chosen randomly from the possible locations on the disk. That


assumption may be appropriate for a system that is executing a large numberof small queries simultaneously. But if all the system is doing is sorting a largerelation, then we can save a significant amount of time by being judicious aboutwhere we put the blocks involved in the sort, thereby taking advantage of theway disks work. In fact, even if the load on the system is from a large numberof unrelated queries accessing "random" blocks on disk, we can do a number ofthings to make the queries run faster and/or allow the system to process morequeries in the same time ("increase the throughput"). Among the strategies weshall consider in this section are:

• Place blocks that are accessed together on the same cylinder so we canoften avoid seek time, and possibly rotational latency as well.

• Divide the data among several smaller disks rather than one large one.Having more head assemblies that can go after blocks independently canincrease the number of block accesses per unit time.

• "Mirror" a disk: making two or more copies of the data on single disk.In addition to saving the data in case one of the disks fails, this strategy,like dividing the data among several disks, lets us access several blocks atonce.

• Use a disk-scheduling algorithm, either in the operating system, in theDBMS, or in the disk controller, to select the order in which severalrequested blocks will be read or written.

• Prefetch blocks to main memory in anticipation of their later use.

In our discussion, we shall emphasize the improvements possible when thesystem is dedicated, at least momentarily, to doing a particular task such asthe sorting operation we introduced in Section 2.5. However, there are at leasttwo other viewpoints with which to measure the performance of systems andtheir use of secondary storage:

1. What happens when there are a large number of processes being supportedsimultaneously by the system? An example is an airline reservation sys-tem that accepts queries about flights and new bookings from many agentsat the same time.

2. What do we do if we have a fixed budget for a computer system, or wemust execute a mix of queries on a system that is already in place andnot easily changed?

We address these questions in Section 2.4.6 after exploring the options.


2.4.1 Organizing Data by CylindersSince seek time represents about half the average time to access a block, thereare a number of applications where it makes sense to store data that is likelyto be accessed together, such as relations, on a single cylinder. If there is notenough room, then several adjacent cylinders can be used.

In fact, if we choose to read all the blocks on a single track or on a cylinderconsecutively, then we can neglect all but the first seek time (to move to thecylinder) and the first rotational latency (to wait until the first of the blocksmoves under the head). In that case, we can approach the theoretical transferrate for moving data on or off the disk.

Example 2.9: Let us review the performance of the Two-Phase, MultiwayMerge-Sort described in Section 2.3.4. Recall from Example 2.3 that we de-termined the average block transfer time, seek time, and rotational latency tobe 0.5 milliseconds, 6.5 milliseconds, and 7.8 milliseconds, respectively, for theMegatron 747 disk. We also found that the sorting of 10,000,000 records oc-cupying a gigabyte took about 250 minutes. This time was divided into fourlarge operations, two for reading and two for writing. One read- and one write-operation was associated with each of the two phases of the algorithm.

Let us consider whether the organization of data by cylinders can improvethe time of these operations. The first operation was the reading of the originalrecords into main memory. Recall from Example 2.7 that we loaded mainmemory 20 times, with 12,800 blocks each time.

The original data may be stored on consecutive cylinders. Each of the 8,192cylinders of the Megatron 747 stores about a megabyte; technically this figureis an average, because inner tracks store less and outer tracks more, but weshall for simplicity assume all tracks and cylinders are average. We must thusstore the initial data on 1000 cylinders, and we read 50 cylinders to fill mainmemory. Therefore we can read one cylinder with a single seek time. We donot even have to wait for any particular block of the cylinder to pass under thehead, because the order of records read is not important at this phase. We mustmove the heads 49 times to adjacent cylinders, but recall that a move of onetrack takes only one millisecond according to the parameters of Example 2.3.The total time to fill main memory is thus:

1. 6.5 milliseconds for one average seek.

2. 49 milliseconds for 49 one-cylinder seeks.

3. 6.4 seconds for the transfer of 12,800 blocks.

All but the last quantity can be neglected. Since we fill memory 20 times,the total reading time for phase 1 is about 2.15 minutes. This number should becompatred with the hour that the reading part of phase 1 took in Example 2.7when we assumed blocks were distributed randomly on disk. The writing partof phase 1 can likewise use adjacent cylinders to store the 20 sorted sublists


of records. They can be written out onto another 1000 cylinders, using thesame head motions as for reading: one random seek and 49 one-cylinder seeksfor each of the 20 lists. Thus, the writing time for phase 1 is also about 2.15minutes, or 4.3 minutes for all of phase 1, compared with 125 minutes whenrandomly distributed blocks were used.

On the other hand, storage by cylinders does not help with the second phaseof the sort. Recall that in the second phase, blocks are read from the fronts ofthe 20 sorted sublists in an order that is determined by the data and by whichlist next exhausts its current block. Likewise, output blocks, containing thecomplete sorted list, are written one at a time, interspersed with block reads.Thus, the second phase will still take about 125 minutes. We have consequentlycut the sorting time almost in half, but cannot do better by judicious use ofcylinders alone. D

2.4.2 Using Multiple DisksWe can often improve the speed of our system if we replace one disk, withmany heads locked together, by several disks with their independent heads. Thearrangement was suggested in Fig. 2.6, where we showed three disks connectedto a single controller. As long as the disk controller, bus, and main memory canhandle the data transferred at a higher rate, the effect will be approximatelyas if all times associated with reading and writing the disk were divided by thenumber of disks. An example should illustrate the difference.

Example 2.10 : The Megatron 737 disk has all the characteristics of the Mega-tron 747 outlined in Examples 2.1 and 2.3, but it has only one platter and twosurfaces. Thus, each Megatron 737 holds 2 gigabytes. Suppose that we re-place our one Megatron 747 by four Megatron 737's. Let us consider how theTwo-Phase, Multiway Merge-Sort can be conducted.

First, we can divide the given records among the four disks; the data willoccupy 1000 adjacent cylinders on each disk. When we want to load mainmemory from disk during phase 1, we shall fill 1/4 of main memory from eachof the four disks. We still get the benefit observed in Example 2.9 that theseek time and rotational latency go essentially to zero. However, we can readenough blocks to fill 1/4 of main memory, which is 3,200 blocks, from a diskin about 1,600 milliseconds, or 1.6 seconds. As long as the system can handledata at this rate coming from four disks, we can fill the 50 megabytes of mainmemory in 1.6 seconds, compared with 6.4 seconds when we used one disk.

Similarly, when we write out main memory during phase 1, we can distributeeach sorted sublist onto the four disks, occupying about 50 adjacent cylinders oneach disk. Thus, there is a factor-of-4 speedup for the writing part of phase 1too, and the entire phase 1 takes about a minute, compared with 4 minutesusing only the cylinder-based improvement of Section 2.4.1 and 125 minutes forthe original, random approach.

Now, let us consider the second phase of the Two-Phase, Multiway Merge-Sort. We must still read blocks from the fronts of the various lists in a seemingly


random, data-dependent way. If the core algorithm of phase 2 — the selectionof smallest remaining elements from the 20 sublists — requires that all 20 listsbe represented by blocks completely loaded into main memory, then we cannotuse the four disks to advantage. Every time a block is exhausted, we must waituntil a new block is read from the same list to replace it. Thus, only one diskat a time gets used.

However, if we write our code more carefully, we can resume comparisonsamong the 20 smallest elements as soon as the first element of the new blockappears in main memory.7 If so, then several lists might be having their blocksloaded into main memory at the same time. As long as they are on separatedisks, then we can perform several block reads at the same time, and we havethe potential of a factor-of-4 increase in the speed of the reading part of phase 2.We are also limited by the random order in which blocks must be read; if thenext two blocks we need happen to be on the same disk, then one has to waitfor the other, and all main-memory processing stops until at least the beginningof the second arrives in main memory.

The writing part of phase 2 is easier to speed up. We can use four outputbuffers, and fill each in turn. Each buffer, when full, is written to one particulardisk, filling cylinders in order. We can thus fill one of the buffers while the otherthree are written out.

Nevertheless, we cannot possibly write out the complete sorted list fasterthan we can read the data from the 20 intermediate lists. As we saw above,it is not possible to keep all four disks doing useful work all the time, and ourspeedup for phase 2 is probably in the 2-3 times range. However, even a factorof 2 saves us an hour. By using cylinders to organize data and four disks tohold data, we can reduce the time for our sorting example from 125 minutesfor each of the two phases to 1 minute for the first phase and an hour for thesecond, d

2.4.3 Mirroring DisksThere are situations where it makes sense to have two or more disks hold identi-cal copies of data. The disks are said to be mirrors of each other. One importantmotivation is that the data will survive a head crash by either disk, since it isstill readable on a mirror of the disk that crashed. Systems designed to enhancereliability often use pairs of disks as mirrors of each other.

However, mirror disks can also speed up access to data. Recall our discussionof phase 2 of multiway merge-sorting Example 2.10, where we observed that ifwe were very careful about timing, we could arrange to load up to four blocksfrom four different sorted lists whose previous blocks were exhausted. However,we could not choose which four lists would get new blocks. Thus, we could be

7We should emphasize that this approach requires extremely delicate implementation andshould only be attempted if there is an important benefit to doing so. There is a significantrisk that, if we are not careful, there will be an attempt to read a record before it actuallyarrives in mam memory.


unlucky and find that the first two lists were on the same disk, or two of thefirst three lists were on the same disk.

If we are willing to waste disk space by making four copies of a single largedisk, then we can guarantee that the system can always be retrieving four blocksat once. That is, no matter which four blocks we need, we can assign each oneto any one of the four disks and have the block read off of that disk.

In general, if we make n copies of a disk, we can read any n blocks in parallel.If we have fewer than n blocks to read at once, then we can often obtain a speedincrease by judiciously choosing which disk to read from. That is, we can pickthe available disk whose head is closest to the cylinder from which we desire toread.

Using mirror disks does not speed up writing, but neithei does it slow writingdown, when compared with using a single disk. That is, whenever we need towrite a block, we write it on all disks that have a copy. Since the writing cantake place in parallel, the elapsed time is about the same as for writing to asingle disk. There is a slight opportunity for differences among the writingtimes for the various mirror disks, because we cannot rely on them rotatingin exact synchronism. Thus, one disk's head might just miss a block, whileanother disk's head might be about to pass over the position for the same block.However, these differences in rotational latency average out, and if we are usingthe cylinder-based strategy of Section 2.4.1, then the rotational latency can beneglected anyway.

2.4.4 Disk Scheduling and the Elevator AlgorithmAnother effective way to speed up disk accesses in some situations is to havethe disk controller choose which of several requests to execute first. This op-portunity is not useful when the system needs to read or write disk blocks in acertain sequence, such as is the case in parts of our running merge-sort example.However, when the system is supporting many small processes that each accessa few blocks, one can often increase the throughput by choosing which process'request to honor first.

A simple and effective way to schedule large numbers of block requests isknown as the elevator algorithm. We think of the disk head as making sweepsacross the disk, from innermost to outermost cylinder and then back again,just as an elevator makes vertical sweeps from the bottom to top of a buildingand back again. As heads pass a cylinder, they stop if there are one or morerequests for blocks on that cylinder. All these blocks aie read or written, asrequested. The heads then proceed in the same direction they were travelinguntil the next cylinder with blocks to access is encountered. When the headsreach a position where there aie no requests ahead of them in their direction oftravel, they reverse direction.

Example 2.11: Suppose we are scheduling a Megatron 747 disk, which werecall has average seek, lotational latency, and transfer times of 6.5, 7.8, and


0.5. In this example, all times are in milliseconds. Suppose that at sometime there are existing requests for block accesses at cylinders 1000, 3000, and7000. The heads are located at cylinder 1000. In addition, there arc threemore requests for block accesses that come in at later times, as summarized inFig. 2.12. For instance, the request for a block from cylinder 2000 is made attime 20 milliseconds.

We shall assume that each block access incurs time 0.5 for transfer and 7.8for average rotational latency, i.e., we need 8.3 milliseconds plus whatever theseek time is for each block access. The seek time can be calculated by the rulefor the Megatron 747 given in Example 2.3: 1 plus the number of tracks dividedby 500. Let us see what happens if we schedule by the elevator algorithm. Thefirst request at cylinder 1000 requires no seek, since the heads are already there.Thus, at time 8.3 the first access will be complete. The request for cylinder2000 has not arrived at this point, so we move the heads to cylinder 3000, thenext requested "stop" on our sweep to the highest-numbered tracks. The seekfrom cylinder 1000 to 3000 takes 5 milliseconds, so we arrive at time 13.3 andcomplete the access in another 8.3. Thus, the second access is complete at time21.6. By this time, the request for cylinder 2000 has arrived, but we passedthat cylinder at time 11.3 and will not come back to it until the next pass.

We thus move next to cylinder 7000, taking time 9 to seek and 8.3 forrotation and transfer. The third access is thus complete at time 38.9. Now, therequest for cylinder 8000 has arrived, so we continue outward. We require 3milliseconds for seek time, so this access is complete at time 38.9+3+8.3 = 50.2.At this time, the request for cylinder 5000 has been made, so it and the requestat cylinder 2000 remain. We thus sweep inward, honoring these two requests.Figure 2.13 summarizes the times at which requests are honored.

Let us compare the performance of the elevator algorithm with a more naiveapproach such as first-come-first-served. The first three requests are satisfiedin exactly the same manner, assuming that the order of the first three requestswas 1000, 3000, 7000. However, at that point, we go to cylinder 2000, becausethat was the fourth request to arrive. The seek time is 11.0 for this request,since we travel from cylinder 7000 to 2000, more than half way across thedisk. The fifth request, at cylinder 8000, requires a seek time of 13, and thelast, at 5000, uses seek time 7. Figure 2.14 summarizes the activity caused byfirst-come-first-serve scheduling. The difference between the two algorithms —14 milliseconds — may appear not significant, but recall that the number ofrequests in this simple example is small and the algorithms were assumed notto deviate until the fourth of the six requests. D

If the average number of requests waiting for the disk increases, the elevatoralgorithm further improves the throughput. For instance, should the pool ofwaiting requests equal the number of cylinders, then each seek will cover but afew cylinders, and the average seek time will approximate the minimum. If thepool of requests grows beyond the number of cylinders, then there will typicallybe more than one request at each cylinder. The disk controller can then order


Cylinder First timeof Request available

= 1000 " 03000 07000 02000 208000 30 ,5000 40

Figure 2.12: Arrival times for six block-access requests

Cylinder Timeof Request completed

1000 8^33000 21.67000 38.98000 50.25000 65.52000 80.8

Figure 2.13: Finishing times for block accesses using the elevator algorithm

Cylinder Timeof Request completed

1000 JT33000 21.67000 38.92000 58.28000 79.55000 94.8

Figure 2.14: Finishing times for block accesses using the first-come-first-servedalgorithm


Effective Latency of the Elevator Algorithm

Although we saw in Example 2.11 how the average time taken per diskaccess can be reduced, the benefit is not uniform among requests. Forexample, the request from cylinder 2000 is satisfied at time 58.2 usingfirst-come-first-served, but time 80.8 using the elevator algorithm, as wefind by examining Figs. 2.13 and 2.14. Since the request was issued attime 20, the apparent latency of the disk as far as the requesting processis concerned went from 38.2 to 60.8 milliseconds.

If there were many more disk-access requests waiting, then each sweepof the heads during the elevator algorithm would take a very long time.A process whose request just missed the "elevator" would see an apparentlatency that was very high indeed. The compensation is that without usingthe elevator algorithm or another good scheduling approach, throughputwould decrease, and the disk could not satisfy requests at the rate theywere generated. The system would eventually experience arbitrarily longdelays, or fewer processes per second could be served.

the requests around the cylinder, reducing the average rotational latency aswell as the average seek time. However, should request pools grow that big, thetime taken to serve any request becomes extremely large. An example shouldillustrate the situation.

Example 2.12 : Suppose again we are operating a Megatron 747 disk, with its8192 cylinders. Imagine that there are 1000 disk access requests waiting. Forsimplicity, assume that these are all for blocks on different cylinders, spaced8 apart. If we start at one end of the disk and sweep across, each of the1000 requests has just a fraction more than 1 millisecond for seek time, 7.8milliseconds for rotational latency, and 0.5 milliseconds for transfer. We canthus satisfy one request every 9.3 milliseconds, about 60% of the 14.4 millisecondaverage time for random block accesses. However, the entire 1000 accesses take9.3 seconds, so the average delay in satisfying a request is half that, or 4.65seconds, a quite noticeable delay.

Now, suppose the pool of requests is as large as 16,384, which we shallassume for simplicity is exactly two accesses per cylinder. In this case, eachseek time is one millisecond, and of course the transfer time is half a millisecond.Since there are two blocks accessed on each cylinder, on average the further ofthe two blocks will be 2/3 of the way around the disk when the heads arrive atthat track. The proof of this estimate is tricky; we explain it in the box entitled"Waiting for the Last of Two Blocks."

Thus the average latency for these two blocks will be half of 2/3 of the timefor a single revolution, or 2 x | x 15.6 = 5.2 milliseconds. We have thus reducedthe average time to access a block to 1 + 0.5 + 5.2 = 6.7 milliseconds, or less


Waiting for the Last of Two Blocks

Suppose there are two blocks at random positions around a cylinder. LetXi and X2 be the positions, in fractions of the full circle, so these arenumbers between 0 and 1. The probability that both x\ and x^ are lessthan some number y between 0 and 1 is y2. Thus, the probability densityfor y is the derivative of y2, or 2y. That is, the probability that y has agiven value increases linearly, as y grows from 0 to 1. The average of y isthe integral of y times the probability density of y, that is J0 2y2 or 2/3.

than half the average time with first-come-first-served scheduling. On the otherhand, the 16,384 accesses take a total of 110 seconds, so the average delay insatisfying a request is 55 seconds. D

2.4.5 Prefetching and Large-Scale BufferingOui final suggestion for speeding up some secondary-memory algorithms iscalled prefetchmg or sometimes double buffering. In some applications we canpredict the order in which blocks will be requested from disk. If so, then we canload them into main memory buffers before they are needed. One advantage todoing so is that we are thus better able to schedule the disk, such as by usingthe elevator algorithm, to reduce the average time needed to access a block. Wecould gain the speedup in block access suggested by Example 2.12 without thelong delay in satisfying requests that we also saw in that example.

Example 2.13: For an example of the use of double buffering, let us againfocus on the second phase of the Two-Phase, Multiway Merge-Sort outlined inSection 2.3.4. Recall that we merged 20 sorted sublists by bringing into mainmemory one block from each list. If we had so many sorted sublists to mergethat one block from each would fill main memory, then we could not do anybetter. But in our example, there is plenty of main memory left over. Forexample, we could devote two block buffers to each list and fill one buffer whilerecords were being selected from the other during the merge. When a bufferwas exhausted, we would switch to the other buffer for the same list, with nodelay. D

However, the scheme of Example 2.13 would still take as much time as isrequired to read all the blocks of the sorted sublists, which is 250,000 blocks.We could combine prefetching with the cylinder-based strategy of Section 2.4.1if we:

1. Store the sorted sublists on whole, consecutive cylinders, with the blockson each track being consecutive blocks of the sorted sublist.


2. Read whole tracks or whole cylinders whenever we need some more recordsfrom a given list.

Example 2.14 : To appreciate the benefit of track-sized or cylinder-sized reads,again let us consider the second phase of the Two-Phase, Multiway Merge-SortWe have room in main memory for two track-sized buffers for each of the 20lists. Recall a track of the Megatron 747 holds 128K bytes, so the total spacerequirement is about 5 megabytes of main memory. We can read a track startingat any sector, so the time to read a track is essentially the average seek timeplus the time for the disk to rotate once, or 6.5 + 15.6 = 22.1 milliseconds. Sincewe must read all the blocks on 1000 cylinders, or 8000 tracks, to read the 20sorted sublists, the total time for reading of all data is about 2.95 minutes.

We can do even better if we use two cylinder-sized buffers per sorted sublist,and fill one while the other is being used. Since there are eight tracks on acylinder of the Megatron 747, we use 40 buffers of a megabyte each. With 50megabytes available for the sort, we have enough room in main memory to doso. Using cylinder-sized buffers, we need only do a seek once per cylinder. Thetime to seek and read all eight tracks of a cylinder is thus 6.5 + 8x15.6= 131.3milliseconds. The time to read all 1000 cylinders is 1000 times as long, or about2.19 minutes. D

The ideas just discussed for reading have their analogs for writing. In thespirit of prefetching, we can delay the writing of buffered blocks, as long as wedon't need to reuse the buffer immediately. This strategy allows us to avoiddelays while we wait for a block to be written out.

However, much more powerful is the strategy of using large output buffers —track-sized or cylinder-sized. If our application permits us to write in such largechunks, then we can essentially eliminate seek time and rotational latency, andwrite to disk at the maximum transfer rate of the disk. For instance, if wemodified the writing part of the second phase of our sorting algorithm so therewere two output buffers of a megabyte each, then we could fill one buffer withsorted records, and write it to a cylinder at the same time we filled the otheroutput buffer with the next sorted records. Then, the writing time would be 2.15minutes, like the reading time in Example 2.14, and the entire phase 2 wouldtake 4.3 minutes, just like the improved phase 1 of Example 2.9. In essence,a combination of the tricks of cylinderization and cylinder-sized buffering andprefetching has given us a sort in 8.6 minutes that takes over 4 hours by a naivedisk-management strategy.

2.4.6 Summary of Strategies and TradeoffsWe have seen five different "tricks" that can sometimes improve the performanceof a disk system. They are:

1. Organizing data by cylinders.

2. Using several disks in place of one.


3. Mirroring disks.

4. Scheduling requests by the elevator algorithm.

5. Prcfetching data in track- or cylinder-sized chunks.

We also considered their effect on two situations, which represent the extremesof disk-access requirements:

a) A very regular situation, exemplified by phase 1 of the Two-Phase, Multi-way Merge-Sort, where blocks can be read and written in a sequence thatcan be predicted in advance, and there is only one process using the disk.

b) A collection of short processes, such as airline reservations or bank-accountchanges, that execute in parallel, share the same disk(s), and cannot bepredicted in advance. Phase 2 of Two-Phase, Multiway Merge-Sort hassome of these characteristics.

Below we summarize the advantages and disadvantages of each of these methodsfor these applications and those in between.

Cylinder-Based Organization

• Advantage: Excellent for type (a) applications, where accesses can bepredicted in advance, and only one process is using the disk.

• Disadvantage: No help for type (b) applications, where accesses are un-predictable.

Multiple Disks

• Advantage: Increases the rate at which read/write requests can be satis-fied, for both types of applications.

• Problem: Read or write requests for the same disk cannot be satisfied atthe same time, so speedup factor may be less than the factor by whichthe number of disks increases.

• Disadvantage: The cost of several small disks exceeds the cost of a singledisk with the same total capacity.

Mirroring

• Advantage: Increases the rate at which read/write requests can be satis-fied, for both types of applications; does not have the problem of collidingaccesses mentioned for multiple disks.

• Advantage: Improves fault tolerance for all applications.

• Disadvantage: We must pay for two or more disks but get the storagecapacity of only one.


Cylinder First timeof Request available

1000 66000 1500 10

5000 20

Figure 2.15: Arrival times for six block-access requests

Elevator Algorithm

• Advantage: Reduces the average time to read/write blocks when the ac-cesses to blocks are unpredictable.

• Problem: The algorithm is most effective in situations where there aremany waiting disk-access requests and therefore the average delay for therequesting processes is high.

Prefetching/Double Buffering

• Advantage: Speeds up access when the needed blocks are known butthe timing of requests is data-dependent, as in phase 2 of the multiwaymerge-sort.

• Disadvantage: Requires extra main-memory buffers. No help when ac-cesses are random.

2.4.7 Exercises for Section 2.4

Exercise 2.4.1: Suppose we are scheduling I/O requests for a Megatron 747disk, and the requests in Fig. 2.15 are made, with the head initially at track4000. At what time is each request seviced fully if:

a) We use the elevator algorithm (it is permissible to start moving in eitherdirection at first).

b) We use first-come, first-served scheduling.

*! Exercise 2.4,2: Suppose we use two Megatron 747 disks as mirrors of oneanother. However, instead of allowing reads of any block from either disk, wekeep the head of the first disk in the inner half of the cylinders, and the headof the second disk in the outer half of the cylinders. Assuming read requestsare on random tracks, and we never have to write:

a) What is the average rate at which this system can read blocks?


b) How does this rate compare with the average rate for mirrored Megatron747 disks with no restriction?

c) What disadvantages do you forsee foi this system?

! Exercise 2.4.3: Let us explore the relationship between the arrival rate ofrequests, the throughput of the elevator algorithm, and the average delay ofrequests. To simplify the problem, we shall make the following assumptions:

1. A pass of the elevator algorithm always proceeds from the innermost tooutermost track, or vice-versa, even if there are no requests at the extremecylinders.

2. When a pass starts, only those requests that are already pending will behonored, not requests that come in while the pass is in progress, even ifthe head passes their cylinder.8

3. There will never be two requests for blocks on the same cylinder waitingon one pass.

Let A be the interarrival rate, that is the time between requests for block ac-cesses. Assume that the system is in steady state, that is, it has been acceptingand answering requests for a long time. For a Megatron 747 disk, compute asa function of A:

* a) The average time taken to perform one pass.

b) The number of requests serviced on one pass.

c) The average time a request waits for service.

*!! Exercise 2.4.4: In Example 2.10, we saw how dividing the data to be sortedamong four disks could allow more than one block to be read at a time. Onthe assumption that the merging phase generates read requests for blocks thatare on random disks, and that all read requests are honored unless they are fora disk that is currently serving another request, what is the average number ofdisks that are serving requests at any time? Note: there are a two observationsthat simplify the problem:

1. Once a request to read a block is unable to proceed, then the mergingmust stop and generate no more requests, because there is no data fromthe exhausted sublist that generated the unsatisfied read request.

8The purpose of this assumption is to avoid having to deal with the fact that a typical passof the elevator algorithm goes fast as first, as there will be few waiting requests where thehead has recently been, and speeds up as it moves into an area of the disk where it has notrecently been. The analysis of the way request density varies during a pass is an interestingexercise in its own right.

2.5. DISK FAILURES 63

2. As soon as merging is able to proceed, it will generate a read request,since main-memory merging takes essentially no time compared with thetime to satisfy a read request.

! Exercise 2.4.5 : If we are to read k randomly chosen blocks fiom one cylinder,on the average how far around the cylinder must we go before we pass all ofthe blocks?

2.5 Disk FailuresIn this and the next section we shall consider the ways in which disks can failand what can be done to mitigate these failures.

1. The most common form of failure is an intermittent failure, where anattempt to read or write a sector is unsuccessful, but with repeated trieswe are able to read or write successfully.

2. A more serious form of failure is one in which a bit or bits are permanentlycorrupted, and it becomes impossible to read a sector correctly no matterhow many times we try. This form of error is called media decay.

3. A related type of error is a write failure, where we attempt to writea sector, but we can neither write successfully nor can we retrieve thepreviously written sector. A possible cause is that there was a poweroutage during the writing of the sector.

4. The most serious form of disk failure is a disk crash, where the entire diskbecomes unreadable, suddenly and permanently.

In this section we consider a simple model of disk failures. We cover paritychecks as a way to detect intermittent failures. We also discuss "stable storage,"a technique for organizing a disk so that media decays or failed writes do notresult in permanent loss. In Section 2.6 we examine techniques collectivelyknown as "RAID" for coping with disk crashes.

2.5.1 Intermittent FailuresDisk sectors are ordinarily stored with some redundant bits, which we discussin Section 2.5.2. The purpose of these bits is to enable us to tell whether whatwe arc reading from the sector is correct or not; they similarly allow us to tellwhether a sector we wrote has been written correctly.

A useful model of disk reads is that the reading function returns a pair(w,s), where w is the data in the sector that is read, and s is a status bit thattells whether or not the read was successful; i.e., whether or not we can rely onw being the true contents of the sector. In an intermittent failure, we may geta status "bad" several times, but if the read function is repeated enough times


(100 times is a typical limit), then eventually a status "good" will be returned,and we rely on the data returned with this status to be the correct contents ofthe disk sector. As we shall see in Section 2.5.2, there is a chance that we arebeing fooled; the status is "good" but the data returned is wrong. However,we can make the probability of this event as low as we like, if we add moreredundancy to the sectors.

Writing of sectors can also profit by observing the status of what we write.As we mentioned in Section 2.2.5, we may try to read each sector after wewrite it and determine whether the write was successful. A straightforwardway to perform the check is to read the sector and compare it with the sectorwe intended to write. However, instead of performing the complete comparisonat the disk controller, it is simpler to attempt to read the sector and see if itsstatus is "good." If so, we assume the write was correct, and if the status is"bad" then the write was apparently unsuccessful and must be repeated. Noticethat, as with reading, we can be fooled if the status is "good" but the writewas actually unsuccessful. Also as with reading, we can make the probabilityof such an event as small as we like.

2.5.2 ChecksumsHow a reading operation can determine the good/bad status of a sector mayappear mysterious at first. Yet the technique used in modern disk drives is quitesimple: each sector has some additional bits, called the checksum, that are setdepending on the values of the data bits stored in that sector. If, on reading,we find that the checksum is not proper for the data bits, then we return status"bad"; otherwise we return "good." While there is a small probability thatthe data bits may be misread, but the incorrect bits happen to have the samechecksum as the correct bits (and therefore incorrect bits will be given thestatus "good"), by using a sufficiently large number of checksum bits, we canreduce this probability to whatever small level we wish.

A simple form of checksum is based on the parity of all the bits in the sector.If there is an odd number of 1's among a collection of bits, we say the bits haveodd parity, or that their parity bit is 1. Similarly, if there is an even number of1's among the bits, then we say the bits have even parity, or that their paritybit is 0. As a result:

• The number of 1 's among a collection of bits and their parity bit is alwayseven.

When we write a sector, the disk controller can compute the parity bit andappend it to the sequence of bits written in the sector. Thus, every sector willhave even parity.

Example 2.15 : If the sequence of bits in a sector were 01101000, then thereis an odd number of 1's, so the parity bit is 1. If we follow this sequence by itsparity bit we have 011010001. If the given sequence of bits were 11101110, we

2.5. DISK FAILURES 65

have an even number of 1's, and the parity bit is 0. The sequence followed byits parity bit is 111011100. Note that each of the nine-bit sequences constructedby adding a parity bit has even parity. D

Any one-bit error in reading or writing the bits and their parity bit resultsin a sequence of bits that has odd parity; i.e., the number of 1's is odd. It iseasy for the disk controller to count the number of 1's and to determine thepresence of an error if a sector has odd parity.

Of course, more than one bit of the sector may be corrupted. If so, theprobability is 50% that the number of 1-bits will be even, and the error willnot be detected. We can increase our chances of detecting even large numbersof errors if we keep several parity bits. For example, we could keep eight paritybits, one for the first bit of every byte, one for the second bit of every byte, andso on, up to the eighth and last bit of every byte. Then, on a massive error, theprobability is 50% that any one parity bit will detect an error, and the chancethat none of the eight do so is only one in 28, or 1/256. In general, if we usen independent bits as a checksum, then the chance of missing an error is only1/2™. For instance, if we devote 4 bytes to a checksum, then there is only onechance in about four billion that the error will go undetected.

2.5.3 Stable StorageWhile checksums will almost certainly detect the existence of a media failureor a failure to read or write correctly, it does not help us correct the error.Moreover, when writing we could find ourselves in a position where we overwritethe previous contents of a sector and yet cannot read the new contents. Thatsituation could be serious in a situation where, say, we were adding a smallincrement to an account balance and have now lost both the original balanceand the new balance. If we could be assured that the contents of the sectorcontained either the new or old balance, then we would only have to determinewhether the write was successful or not.

To deal with the problems above, we can implement a policy known asstable storage on a disk or on several disks. The general idea is that sectorsare paired, and each pair represents one sector-contents X. We shall refer tothe pair of sectors representing X as the "left" and "right" copies, XL and XR.We continue to assume that the copies are written with a sufficient number ofparity-check bits so that we can rule out the possibility that a bad sector looksgood when the parity checks are considered. Thus, we shall assume that if theread function returns (w,good) for either XL or XR, then w is the true valueof X. The stable-storage writing policy is:

1. Write the value of X into XL- Check that the value has status "good";i.e., the parity-check bits are correct in the written copy. If not, repeat thewrite. If after a set number of write attempts, we have not successfullywritten X into XL, assume that there is a media failure in this sector. Afix-up such as substituting a spare sector for XL must be adopted.


2. Repeat (1) for XR.

The stable-storage reading policy is:

1. To obtain the value of X, read XL- If status "bad" is returned, repeat theicad a set number of times. If a value with status "good" is eventuallyreturned, take that value as X.

2. If we cannot read XL, repeat (1) with XR.

2.5 A Error-Handling Capabilities of Stable StorageThe policies described in Section 2.5.3 are capable of compensating for severaldifferent kinds of eirors. We shall outline them here.

1. Media failures. If, after storing X in sectors XL and XR, one of themundergoes a media failure and becomes permanently unreadable, we canalways lead X from the other. If XR has failed but XL has not, thenthe lead policy will correctly read XL and not even look at XR; we shalldiscover that XR is failed when we next try to wiite a new value for X.If only XL has failed, then we shall not be able to get a "good" statusfor X in any of our attempts to read XL (recall that we assume a badsector will always return status "bad," even though in reality there is atiny chance that "good" will be returned because all the parity-check bitshappen to match). Thus, we proceed to step (2) of the read algorithmand correctly read X from XR. Note that if both XL and XR have failed,then we cannot read X, but the probability of both failing is extremelysmall.

2. Write failure. Suppose that as we write X, there is a system failure —e.g., a power outage. It is possible that X will be lost in main memory,and also the copy of X being written at the time will be garbled. Forexample, half the sector may be written with part of the new value of X,while the other half remains as it was. When the system becomes availableand we examine XL and XR, we are sure to be able to determine eitherthe old or new value of X. The possible cases are:

(a) The failure occurred as we were writing XL • Then we shall find thatthe status of XL is "bad." However, since we never got to write XR,its status will be "good" (unless there is a coincident media failure atXR, which we rule out as extremely unlikely). Thus, we can obtainthe old value of X. We may also copy XR into XL to repair thedamage to XL-

(b) The failure occurred after we wrote XL- Then we expect that XLwill have status "good," and we may read the new value of X fromXL- Note that XR may have status bad, and we should copy XLinto XR if so.

2.6. RECOVERY FROM DISK CRASHES 67

2.5.5 Exercises for Section 2.5Exercise 2.5.1: Compute the parity bit for the following bit sequences:

* a) 00111011.

b) 00000000.

c) 10101101.

Exercise 2.5.2: We can have two parity bits associated with a string if wefollow the string by one bit that is a parity bit for the odd positions and asecond that is the parity bit for the even positions. For each of the strings inExercise 2.5.1, find the two bits that serve in this way.

2.6 Recovery from Disk CrashesIn this section, we shall consider the most serious mode of failure for disks —the "head crash," where data is permanently destroyed. In this event, if data isnot backed up on another medium, such as a tape backup system, or on a mirrordisk as we discussed in Section 2.4.3, then there is nothing we can do to recoverthe data. This situation represents a disaster for major DBMS applications,such as banking and other financial applications, airline or other reservation-booking databases, inventory-management systems, and many others.

There are a number of schemes that have been developed to reduce the riskof data loss by disk crashes. They generally involve redundancy, extendingthe idea of parity checks, as discussed in Section 2.5.2, or duplicated sectors,as in Section 2.5.3. The common term for this class of strategics is RAID,or Redundant Arrays of Independent Disks.9 Here, we shall discuss primarilythree schemes, called RAID levels 4, 5, and 6. These RAID schemes also handlefailures in the modes discussed in Section 2.5: media failures and corruption ofdata in a single sector due to a temporary system failure.

2.6.1 The Failure Model for DisksTo begin our discussion of disk crashes, we need to consider first the statisticsof such failures. The simplest way to describe failure behavior is through ameasurement known as mean time to failure. This number is the length of timeby which 50% of a population of disks will have failed catastrophically, i.e., hada head crash so they are no longer readable. For modern disks, the mean timeto failure is about 10 years.

The simplest way to use this number is to assume that failures occur linearly.That is, if 50% have failed by 10 years, then 5% will fail in the first year,5% in the second, and so on, for 20 years. More realistically, the survival

Previously, the acronym RAID was translated as "Redundant Array of InexpensiveDisks," and this meaning may still appear in literature.


percentage of disks looks more like Fig. 2.16. As for most types of electronicequipment, many disk failures occur early in the life cycle, due to tiny defectsin the manufacture of that disk. Hopefully, most of these are caught before thedisk leaves the factory, but some do not show up for months. A disk that doesnot suffer an early failure will probably serve for many years. Later in the lifecycle, factors such as "wear-and-tear" and the accumulated effects of tiny dustparticles increase the chances of a crash.

Figure 2.16: A survival rate curve for disks

However, the mean time to a disk crash does not have to be the same asthe mean time to data loss. The reason is that there are a number of schemesavailable for assuring that if one disk fails, there are others to help recover thedata of the failed disk. In the balance of this section, we shall study the mostcommon schemes.

Each of these schemes starts with one or more disks that hold the data (we'llcall these the data disks) and adding one or more disks that hold informationthat is completely determined by the contents of the data disks. The latter arecalled redundant disks. When there is a disk crash of either a data disk or aredundant disk, the other disks can be used to restore the failed disk, and thereis no permanent information loss.

2.6.2 Mirroring as a Redundancy Technique

The simplest scheme is to mirror each disk, as discussed in Section 2.4.3. Weshall call one of the disks the data disk, while the other is the redundant disk;which is which doesn't matter in this scheme. Mirroring, as a protection againstdata loss, is often referred to as RAID level 1. It gives a mean time to memoryloss that is much greater than the mean time to disk failure, as the follow-ing example illustrates. Essentially, with mirroring and the other redundancyschemes we discuss, the only way data can be lost is if there is a second diskcrash while the first crash is being repaired.


Example 2.16: Suppose each disk has a 10 year mean time to failure. Weshall use the linear model of failures described in Section 2.6.1, which meansthat the chance a disk will fail is 5% per year. If disks are mirrored, then whena disk fails, we have only to replace it with a good disk and copy the mirrordisk to the new one. At the end, we have two disks that are mirrors of eachother, and the system is restored to its former state.

The only thing that could go wrong is that during the copying the mirrordisk fails. Now, both copies of at least part of the data have been lost, andthere is no way to recover.

But how often will this sequence of events occur? Suppose that the processof replacing the failed disk takes 3 hours, which is 1/8 of a day, or 1/2920 of ayear. Since we assume a failure rate of 5% per year, the probability that themirror disk will fail during copying is (1/20) x (1/2920), or one in 58,400. Ifone disk fails every 10 years, then one of the two disks will fail once in 5 yearson the average. One in every 58,400 of these failures results in data loss. Putanother way, the mean time to a failure involving data loss is 5 x 58,400 =292,000 years. O

2.6.3 Parity BlocksWhile mirroring disks is an effective way to reduce the probability of a disk crashinvolving data loss, it uses as many redundant disks as there are data disks.Another approach, often called RAID level 4, uses only one redundant disk nomatter how many data disks there are. We assume the disks are identical, sowe can number the blocks on each disk from 1 to some number n. Of course,all the blocks on all the disks have the same number of bits; for instance, the4096-byte blocks in our Megatron 747 running example have 8 x 4096 = 32,768bits. In the redundant disk, the ith block consists of parity checks for the ithblocks of all the data disks. That is, the jth bits of all the ith blocks, includingboth the data disks and the redundant disk, must have an even number of 1'samong them, and we always choose the bit of the redundant disk to make thiscondition true.

We saw in Example 2.15 how to force the condition to be true. In theredundant disk, we choose bit j to be 1 if an odd number of the data diskshave 1 in that bit, and we choose bit j of the redundant disk to be 0 if thereare an even number of 1's in that bit among the data disks. The term for thiscalculation is the modulo-2 sum. That is, the modulo-2 sum of bits is 0 if thereare an even number of 1's among those bits, and 1 if there are an odd numberof 1's.

Example 2.17: Suppose for sake of an extremely simple example that blocksconsist of only one byte — eight bits. Let there be three data disks, called 1,2, and 3, and one redundant disk, called disk 4. Focus on, say, the first blockof all these disks. If the data disks have in their first blocks the following bitsequences:


disk 1: 11110000disk 2- 10101010disk 3: 00111000

then the redundant disk will have in block 1 the parity check bits:

disk 4: 01100010

Notice how in each position, an even numbei of the four 8-bit sequences have1's. There aie two 1's in positions 1, 2, 4, 5, and 7, four 1's in position 3, andzero 1's in positions 6 and 8. D

Reading

Reading blocks from a data disk is no different from reading blocks from anydisk. There is generally no reason to read from the redundant disk, but we could.In some circumstances, we can actually get the effect of two simultaneous readsfrom one of the data disks; the following example shows how, although theconditions under which it could be used are expected to be rare.

Example 2.18: Suppose we are reading a block of the first data disk, andanother request comes in to read a different block, say block 1, of the samedata disk. Ordinarily, we would have to wait foi the first request to finish.However, if none of the other disks are busy, we could read block 1 from eachof them, and compute block 1 of the first disk by taking the modulo-2 sum.

Specifically, if the disks and their first blocks were as in Example 2.17, thenwe could lead the second and third data disks and the redundant disk, to getthe following blocks:

disk 2: 10101010disk 3: 00111000disk 4: 01100010

If we take the modulo-2 sum of the bits in each column, we get

disk 1: 11110000

which is the same as block 1 of the first disk. D

Writing

When we wiite a new block of a data disk, we need not only to change thatblock, but we need to change the corresponding block of the redundant diskso it continues to hold the parity checks for the corresponding blocks of all thedata disks. A naive approach would read the corresponding blocks of the n datadisks, take their modulo-2 sum, and rewrite the block of the redundant disk.That approach requires n - 1 reads of the data blocks not being rewritten,


The Algebra of Modulo-2 Sums

It may be helpful for understanding some of the tricks used with paritychecks to know the algebraic rules involving the modulo-2 sum operationon bit vectors. We shall denote this operation 0. For instance, 110001010 = 0110. Here are some useful rules about 0:

• The commutative law: x 0 y = y 0 x.

• The associative law: x 0 (y 0 z) = (x © y) 0 z.

• The all-0 vector of the appropriate length, which we denote 0, is theidentity for 0; that is, x 0 0 = 0 0 x = x.

• 0 is its own inverse: x®x = 0. As a useful consequence, if x(I)y = z,then we can "add" x to both sides and get y = x 0 z.

a write of the data block that is rewritten, and a write of the block of theredundant disk. The total is thus n + 1 disk I/0's.

A better approach is to look only at the old and new versions of the datablock i being rewritten. If we take their modulo-2 sum, we know in whichpositions there is a change in the number of 1's among the blocks numbered ion all the disks. Since these changes are always by one, any even number of 1'schanges to an odd number. If we change the same positions of the redundantblock, then the number of 1's in each position becomes even again. We canperform these calculations using four disk I/0's:

1. Read the old value of the data block being changed.

2. Read the corresponding block of the redundant disk.

3. Write the new data block.

4. Recalculate and write the block of the redundant disk.

Example 2.19: Suppose the three first blocks of the data disks are as inExample 2.17:

disk 1: 11110000disk 2: 10101010disk 3: 00111000

Suppose also that the block on the second disk changes from 10101010 to11001100. We take the modulo-2 sum of the old and new values of the block


on disk 2, to get 01100110. That tells us we must change positions 2, 3, 6, and7 of the first block of the redundant disk. We read that block: 01100010. Wereplace this block by a new block that we get by changing the appropriate po-sitions; in effect we replace the redundant block by the modulo-2 sum of itselfand 01100110, to get 00000100. Another way to express the new redundantblock is that it is the modulo-2 sum of the old and new versions of the blockbeing rewritten and the old value of the redundant block. In our example, thefirst blocks of the four disks — three data disks and one redundant — havebecome

disk 1: 11110000disk 2: 11001100disk 3: 00111000disk 4: 00000100

after the write to the block on the second disk and the necessary recomputationof the redundant block. Notice that in the blocks above, each column continuesto have an even number of 1's.

Incidentally, notice that this write of a data block, like all writes of datablocks using the scheme described above, takes four disk I/0's. The naivescheme — read all but the rewritten block and recompute the redundant blockdirectly — would also require four disk I/0's in this example: two to read datafrom the first and third data disks, and two to write the second data disk andthe redundant disk. However, if we had more than three data disks, the numberof I/0's for the naive scheme rises linearly with the number of data disks, whilethe cost of the scheme advocated here continues to require only four. D

Failure Recovery

Now, let us consider what we would do if one of the disks crashed. If it is theredundant disk, we swap in a new disk, and recompute the redundant blocks. Ifthe failed disk is one of the data disks, then we need to swap in a good disk andrecompute its data from the other disks. The rule for recomputing any missingdata is actually simple, and doesn't depend on which disk, data or redundant,is failed. Since we know that the number of 1's among corresponding bits of alldisks is even, it follows that:

• The bit in any position is the modulo-2 sum of all the bits in the corre-sponding positions of all the other disks.

If one doubts the above rule, one has only to consider the two cases. If the bitin question is 1, then the number of corresponding bits that are 1 must be odd,so their modulo-2 sum is 1. If the bit in question is 0, then there are an evennumber of 1's among the corresponding bits, and their modulo-2 sum is 0.

Example 2.20: Suppose that disk 2 fails. We need to recompute each blockof the replacement disk. Following Example 2.17, let us see how to recompute


the first block of the second disk. Wo are given the corresponding blocks of thefirst and third data disks and the redundant disk, so the situation looks like:

disk 1: 11110000disk 2: ????????disk 3: 00111000disk 4: 01100010

If we take the modulo-2 sum of each column, we deduce that the missing blockis 10101010, as was initially the case in Example 2.17. D

2.6.4 An Improvement: RAID 5The RAID level 4 strategy described in Section 2.6.3 effectively preserves dataunless there are two, almost-simultaneous disk crashes. However, it suffers froma bottleneck defect that we can see when we re-examine the process of writinga new data block. Whatever scheme we use for updating the disks, we need toread and write the redundant disk's block. If there are n data disks, then thenumber of disk writes to the redundant disk will be n times the average numberof writes to any one data disk.

However, as we observed in Example 2.20, the rule for recovery is the same asfor the data disks and redundant disks: take the modulo-2 sum of correspondingbits of the other disks. Thus, we do not have to treat one disk as the redundantdisk and the others as data disks. Rather, we could treat each disk as theredundant disk for some of the blocks. This improvement is often called RAIDlevel 5.

For instance, if there are n + 1 disks numbered 0 through n, we could treatthe zth cylinder of disk j as redundant if j is the remainder when i is dividedby n + 1.

Example 2.21: If, as in our running example, n = 3 so there are 4 disks, thefirst disk, numbered 0, would be redundant for its cylinders numbered 4, 8, 12,and so on, because these are the numbers that leave remainder 0 when dividedby 4. The disk numbered 1 would be redundant for blocks numbered 1, 5, 9,and so on; disk 2 is redundant for blocks 2, 6, 10,..., and disk 3 is redundantfor 3, 7, 11,... .

As a result, the reading and writing load for each disk is the same. If allblocks are equally likely to be written, then for one write, each disk has a 1/4chance that the block is on that disk. If not, then it has a 1/3 chance thatit will be the redundant disk for that block. Thus, each of the four disks isinvolved i n^ - | - | x | = |of the writes. D

2.6.5 Coping With Multiple Disk CrashesThere is a theory of error-correcting codes that allows us to deal with anynumber of disk crashes — data or redundant — if we use enough redundant


disks. This strategy leads to the highest RAID "level," RAID level 6. Weshall give only a simple example here, where two simultaneous crashes arecorrectable, and the strategy is based on the simplest error-correcting code,known as a Hamming code.

In our description we focus on a system with seven disks, numbered 1through 7. The first four are data disks, and disks 5 through 7 are redun-dant. The relationship between data and redundant disks is summarized bythe 3 x 7 matrix of O's and 1's in Fig. 2.17. Notice that:

a) Every possible column of three O's and 1's, except for the all-0 column,appears in the matrix of Fig. 2.17.

b) The columns for the redundant disks have a single 1.

c) The columns for the data disks each have at least two 1's.

Data Redundant

Disk number 1 2 3 4 5 6 7

1 1 1 0 1 0 01 1 0 1 0 1 01 0 1 1 0 0 1

Figure 2.17: Redundancy pattern for a system that can recover from two si-multaneous disk crashes

The meaning of each of the three rows of O's and 1's is that if we look atthe corresponding bits from all seven disks, and restrict our attention to thosedisks that have 1 in that row, then the modulo-2 sum of these bits must be 0.Put another way, the disks with 1 in a given row are treated as if they were theentire set of disks in a RAID level 4 scheme. Thus, we can compute the bitsof one of the redundant disks by finding the row in which that disk has 1, andtaking the modulo-2 sum of the corresponding bits of the other disks that have1 in the same row.

For the matrix of Fig. 2.17, this rule implies:

1. The bits of disk 5 are the modulo-2 sum of the corresponding bits of disks1, 2, and 3.




We shall see shortly that the particular choice of bits in this matrix gives us asimple rule by which we can recover from two simultaneous disk crashes.

Reading

We may read data from any data disk normally. The redundant disks can beignored.

Writing

The idea is similar to the writing strategy outlined in Section 2.6.4, but nowseveral redundant disks may be involved. To write a block of some data disk,we compute the modulo-2 sum of the new and old versions of that block. Thesebits are then added, in a modulo-2 sum, to the corresponding blocks of all thoseredundant disks that have 1 in a row in which the written disk also has 1.

Example 2.22: Let us again assume that blocks are only eight bits long,and focus on the first blocks of the seven disks involved in our RAID level 6example. First, suppose the data and redundant first blocks are as given inFig. 2.22. Notice that the block for disk 5 is the modulo-2 sum of the blocksfor the first three disks, the sixth row is the modulo-2 sum of rows 1, 2, and 4,and the last row is the modulo-2 sum of rows 1, 3, and 4.

Disk | Contents1) 111100002) 101010103) 001110004) 010000015)011000106) 000110117) 10001001

Figure 2.18: First blocks of all disks

Suppose we rewrite the first block of disk 2 to be 00001111. If we sum thissequence of bits modulo-2 with the sequence 10101010 that is the old value ofthis block, we get 10100101. If we look at the column for disk 2 in Fig. 2.17,we find that this disk has 1's in the first two rows, but not the third. Sinceredundant disks 5 and 6 have 1 in rows 1 and 2, respectively, we must performthe sum modulo-2 operation on the current contents of their first blocks andthe sequence 10100101 just calculated. That is, we flip the values of positions1, 3, 6, and 8 of these two blocks. The resulting contents of the first blocks ofall disks is shown in Fig. 2.19. Notice that the new contents continue to satisfythe constraints implied by Fig. 2.17: the modulo-2 sum of corresponding blocksthat have 1 in a particular row of the matrix of Fig. 2.17 is still all O's. d


Disk Contents1)111100002) 000011113) 00111000

__4) 010000015) 110001116) 101111107) 10001001

Figure 2.19: First blocks of all disks after rewriting disk 2 and changing theredundant disks

Failure Recovery

Now, let us see how the redundancy scheme outlined above can be used tocorrect up to two simultaneous disk crashes. Let the failed disks be a and b.Since all columns of the matrix of Fig. 2.17 are different, we must be able tofind some row r in which the columns for a and b are different. Suppose that ahas 0 in row r, while b has 1 there.

Then we can compute the correct b by taking the modulo-2 sum of corre-sponding bits from all the disks other than b that have 1 in row r. Note thata is not among these, so none of them have failed. Having done so, we mustrecompute a, with all other disks available. Since every column of the matrixof Fig. 2.17 has a 1 in some row, we can use this row to recompute disk a bytaking the modulo-2 sum of bits of those other disks with a 1 in this row.

Example 2.23: Suppose that disks 2 and 5 fail at about the same time. Con-sulting the matrix of Fig. 2.17, we find that the columns for these two disksdiffer in row 2, where disk 2 has 1 but disk 5 has 0. We may thus reconstructdisk 2 by taking the modulo-2 sum of corresponding bits of disks 1, 4, and 6,the other three disks with 1 in row 2. Notice that none of these three disks hasfailed. For instance, following from the situation regarding the first blocks inFig. 2.19, we would initially have the data of Fig. 2.20 available after disks 2and 5 failed.

If we take the modulo-2 sum of the contents of the blocks of disks 1, 4, and6, we find that the block for disk 2 is 00001111. This block is correct as can beverified from Fig. 2.19. The situation is now as in Fig. 2.21.

Now, we see that disk 5's column in Fig. 2.17 has a 1 in the first row. Wecan therefore recompute disk 5 by taking the modulo-2 sum of correspondingbits from disks 1, 2, and 3, the other three disks that have 1 in the first row.For block 1, this sum is 11000111. Again, the correctness of this calculationcan be confirmed by Fig. 2.19. n


Additional Observations About RAID Level 6

1. We can combine the ideas of RAID levels 5 and 6, by varying theredundant disks according to the block or cylinder number. Do-ing so will avoid bottlenecks when writing; the scheme described inSection 2.6.5 will cause bottlenecks at the redundant disks.

2. The scheme described in Section 2.6.5 is not restricted to four datadisks. The number of disks can be one less than any power of 2, say2k - I. Of these disks, k are redundant, and the remaining 2k - k - Iare data disks, so the redundancy grows roughly as the logarithm ofthe number of data disks. For any k, we can construct the matrixcorresponding to Fig. 2.17 by writing all possible columns of k O'sand 1's, except the all-0's column. The columns with a single 1correspond to the redundant disks, and the columns with more thanone 1 are the data disks.

Disk | Contents1)111100002) ????????3) 001110004) 010000015) ????????6) 101111107) 10001001

Figure 2.20: Situation after disks 2 and 5 fail

2.6.6 Exercises for Section 2.6Exercise 2.6.1: Suppose we use mirrored disks as in Example 2.16, the failurerate is 4% per year, and it takes 8 hours to replace a disk. What is the meantime to a disk failure involving loss of data?

*! Exercise 2.6.2: Suppose disks have a failure rate of fraction F per year andit takes H hours to replace a disk.

a) If we use mirrored disks, what is the mean time to data loss, as a functionof F and HI

b) If we use a RAID level 4 or 5 scheme, with TV disks, what is the meantime to data loss?


Disk Contents1)111100002) 000011113) 00111000

__4) 010000015) ????????6) 101111107) 10001001

Figure 2.21: After recovering disk 2

!! Exercise 2.6.3 : Suppose we use three disks as a mirrored group; i.e., all threehold identical data. If the failure rate for one disk is F per year and it takes Hhours to restore a disk, what is the mean time to data loss?

Exercise 2.6.4: Suppose we are using a RAID level 4 scheme with four datadisks and one redundant disk. As in Example 2.17 assume blocks arc a singlebyte. Give the block of the redundant disk if the corresponding blocks of thedata disks are:

* a) 01010110, 11000000, 00111011, and 11111011.

b) 11110000, 11111000, 00111111, and 00000001.

Exercise 2.6.5: Using the same RAID level 4 scheme as in Exercise 2.6.4,suppose that data disk 1 has failed. Recover the block of that disk under thefollowing circumstances:

* a) The contents of disks 2 through 4 are 01010110, 11000000, and 00111011,while the redundant disk holds 11111011.

b) The contents of disks 2 through 4 aie 11110000, 11111000, and 00111111,while the redundant disk holds 00000001.

Exercise 2.6.6 : Suppose the block on the first disk in Exercise 2.6.4 is changedto 10101010. What changes to the corresponding blocks on the other disks mustbe made?

Exercise 2.6.7: Suppose we have the RAID level 6 scheme of Example 2.22,and the blocks of the four data disks are 00111100, 11000111, 01010101, and10000100, respectively.

a) What are the corresponding blocks of the redundant disks?

b) If the third disk's block is rewritten to be 10000000, what steps must betaken to change other disks?


Error-Correcting Codes and RAID Level 6

There is a broad theory that guides our selection of a suitable matrix, likethat of Fig. 2.17, to determine the content of redundant disks. A codeof length n is a set of bit-vectors (called code words) of length n. TheHamming distance between two code words is the number of positions inwhich they differ, and the minimum distance of a code is the binallestHamming distance of any two different rode words.

If C is any code of length n, we can require that the correspondingbits on n disks have one of the sequences that are members of the code. Asa very simple example, if we are using a disk and its minor, then n = 2,and we can use the code C = {00,11}. That is, the corresponding bitsof the two disks must be the same. For another example, the matrix ofFig. 2.17 defines the code consisting of the 16 bit-vectois of length 7 thathave arbitrary values for the fhst four bits and have the remaining threebits determined by the mles for the three redundant disks.

If the minimum distance of a code is d, then disks whose correspondingbits are required to be a vector in the code will be able to tolerate d — 1simultaneous disk crashes. The reason is that, should we obscure d — 1positions of a code word, and there were two different ways these positionscould be filled in to make a code word, then the two code words would haveto differ in at most the d — 1 positions. Thus, the code could not haveminimum distance d. As an example, the matrix of Fig. 2.17 actuallydefines the well-known Hamming code, which has minimum distance 3.Thus, it can handle two disk crashes.

Exercise 2.6.8 : Describe the steps taken to recover from the following failuresusing the RAID level 6 scheme with seven disks:

* a) Disks 1 and 7.

b) Disks 1 and 4.

c) Disks 3 and 6.

Exercise 2.6.9: Find a RAID level 6 scheme using 15 disks, four of which areredundant. Hint: Generalize the 7-disk Hamming matrix.

Exercise 2.6.10: List the 16 code words for the Hamming code of length 7.That is, what are the 16 lists of bits that could be corresponding bits on theseven disks of the RAID level 6 scheme based on the matrix of Fig. 2.17?

Exercise 2.6.11: Suppose we have four disks, of which disks 1 and 2 are datadisks, and disks 3 and 4 are redundant. Disk 3 is a mirror of disk 1. Disk 4holds the parity check bits for the corresponding bits of disks 2 and 3.


a) Express this situation by giving a parity check matrix analogous to Fig.2.17.

!! b) It is possible to recover from some but not all situations where two disksfail at the same time. Determine for which pairs it is possible to recoverand for which pairs it is not.

*! Exercise 2.6.12: Suppose we have eight data disks numbered 1 through 8,and three redundant disks: 9, 10, and 11. Disk 9 is a parity check on disks1 through 4, and disk 10 is a parity check on disks 5 through 8. If all pairsof disks are equally likely to fail simultaneously, and we want to maximize theprobability that we can recover from the simultaneous failure of two disks, thenon which disks should disk 11 be a parity check?

!! Exercise 2.6.13: Find a RAID level 6 scheme with ten disks, such that itis possible to recover from the failure of any three disks simultaneously. Youshould use as many data disks as you can.

2.7 Summary of Chapter 2+ Memory Hierarchy: A computer system uses storage components ranging

over many orders of magnitude in speed, capacity, and cost per bit. Fromthe smallest/most expensive to largest/cheapest, they are: cache, mainmemory, secondary memory (disk), and tertiary memory.

+ Tertiary Storage: The principal devices for tertiary storage are tape cas-settes, tape silos (mechanical devices for managing tape cassettes), and"juke boxes" (mechanical devices for managing CD-ROM disks). Thesestorage devices have capacities of many terabytes, but are the slowestavailable storage devices.

4- Disks/Secondary Storage: Secondary storage devices are principally mag-netic disks with multigigabyte capacities. Disk units have several circularplatters of magnetic material, with concentric tracks to store bits. Plat-ters rotate around a central spindle. The tracks at a given radius fromthe center of a platter form a cylinder.

4- Blocks and Sectors: Tracks are divided into sectors, which are separatedby unmagnetized gaps. Sectors are the unit of reading and writing fromthe disk. Blocks are logical units of storage used by an application suchas a DBMS. Blocks typically consist of several sectors.

4- Disk Controller: The disk controller is a processor that controls one ormore disk units. It is responsible for moving the disk heads to the propercylinder to read or write a requested track. It also may schedule competingrequests for disk access and buffers the blocks to be read or written.

2.7. SUMMARY OF CHAPTER 2 81

4- Disk Access Time: The latency of a disk is the time between a request toread or write a block, and the time the access is completed. Latency iscaused principally by three factors: the seek time to move the heads tothe proper cylinder, the rotational latency during which the desired blockrotates under the head, and the transfer time, while the block moves underthe head and is read or written.

4- Moore's Law: A consistent trend sees parameters such as processor speedand capacities of disk and main memory doubling every 18 months. How-ever, disk access times shrink little if at all in a similar period. An im-portant consequence is that the (relative) cost of accessing disk appearsto grow as the years progress.

4- Algorithms Using Secondary Storage: When the data is so large it doesnot fit in main memory, the algorithms used to manipulate the data musttake into account the fact that reading and writing disk blocks betweendisk and memory often takes much longer than it does to process thedata once it is in main memory. The evaluation of algorithms for data insecondary storage thus focuses on the number of disk I/O's required.

4- Two-Phase, Multiway Merge-Sort: This algorithm for sorting is capableof sorting enormous amounts of data on disk using only two disk readsand two disk writes of each datum. It is the sorting method of choice inmost database applications.

4- Speeding Up Disk Access: There are several techniques for accessing diskblocks faster for some applications. They include dividing the data amongseveral disks (to allow parallel access), mirroring disks (maintaining sev-eral copies of the data, also to allow parallel access), organizing data thatwill be accessed together by tracks or cylinders, and prefetching or doublebuffering by reading or writing entire tracks or cylinders together.

4- Elevator Algorithm: We can also speed accesses by queueing access re-quests and handling them in an order that allows the heads to make onesweep across the disk. The heads stop to handle a request each timeit reaches a cylinder containing one or more blocks with pending accessrequests.

4- Disk Failure Modes: To avoid loss of data, systems must be able to handleerrors. The principal types of disk failure are intermittent (a read or writeerror that will not reoccur if repeated), permanent (data on the disk iscorrupted and cannot be properly read), and the disk crash, where theentire disk becomes unreadable.

4- Checksums: By adding a parity check (extra bit to make the number of1's in a bit string even), intermittent failures and permanent failures canbe detected, although not corrected.


+ Stable Storage: By making two copies of all data and being careful aboutthe order in which those copies are written, a single disk can be used toprotect against almost all permanent failures of a single sector.

+ RAID: There are several schemes for using an extra disk or disks to enabledata to survive a disk crash. RAID level 1 is mirroring of disks; level 4adds a disk whose contents are a parity check on corresponding bits of allother disks, level 5 varies the disk holding the parity bit to avoid makingthe parity disk a writing bottleneck. Level 6 involves the use of error-correcting codes and may allow survival after several simultaneous diskcrashes.

2.8 References for Chapter 2The RAID idea can be traced back to [6] on disk striping. The name andeiror-correcting capability is from [5].

The model of disk failures in Section 2.5 appears in unpublished work ofLampson and Sturgis [4].

There are several useful surveys of material relevant to this chapter. [2]discusses trends in disk storage and similar systems. A study of RAID systemsis in [1]. [7] surveys algorithms suitable for the secondary storage model (blockmodel) of computation.

[3] is an important study of how one optimizes a system involving processor,memory, and disk, to perform specific tasks.

1. P. M. Chen et al., "RAID: high-performance, reliable secondary storage,"Computing Surveys 26:2 (1994), pp. 145-186.

2. G. A. Gibson et al., "Strategic directions in storage I/O issues in large-scale computing," Computing Surveys 28:4 (1996), pp. 779-793.

3. J. N. Gray and F. Putzolo, "The five minute rule for trading memoryfor disk accesses and the 10 byte rule for trading memory for CPU time,"Proc. ACMSIGMOD Intl. Conf. on Management of Data (1987), pp. 395-398.

4. B. Lampson and H. Sturgis, "Crash recovery in a distributed data storagesystem," Technical report, Xerox Palo Alto Research Center, 1976.

5. D. A. Patterson, G. A. Gibson, and R. H. Katz, "A case for redundantarrays of inexpensive disks," Proc. ACM SIGMOD Intl. Conf. on Man-agement of Data, pp. 109-116, 1988.

6. K. Salem and H. Garcia-Molina, "Disk striping," Proc. Second Intl. Conf.on Data Engineering, pp. 336-342, 1986.

7. J. S. Vitter, "External memory algorithms," Proc. Seventeenth AnnualACM Symposium on Principles of Database Systems, pp. 119-128, 1998.

Date post:	05-Feb-2018
Category:	Documents
Upload:	duongduong
View:	216 times
Download:	2 times

Data Storage - University of Cretehy460/pdf/002.pdf · Data Storage One of the important ... hold a...

Documents