Isotope: Transactional Isolation for Block Storage · • We describe storage systems built using...

This paper is included in the Proceedings of the 14th USENIX Conference on

File and Storage Technologies (FAST ’16).February 22–25, 2016 • Santa Clara, CA, USA

ISBN 978-1-931971-28-7

Open access to the Proceedings of the 14th USENIX Conference on

File and Storage Technologies is sponsored by USENIX

Isotope: Transactional Isolation for Block Storage

Ji-Yong Shin, Cornell University; Mahesh Balakrishnan, Yale University; Tudor Marian, Google; Hakim Weatherspoon, Cornell University

https://www.usenix.org/conference/fast16/technical-sessions/presentation/shin

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 23

Isotope: Transactional Isolation for Block Storage

Ji-Yong ShinCornell University

Mahesh BalakrishnanYale University

Tudor MarianGoogle

Hakim WeatherspoonCornell University

AbstractExisting storage stacks are top-heavy and expect littlefrom block storage. As a result, new high-level storageabstractions – and new designs for existing abstractions– are difficult to realize, requiring developers to imple-ment from scratch complex functionality such as failureatomicity and fine-grained concurrency control. In thispaper, we argue that pushing transactional isolation intothe block store (in addition to atomicity and durability) isboth viable and broadly useful, resulting in simpler high-level storage systems that provide strong semantics with-out sacrificing performance. We present Isotope, a newblock store that supports ACID transactions over blockreads and writes. Internally, Isotope uses a new multi-version concurrency control protocol that exploits fine-grained, sub-block parallelism in workloads and offersboth strict serializability and snapshot isolation guaran-tees. We implemented several high-level storage systemsover Isotope, including two key-value stores that imple-ment the LevelDB API over a hashtable and B-tree, re-spectively, and a POSIX filesystem. We show that Iso-tope’s block-level transactions enable systems that aresimple (100s of lines of code), robust (i.e., providingACID guarantees), and fast (e.g., 415 MB/s for randomfile writes). We also show that these systems can be com-posed using Isotope, providing applications with transac-tions across different high-level constructs such as files,directories and key-value pairs.

1 IntroductionWith the advent of multi-core machines, storage systemssuch as filesystems, key-value stores, graph stores anddatabases are increasingly parallelized over dozens ofcores. Such systems run directly over raw block stor-age but assume very little about its interface and seman-tics; usually, the only expectations from the block storeare durability and single-operation, single-block lineariz-ability. As a result, each system implements complexcode to layer high-level semantics such as atomicity andisolation over the simple block address space. Whilemultiple systems have implemented transactional atom-icity within the block store [18, 24, 46, 6, 19], concur-rency control has traditionally been delegated to the stor-age system above the block store.

In this paper, we propose the abstraction of a transac-tional block store that provides isolation in addition to

atomicity and durability. A number of factors make iso-lation a prime candidate for demotion down the stack.

1) Isolation is general; since practically every storagesystem has to ensure safety in the face of concur-rent accesses, an isolation mechanism implementedwithin the block layer is broadly useful.

2) Isolation is hard, especially for storage systems thatneed to integrate fine-grained concurrency controlwith coarse-grained durability and atomicity mech-anisms (e.g., see ARIES [40]); accordingly, it is bet-ter provided via a single, high-quality implementa-tion within the block layer.

3) Block-level transactions allow storage systemsto effortlessly provide end-user applications withtransactions over high-level constructs such as filesor key-value pairs.

4) Block-level transactions are oblivious to softwareboundaries at higher levels of the stack, and canseamlessly span multiple layers, libraries, threads,processes, and interfaces. For example, a singletransaction can encapsulate an end application’s ac-cesses to an in-process key-value store, an in-kernelfilesystem, and an out-of-process graph store.

5) Finally, multiversion concurrency control(MVCC) [17] provides superior performanceand liveness in many cases but is particularly hardto implement for storage systems since it requiresthem to maintain multiversioned state; in contrast,many block stores (e.g., log-structured designs) arealready internally multiversioned.

Block-level isolation is enabled and necessitated by re-cent trends in storage. Block stores have evolved overtime. They are increasingly implemented via a combi-nation of host-side software and device firmware [9, 3];they incorporate multiple, heterogeneous physical de-vices under a single address space [59, 56]; they lever-age new NVRAM technologies to store indirection meta-data; and they provide sophisticated functionality suchas virtualization [9, 61], tiering [9], deduplication andwear-leveling. Unfortunately, storage systems such asfilesystems continue to assume minimum functionalityfrom the block store, resulting in redundant, complex,and inefficient stacks where layers constantly tussle witheach other [61]. A second trend that argues for push-ing functionality from the filesystem to a lower layeris the increasing importance of alternative abstractions

1

24 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

that can be implemented directly over block storage,such as graphs, key-value pairs [8], tables, caches [53],tracts [42], byte-addressable [14] and write-once [15] ad-dress spaces, etc.

To illustrate the viability and benefits of block-levelisolation, we built Isotope, a transactional block storethat provides isolation (with a choice of strict serializ-ability or snapshot isolation) in addition to atomicity anddurability. Isotope is implemented as an in-kernel soft-ware module running over commodity hardware, expos-ing a conventional block read/write interface augmentedwith beginTX/endTX IOCTLs to demarcate transactions.Transactions execute speculatively and are validated byIsotope on endTX by checking for conflicts. To minimizethe possibility of conflict-related aborts, applications canprovide information to Isotope about which sub-parts ofeach 4KB block are read or written, allowing Isotope toperform conflict detection at sub-block granularity.

Internally, Isotope uses an in-memory multiversion in-dex over a persistent log to provide each transaction witha consistent, point-in-time snapshot of a block addressspace. Reads within a transaction execute against thissnapshot, while writes are buffered in RAM by Isotope.When endTX is called, Isotope uses a new MVCC com-mit protocol to determine if the transaction commits oraborts. The commit/abort decision is a function of thetimestamp-ordered stream of recently proposed transac-tions, as opposed to the multiversion index; as a re-sult, the protocol supports arbitrarily fine-grained con-flict detection without requiring a corresponding increasein the size of the index. When transactions commit, theirbuffered writes are flushed to the log, which is imple-mented on an array of physical drives [56], and reflectedin the multiversion index. Importantly, aborted transac-tions do not result in any write I/O to persistent storage.

Storage systems built over Isotope are simple, state-less, shim layers that focus on mapping some variable-sized abstraction – such as files, tables, graphs, and key-value pairs – to a fixed-size block API. We describe sev-eral such systems in this paper, including a key-valuestore based on a hashtable index, one based on a B-tree,and a POSIX user-space filesystem. These systems donot have to implement their own fine-grained locking forconcurrency control and logging for failure atomicity.They can expose transactions to end applications with-out requiring any extra code. Storage systems that resideon different partitions of an Isotope volume can be com-posed with transactions into larger end applications.

Block-level isolation does have its limitations. Stor-age systems built over Isotope cannot share arbitrary, in-memory soft state such as read caches across transactionboundaries, since it is difficult to update such state atom-ically based on the outcome of a transaction. Instead,they rely on block-level caching in Isotope by provid-

ing hints about which blocks to cache. We found thisapproach well-suited for both the filesystem application(which cached inode blocks, indirection blocks and al-location maps) and the key-value stores (which cachedtheir index data structures). In addition, information isinvariably lost when functionality is implemented at alower level of the stack: Isotope cannot leverage prop-erties such as commutativity and idempotence while de-tecting conflicts.

This paper makes the following contributions:

• We revisit the end-to-end argument for storagestacks with respect to transactional isolation, in thecontext of modern hardware and applications.

• We propose the abstraction of a fully transactionalblock store that provides isolation, atomicity anddurability. While others have explored block-leveltransactional atomicity [18, 24, 46, 19], this is thefirst proposal for block-level transactional isolation.

• We realize this abstraction in a system called Iso-tope via a new MVCC protocol. We show that Iso-tope exploits sub-block concurrency in workloadsto provide a high commit rate for transactions andhigh I/O throughput.

• We describe storage systems built using Isotopetransactions – two key-value stores and a filesystem– and show that they are simple, fast, and robust,as well as composable via Isotope transactions intolarger end applications.

2 MotivationBlock-level isolation is an idea whose time has come.In the 90s, the authors of Rio Vista (a system that pro-vided atomic transactions over a persistent memory ab-straction) wrote in [36]: “We believe features such as se-rializability are better handled by higher levels of soft-ware... adopting any concurrency control scheme wouldpenalize the majority of applications, which are single-threaded and do not need locking.” Today, applicationsrun on dozens of cores and are multi-threaded by default;isolation is a universal need, not a niche feature.

Isolation is simply the latest addition to a long list offeatures provided by modern block stores: caching, tier-ing, mapping, virtualization, deduplication, and atomic-ity. This explosion of features has been triggered partlyby the emergence of software-based block layers, rang-ing from flash FTLs [3] to virtualized volume man-agers [9]. In addition, the block-level indirection nec-essary for many of these features has been made prac-tical and inexpensive by hardware advances in the lastdecade. In the past, smart block devices such as HP Au-toRAID [65] were restricted to enterprise settings dueto their reliance on battery-backed RAM; today, SSDs

2


routinely implement indirection in FTLs, using super-capacitors to flush metadata and data on a power fail-ure. Software block stores in turn can store metadata onthese SSDs, on raw flash, or on derivatives such as flash-backed RAM [34] and Auto-Commit Memory [7].

What about the end-to-end argument? We arguethat block-level isolation passes the litmus test imposedby the end-to-end principle [49] for pushing functional-ity down the stack: it is broadly useful, efficiently im-plementable at a lower layer of the stack with negligibleperformance overhead, and leverages machinery that al-ready exists at that lower layer. The argument regardingutility is obvious: pushing functionality down the stackis particularly useful when it is general enough to be usedby the majority of applications, which is undeniably thecase for isolation or concurrency control. However, theother motivations for a transactional block store requiresome justification:

Isolation is hard. Storage systems typically imple-ment pessimistic concurrency control via locks, open-ing the door to a wide range of aberrant behavior suchas deadlocks and livelocks. This problem is exacerbatedwhen developers attempt to extract more parallelism viafine-grained locks, and when these locks interact withcoarse-grained failure atomicity and durability mecha-nisms [40]. Transactions can provide a simpler pro-gramming model that supplies isolation, atomicity anddurability via a single abstraction. Additionally, trans-actions decouple the policy of isolation – as expressedthrough beginTX/endTX calls – from the concurrencycontrol mechanism used to implement it under the hood.

Isolation is harder when exposed to end applications.Storage systems often provide concurrency control APIsover their high-level storage abstractions; for example,NTFS offers transactions over files, while Linux providesfile-level locking. Unfortunately, these high-level con-currency control primitives often have complex, weak-ened, and idiosyncratic semantics [44]; for instance,NTFS provides transactional isolation for accesses to thesame file, but not for directory modifications, while aLinux fcntl lock on a file is released when any file de-scriptor for that file is closed by a process [1]. The com-plex semantics are typically a reflection of a complex im-plementation, which has to operate over high-level con-structs such as files and directories. In addition, compos-ability is challenging if each storage system implementsisolation independently: for example, it is impossible todo a transaction over an NTFS file and a Berkeley DBkey-value pair.

Isolation is even harder when multiversion concur-rency control is required. In many cases, pessimisticconcurrency control is slow and prone to liveness bugs;for example, when locks are exposed to end applicationsdirectly or via a transactional interface, the application

/*** Transaction API ***/int beginTX();int endTX();int abortTX();//POSIX read/write commands/*** Optional API ***///release ongoing transaction and return handleint releaseTX();//take over a released transactionint takeoverTX(int tx_handle);//mark byte range accessed by last read/writeint mark_accessed(off_t blknum, int start, int size);//request caching for blocksint please_cache(off_t blknum);

Figure 1: The Isotope API.

could hang while holding a lock. Optimistic concur-rency control [35] works well in this case, ensuring thatother transactions can proceed without waiting for thehung process. Multiversion concurrency control workseven better, providing transactions with stable, consis-tent snapshots (a key property for arbitrary applicationsthat can crash if exposed to inconsistent snapshots [31]);allowing read-only transactions to always commit [17];and enabling weaker but more performant isolation lev-els such as snapshot isolation [16].

However, switching to multiversion concurrency con-trol can be difficult for storage systems due to its inherentneed for multiversion state. High-level storage systemsare not always intrinsically multiversioned (with notableexceptions such as WAFL [33] and other copy-on-writefilesystems), making it difficult for developers to switchfrom pessimistic locking to a multiversion concurrencycontrol scheme. Multiversioning can be particularly dif-ficult to implement for complex data structures used bystorage systems such as B-trees, requiring mechanismssuch as tombstones [26, 48].

In contrast, multiversioning is relatively easy to imple-ment over the static address space provided by a blockstore (for example, no tombstones are required sinceaddresses can never be ‘deleted’). Additionally, manyblock stores are already multiversioned in order to ob-tain write sequentiality: examples include log-structureddisk stores, shingled drives [11] and SSDs.

3 The Isotope APIThe basic Isotope API is shown in Figure 1: applicationscan use standard POSIX calls to issue reads and writesto 4KB blocks, bookended by beginTX/endTX calls. ThebeginTX call establishes a snapshot; all reads within thetransaction are served from that snapshot. Writes withinthe transaction are speculative. Each transaction canview its own writes, but the writes are not made visi-ble to other concurrent transactions until the transactioncommits. The endTX call returns true if the transactioncommits, and false otherwise. The abortTX allows theapplication to explicitly abort the transaction. The appli-cation can choose one of two isolation levels on startup:strict serializability or snapshot isolation.

3


The Isotope API implicitly associates transaction IDswith user-space threads, instead of augmenting each callsignature in the API with an explicit transaction ID thatthe application supplies. We took this route to allow ap-plications to use the existing, highly optimized POSIXcalls to read and write data to the block store. The con-trol API for starting, committing and aborting transac-tions is implemented via IOCTLs. To allow transactionsto execute across different threads or processes, Isotopeprovides additional APIs via IOCTLs: releaseTX discon-nects the association between the current thread and thetransaction, and returns a temporary transaction handle.A different thread can call takeoverTX with this handleto associate itself with the transaction.

Isotope exposes two other optional calls via IOCTLs.After reading or writing a 4KB block within a transac-tion, applications can call mark accessed to explicitlyspecify the accessed byte range within the block. Thisinformation is key for fine-grained conflict detection; forexample, a filesystem might mark a single inode withinan inode block, or a single byte within a data allocationbitmap. Note that this information cannot be inferred im-plicitly by comparing the old and new values of the 4KBblock; the application might have overwritten parts of theblock without changing any bits. The second optionalcall is please cache, which lets the application requestIsotope to cache specific blocks in RAM; we discuss thiscall in detail later in the paper. Figure 2 shows a snippetof application code that uses the Isotope API (the setattrfunction from a filesystem).

If a read or write is issued outside a transaction, it istreated as a singleton transaction. In effect, Isotope be-haves like a conventional block device if the reads andwrites issued to it are all non-transactional. In addi-tion, Isotope can preemptively abort transactions to avoidbuggy or malicious applications from hoarding resourceswithin the storage subsystem. When a transaction is pre-emptively aborted, any reads, writes, or control calls is-sued within it will return error codes, except for endTX,which will return false, and abortTX.

Transactions can be nested – i.e., a beginTX/endTXpair can have other pairs nested within it – with the sim-ple semantics that the internal transactions are ignored.A nested beginTX does not establish a new snapshot,and a nested endTX always succeeds without changingthe persistent state of the system. A nested abortTXcauses any further activity in the transaction to returnerror codes until all the enclosing abortTX/endTX havebeen called. This behavior is important for allowing stor-age systems to expose transactions to end-user applica-tions. In the example of the filesystem, if an end-userapplication invokes beginTX (either directly on Isotopeor through a filesystem-provided API) before calling thesetattr function in Figure 2 multiple times, the internal

isofs_inode_num ino;unsigned char *buf;//allocate buf, set ino to parameter...int blknum = inode_to_block(ino);txbegin:beginTX();if(!read(blknum, buf)){

abortTX();return EIO;

}mark_accessed(blknum, off, sizeof(inode));//update attributes...if(!write(blknum, buf)){

abortTX();return EIO;

}mark_accessed(blknum, off, sizeof(inode));if(!endTX()) goto txbegin;

Figure 2: Example application: setattr code for a filesys-tem built over Isotope.

transactions within each setattr call are ignored and theentire ensemble of operations will commit or abort.

3.1 ComposabilityAs stated earlier, a primary benefit of a transactionalblock store is its obliviousness to the structure of thesoftware stack running above it, which can range froma single-threaded application to a composition of multi-threaded application code, library storage systems, out-of-process daemons and kernel modules. The IsotopeAPI is designed to allow block-level transactions to spanarbitrary compositions of different types of softwaremodules. We describe some of these composition pat-terns in the context of a simple photo storage applicationcalled ImgStore, which stores photos and their associatedmetadata in a key-value store.

In the simplest case, ImgStore can store images andvarious kinds of metadata as key-value pairs in IsoHT,which in turn is built over a Isotope volume using trans-actions. Here, a single transaction-oblivious application(ImgStore) runs over a single transaction-aware library-based storage system (IsoHT).Cross-Layer: ImgStore may want to atomically updatemultiple key-value pairs in IsoHT; for example, when auser is tagged in a photo, ImgStore may want to updatea photo-to-user mapping as well as a user-to-photo map-ping, stored under two different keys. To do so, Img-Store can encapsulate calls to IsoHT within Isotope be-ginTX/endTX calls, leveraging nested transactions.Cross-Thread: In the simplest case, ImgStore executeseach transaction within a single thread. However, if Img-Store is built using an event-driven library that requirestransactions to execute across different threads, it can usethe releaseTX/takeoverTX calls.Cross-Library: ImgStore may find that IsoHT workswell for certain kinds of accesses (e.g., retrieving a spe-cific image), but not for others such as range queries(e.g., finding photos taken between March 4 and May

4


0 1 2 3 4 5 …

0 1 2 3 …

• V55: L5• V43: L40

• V55: L31

...

beginTX();Read(1);Write(0);endTX();

55

Commit?

appendYes.

A

D B

E

C

Logical Address Space

Figure 3: Isotope consists of (A) a timestamp counter,(B) a multiversion index, (C) a write buffer, (D) a deci-sion algorithm, and (E) a persistent log.

10, 2015). Accordingly, it may want to spread its stateacross two different library key-value stores, one basedon a hashtable (IsoHT) and another on a B-tree (IsoBT)for efficient range queries. When a photo is added to thesystem, ImgStore can transactionally call put operationson both stores. This requires the key-value stores to runover different partitions on the same Isotope volume.Cross-Process: For various reasons, ImgStore may wantto run IsoHT in a separate process and access it via anIPC mechanism; for example, to share it with other ap-plications on the same machine, or to isolate failures indifferent codebases. To do so, ImgStore has to call re-leaseTX and pass the returned transaction handle via IPCto IsoHT, which then calls takeoverTX. This requiresIsoHT to expose a transaction-aware IPC interface forcalls that occur within a transactional context.

4 Design and ImplementationFigure 3 shows the major components of the Isotope de-sign. Isotope internally implements an in-memory mul-tiversion index (B in the figure) over a persistent log (E).Versioning is provided by a timestamp counter (A) whichdetermines the snapshot seen by a transaction as well asits commit timestamp. This commit timestamp is usedby a decision algorithm (D) to determine if the transac-tion commits or not. Writes issued within a transactionare buffered (C) during its execution, and flushed to thelog if the transaction commits. We now describe the in-teraction of these components.

When the application calls beginTX, Isotope createsan in-memory intention record for the speculative trans-action: a simple data structure with a start timestamp anda read/write-set. Each entry in the read/write-set consistsof a block address, a bitmap that tracks the accessed sta-tus of smaller fixed-size chunks or fragments within theblock (by default, the fragment size is 16 bytes, resultingin a 256-bit bitmap for each 4KB block), and an addi-tional 4KB payload only in the write-set. These bitmapsare never written persistently and are only maintained in-

memory for currently executing transactions. After cre-ating the intention record, the beginTX call sets its starttimestamp to the current value of the timestamp counter(A in Figure 3) without incrementing it.

Until endTX is called, the transaction executes specu-latively against the (potentially stale) snapshot, withoutany effect on the shared or persistent state of the sys-tem. Writes update the write-set and are buffered in-memory (C in Figure 3) without issuing any I/O. A trans-action can read its own buffered writes, but all other readswithin the transaction are served from the snapshot cor-responding to the start timestamp using the multiversionindex (B in Figure 3). The mark accessed call mod-ifies the bitmap for a previously read or written blockto indicate which bits the application actually touched.Multiple mark accessed calls have a cumulative effecton the bitmap. At any point, the transaction can be pre-emptively aborted by Isotope simply by discarding itsintention record and buffered writes. Subsequent reads,writes, and endTX calls will be unable to find the recordand return an error code to the application.

All the action happens on the endTX call, which con-sists of two distinct phases: deciding the commit/abortstatus of the transaction, and applying the transaction (ifit commits) to the state of the logical address space. Re-gardless of how it performs these two phases, the firstaction taken by endTX is to assign the transaction a com-mit timestamp by reading and incrementing the globalcounter. The commit timestamp of the transaction is usedto make the commit decision, and is also used as the ver-sion number for all the writes within the transaction if itcommits. We use the terms timestamp and version num-ber interchangeably in the following text.

4.1 Deciding TransactionsTo determine whether the transaction commits or aborts,endTX must detect the existence of conflicting transac-tions. The isolation guarantee provided – strict serializ-ability or snapshot isolation – depends on what consti-tutes a conflicting transaction. We first consider a simplestrawman scheme that provides strict serializability andimplements conflict detection as a function of the multi-version index. Here, transactions are processed in com-mit timestamp order, and for each transaction the multi-version index is consulted to check if any of the logicalblocks in its read-set has a version number greater thanthe current transaction’s start timestamp. In other words,we check whether any of the blocks read by the transac-tion has been updated since it was read.

This scheme is simple, but suffers from a major draw-back: the granularity of the multiversion index has tomatch the granularity of conflict detection. For exam-ple, if we want to check for conflicts at 16-byte grain, theindex has to track version numbers at 16-byte grain aswell; this blows up the size of the in-memory index by

5


…W(17)W(88)

W(88)W(25)

W(33) W(77)W(25)W(33)

Start Timestamp

Commit Timestamp

Conflict Window

beginTX();Read(25);Write(25);Write(33);endTX();

Commit Abort

T1 T2 T3 T4 T5

Conflict with T3!

Figure 4: Conflict detection under snapshot isolation: atransaction commits if no other committed transaction inits conflict window has an overlapping write-set.

256X compared to a conventional block-granular index.As a result, this scheme is not well-suited for fine-grainedconflict detection.

To perform fine-grained conflict detection whileavoiding this blow-up in the size of the index, Isotopeinstead implements conflict detection as a function overthe temporal stream of prior transactions (see Figure 4).Concretely, each transaction has a conflict window ofprior transactions between its start timestamp and itscommit timestamp.• For strict serializability, the transaction T aborts if

any committed transaction in its conflict windowmodified an address that T read; else, T commits.

• For snapshot isolation, the transaction T aborts ifany committed transaction in its conflict windowmodified an address that T wrote; else, T commits.

In either case, the commit/abort status of a transactionis a function of a window of transactions immediatelypreceding it in commit timestamp order.

When endTX is called on T , a pointer to its intentionrecord is inserted into the slot corresponding to its com-mit timestamp in an in-memory array. Since the counterassigns contiguous timestamps, this array has no holes;each slot is eventually occupied by a transaction. At thispoint, we do not yet know the commit/abort status of Tand have not issued any write I/O, but we have a starttimestamp and a commit timestamp for it. Each slot isguarded by its own lock.

To decide if T commits or aborts, we simply look atits conflict window of transactions in the in-memory ar-ray (i.e., the transactions between its start and committimestamps). We can decide T ’s status once all thesetransactions have decided. T commits if each transactionin the window either aborts or has no overlap betweenits read/write-set and T ’s read/write-set (depending onthe transactional semantics). Since each read/write-setstores fine-grained information about which fragments ofthe block are accessed, this scheme provides fine-grained

conflict detection without increasing the size of the mul-tiversion index.

Defining the commit/abort decision for a transactionas a function of other transactions is a strategy as oldas optimistic concurrency control itself [35], but choos-ing an appropriate implementation is non-trivial. Likeus, Bernstein et al. [48] formulate the commit/abort de-cision for distributed transactions in the Hyder systemas a function of a conflict window over a totally orderedstream of transaction intentions. Unlike us, they explic-itly make a choice to use the spatial state of the system(i.e., the index) to decide transactions. A number of fac-tors drive our choice in the opposite direction: we needto support writes at arbitrary granularity (e.g., an inode)without increasing index size; our intention log is a lo-cal in-memory array and not distributed or shared acrossthe network, drastically reducing the size of the conflictwindow; and checking for conflicts using read/write-setsis easy since our index is a simple address space.

4.2 Applying TransactionsIf the outcome of the decision phase is commit, endTXproceeds to apply the transaction to the logical addressspace. The first step in this process is to append thewrites within the transaction to the persistent log. Thisstep can be executed in parallel for multiple transactions,as soon as each one’s decision is known, since the ex-istence and order of writes on the log signifies nothing:the multiversion index still points to older entries in thelog. The second step involves changing the multiversionindex to point to the new entries. Once the index hasbeen changed, the transaction can be acknowledged andits effects are visible.

One complication is that this protocol introduces alost update anomaly. Consider a transaction that readsa block (say an allocation bitmap in a filesystem), exam-ines and changes the first bit, and writes it back. A sec-ond transaction reads the same block concurrently, ex-amines and changes the last bit, and writes it back. Ourconflict detection scheme will correctly allow both trans-actions to commit. However, each transaction will writeits own version of the 4KB bitmap, omitting the other’smodification; as a result, the transaction with the highertimestamp will destroy the earlier transaction’s modifica-tion. To avoid such lost updates, the endTX call performsan additional step for each transaction before appendingits buffered writes to the log. Once it knows that the cur-rent transaction can commit, it scans the conflict windowand merges updates made by prior committed transac-tions to the blocks in its write-set.

4.3 Implementation DetailsIsotope is implemented as an in-kernel software modulein Linux 2.6.38; specifically, as a device mapper that ex-poses multiple physical block devices as a single virtual

6


disk, at the same level of the stack as software RAID.Below, we discuss the details of this implementation.Log implementation: Isotope implements the log (i.e.,E in Figure 3) over a conventional address space witha counter marking the tail (and additional bookkeepinginformation for garbage collection, which we discussshortly). From a correctness and functionality stand-point, Isotope is agnostic to how this address space is re-alized. For good performance, it requires an implementa-tion that works well for a logging workload where writesare concentrated at the tail, while reads and garbage col-lection can occur at random locations in the body. Anaive solution is to use a single physical disk (or a RAID-0 or RAID-10 array of disks), but garbage collection ac-tivity can hurt performance significantly by randomizingthe disk arm. Replacing the disks with SSDs increasesthe cost-to-capacity ratio of the array without entirelyeliminating the performance problem [58].

As a result, we use a design where a log is chainedacross multiple disks or SSDs (similar to Gecko [56]).Chaining the log across drives ensures that garbage col-lection – which occurs in the body of the log/chain – isseparated from the first-class writes arriving at the taildrive of the log/chain. In addition, a commodity SSDis used as a read cache with an affinity for the tail driveof the chain, preventing application reads from disrupt-ing write sequentiality at the tail drive. In essence, thisdesign ‘collars’ the throughput of the log, pegging writethroughput to the speed of a single drive, but simultane-ously eliminating the throughput troughs caused by con-current garbage collection and read activity.Garbage collection (GC): Compared to conventionallog-structured stores, GC is slightly complicated in Iso-tope by the need to maintain older versions of blocks.Isotope tracks the oldest start timestamp across all on-going transactions and makes a best-effort attempt tonot garbage collect versions newer than this timestamp.In the worst case, any non-current versions can be dis-carded without compromising safety, by first preemp-tively aborting any transactions reading from them. Theapplication can simply retry its transactions, obtaining anew, current snapshot. This behavior is particularly use-ful for dealing with the effects of rogue transactions thatare never terminated by the application. The alternative,which we did not implement, is to set a flag that pre-serves a running transaction’s snapshot by blocking newwrites if the log runs out of space; this may be requiredif it’s more important for a long-running transaction tofinish (e.g., if it’s a critical backup) than for the systemto be online for writes.Caching: The please cache call in Isotope allows the ap-plication to mark the blocks it wants cached in RAM. Toimplement caching, Isotope annotates the multiversionindex with pointers to cached copies of block versions.

This call is merely a hint and provides no guarantees tothe application. In practice, our implementation uses asimple LRU scheme to cache a subset of the blocks if theapplication requests caching indiscriminately.Index persistence: Thus far, we have described the mul-tiversion index as an in-memory data structure pointingto entries on the log. Changes to the index have to bemade persistent so that the state of the system can be re-constructed on failures. To obtain persistence and failureatomicity for these changes, we use a metadata log. Thesize of this log can be limited via periodic checkpoints.

A simple option is to store the metadata log on battery-backed RAM, or on newer technologies such as PCM orflash-backed RAM (e.g., Fusion-io’s AutoCommit Mem-ory [7]). In the absence of special hardware on our ex-perimental testbed, we instead used a commodity SSD.Each transaction’s description in the metadata log is quitecompact (i.e., the logical block address and the physicallog position of each write in it, and its commit times-tamp). To avoid the slowdown and flash wear-out in-duced by logging each transaction separately as a syn-chronous page write, we batch multiple committed trans-actions together [25], delaying the final step of modify-ing the multiversion index and acknowledging the trans-action to the application. We do not turn off the writecache on the SSD, relying on its ability to flush data onpower failures using supercapacitors.Memory overhead: A primary source of memory over-head in Isotope is the multiversion index. A single-version index that maps a 2TB logical address space toan 4TB physical address space can be implemented as asimple array that requires 2GB of RAM (i.e., half a bil-lion 4-byte entries), which can be easily maintained inRAM on modern machines. Associating each addresswith a version (without supporting access to prior ver-sions) doubles the space requirement to 4GB (assum-ing 4-byte timestamps), which is still feasible. However,multiversioned indices that allow access to past versionsare more expensive, due to the fact that multiple ver-sions need to be stored, and because a more complexdata structure is required instead of an array with fixed-size values. These concerns are mitigated by the fact thatIsotope is not designed to be a fully-fledged multiversionstore; it only stores versions from the recent past, corre-sponding to the snapshots seen by executing transactions.

Accordingly, Isotope maintains a pair of indices: asingle-version index in the form of a simple array anda multiversion index implemented as a hashtable. Eachentry in the single-version index either contains a validphysical address if the block has only one valid, non-GC’ed version, a null value if the block has never beenwritten, or a constant indicating the existence of multipleversions. If a transaction issues a read and encountersthis constant, the multiversion index is consulted. An ad-

7


dress is moved from the single-version index to the mul-tiversion index when it goes from having one version totwo; it is moved back to the single-version index whenits older version(s) are garbage collected (as describedearlier in this section).

The multiversion index consists of a hashtable thatmaps each logical address to a linked list of its exist-ing versions, in timestamp order. Each entry containsforward and backward pointers, the logical address, thephysical address, and the timestamp. A transaction walksthis linked list to find the entry with the highest times-tamp less than its snapshot timestamp. In addition, theentry also has a pointer to the in-memory cached copy,as described earlier. If an address is cached, the firstsingle-version index is marked as having multiple ver-sions even if it does not, forcing the transaction to lookat the hashtable index and encounter the cached copy. Inthe future, we plan on applying recent work on compact,concurrent maps [28] to further reduce overhead.Rogue Transactions: Another source of memory over-head in Isotope is the buffering of writes issued by in-progress transactions. Each write adds an entry to thewrite-set of the transaction containing the 4KB payloadand a 4K

C bit bitmap, where C is the granularity of con-flict detection (e.g., with 16-byte detection, the bitmap is256 bits). Rogue transactions that issue a large numberof writes are a concern, especially since transactions canbe exposed to end-user applications. To handle this, Iso-tope provides a configuration parameter to set the maxi-mum number of writes that can be issued by a transaction(set to 256 by default); beyond this, writes return an er-ror code. Another parameter sets the maximum numberof outstanding transactions a single process can have in-flight (also set to 256). Accordingly, the maximum mem-ory a rogue process can use within Isotope for bufferedwrites is roughly 256MB. When a process is killed, itsoutstanding transactions are preemptively aborted.

Despite these safeguards, it is still possible for Isotopeto run out of memory if many processes are launchedconcurrently and each spams the system with rogue,never-ending transactions. In the worst case, Isotope canalways relieve memory pressure by preemptively abort-ing transactions. Another option which we considered isto flush writes to disk before they are committed; sincethe metadata index does not point to them, they won’tbe visible to other transactions. Given that the system isonly expected to run out of memory in pathological caseswhere issuing I/O might worsen the situation, we didn’timplement this scheme.

Note that the in-memory array that Isotope uses forconflict detection is not a major source of memory over-head; pointers to transaction intention records are in-serted into this array in timestamp order only after the ap-plication calls endTX, at which point it has relinquished

Application Original Basic APIs Optional APIswith locks (lines modified) (lines added)

IsoHT 591 591 (15) 617 (26)IsoBT 1,229 1,229 (12) 1,246 (17)IsoFS 997 997 (19) 1,022 (25)

Table 1: Lines of code for Isotope storage systems.

control to Isotope and cannot prolong the transaction. Asa result, the lifetime of an entry in this array is short andlimited to the duration of the endTX call.

5 Isotope ApplicationsTo illustrate the usability and performance of Isotope, webuilt four applications using Isotope transactions: IsoHT,a key-value store built over a persistent hashtable; IsoBT,a key-value store built over a persistent B-tree; IsoFS,a user-space POSIX filesystem; and ImgStore, an im-age storage service that stores images in IsoHT, and asecondary index in IsoBT. These applications implementeach call in their respective public APIs by following asimple template that wraps the entire function in a sin-gle transaction, with a retry loop in case the transactionaborts due to a conflict (see Figure 2).

5.1 Transactional Key-Value StoresLibrary-based or ‘embedded’ key-value stores (such asLevelDB or Berkeley DB) are typically built over per-sistent, on-disk data structures. We built two key-valuestores called IsoHT and IsoBT, implemented over anon-disk hashtable and B-tree data structure, respectively.Both key-value stores support basic put/get operations onkey-value pairs, while IsoBT additionally supports rangequeries. Each API call is implemented via a single trans-action of block reads and writes to an Isotope volume.

We implemented IsoHT and IsoBT in three stages.First, we wrote code without Isotope transactions, us-ing a global lock to guard the entire hashtable or B-tree.The resulting key-value stores are functional but slow,since all accesses are serialized by the single lock. Fur-ther, they do not provide failure atomicity: a crash in themiddle of an operation can catastrophically violate datastructure integrity.

In the second stage, we simply replaced the ac-quisitions/releases on the global lock with Isotope be-ginTX/endTX/ abortTX calls, without changing the over-all number of lines of code. With this change, thekey-value stores provide both fine-grained concurrencycontrol (at block granularity) and failure atomicity. Fi-nally, we added optional mark accessed calls to obtainsub-block concurrency control, and please cache calls tocache the data structures (e.g., the nodes of the B-tree,but not the values pointed to by them). Table 1 reportson the lines of code (LOC) counts at each stage for thetwo key-value stores.

8


5.2 Transactional FilesystemIsoFS is a simple user-level filesystem built over Iso-tope accessible via FUSE [2], comprising 1K lines of Ccode. Its on-disk layout consists of distinct regions forstoring inodes, data, and an allocation bitmap for each.Each inode has an indirect pointer and a double indirectpointer, both of which point to pages allocated from thedata region. Each filesystem call (e.g., setattr, lookup,or unlink) uses a single transaction to access and modifymultiple blocks. The only functionality implemented byIsoFS is the mapping and allocation of files and direc-tories to blocks; atomicity, isolation, and durability arehandled by Isotope.

IsoFS is stateless, caching neither data nor metadataacross filesystem calls (i.e., across different transac-tions). Instead, IsoFS tells Isotope which blocks to cachein RAM. This idiom turned out to be surprisingly easy touse in the context of a filesystem; we ask Isotope to cacheall bitmap blocks on startup, each inode block when aninode within it is allocated, and each data block that’s al-located as an indirect or double indirect block. Like thekey-value stores, IsoFS was implemented in three stagesand required few extra lines of code to go from a globallock to using the Isotope API (see Table 1).

IsoFS trivially exposes transactions to end applica-tions over files and directories. For example, a user mightcreate a directory, move a file into it, edit the file, andrename the directory, only to abort the entire transac-tions and revert the filesystem to its earlier state. Oneimplementation-related caveat is that we were unableto expose transactions to end applications of IsoFS viathe FUSE interface, since FUSE decouples applicationthreading from filesystem threading and does not provideany facility for explicitly transferring a transaction han-dle on each call. Accordingly, we can only expose trans-actions to the end application if IsoFS is used directly asa library within the application’s process.

5.3 ExperienceComposability: As we stated earlier, Isotope-based stor-age systems are trivially composable: a single transac-tion can encapsulate calls to IsoFS, IsoHT and IsoBT.To demonstrate the power of such composability, webuilt ImgStore, the image storage application describedin Section 3. ImgStore stores images in IsoHT, using 64-bit IDs as keys. It then stores a secondary index in IsoBT,mapping dates to IDs. The implementation of ImgStoreis trivially simple: to add an image, it creates a trans-action to put the image in IsoHT, and then updates thesecondary index in IsoBT. The result is a storage systemthat – in just 148 LOC – provides hashtable-like perfor-mance for gets while supporting range queries.Isolation Levels: Isotope provides both strict serializ-ability and snapshot isolation; our expectation was thatdevelopers would find it difficult to deal with the seman-

tics of the latter. However, our experience with IsoFS,IsoHT and IsoBT showed otherwise. Snapshot isolationprovides better performance than strict serializability, butintroduces the write skew anomaly [16]: if two concur-rent transactions read two blocks and each updates oneof the blocks (but not the same one), they will both com-mit despite not being serializable in any order. The writeskew anomaly is problematic for applications if a trans-action is expected to maintain an integrity constraint thatincludes some block it does not write to (e.g., if the twoblocks in the example have to sum to less than some con-stant). In the case of the storage systems we built, we didnot encounter these kinds of constraints; for instance, noparticular constraint holds between different bits on anallocation map. As a result, we found it relatively easyto reason about and rule out the write skew anomaly.Randomization: Our initial implementations exhibiteda high abort rate due to deterministic behavior across dif-ferent transactions. For example, a simple algorithm forallocating a free page involved getting the first free bitfrom the allocation bitmap; as a result, multiple concur-rent transactions interfered with each other by trying toallocate the same page. To reduce the abort rate, it wassufficient to remove the determinism in simple ways; forexample, we assigned each thread a random start offsetinto the allocation bitmap.

6 Performance EvaluationWe evaluate Isotope on a machine with an Intel XeonCPU with 24 hyper-threaded cores, 24GB RAM, three10K RPM disks of 600GB each, an 128GB SSD for theOS and two other 240GB SSDs with SATA interfaces. Inthe following experiments, we used two primary config-urations for Isotope’s persistent log: a three-disk chainedlogging instance with a 32GB SSD read cache in front,and a 2-SSD chained logging instance. In some of theexperiments, we compare against conventional systemsrunning over RAID-0 configurations of 3 disks and 2SSDs, respectively. In the chained logging configura-tions, all writes are logged to the single tail drive, whilereads are mostly served by the other drives (and the SSDread cache for the disk-based configuration). The perfor-mance of this logging design under various workloadsand during GC activity has been documented in [56].In all our experiments, GC is running in the backgroundand issuing I/Os to the drives in the body of the chain tocompact segments, without disrupting the tail drive.

Our evaluation consists of two parts. First, we fo-cus on the performance and overhead of Isotope, show-ing that it exploits fine-grained concurrency in work-loads and provides high, stable throughput. Second, weshow that Isotope applications – in addition to being sim-ple and robust – are fast, efficient, and composable intolarger applications.

9


0100200300400500600

16KB256KB

4MB64MB

1GB16GB

256GB

DiskM

B/se

c

Accessed address space

ThroughputGoodput

16KB256KB

4MB64MB

1GB16GB

256GB

0510152025SSD

Ks o

f TXe

s/se

c


Figure 5: Without fine-grained conflict detection, Isotopeperforms well under low contention workloads.

6.1 Isotope PerformanceTo understand how Isotope performs depending on theconcurrency present in the workload, we implemented asynthetic benchmark. The benchmark executes a sim-ple type of transaction that reads three randomly chosenblocks, modifies a random 16-byte segment within eachblock (aligned on a 16-byte boundary), and writes themback. This benchmark performs identically with strictserializability and snapshot isolation, since the read-setexactly matches the write-set.

In the following experiments, we executed 64 in-stances of the micro benchmark concurrently, varyingthe size of the address space accessed by the instances tovary contention. The blocks are chosen from a specificprefix of the address space, which is a parameter to thebenchmark; the longer this prefix, the bigger the fractionof the address space accessed by the benchmark, and theless skewed the workload. The two key metrics of inter-est are transaction goodput (measured as the number ofsuccessfully committed transactions per second, as wellas the total number of bytes read or written per secondby these transactions) and overall transaction throughput;their ratio is the commit rate of the system. Each datapoint in the following graphs is averaged across threeruns; in all cases, the minimum and the maximum runwere within 10% of the average.

Figure 5 shows the performance of this benchmarkagainst Isotope without fine-grained conflict detection;i.e., the benchmark does not issue mark accessed callsfor the 16-byte segments it modifies. On the x-axis,we increase the fraction of the address space accessedby the benchmark. On the left y axis, we plot the rateat which data is read and written by transactions; onthe right y-axis, we plot the number of transactions/sec.On both disk and SSD, transactional contention cripplesperformance on the left part of the graph: even thoughthe benchmark attempts to commit thousands of transac-tions/sec, all of them access a small number of blocks,leading to low goodput. Note that overall transactionthroughput is very high when the commit rate is low:aborts are cheap and do not result in storage I/O.

Conversely, disk contention hurts performance on theright side of Figure 5-Left: since the blocks read byeach transaction are distributed widely across the addressspace, the 32GB SSD read cache is ineffective in serving

0100200300400500600

16KB256KB

4MB64MB

1GB16GB

256GB

Disk

MB/

sec


ThroughputGoodput

16KB256KB

4MB64MB

1GB16GB

256GB

0510152025SSD

Ks o

f TXe

s/se

c


Figure 6: With fine-grained conflict detection, Isotopeperforms well even under high block-level contention.

reads and the disk arm is randomized and seeking con-stantly. As a result, the system provides very few trans-actions per second (though with a high commit rate). Inthe middle of the graph is a sweet spot where Isotope sat-urates the disk at roughly 120 MB/s of writes, where theblocks accessed are concentrated enough for reads to becacheable in the SSD (which supplies 120 MB/s of reads,or 30K 4KB IOPS), while distributed enough for writesto not trigger frequent conflicts.

We can improve performance on the left side of thegraphs in Figure 5 via fine-grained conflict detection. InFigure 6, the benchmark issues mark accessed calls totell Isotope which 16-byte fragment it is modifying. Theresult is high, stable goodput even when all transactionsare accessing a small number of blocks, since there isenough fragment-level concurrency in the system to en-sure a high commit rate. Using the same experiment butwith smaller and larger data access and conflict detec-tion granularities than 16 bytes showed similar trends.Isotope’s conflict detection was not CPU-intensive: weobserved an average CPU utilization of 5.96% withoutfine-grained conflict detection, and 6.17% with it.

6.2 Isotope Application PerformanceAs described earlier, we implemented two key-valuestores over Isotope: IsoHT using a hashtable index andIsoBT using a B-tree index, respectively. IsoBT ex-poses a fully functional LevelDB API to end applica-tions; IsoHT does the same minus range queries. To eval-uate these systems, we used the LevelDB benchmark [5]as well as the YCSB [21] benchmark. We ran the fill-random, read-random, and delete-random workloads ofthe LevelDB benchmark and YCSB workload-A traces(50% reads and 50% updates following a zipf distributionon keys). All these experiments are on the 2-SSD config-uration of Isotope. For comparison, we ran LevelDB ona RAID-0 array of the two SSDs, in both synchronousmode (‘LvlDB-s’) and asynchronous mode (‘LvlDB’).LevelDB was set to use no compression and the defaultwrite cache size of 8MB. For all the workloads, we useda value size of 8KB and varied the number of threads is-suing requests from 4 to 128. Results with different valuesizes (from 4KB to 32KB) showed similar trends.

For operations involving writes (Figure 7-(a), (c), and(d)), IsoHT and IsoBT goodput increases with the num-

10


0 5

10 15 20 25 30 35

4 8 16 32 64 128Goo

dput

(K O

ps/s

ec)

# of Threads

(a) Random fill

0 5

10 15 20 25 30 35

4 8 16 32 64 128# of Threads

LvlDB-sLvlDBIsoHTIsoBT

(b) Random read

0 10 20 30 40 50 60 70 80

4 8 16 32 64 128Goo

dput

(K O

ps/s

ec)

# of Threads

(c) Random delete

0 5

10 15 20 25 30 35

4 8 16 32 64 128# of Threads

(d) YCSB workload-AFigure 7: IsoHT and IsoBT outperform LevelDB for dataoperations while providing stronger consistency.

ber of threads, but dips slightly beyond 64 threads due toan increased transaction conflict rate. For the read work-load (Figure 7-(b)), throughput increases until the un-derlying SSDs are saturated. Overall, IsoHT has highergoodput than IsoBT, since it touches fewer metadatablocks per operation. We ran these experiments with Iso-tope providing snapshot isolation, since it performed bet-ter for certain workloads and gave sufficiently strong se-mantics for building the key-value stores. With strict se-rializability, for instance, the fill workload showed nearlyidentical performance, whereas the delete workload ranup to 25% slower.

LevelDB’s performance is low for fill operations dueto sorting and multi-level merging (Figure 7-(a)), and itsread performance degrades as the number of concurrentthreads increases because of the CPU contention in theskip list, cache thrashing, and internal merging opera-tions (Figure 7-(b)). Still, LevelDB’s delete is very effi-cient because it only involves appending a small deleteintention record to a log, whereas IsoBT/IsoHT has toupdate a full 4KB block per delete (Figure 7-(c)).

The point of this experiment is not to show Iso-HT/IsoBT is better than LevelDB, which has a differentinternal design and is optimized for specific workloadssuch as sequential reads and bulk writes. Rather, it showsthat systems built over Isotope with little effort can pro-vide equivalent or better performance than an existingsystem that implements its own concurrency control andfailure atomicity logic.

6.2.1 Composability

To evaluate the composability of Isotope-based storagesystems, we ran the YCSB workload on ImgStore, ourimage storage application built over IsoHT and IsoBT.In our experiments, ImgStore transactionally stored a16KB payload (corresponding to an image) in IsoHT anda small date-to-ID mapping in IsoBT. To capture the var-

0 2 4 6 8

10 12 14 16

4 8 16 32 64 128

Goo

dput

(K O

ps/s

ec)

# of Threads

Cross-libraryCross-threadCross-process

Figure 8: YCSB over different compositions of IsoBTand IsoHT.ious ways in which Isotope storage systems can be com-posed (see Section 3), we implemented several versionsof ImgStore: cross-library, where ImgStore accesses thetwo key-value stores as in-process libraries, with eachtransaction executing within a single user-space thread;cross-thread, where ImgStore accesses each key-valuestore using a separate thread, and requires transactionsto execute across them; and cross-process, where eachkey-value store executes within its own process and isaccessed by ImgStore via socket-based IPC. Figure 8shows the resulting performance for all three versions.It shows that the cost of the extra takeoverTX/releaseTXcalls required for cross-thread transactions is negligi-ble. As one might expect, cross-process transactionsare slower due to the extra IPC overhead. Additionally,ImgStore exhibits less concurrency than IsoHT or IsoBT(peaking at 32 threads), since each composite transactionconflicts if either of its constituent transactions conflict.

6.2.2 Filesystem Performance

Next, we compare the end-to-end performance of IsoFSrunning over Isotope using the IOZone [4] write/rewritebenchmark with 8 threads. Each thread writes to itsown file using a 16KB record size until the file sizereaches 256MB; it then rewrites the entire file sequen-tially; and then rewrites it randomly. We ran this work-load against IsoFS running over Isotope, which con-verted each 16KB write into a transaction involving four4KB Isotope writes, along with metadata writes. We alsoran ext2 and ext3 over Isotope; these issued solitary, non-transactional reads and writes, which were interpreted byIsotope as singleton transactions (in effect, Isotope oper-ated as a conventional log-structured block store, so thatext2 and ext3 are not penalized for random I/Os). We ranext3 in ‘ordered’ mode, where metadata is journaled butfile contents are not.

Figure 9 plots the throughput observed by IOZone:on disk, IsoFS matches or slightly outperforms ext2 andext3, saturating the tail disk on the chain. On SSD, IsoFSis faster than ext2 and ext3 for initial writes, but is bot-tlenecked by FUSE on rewrites. When we ran IsoFS di-rectly using a user-space benchmark that mimics IOZone(‘IsoFS-lib’), throughput improved to over 415MB/s. A

11


0 100

200 300

400 500

seq-write

seq-rewrite

rand-rewrite

DiskM

B/se

cext2ext3

IsoFSIsoFS-lib (SSD only)

seq-write

seq-rewrite

rand-rewrite

SSD

Figure 9: IOZone over IsoFS and ext2/ext3.secondary point made by this graph is that Isotope doesnot slow down applications that do not use its transac-tional features (the high performance is mainly due tothe underlying logging scheme, but ext2 and ext3 stillsaturate disk and SSD for rewrites), satisfying a key con-dition for pushing functionality down the stack [49].

7 Related WorkThe idea of transactional atomicity for multi-block writeswas first proposed in Mime [18], a log-structured storagesystem that provided atomic multi-sector writes. Overthe years, multiple other projects have proposed block-level or page-level atomicity: the Logical Disk [24] in1993, Stasis [54] in 2006, TxFlash [46] in 2008, andMARS [19] in 2013. RVM [51] and Rio Vista [36] pro-posed atomicity over a persistent memory abstraction.All these systems explicitly stopped short of providingfull transactional semantics, relying on higher layers toimplement isolation. To the best of our knowledge, noexisting single-machine system has implemented trans-actional isolation at the block level, or indeed any con-currency control guarantee beyond linearizability.

On the other hand, distributed filesystems have oftenrelied on the underlying storage layer to provide con-currency control. Boxwood [37], Sinfonia [12], andCalvinFS [62] presented simple NFS designs that lever-aged transactions over distributed implementations ofhigh-level data structures and a shared address space.Transaction isolation has been proposed for shared blockstorage accessed over a network [13] and for key-valuestores [60]. Isotope can be viewed as an extension ofsimilar ideas to single-machine, multi-core systems thatdoes not require consensus or distributed rollback pro-tocols. Our single-machine IsoFS implementation hasmuch in common with the Boxwood, Sinfonia, and Calv-inFS NFS implementations which ran against clusters ofstorage servers.

Isotope also fits into a larger body of work on smartsingle-machine block devices, starting with Loge [27]and including HP AutoRAID [65]. Some of this workhas focused on making block devices smarter withoutchanging the interface [57], while other work has lookedat augmenting the block interface [18, 64, 30], modify-ing it [67], and even replacing it with an object-basedinterface [38]. In a distributed context, Parallax [39]and Strata [23] provide virtual disks on storage clusters.A number of filesystems are multiversion, starting with

WAFL [33], and including many others [50, 41, 22]. Un-derlying these systems is research on multiversion datastructures [26]. Less common are multiversion blockstores such as Clotho [29] and Venti [47].

A number of filesystems have been built over afull-fledged database. Inversion [43] is a conven-tional filesystem built over the POSTGRES database,while Amino [66] is a transactional filesystem (i.e., ex-posing transactions to users) built over Berkeley DB.WinFS [10] was built over a relational engine derivedfrom SQL Server. This route requires storage system de-velopers to adopt a complex interface – one that does notmatch or expose the underlying grain of the hardware –in order to obtain benefits such as isolation and atomic-ity. In contrast, Isotope retains the simple block storageinterface while providing isolation and atomicity.

TxOS [45] is a transactional operating system thatprovides ACID semantics over syscalls, include file ac-cesses. In contrast, Isotope is largely OS-agnostic andcan be ported easily to commodity operating systems, oreven implemented under the OS as a hardware device. Inaddition, Isotope supports the easy creation of new sys-tems such as key-value stores and filesystems that rundirectly over block storage.

Isotope is also related to the large body of work onsoftware transactional memory (STM) [55, 32] systems,which typically provide isolation but not durability oratomicity. Recent work has leveraged new NVRAMtechnologies to add durability to the STM abstraction:Mnemosyne [63] and NV-Heaps [20] with PCM andHathi [52] with commodity SSDs. In contrast, Iso-tope aims for transactional secondary storage, rather thantransactional main-memory.

8 ConclusionWe described Isotope, a transactional block store thatprovides isolation in addition to atomicity and durabil-ity. We showed that isolation can be implemented ef-ficiently within the block layer, leveraging the inher-ent multi-versioning of log-structured block stores andapplication-provided hints for fine-grained conflict de-tection. Isotope-based systems are simple and fast, whileobtaining database-strength guarantees on failure atom-icity, durability, and consistency. They are also compos-able, allowing application-initiated transactions to spanmultiple storage systems and different abstractions suchas files and key-value pairs.

AcknowledgmentsThis work is partially funded and supported by a SLOANResearch Fellowship received by Hakim Weatherspoon,DARPA MRC and CSSG (D11AP00266) and NSF(1422544, 1053757, 0424422, 1151268, 1047540). Wewould like to thank our shepherd, Sage Weil, and theanonymous reviewers for their comments.

12


References[1] fcntl man page.

[2] Filesystem in userspace. http://fuse.sourceforge.net.

[3] Fusion-io. www.fusionio.com.

[4] Iozone filesystem benchmark. http://www.iozone.org.

[5] LevelDB benchmarks. http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html.

[6] SanDisk Fusion-io atomic multi-blockwrites. http://www.sandisk.com/assets/docs/accelerate-myql-open-source-databases-with-sandisk-nvmfs-and-fusion-iomemory-sx300-application-accelerators.pdf.

[7] SanDisk Fusion-io auto-commit mem-ory. http://web.sandisk.com/assets/white-papers/MySQL_High-Speed_Transaction_Logging.pdf.

[8] Seagate kinetic open storage platform. http://www.seagate.com/solutions/cloud/data-center-cloud/platforms/.

[9] Storage spaces. http://technet.microsoft.com/en-us/library/hh831739.aspx.

[10] Winfs. http://blogs.msdn.com/b/winfs/.

[11] A. Aghayev and P. Desnoyers. Skylight a win-dow on shingled disk operation. In USENIX FAST,pages 135–149, 2015.

[12] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch,and C. Karamanolis. Sinfonia: a new paradigmfor building scalable distributed systems. ACMSIGOPS Operating Systems Review, 41(6):159–174, 2007.

[13] K. Amiri, G. A. Gibson, and R. Golding. Highlyconcurrent shared storage. In IEEE ICDCS, pages298–307, 2000.

[14] A. Badam and V. S. Pai. SSDAlloc: hybridSSD/RAM memory management made easy. InUSENIX NSDI, 2011.

[15] M. Balakrishnan, D. Malkhi, V. Prabhakaran,T. Wobber, M. Wei, and J. D. Davis. CORFU: Ashared log design for flash clusters. In USENIXNSDI, pages 1–14, 2012.

[16] H. Berenson, P. Bernstein, J. Gray, J. Melton,E. O’Neil, and P. O’Neil. A critique of ansi sql iso-lation levels. ACM SIGMOD Record, 24(2):1–10,1995.

[17] P. A. Bernstein, V. Hadzilacos, and N. Goodman.Concurrency control and recovery in database sys-tems, volume 370. Addison-wesley New York,1987.

[18] C. Chao, R. English, D. Jacobson, A. Stepanov, andJ. Wilkes. Mime: a high performance parallel stor-age device with strong recovery guarantees. Tech-nical report, HPL-CSP-92-9, Hewlett-Packard Lab-oratories, 1992.

[19] J. Coburn, T. Bunker, R. K. Gupta, and S. Swanson.From ARIES to MARS: Reengineering transactionmanagement for next-generation, solid-state drives.In SOSP, 2013.

[20] J. Coburn, A. M. Caulfield, A. Akel, L. M.Grupp, R. K. Gupta, R. Jhala, and S. Swan-son. NV-Heaps: making persistent objects fastand safe with next-generation, non-volatile memo-ries. ACM SIGARCH Computer Architecture News,39(1):105–118, 2011.

[21] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakr-ishnan, and R. Sears. Benchmarking cloud servingsystems with ycsb. In ACM SoCC, pages 143–154,2010.

[22] B. Cornell, P. A. Dinda, and F. E. Bustamante.Wayback: A user-level versioning file system forlinux. In USENIX ATC, FREENIX Track, 2004.

[23] B. Cully, J. Wires, D. Meyer, K. Jamieson,K. Fraser, T. Deegan, D. Stodden, G. Lefebvre,D. Ferstay, and A. Warfield. Strata: scalable high-performance storage on virtualized non-volatilememory. In USENIX FAST, pages 17–31, 2014.

[24] W. De Jonge, M. F. Kaashoek, and W. C. Hsieh.The logical disk: A new approach to improving filesystems. ACM SIGOPS Operating Systems Review,27(5):15–28, 1993.

[25] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro,M. R. Stonebraker, and D. A. Wood. Implementa-tion techniques for main memory database systems.In ACM SIGMOD, pages 1–8, 1984.

[26] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E.Tarjan. Making data structures persistent. In Pro-ceedings of the eighteenth annual ACM symposiumon Theory of computing, pages 109–121. ACM,1986.

13


[27] R. M. English and A. A. Stepanov. Loge: a self-organizing disk controller. In USENIX Winter,1992.

[28] B. Fan, D. G. Andersen, and M. Kaminsky.Memc3: Compact and concurrent memcache withdumber caching and smarter hashing. USENIXNSDI, 2013.

[29] M. Flouris and A. Bilas. Clotho: Transparent DataVersioning at the Block I/O Level. In MSST, pages315–328, 2004.

[30] G. R. Ganger. Blurring the line between OSesand storage devices. School of Computer Science,Carnegie Mellon University, 2001.

[31] R. Guerraoui and M. Kapalka. On the correctnessof transactional memory. In Proceedings of the13th ACM SIGPLAN Symposium on Principles andpractice of parallel programming, pages 175–184.ACM, 2008.

[32] T. Harris, J. Larus, and R. Rajwar. TransactionalMemory. Morgan and Claypool Publishers, 2010.

[33] D. Hitz, J. Lau, and M. A. Malcolm. File systemdesign for an nfs file server appliance. In USENIXWinter, volume 94, 1994.

[34] J. Jose, M. Banikazemi, W. Belluomini, C. Murthy,and D. K. Panda. Metadata persistence using stor-age class memory: experiences with flash-backeddram. In Proceedings of the 1st Workshop on In-teractions of NVM/FLASH with Operating Systemsand Workloads, page 3. ACM, 2013.

[35] H.-T. Kung and J. T. Robinson. On optimistic meth-ods for concurrency control. ACM Transactions onDatabase Systems (TODS), 6(2):213–226, 1981.

[36] D. E. Lowell and P. M. Chen. Free transactions withrio vista. ACM SIGOPS Operating Systems Review,31(5):92–101, 1997.

[37] J. MacCormick, N. Murphy, M. Najork, C. A.Thekkath, and L. Zhou. Boxwood: Abstractions asthe foundation for storage infrastructure. In OSDI,volume 4, pages 8–8, 2004.

[38] M. Mesnier, G. R. Ganger, and E. Riedel. Object-based storage. Communications Magazine, IEEE,41(8):84–90, 2003.

[39] D. T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre,M. J. Feeley, N. C. Hutchinson, and A. Warfield.Parallax: virtual disks for virtual machines. ACMSIGOPS Operating Systems Review, 42(4):41–54,2008.

[40] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, andP. Schwarz. Aries: a transaction recovery methodsupporting fine-granularity locking and partial roll-backs using write-ahead logging. ACM Transac-tions on Database Systems (TODS), 17(1):94–162,1992.

[41] K.-K. Muniswamy-Reddy, C. P. Wright, A. Him-mer, and E. Zadok. A versatile and user-orientedversioning file system. In USENIX FAST, 2004.

[42] E. B. Nightingale, J. Elson, J. Fan, O. Hofmann,J. Howell, and Y. Suzue. Flat datacenter storage. InUSENIX OSDI, 2012.

[43] M. A. Olson. The design and implementation ofthe inversion file system. In USENIX Winter, pages205–218, 1993.

[44] A. Pennarun. Everything you never wanted toknow about file locking. http://apenwarr.ca/log/?m=201012#13.

[45] D. E. Porter, O. S. Hofmann, C. J. Rossbach,A. Benn, and E. Witchel. Operating system trans-actions. In SOSP, pages 161–176. ACM, 2009.

[46] V. Prabhakaran, T. L. Rodeheffer, and L. Zhou.Transactional flash. In USENIX OSDI, 2008.

[47] S. Quinlan and S. Dorward. Venti: A new approachto archival storage. In USENIX FAST, 2002.

[48] C. Reid, P. A. Bernstein, M. Wu, and X. Yuan. Op-timistic concurrency control by melding trees. Pro-ceedings of the VLDB Endowment, 4(11), 2011.

[49] J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transac-tions on Computer Systems (TOCS), 2(4):277–288,1984.

[50] D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C.Veitch, R. W. Carton, and J. Ofir. Deciding whento forget in the elephant file system. ACM SIGOPSOperating Systems Review, 33(5):110–123, 1999.

[51] M. Satyanarayanan, H. H. Mashburn, P. Kumar,D. C. Steere, and J. J. Kistler. Lightweight recover-able virtual memory. ACM Transactions on Com-puter Systems (TOCS), 12(1):33–57, 1994.

[52] M. Saxena, M. A. Shah, S. Harizopoulos, M. M.Swift, and A. Merchant. Hathi: durable transac-tions for memory using flash. In Proceedings of theEighth International Workshop on Data Manage-ment on New Hardware, pages 33–38. ACM, 2012.

14


[53] M. Saxena, M. M. Swift, and Y. Zhang. Flashtier:a lightweight, consistent and durable storage cache.In ACM EuroSys, pages 267–280, 2012.

[54] R. Sears and E. Brewer. Stasis: Flexible transac-tional storage. In USENIX OSDI, 2006.

[55] N. Shavit and D. Touitou. Software transactionalmemory. Distributed Computing, 10(2):99–116,1997.

[56] J.-Y. Shin, M. Balakrishnan, T. Marian, andH. Weatherspoon. Gecko: Contention-obliviousdisk arrays for cloud storage. In USENIX FAST,pages 213–225, 2013.

[57] M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E.Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Semantically-smart disk systems. InUSENIX FAST, 2003.

[58] D. Skourtis, D. Achlioptas, N. Watkins,C. Maltzahn, and S. Brandt. Flash on rails:consistent flash performance through redundancy.In USENIX ATC, pages 463–474, 2014.

[59] G. Soundararajan, V. Prabhakaran, M. Balakrish-nan, and T. Wobber. Extending SSD Lifetimes withDisk-Based Write Caches. In USENIX FAST, 2010.

[60] Y. Sovran, R. Power, M. K. Aguilera, and J. Li.Transactional storage for geo-replicated systems.In ACM SOSP, 2011.

[61] L. Stein. Stupid File Systems Are Better. In HotOS,2005.

[62] A. Thomson and D. J. Abadi. Calvinfs: Consis-tent wan replication and scalable metadata manage-ment for distributed file systems. In USENIX FAST,pages 1–14, 2015.

[63] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:Lightweight persistent memory. ACM SIGARCHComputer Architecture News, 39(1):91–104, 2011.

[64] R. Y. Wang, T. E. Anderson, and D. A. Patterson.Virtual log based file systems for a programmabledisk. Operating systems review, 33:29–44, 1998.

[65] J. Wilkes, R. Golding, C. Staelin, and T. Sulli-van. The hp autoraid hierarchical storage system.ACM Transactions on Computer Systems (TOCS),14(1):108–136, 1996.

[66] C. P. Wright, R. Spillane, G. Sivathanu, andE. Zadok. Extending acid semantics to the file sys-tem. ACM Transactions on Storage (TOS), 3(2):4,2007.

[67] Y. Zhang, L. P. Arulraj, A. C. Arpaci-Dusseau, andR. H. Arpaci-Dusseau. De-indirection for flash-based ssds with nameless writes. In USENIX FAST,2012.

15

Date post:	20-Aug-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Isotope: Transactional Isolation for Block Storage · • We describe storage systems built using...

Documents