+ All Categories
Home > Documents > LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping...

LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping...

Date post: 13-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
12
LVD: Lean Virtual Disks Gaurab Basu [email protected] Shripad Nadgowda [email protected] Akshat Verma [email protected] ABSTRACT In this work, we present Lean Virtual Disks (LVD), a new virtual disk format for virtualized servers. LVD transpar- ently consolidates duplicate blocks across virtual machines to create a lean disk image, leading to a merged datapath for all virtual machines. This merged datapath allows effi- cient storage usage, reduction in disk I/O (read/write) by eliminating I/O for same content across VMs and efficient host cache utilization. LVD is motivated by clouds, where VMs are created from golden masters and use standardized middleware and management tools leading to high content similarity. We implement LVD as an extension of QCow2 and study its ability to improve common data center sys- tem management activities as well as improving application performance of popular I/O benchmark workloads. We ob- served that LVD reduced disk space and disk I/O by 70%, making applications run faster by 25% on an average. 1. INTRODUCTION vDisk Drv vDisk Emulator vm1 vDisk Drv vm2 Host Cache Data Store Client Data Store Server shared-disk.img vm1.img vm2.img Server Filesystem Data Store Hyper- visor Virtual Env LVD Storage Subsystem Figure 1: Storage Stack in Virtualized Environment I/O performance in virtualized environments has been tra- ditionally viewed as inferior to I/O performance in native environments. I/O processing requires an additional hop at the hypervisor leading to both CPU overhead and higher I/O latency [5]. In this work, we conjecture that leverag- ing duplicate content across virtual machines can help us alleviate some of the performance overhead associated with I/O processing in virtualized environments. Towards, this goal, we have designed Lean Virtual Disks (LVD), a new disk image format for virtualized environment. LVD is based on the idea that a lot of common content across virtual machines is stored multiple times within each disk file[11]. Content similarity across VM images occurs because of the use of similar operating system instances as well as the widespread use of master images within publicly available virtual appliance libraries to create VMs [7, 27]. Duplicate content is stored (wasted storage space), read or written (wasted disk I/O) and cached in buffer cache (wasted memory) multiple times. Fig. 1 captures a typical storage stack in a virtualized envi- ronment. Duplicate elimination has focused on either merg- ing duplicate content in page cache or on secondary storage for saving space. The first approach does not save storage space or disk I/O. The second approach merges blocks at a logical level that the host is not aware of. Hence, the host needs to issue duplicate I/O and maintain duplicate content in the buffer cache. Merging duplicate blocks at an image file level allows us to eliminate redundant storage, disk I/O and cached pages. LVD introduces the notion of a shared image file, which is a merged collection of disk blocks across the virtual machines. The LVD driver sits on the vDisk Emulator and translates requests from different virtual machines to a block on the shared image file. Merging duplicate blocks at the logical level of disk images ensures that the image data store keeps duplicate content only once. Further, since duplicate blocks across virtual images are merged at the level of the image file, the buffer cache on the host keeps a single copy of every unique content block and I/O requests for duplicate content is not made to the image data store. Finally, LVD driver can modify the (existing) mapping of virtual to physical blocks maintained by the vDisk Emulator and does not require any modifications to the read path. Implementing a host-aware duplicate elimination strategy presents its own challenges. Duplicate elimination tech- niques at storage layer do not use host resources and can maintain large hash indexes, and perform out-of-band read- ing and merging of data blocks. LVD needs to ensure mini- mal and deterministic use of host resources to ensure appli- cations are not impacted. Moreover, virtual disk drivers are designed with the assumption that they make exclusive use of the image file. Sharing a lean image disk across virtual
Transcript
Page 1: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

LVD: Lean Virtual Disks

Gaurab [email protected]

Shripad [email protected]

Akshat [email protected]

ABSTRACTIn this work, we present Lean Virtual Disks (LVD), a newvirtual disk format for virtualized servers. LVD transpar-ently consolidates duplicate blocks across virtual machinesto create a lean disk image, leading to a merged datapathfor all virtual machines. This merged datapath allows effi-cient storage usage, reduction in disk I/O (read/write) byeliminating I/O for same content across VMs and efficienthost cache utilization. LVD is motivated by clouds, whereVMs are created from golden masters and use standardizedmiddleware and management tools leading to high contentsimilarity. We implement LVD as an extension of QCow2and study its ability to improve common data center sys-tem management activities as well as improving applicationperformance of popular I/O benchmark workloads. We ob-served that LVD reduced disk space and disk I/O by 70%,making applications run faster by 25% on an average.

1. INTRODUCTION

vDisk Drv

vDisk Emulator

vm1

vDisk Drv

vm2

Host Cache

Data Store Client

Data Store Server

shared-disk.img

vm1.img vm2.img

Server Filesystem

Data Store

Hyper-visor

Virtual Env

LVD

Storage Subsystem

Figure 1: Storage Stack in Virtualized Environment

I/O performance in virtualized environments has been tra-ditionally viewed as inferior to I/O performance in nativeenvironments. I/O processing requires an additional hop atthe hypervisor leading to both CPU overhead and higherI/O latency [5]. In this work, we conjecture that leverag-ing duplicate content across virtual machines can help usalleviate some of the performance overhead associated withI/O processing in virtualized environments. Towards, this

goal, we have designed Lean Virtual Disks (LVD), a newdisk image format for virtualized environment.

LVD is based on the idea that a lot of common contentacross virtual machines is stored multiple times within eachdisk file[11]. Content similarity across VM images occursbecause of the use of similar operating system instances aswell as the widespread use of master images within publiclyavailable virtual appliance libraries to create VMs [7, 27].Duplicate content is stored (wasted storage space), read orwritten (wasted disk I/O) and cached in buffer cache (wastedmemory) multiple times.

Fig. 1 captures a typical storage stack in a virtualized envi-ronment. Duplicate elimination has focused on either merg-ing duplicate content in page cache or on secondary storagefor saving space. The first approach does not save storagespace or disk I/O. The second approach merges blocks at alogical level that the host is not aware of. Hence, the hostneeds to issue duplicate I/O and maintain duplicate contentin the buffer cache. Merging duplicate blocks at an imagefile level allows us to eliminate redundant storage, disk I/Oand cached pages.

LVD introduces the notion of a shared image file, which is amerged collection of disk blocks across the virtual machines.The LVD driver sits on the vDisk Emulator and translatesrequests from different virtual machines to a block on theshared image file. Merging duplicate blocks at the logicallevel of disk images ensures that the image data store keepsduplicate content only once. Further, since duplicate blocksacross virtual images are merged at the level of the imagefile, the buffer cache on the host keeps a single copy of everyunique content block and I/O requests for duplicate contentis not made to the image data store. Finally, LVD driver canmodify the (existing) mapping of virtual to physical blocksmaintained by the vDisk Emulator and does not require anymodifications to the read path.

Implementing a host-aware duplicate elimination strategypresents its own challenges. Duplicate elimination tech-niques at storage layer do not use host resources and canmaintain large hash indexes, and perform out-of-band read-ing and merging of data blocks. LVD needs to ensure mini-mal and deterministic use of host resources to ensure appli-cations are not impacted. Moreover, virtual disk drivers aredesigned with the assumption that they make exclusive useof the image file. Sharing a lean image disk across virtual

Page 2: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

machines with high concurrency, while maintaining isola-tion, presents significant technical challenges. We presentseveral novel ideas like index sampling, semi-private coarse-grained dynamic space allocation, shared hash index withrange locks, and leveraging copy-on-write mechanisms to en-able sharing to address these challenges. LVD is generic andand we have implemented it as an extension of QCow2 [15].We perform a detailed experimental evaluation with bothmicro benchmarks, typical data center system managementactivities, and benchmark applications. We observed thatLVD reduces space and disk I/O by more than 70% for alarge number of benchmarks, making application run fasterby around 25% for many I/O intensive applications. Com-pared to LVD, host-level filesystem deduplication achieves75% lower read throughput, 5% lower write throughput, uses3X to 4X more memory and 1.1X to 3X more CPU.

2. LVD DESIGN OVERVIEWWe first present an overview of high-level design choicesavailable to design virtual disks, which eliminate duplicatesacross virtual machines.

2.1 Understanding the Design Space

Figure 2: Virtual to Physical Block Mapping in LVD

The goal of LVD is to remove redundancy in virtual disksenabling higher efficiency in storage space, storage band-width (IOPS), and host page caches. A virtual disk consistsof virtual blocks that are used to allocate storage to appli-cations. These virtual disks are mapped transparently tophysical blocks allocated from physical storage. Most com-mon virtual disk formats encapsulate the metadata (virtual-to-physical disk mapping) as well as data within one imagefile and LVD merges blocks at this layer avoiding any ad-ditional layers of indirection. Our goal is to identify dupli-cate blocks across virtual disks and merge them on physicalstorage. Fig.2 captures one possible layout for lean virtualdisks. In this example, Disk 1 and Disk 2 are identical in allblocks barring the second block (2). Hence, virtual blocks1, 3, 4 and 5 are merged across the two disks leading to themapping depicted in the figure. Clearly, such a mapping vio-lates the clean independent structure of virtual disks, whichencapsulate metadata and data within independent imagefiles.

Linking Instance-specific Disk Files Versus SharedDisk File : There are two high-level design choices for thestructure of LVD. The first option is to retain individual filesfor virtual disks and modify the metadata of one image toreflect that its physical blocks may be present in a different

virtual disk. In our example above, the metadata of disk 2can link blocks 1, 3, 4, 5 to Disk 1. The metadata may eithercapture the exact physical addresses of these blocks on thelinked disk or redirect the metadata query to the linked disk.The first option suffers from consistency issues as two VMsmay attempt to modify blocks concurrently. This option alsocreates management issues due to the dependency betweenvirtual disks - any changes made by master disk needs tobe notified to all other dependent disks. The second high-level design option is to create an additional single sharedfile abstraction across all virtual disks. All requests acrossthe disks are routed to this shared file, which maintains themeta-data as well as data for all virtual disks. This, webelieve, is a cleaner design and is pursued by LVD.

Inline Deduplication Versus Scan-based Deduplica-tion: The second key design decision for LVD is to choosebetween inline deduplication versus scan-based deduplica-tion. Scan-based deduplication does not introduce any la-tency during regular I/O operation. However, it does leadto additional I/Os to merge duplicate data blocks and up-date metadata, which can impact application performanceon the host. We chose inline deduplication to avoid hostcontention and reduce disk I/Os. Building a virtual diskwith shared image file and inline deduplication raises somekey issues that we discuss next.

2.2 Design Issues and ApproachDesigning lean virtual disks that eliminate redundancy acrossstorage space, storage bandwidth and host caches need todeal with significant challenges, which we enumerate next.

Bounded Impact on Host: LVD runs on a productionhost server and needs to ensure that the amount of host re-sources used are deterministic. Using in-band deduplicationensures that data blocks are not read or written. Since meta-data (virtual-to-physical mapping) is typically in cache, theI/O overheads of LVD are minimal. However, we need tocompute hashes (CPU resources) and store the hash in anindex (memory usage) on the host. LVD does index sam-pling within a bounded memory. Further, deduplication isperformed at the logical level of image file, which is visibleto the host, we avoid the need for another cache layer anddirectly use the host buffer cache. Finally, the read path isunmodified and hash computation is avoided for reads.

Decrease in Sequential Traffic: Traditional thick pro-visioned virtual disk formats pre-allocate physical space foreach virtual disk to ensure contiguous space for sequentialaccess. The allocation of virtual blocks to applications isalso done in a manner, which encourages contiguous physicalblock allocation to semantically related data. Hence, work-loads with sequential access pattern often lead to sequentialdisk activity, leading to improved performance. Mergingphysical blocks across disks and allocating space to multiplevirtual disks may lead to increased randomness as perceivedby physical disks, leading to lower storage performance. Weuse semi-private coarse grained dynamic space allocation toensure that VMs maintain locality. When a VM needs spacefor its data (or metadata), physical space allocation is donein a semi-private area allowing private sequential disk traf-fic from each virtual machine to remain sequential in mostcases. Further, space is allocated in coarse-grained chunks

Page 3: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

Metadata Metadata Metadata

vm1.img vm2.img vm3.img

vm1 Semi-private

vm2 Semi-private

vm3 Semi-private

Metadata indicating missing block

Metadata indicating allocated block

Unique Data Blocks

Dedup/Merged data blocks

Figure 3: Design overview. For simplicity, metadatais shown as 1-level. In practice, metadata will havemultiple levels and all metadata other than Level-1will be stored in the semi private area along withthe data to ensure locality between metadata anddata.

(e.g., 1GB), ensuring that space is allocated infrequentlyalleviating any performance impact due to locking duringspace allocation. Finally, we conjecture that shared seg-ments across virtual machines are large in size and are ac-cessed in the same order by all VMs accessing it. Hence,locality is preserved for shared data as well. This is val-idated in our Patch experiment where we observe that 19clusters (76KB) is the average size and 4719 clusters (19MB)is the maximum size of shared segments. Fig.3 captures thesemi-private physical space allocation, which allows mergedblocks in a disk’s semi-private space to be accessed by othervirtual machines as well.

Handling Parallel Access to Shared Image File: Thereare two separate issues that need to be handled. First, LVDneeds to ensure consistency without expensive locking thatmay impact performance. Different virtual machines maymodify a merged block at the same time. Expensive lockingof physical blocks across the disks may lead to significantperformance problems. We leverage copy-on-write mecha-nism, typically used to implement snapshots in virtual disks.Any physical block that is deduplicated is marked as copy-on-write, allowing changes to those blocks to be trappedand handled appropriately. Copy-on-write support is imple-mented in most popular virtual disk implementations likeQCow2 or vmdk, allowing LVD to be implemented as anextension to existing virtual disk formats. Fig.3 illustratesthe proposed strategy with 3 virtual disks. Second, a singledisk file across multiple virtual disks also leads to names-pace issues. Different disks may have virtual addresses thatconflict with each other and need to be resolved in LVD. Inorder to deal with this issue, we assign an unique identifierto each VM. We use the first few bits of a virtual addressto capture the virtual machine issuing an I/O request. Thisallows LVD to use the appropriate private metadata to servethe I/O request.

Maintaining Shared Index: All virtual disks need toidentify if a block is already present in another virtual disk.This requires a shared index to be maintained, which willbe queried and updated concurrently by all virtual disks.In this respect, LVD shares the problem of maintaining ashared index with decentralized deduplication systems like

Figure 4: Flow for a write request in LVD.(A,B,C,D) are specific cases when the write is writeto a new block with unique content (A), write to anew block with duplicate content (B), rewrite withunique content (C), and rewrite with duplicate con-tent (D).

[6] and employs similar ideas. In order to ensure that the in-dex can be updated by different virtual disks independently,we partition the hash index into buckets. Concurrent up-dates to a bucket are not allowed. However, different buck-ets can be updated concurrently. This 2-level index struc-ture allows us to perform index lookup in an almost fullydistributed manner. Further, in order to perform reverselookup, we extend the metadata to also include hash for thecontent. This allows us to update the old entry for the blockwithout scanning the entire map.

2.3 Design OverviewWe next present the salient aspects of LVD.

We now describe an end-to-end flow for the LVD system.Read requests in LVD are handled without any change. Ifthe block requested is present in the page cache, it is serveddirectly. Otherwise, the request is served from the sharedimage file using the appropriate private metadata. If a readrequest from one VM loads a physical block into host pagecache, then subsequent reads for the same content acrossVMs are served from the page cache. Read requests do notrequire any locking and concurrent reads to different contentin the shared image file can be performed trivially.

In LVD, write requests are used for duplicate identificationand differ significantly from traditional virtual disks in LVD.Fig.4 captures how a write request flows through the system,either merging to an existing block, or creating a new block,which can be leveraged by future writes. A write is firstprocessed by the private file driver where logical address ismasked to identify the VM making the request. The maskedrequest is then forwarded to a shared driver. The shareddriver first computes a hash for the content and queries acontent index. If the content is already present, the writeis marked as a duplicate write (cases B and D in Fig.4).We then query the metadata manager, which uses the VMmask to identify the appropriate metadata for the request.The metadata manager identifies if this is a fresh write or arewrite to an existing block. Once the exact scenario (be-tween A,B,C,D is identified), the rest of the flow is exe-cuted.

Page 4: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

If there is new content (A or C), the content index needs tobe updated. Similarly, if this was a rewrite, the older hashmay need to be invalidated (C or D). The older hash isobtained from the metadata, which also contains the con-tent hash. For updating content index, we first get a rangelock on the appropriate part of the content index, and thenupdate the index. The next step is to allocate space if thecontent is new (A or C). Space is pre-allocated in coarse-grained chunks and the space manager typically returns pre-allocated blocks. If the space manager has run out of pre-allocated space, it locks physical space allocation manager,reserves a large chunk (e.g., 1 GB) and allocates the requiredblocks from this chunk. This is followed by an update in thereference count table to capture decrease in reference countfor older entry (C and D) and/or increase in reference countfor new entry (A,B,C,D). Finally, the metadata is updatedwith the new content and the address of the physical block.For unique content (A and C), the last step is a request tothe physical disk driver to write the content at the appro-priate physical block.

3. IMPLEMENTATIONWe have designed LVD as a generic extension to any virtualdisk format that supports copy-on-write. As a concrete im-plementation, we extended Qcow2 Image format to supportlean disks. Our modifications are less than 2000 lines of Ccode and can be easily extended to other similar image for-mats. Major code contribution are towards implementationof Deduplication (section 3.4) and L2 table schema changes(section 3.1). We next discuss specific aspects of QCow2that we modified.

3.1 Logical to Physical Address TranslationThe fundamental unit of physical storage in QCow2 is acluster and the addressing scheme translates virtual addressto a cluster and the offset within a cluster. Each clustercan be configured to a size between 512B-2MB. Logical ad-dress translation is performed using 2-Level address lookupconsisting of L1 tables and L2 tables (Figure. 5). Virtualaddress is 64 bit long and consists of 3 parts. The leastsignificant bits map to an offset within the cluster and aredetermined by configured size of the cluster ( e.g. for 4KBclusters, 12 LSB bits will be cluster-bits). The next set ofbits are L2-bits which are used as an index into L2 table.Since L2 table is a single cluster containing 8 byte entries,L2-bits = cluster-bits - 3. Remaining bits are identified asL1-bits, which index into L1 table. L1 tables are pre-allocated at the start of the image file and are completelycached in main memory. L2 tables and data clusters areallocated dynamically to ensure data locality. A fixed sizeLRU cache is also maintained for the L2 tables.

In QCow2, logical addresses across VMs may overlap with-out problem as they are mapped to different disk image files.In LVD, we avoid cluster address collisions by masking clus-ter address for each VM. When a VM is created/launchedwith LVD, a 8-bit unique identifier is persisted in the headerof the VM’s private file. Each read/write request comingfrom the VM is then translated by masking 8 MSB bits ofthe logical address with the respective VM’s identifier. Thisensures that the shared file sees different logical addressesfor requests coming from different VMs and can appropri-ately resolve these addresses. The logical address in LVD

43 9 12

L1 Table

L2 Table

Data Cluster

8 Byte

Cluster offset

512

0

37 7 12 8

0

128

8 Byte

Cluster offset

24 Byte

Hash + Pad

L1 Table

L2’ Table

64 Bit Address Qcow2 LVD

VM-ID

Figure 5: L2 Table Design

is thus split into 4 parts - VM-bits, L1-bits, L2-bits andcluster-bits. Use of VM-bits reduces the address space forvirtual disk. E.g. in LVD we use 8-VM-bits which allow usto have a shared disk with maximum size of 64PB and itcan be shared between 256 VMs on a single host.

The structure of the shared file leverages the ideas employedby QCow2 to ensure locality between metadata and data.We pre-allocate clusters to place L1 tables for upto 256 VMsat the start of the shared file. Clusters for L2 tables anddata clusters are allocated dynamically. Hence, L1 tablesare cached in memory whereas L2 tables and their corre-sponding data clusters are spatially close to each other. Itis also important to note here that we do not share meta-data (either L1 or L2 tables) across images. This allowsmetadata updates across VMs to be performed completelyconcurrently.

3.2 Snapshot and Reference CountingQCow2 allows VMs to start with a Read-Only shared BaseImage or Backing File and their own clean Private image file.The Base image is marked as copy-on-write, allowing anywrites to base image to be redirected to the private imagefile. Hence, the Private image only contains the changedclusters w.r.t its Base Image.

QCow2 implements snapshots using copy-on-write (cow) mech-anism. Snapshots are linked in a chain with the current im-age as the head. When a cluster is not present on the head,next snapshot in the chain should be referred. QCow2 alsomaintains a reference count for each cluster to maintain thenumber of snapshots using that cluster. The reference countis maintained in a table with a 2-byte entry for each physi-cal cluster. A cluster with a reference count greater than 1indicates one or more active snapshots for an image.

We piggyback the chain-based redirection to redirect I/Orequests to shared data to the shared file. Similarly, we usethe copy-on-write mechanism to flag shared clusters. Writesto a shared cluster trigger a copy-on-write. In LVD, we cannot use the copy-on-write bit to indicate sharing. This isbecause the L2 table for each virtual machine is separateand we want to avoid the overhead of updating metadataof other VMs, when we write a cluster in one VM. Instead,we use only the (RefCount table) to identify if a cluster isshared.

Page 5: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

In LVD a cluster is marked copy-on-write only when it getsdeduplicated and is being shared across multiple VMs. Wecan not exercise QCow2’s optimization technique primarilybecause when a cluster is deduplicated and is to be markedfor copy-on-write, we need to find out all the L2 entriespointing to that cluster and this inverse mapping is notmaintained. So we decided to implement a single GlobalSynchronized refcount table for shared image file.

We cache this single copy of RefCount table in the sharedmemory so that LVD driver for all VMs can access it. Con-sistency of this table is maintained using range locks as usedby the hash index (more details in Sec. 3.4). In order toimprove hits in the cache, we also reduce the size of theRefCount table to 8 bits per cluster instead of 16 bits (inqcow2).

3.3 Semi-private Coarse-grained Dynamic SpaceAllocation

VM1

Pre-alloc suffice?

Pre-Allocate

Get spin-lock

Release spin-lock

VM2

Pre-alloc suffice?

VM3

Pre-alloc suffice?

VM1 Alloc Space

VM2 Alloc Space

VM3 Alloc Space

Pre-allocation at VM Create

Dynamic Pre-allocation request

Write to shared file

LVD Image Header

Unique Data Writes

N Y

N

N Y Y

Figure 6: Dynamic Space Allocation

Qcow2 is a sparse image format and does not preallocate anyspace for data clusters. During writes, free clusters are al-located on-demand. Allocation algorithm here makes a fun-damental assumption that temporarily located requests arerelated and ensures spatial locality for them. Sharing onedisk file across multiple virtual machines breaks the assump-tion that temporally correlated write requests are logicallyrelated. Hence, write requests from VMs may be allocatedspace, which overlap with each other, leading to degradedI/O performance due to fragmentation. In order to dealwith this issue, we change the allocation policy to allocatecoarse-grained space for each request. In this coarse-grainedprovisioning model, we allocate a predefined number of clus-ters to VM at the time of instantiation. By default, we pre-allocate 1GB for each VM. When the allocated space runsout, we allocate another 1GB from next available locationin the shared image file.

To avoid race condition during allocation, we use spinlocks.Fig.6 captures the space allocation process implemented byLVD. Coarse-grained dynamic allocation helps us achievetwo goals simultaneously: (i) Clusters for one VM are almostcontiguous. (ii) Space allocation requests are infrequent andhence locking does not lead to any performance issues. It isimportant to note here that the space allocated for each VM

is only semi-private. We will allow duplicate blocks acrossVMs to be merged and shared across virtual machines. Thesemi-private space allocation allows us to achieve the ben-efits of deduplication, while ensuring that the performanceof unique data is not impacted at all.

3.4 Distributed Index LookupOne of the key operation in LVD is to identify duplicateblocks during writes. We implemented a distributed hashindex(HashMap) using shared memory as an IPC mecha-nism. The HashMap maintains dedup metadata for a set ofrecently written clusters. Each entry in the map is 32 bytesand consists of the SHA-1 hash value (20 bytes), physical ad-dress for the content (8 bytes) and padding (4 bytes). Thedefault size for the index store is set to 2.5% of the hostmemory available (400MB on hosts with 16GB memory inour environment).

Delete HashMap (Hashold)

Compute SHA1 (data)

Find Hash Bucket Index

Get Lock (bucket)

Hash Exist?

Add to HashMap (Hashnew)

Update RefCount Table

Release Lock

N Y

Get Old Hash (addr)

Find Hash Bucket Index

Get Lock (bucket)

RefCount > 1 N

Update RefCount

Release Lock

Y

Hashold Hashnew

(addr,data)

Figure 7: Update flow for Hashmap

Multiple VMs can access this HashMap concurrently leadingto consistency issues. To address this, in LVD we define acustom 2-Level Hash Lookup with range locks as shown inFig.7. The entire HashMap is divided into fixed size HashBuckets(default 1 million hash buckets). In this 2-level HashLookup, given the hash value,we first identify the Bucket In-dex using the first 20 bits of the hash. The content hash isthen searched sequentially inside the bucket. For implement-ing consistency we maintain a pool of r/w spinlocks(defaultis 1000). We use the same Bucket Index to index and ac-quire corresponding r/w spinlock. Thus we can maintainHashMap consistency for upto 1000 update operations

If a cluster is overwritten and refcount reduces to 0, then thecorresponding metadata need to be deleted from HashMap.But write request has only two parameters viz. data tobe written and logical address. Since old content of clus-ter is not available in the request, we cannot invalidate thehash entry corresponding to the old content without scan-ning entire HashMap. We address this problem by extendingQcow2’s L2 table in LVD by storing hash of the content(20bytes) in each L2 entry. Since L2 entry is accessed duringaddress resolution for every request and is typically cached,we get the hash for old content (almost) for free and use itto invalidate the old hash entry.

Page 6: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

3.5 Recovery after failureWe do not store HashMap on disk, so any server shutdown/failureleads to the loss of HashMap. However it does not affectfunctionality during normal operation.(Also it is an easy im-plementation to store HashMap on disk during normal shut-down.) The RefCount table though cached in memory, isstored on durable storage during normal shutdown/startupoperation. We do not synchronize RefCount table withdurable storage, so it will get lost in case of failure. HoweverRefCount table can be recovered after failure by scanningall L1,L2 tables which points to physical address on image.(Note: L1 table has fixed position in LVD. RefCount ta-ble stores the reference count of each physical address onimage.)

4. EXPERIMENTAL SETUPWe conducted a large number of experiments to evaluatethe performance of LVD. Since we have implemented LVDas an extension to QCow2, we compare the performanceagainst a vanilla QCow2 deployment. We also compare theperformance against raw image format. Raw images have asimple design, tuned for performance, and provide an upperperformance limit for virtual images without deduplication.

4.1 Testbed SetupWe conducted our experiments on a server with 8-Quad2.0GHz Intel(R) Xeon(R) processor, 16 GB main memoryand two SATA Disk, each of 250GB. Both disks were config-ured with default cfq(Completely Fair Queuing) IO sched-uler. The physical server was running RHEL 6.3 and ext4filesystem. For each experiment, we hosted 4 guest VMson the host in our default configuration. We also experi-mented with 10VMs on a host to study LVD performancewith higher VM density. Each guest VM ran Ubuntu 10.04and was configured with 1 GB memory and and a 20GBvirtual disk formatted with ext2. We selected non-journaledfilesystem for VMs merely to ensure clarity in the illustra-tion of our results. As stated before LVD is oblivious tothe guest filesystem and it can perform for all filesystems.In experiment with QCow2 images, all the 4 VMs had acommon backing image and a private image of their own.For LVD, all the 4 VMs had a common backing and sharedimage. The shared image is sparse allocated and can growupto 64PB. The HashMap is configured to use upto 400MBof host memory and the refcount table is configured to useupto 30MB. The default cluster size used for QCow2 aswell as LVD is 4KB. This default configuration is used forall experiments unless otherwise mentioned.

4.2 Evaluation ObjectivesWe have designed our experiments to study the usefulness ofLVD as well as any associated overheads. LVD is designed toreduce the storage space, reduce the number of applicationI/Os by reusing I/Os across VMs, and improve end-to-endapplication performance by serving I/O requests from thehost page cache. Hence, we measure the total space, totalI/O (read and write), as well as application performance im-provement due to LVD. In order to measure any overheadsdue to LVD, we measure the CPU utilization and host mem-ory usage.

We conducted two different set of experiments to measure

performance. Our first set of experiments were used to care-fully study the performance of LVD for pure I/O workloads,which performed exactly one type of operation. These ex-periments called micro experiments were designed to under-stand the conditions under which LVD works well and thoseunder which it does not work well. In fact, most of thetuning we did for LVD including increasing caches for L2and prefetching were motivated by our earlier micro exper-iments. Our second set of experiments were performed tostudy the impact of LVD to improve performance for reallife workloads.

We also compare LVD with a dedup filesystem SDFS [20].SDFS is closest system to LVD as it runs on host server anddoes inline deduplication on primary storage.

4.3 Application Used4.3.1 Micro Benchmarks

Operation Data/VMS1 P1 iterative seq read 2GB

P2 parallel seq read 2GBS2 P3 iterative seq write 2GB

P4 seq write custom 2GBP5 parallel seq write 2GB

S3 P6 iterative seq unique rewrite 2GBP7 parallel seq unique rewrite 2GB

S4 P8 iterative random read 500MBP9 parallel random read 500MB

S5 P10 iterative random write 500MBP11 parallel random write 500MB

Table 1: Micro Benchmark Operations. Bench-marks are sequential (seq) or random and run byVMs in order (iterative) or concurrently in paral-lel. Unique workloads (P6, P7) write unique data onpreviously merged blocks triggering copy-on-write andare worst case for LVD.

We have used Filebench [24] which is a filesystem and stor-age benchmark tool, to generate the customized I/O work-loads inside guest VMs. Workload profiles listed in Tab.1are executed.

4.3.2 Macro BenchmarksWe have categorized our Macro test suite as:a) System Management operations: We experimentedwith following system management activities that are per-formed quite often in data centers. These experiments wereexecuted in parallel on all 4 guest VMs. Download: Down-load of a combination of large and small-sized files over thenetwork occupying a total space of 2GB on the system.Decompression: We use ‘tar’ to untar a compressed linuxsource package (2.6.34). During untar many directories andsmall files are created. The tar file occupies 70MB whereasafter decompression, it occupies 450MB. Install: We in-stalled autocad software for linux(freecad) of size 400MBon the system. Installation was done from locally down-loaded .deb package to avoid network delay during install.Backup: We rsync complete data directory to a remote NASserver. AVG Scan: AVG Antivirus protection suite for linuxscans(reads) all files on the disk to check for corruption orinfection. Patch: We apply a set of 12 security patches for

Page 7: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

Ubuntu. The total download size of the patches was around160MB with maximum patch size of 100MB while the oth-ers ranged from KB’s to MB’s. We used Tivoli EndpointManager [9], an endpoint management tool, to apply thepatches in an automated fashion.

b) Clustered applications: We have experimented withthe following common clustered applications:Mysql Cluster: Mysql cluster was setup without replication,with 1 management node and 4 data nodes. We use Sys-Bench [25] to run clustered database under intensive OLTPworkload. We inserted 10M tuples in the cluster. Then, weran OLTP Read+Write benchmark with 16 client threadsmaking a total of 1.9M requests(read-write ratio was 2.8).Web Cluster: We had a typical ‘Tomcat Load Balancing’setup with one node hosting load balancer and 4 nodes host-ing tomcat server. We used httperf to drive a web appli-cation, hosting 1000 unique pages, with request rate of 100per seconds for maximum of 100000 requests.Hadoop Map-Reduce: We setup a hadoop cluster with 1 mas-ter, 4 slave nodes with no replication. We experimented withthe well-known TeraSort hadoop benchmark. We measuredapplication performance for each of the three stages of thebenchmark namely teragen, terasort, teravalidate on a work-load of 5gb data.

c) Cloud Application/Case Study: In this set of ex-periment we have used CloudSuite [2], a benchmark suitefor scale-out application based on real-world software stacks.We selected Data Serving application for benchmarking whichcovers large scale web applications with NoSQL datastore.During setup, we install Cassandra [1] database on 4 VMs.Using Yahoo! Cloud Serving Benchmark (YCSB) [3] a datasetwith 5M entries is generated and populated into db. Then,YCSB client generates reads/updates workload on the database.The workload contains a total of 1M operation of read/updates.

5. EXPERIMENTAL RESULTS5.1 Micro Benchmark ResultsWe now study the performance of LVD as compared toqcow2 and raw.

Read Performance: We observed (P1, P2, P8, P9 in Fig.8(a)and Fig.8(b)) that LVD out-performs raw and qcow2 for allread benchmarks. The read path in LVD has no overheadsand it directly benefits from any duplicate data leading tosignificant performance improvement. For iterative sequen-tial reads, Fig.8(e) shows disk at the host being accessedonly for the first VM in contrast with qcow2 which accessesdisk for each VM. Fig.8(f) also shows host cache being popu-lated by first VM and used by other VMs. Raw images havedisk and memory access pattern similar to qcow2. Clearly,LVD is very effective for read-intensive workloads.

Write Performance: We next study the random writeperformance (P10, P11). Deduplication is able to save writeI/Os significantly for random writes as well (Fig.8(d)). How-ever, random access significantly impacts caching of the L2table. As we discussed in Section 4.4, LVD implementsL2 table pre-fetching i.e. 4 L2 table clusters are fetchedfor every L2 cache miss. This makes LVD read more datathan qcow2. In comparison, raw does not store or read any

metadata. Hence, raw performs better than both Qcow2and LVD (Fig.8(b))). Sequential writes do not suffer fromthis caching problem. For sequential writes (P3, P5), we ob-serve savings in disk IO at the host (Fig.8(d)). However,we do not see corresponding performance benefit for LVDin Fig.8(a). Filebench needs to ensure that data is notduplicated within each VM (only cross-VM duplication ispermitted). Hence, it generates random data, which cre-ated an overhead. We separately benchmarked Filebenchand observed that the testsuite could only generate uniquewrite content upto 50MBps. To validate the hypothesis,we created our own custom testsuite P4, which can gen-erate random data at faster rate, and observed significantperformance improvement for LVD. For this experiment weran sequential write workload on 4 VMs iteratively. We ob-served minimum throughput (51.2 MB/s) only for the firstVM, while for other 3 VMs the testsuite reported maximumthroughput of 107 MB/s, averaging to 79 MB/s for all 4 VMson LVD. LVD also show comparable write performance forunique content (P6, P7), where it can not benefit from anydeduplication but still needs to incur the overhead of dupli-cate identification and copy-on-write. In summary, we ob-served that LVD is suitable for sequential write workloadsand performs similar to Qcow2 for workloads that performrandom and/or unique writes.

Read Latency: Disk latency is an important parameter forsome applications and we study the average latency for eachcompleted IO request across all 4 VMs in Tab.3. The latencyfor sequential reads are very low and omitted. For randomreads, we observe that LVD incurs significantly less latency,when the content is duplicated and comparable latency forunique content reads.

High VM density: We conduct P2, P5 on a set of 10 VMsfor LVD and raw. We observe that even at high VM densityLVD outperforms other disk formats. Further eliminatingduplicate I/O also helps LVD alleviate the overhead of I/Ovirtualization. As a consequence of reduced I/O, the CPUusage of LVD is much lower than raw.

5.2 Dedup FS vs LVDWe compare LVD with SDFS [20] filesystem. During thisexperiment, we mounted SDFS on physical host and store 4VM’s raw disk images on this filesystem. These VMs weresubjected to (P2, P5) workloads from Tab. 1. We monitoredworkload performances and host resource utilizations andcompared those with LVD as shown in Fig.9. Few importantdistinctions we made are : i) for reads, LVD easily outper-forms fs dedup approach with almost 4x less CPU utilization.This is attributed to the fact that LVD requires no change inthe read path while fs deduplication needs additional com-putation. ii)fs deduplication has significantly high memoryrequirement to maintain the dedup metadata for every fileblock while LVD manages that metadata in a fixed size mem-ory. But still as seen from the write throughput and disk IO,LVD achieves equally effective deduplication. iii) fs dedupli-cation incurs additional disk IO to persist duplicate blockmappings, while in LVD this mapping is maintained inher-ently as a part of virtual disk metadata, thus no additionalstorage is required.

In summary, we make the following observations with our

Page 8: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

0

50

100

150

200

P1 P2 P3 P4 P5 P6 P7

App

licat

ion

Thr

ough

put (

MB

/s)

raw qcow2 lvd

0.1

1

10

100

P8 P9 P10 P11

App

licat

ion

Thr

ough

put (

MB

/s)

raw qcow2 lvd

0

1000

2000

3000

4000

5000

6000

7000

S1 S2 S3 S4 S5

Hos

t Mem

ory

Use

d (M

B)

raw qcow2 lvd

(a) Application Throughput (b) Application Throughput (on logscale) (c) Avg. Host Memory Use

0

2000

4000

6000

8000

10000

12000

S1 S2 S3 S4 S5

Hos

t Dis

k IO

(M

B)

raw qcow2 lvd

0

50000

100000

150000

200000

0 20 40 60 80 100 120 140

Hos

t Dis

k R

eads

(B

lks/

s)

Time(s)

qcow2lvd

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140

Hos

t Mem

ory

Use

d(M

B)

Time(s)

qcow2lvd

(d) Total Host Disk IO (e) Seq. Read Disk I/O Timeplot (f) Seq. Read Memory Usage Timeplot

Figure 8: Micro Experiments Evaluation.

0

20

40

60

80

100

FSread

LVDread

FSwrite

LVDwrite

0

2

4

6

8

10

Thr

ough

put (

MB

/s)

Hos

t IO

(G

B)

Throughput HostDiskIO

(a) Throughput and Disk IO

0

2000

4000

6000

8000

10000

FSread

LVDread

FSwrite

LVDwrite

0

20

40

60

80

100

Avg

. Hos

t Mem

ory

Use

(MB

)

Avg

. Hos

t CP

U U

se (

%)

Memory CPU

(b) Host CPU/Memory Overhead

Figure 9: Dedup FS Comparison

1

10

100

httperf

OLTPteragen

terasort

teravalidate

Thr

ough

put (

MB

/s)

raw qcow2 lvd

Figure 10: Cluster work-loads(logscale)

micro benchmark experiments. LVD is eminently suitablefor read-intensive workloads and works well for sequentialwrite workloads. Its performance is comparable to QCow2for random write workloads. Further, even when a workloadhas zero duplicates, LVD performs as well as Qcow2 (whichis the implementation baseline for LVD). Hence, our in-linededuplication has low overhead for duplication checks andincurs low latency during writes. LVD outperforms dedupat filesystem by avoiding additional redirection.

Iterative Parallelraw 3.1 8.55

qcow2 3.2 13.4lvd (S4) 1.22 3.8

lvd (no duplicates) 3.2 12.57

Table 3: Application Latency (ms) for random reads

5.3 System Management ResultsIn this set of experiments we ran the most common systemmanagement operations that are performed in data centers.

Each of these operations were performed in parallel on 4VMs provisioned with different disk formats namely raw,qcow2 and lvd respectively. Fig.11(a) reports the completiontime for slowest of the 4 VMs. Fig.11(b), (c) & (d) capturesmonitored disk IO (read/write), memory and CPU use atthe host during experiments. Overall, we observe that theseoperations can be performed faster with LVD and with lowerresource utilization on host.

We observe significant reduction in activity completion fordownload, untar, install operations. backup, scan, patch onlyexhibit modest reduction in completion time even thoughdisk I/O is significantly reduced for all 6 activities. Theperformance of backup and patch operations are limited bythe available network bandwidth to backup and TEM serverrespectively. In our setup, we measured the maximum band-width to these servers to be around 100 MB/s. We observedthat we achieved maximum network throughput of approx-imately 85MB/s using all image formats. This proves ourhypothesis that these activities were constrained by networkand improved disk performance does not benefit them. Wealso observe a side-effect of reduced disk I/O for LVD. Since

Page 9: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

Sequential Read Experiment Sequential Write ExperimentIOPS(MBps) IO(GB) Mem(GB) CPU(%) IOPS(MBps) IO(GB) Mem(GB) CPU(%)

raw 8.16 21.28 5.3 45.14 11 21.49 4.74 65.92lvd 50.0 2.3 1.8 13.12 19.5 2.9 1.61 35.76

Table 2: Throughput, Host IO, Memory and CPU Usage during sequential read/write experiment on 10VMs

IOPS at the host disk are reduced, the CPU overhead of I/Oprocessing is reduced, leading to reduced CPU utilization forVMs running LVD (Fig.11(d)).

5.4 Cluster Application BenchmarksWe finally report results for experiments with clustered ap-plications namely Hadoop, Mysql-cluster and Apache WebCluster. Figure 10 summarizes application throughput foreach of these applications.

For httperf workload Tab.4 reports the reply rate, reply timeand total number of requests processed. For LVD we observesignificant reduction in response time, increase in reply rateand increased throughput. httperf workload primarily gen-erates http GET requests for different static html pages.Load balancer then distributes these requests across webservers hosted on the VMs in the cluster. Thus, the work-load mainly generates read requests on the VMs. But, sinceweb servers on all VMs host the identical pages, data is dedu-plicated across them. So most of the requests across VMsare served from the host cache saving the disk IO. This leadsto lower read IOs as well as less memory required by LVDcompared to other two image formats.

We ran OLTP workload on mysql-cluster. For this workload,we observed about 10x write IOs saved at the host leadingto throughput improvement of around 2.5x for LVD as com-pared to raw or qcow2, shown in Tab.5 This is significantlyhigher than the maximum inherent redundancy across VMs(4x). Mysql generates duplicate data within each VM dueto three operations namely logging, local checkpointing anduse of double-write buffer. Local checkpointing creates snap-shots frequently and there is a lot of duplication betweenthose. With mysql’s double-write buffer, every write opera-tion is performed twice (first in double-write buffer, whichis a file, and then in the final location) in order to maintainconsistency. LVD transparently deduplicates within eachVM without compromising any of the consistency opera-tions. To illustrate this, consider that case where the ap-plication writes in double-write buffer followed by write tothe final position. In case of failure, the content would bedifferent and we won’t merge the two writes. We merge thewrites only when both content are same, making the contentin the buffer redundant.

We next study hadoop application performance with wellknown TeraSort benchmark. At first stage, teragen gen-erates(writes) random data as input dataset to be used inthe later stage. Clearly, this is the worst case workload forLVD as there is no duplicate content. Hence, LVD performsas well as qcow2. At second stage, terasort partitions in-put dataset, sorts it using map-reduce and writes back thesorted result to HDFS. During this stage, sorted data mayhave duplicate content with the input data, we observe LVDout-performing qcow2 as well as raw. The last stage of ter-

avalidation primarily perform reads and LVD outperformsqcow2 and raw in this phase due to better utilization ofhost cache.

5.5 Case Study: Data ServingWe study the performance of Yahoo! Cloud Serving Bench-mark(YCSB) on Cassandra datastore. In this experiment,5M entries are inserted into the datastore before runningthe benchmark. Then YCSB is run which consists of 1Mread/update operations. We observe that during insert phaseLVD saves write of 2GB(10%) compared to other image for-mats. (Read/Write I/O is more or less equal across all differ-ent image formats during benchmarking phase and hence wedo not report them here.) We observe that we obtain a per-formance improvement of about 20% over Qcow2 in termsof throughput. We also observe that Average Latency islower for LVD than other image formats. We also observethat Maximum Latency is minimum for raw image formatas compared to LVD and Qcow2.

5.6 Image Management Operations7GB 15GB

convert lvd to raw 171 392convert raw to lvd 166 382

convert qcow2 to raw 141 353convert raw to qcow2 135 317

delete lvd 29 51

Table 7: Virtual Disk Management Operations (sec)

We also implemented image management operations in LVDand report average completion time in Tab.7. For LVD theseoperations were performed when 4 VMs were sharing a sin-gle disk image and all the VMs were shutdown and hostcache was cleared. Size represents the approximate data aVM had when it was subjected to the respective operation.For convert which is read at the source and write at thetarget image format, we observe between lvd-to-raw conver-sion is slower then qcow2-to-raw. This is because in LVDdata for a VM gets fragmented in the shared image file atthe boundary of coarse-grained allocation chunk. Thus, ingeneral for disk scanning kind-of workload when host cacheis not hit, fragmentation introduced by LVD becomes bot-tleneck. Similarly, for delete of a VM from shared diskimage we need to manage the deduplication metadata fromthe HashMap and shared image’s own metadata (L2 Table,RefCount) for this VM. As shown in the Tab.7, delete oper-ation in LVD takes more time than what it would otherwisetake to complete file operation like rm for other disk for-mats. However, the overhead of LVD over QCow2 for theseoperations is marginal (20%). Since these operations are in-frequent, we believe this minor overhead is more than offsetby the significant increase in steady-state performance dueto LVD.

Page 10: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

0

20

40

60

80

100

120

140

download

untarinstall

backup

scanpatch

Com

plet

ion

Tim

e

rawqcow2

lvd

0

2000

4000

6000

8000

10000

12000

rawqcow2

lvd rawqcow2

lvd rawqcow2

lvd rawqcow2

lvd rawqcow2

lvd rawqcow2

lvd

Tot

al I/

O(M

B)

readwrite

patchscanbackupinstalluntardownload

(a) Completion Time (normalized w.r.t. raw) (b) Total Host Disk IO

0

2000

4000

6000

8000

10000

12000

14000

16000

download

untarinstall

backup

scanpatch

Mem

ory

Usa

ge(M

B)

rawqcow2

lvd

0

5

10

15

20

25

30

download

untarinstall

backup

scanpatch

CP

U U

sage

(%)

rawqcow2

lvd

(c) Avg. Host Memory Use (d) Avg. Host CPU Use

Figure 11: Macro Experiments Evaluation

minR avgR maxR Replies Response Transfer Read(GB) Write(GB) Memory (MB)raw 3.0 14.6 19.0 15676 23.3 5.1 4.22 0.27 8382

qcow2 3.2 14.5 18.6 15623 22.9 5.4 4.15 0.50 8203lvd 1.0 67.6 81.2 71331 5.87 1.34 1.37 0.23 6753

Table 4: Min, Avg and Max Reply Rate(per sec), Total no. of replies(out of 100000 in workload), ReplyTime in sec (Response and Transfer), Read/Write Total IO and Active Memory for Web(httperf) workload

Our experiments conclusively establish that LVD improvesapplication performance and host resource utilization for alarge variety of workloads. We observed an average diskIO savings of around 70% making application run faster byaround 25% across the real-life benchmarks studied.

6. RELATED WORKIn this work, we present LVD, a new virtual disk image for-mat, which has the goal of reducing disk space, disk I/Oand host memory by eliminating duplicates. To the best ofour knowledge, LVD is the first virtual disk format that sup-ports deduplication across virtual machines. However, thegoals of LVD - deduplication for space, I/O, or memory -have been a popular area of research.LVD differs from exist-ing techniques in achieving all three goals at the same timewithout requiring any change in guest or hosts.

6.1 Memory DeduplicationMemory deduplication techniques merge identical memorypages to reduce host memory pressure. Kernel Samepagemerging (KSM) merges anonymous pages and is present inmainline linux kernel. VMWare ESX Server also implements

memory page deduplication [28]. Difference Engine is a so-phisticated merging algorithm that operates at a finer gran-ularity [8]. All these techniques use expensive memory scan-ning to identify duplicate pages. Satori [18] avoids expensivememory scanning by identifying duplicate pages directly inthe I/O buffer. However, all these techniques need to readcontent from the disk before merging duplicates attackingonly one of the goals. Further, they either miss short-termsharing opportunities (scanning-based) or require specificenlightenments (Satori). In comparison, LVD trivially al-lows page cache sharing without any changes in the operat-ing system. On the flip side, LVD does not merge anony-mous pages, which existing techniques do. Since LVD worksat a block-level, it can be complemented with existing so-lutions like KSM to achieve memory deduplication for bothanonymous and non-anonymous pages.

6.2 Disk Space and I/O DeduplicationDeduplication to reduce space has been a very popular areaof research [6, 10, 14, 16, 22, 29, 23]. Identical content canoccur either in the form of whole or partial files [17] andtechniques to detect similar content have ranged from whole

Page 11: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

Total Time Req/s Min/req Avg/req Max/req Read(GB) Write(GB) Memory(MB)raw 2100 904 140 336 570 0.02 35.6 14550

qcow2 2075 915 128 332 530 0.37 35.9 14139lvd 822 2310 67 132 387 0.28 3.59 13438

Table 5: Total Time taken(in sec) for 100K events, Requests per sec, Min, Avg and Max time per request(inms), Total Read/Write IO and Active Memory (MB) for mysql(OLTP) workload

Total Time Throughput Avg/Lat Max/Lat Min/Lat Avg/Lat Max/Lat Min/Lat(ops/sec) (Read) (Read) (Read) (Update) (Update) (Update)

raw 1109 901 6.0 1229 0.8 2.6 141 0.6qcow2 1009 990 5.2 1642 0.8 2.6 854 0.6

lvd 804 1243 4.7 4007 0.4 1.5 189 0.4

Table 6: Total Time taken(in sec) for 1M events (500K Read and Update each), Throughput (ops/sec), Avg,Max and Min Latency time per request(in ms) for Read and Update respectively

file and fixed size chunking to more sophisticated variablesize chunking [12, 29, 19]. Compression has also been usedto reduce storage space [4].

Koller et al. present a content-based cache that used du-plicate content for reducing disk I/O [13]. LVD eliminatesduplicates without introducing any extra cache layer or anyoverheads in the read path. Conceptually, some of the tech-niques use by LVD are similar to inline deduplication tech-niques [16, 26, 20]. Although SDFS [20] can perform in-line deduplication, it incurs significant overhead in terms ofCPU and memory. However, the isolation boundaries be-tween virtual machines introduce additional challenges forus. Further, LVD can be applied in conjunction with any ofthe prior work.

6.3 Sharing in Image DisksTraditional virtual disk formats (e.g., QCow2 [15]) also al-low limited sharing of content across virtual machines us-ing linked clones that are provisioned from a common baseimage. However, deduplicating beyond the base image canreduce storage space to 1/3rd of the space taken using linkedclones [6]. Recently, a patch for QCow2 has been releasedthat deduplicates content within one virtual machine. Dedu-plication across virtual machines, which LVD solves, is amuch harder problem due to concurrency issues. Ventanais a virtualization aware filesyetm that allows administratorto define shared files across virtual machines [21]. However,the focus is on providing versioning, access control and mo-bility features for virtual disks. LVD works with the currentvirtual machine isolation model and needs to identify sharedcontent across concurrent virtual machines transparently.

7. CONCLUSIONWe presented the design and implementation of LVD in thiswork. We showed that LVD reduces space and disk I/O, im-proves page cache efficiency and application performance.

We observed that selecting a common shared file allows re-lated data across physical machines to be allocated close toeach other. This created a significant positive impact onclustered applications like mysql and HDFS, which was overand above the impact of duplicate elimination. Another in-teresting impact of duplicate elimination was reduced CPUusage. It has been observed that I/O activity in virtualized

servers leads to CPU overhead [5]. The reduction in I/Owith LVD reduces this CPU overhead. Similarly for hostmemory though LVD requires additional memory to cachethe HashMap and RefCount table, this is more than com-pensated by savings in the host page cache by eliminatingduplicates.

We would also like to note that LVD may increase the num-ber of metadata updates. Writes to a virtual block in a vir-tual machine always lead to change in the metadata. Sincewe are able to cache metadata well in our implementation,the metadata updates do not lead to any performance degra-dation. However, if the underlying virtual disk format doesnot allow metadata caching, LVD may not lead to savingsin disk I/O.

8. REFERENCES[1] Cassandra. http://cassandra.apache.org/.

[2] CloudSuite. http://parsa.epfl.ch/cloudsuite/cloudsuite.html.

[3] Erwin Tam Raghu Ramakrishnan Russell Sears BrianF. Cooper, Adam Silberstein. Benchmarking CloudServing Systems with YCSB. In Proc. SoCC, 2010.

[4] M. Burrows, C. Jerian, B. Lampson, and T. Mann.On-line data compression in a log-structured Filesystem. In Proc. ASPLOS, 1992.

[5] L. Cherkasova and R. Gardner. Measuring CPUOverhead for I/O Processing in the Xen VirtualMachine Monitor. In Usenix ATC, 2005.

[6] A. Clements, I. Ahmad, M. Vilayannur, and J. Li.Decentralized Deduplication in SAN Cluster FileSystems. In USENIX ATC, 2009.

[7] CloudZoom. CloudZoom Virtual ApplianceMarketplace. http://cloudzoom.com/.

[8] Diwaker Gupta, Sangmin Lee, Michael Vrable, StefanSavage, Alex C. Snoeren, George Varghese, GeoffreyVoelker, and Amin Vahdat. Difference Engine:Harnessing Memory Redundancy in Virtual Machines.Proc. of the USENIX OSDI, December 2008.

[9] IBM. Tivoli Endpoint Manager. http://www-01.ibm.com/software/tivoli/solutions/endpoint.

[10] N. Jain, M. Dahlin, and R. Tewari. TAPER: TieredApproach for Eliminating Redundancy in ReplicaSynchronization. In USENIX FAST, 2005.

Page 12: LVD: Lean Virtual Disks - IBM...1, 3, 4 and 5 are merged across the two disks leading to the mapping depicted in the gure. Clearly, such a mapping vio-lates the clean independent structure

[11] K. R. Jayaram, Chunyi Peng, Zhe Zhang, MinkyongKim, Han Chen, and Hui Lei. An empirical analysis ofsimilarity in virtual machine images. In ACMMiddleware (Industry Track), 2011.

[12] K. Jin and E. Miller. The effectiveness ofdeduplication on virtual machine disk images. InProceedings of SysStor, 2009.

[13] Ricardo Koller and Raju Rangaswami. I/ODeduplication: Utilizing Content Similarity toImprove I/ O Performance. In Proc. of USENIXFAST, 2010.

[14] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M.Tracey. Redundancy Elimination Within LargeCollections of Files. USENIX ATC, 2004.

[15] KVM. Qemu Copy-on-write Disk Image. Inhttp://www.linux-kvm.org/page/Qcow2, 2013.

[16] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar,G. Trezise, and P. Camble. Sparse indexing: largescale, inline deduplication using sampling and locality.In USENIX FAST, 2009.

[17] Dutch T. Meyer and William J. Bolosky. A Study ofPractical Deduplication. In Proc. of USENIX FAST,February 2011.

[18] G. Milos, D. G. Murray, S. Hand, and M. Fetterman.Satori: Enlightened Page Sharing. In Proc. of theUsenix Annual Technical Conference, June 2009.

[19] Athicha Muthitacharoen, Benjie Chen, and DavidMazieres. A low-bandwidth network file system. InProc. of the ACM SOSP, October 2001.

[20] OPENDEDUP. SDFS: Userspace Deduplication BasedFilesystem. http://opendedup.org/.

[21] Ben Pfaff, Tal Garnkel, and Mendel Rosenblum.Virtualization Aware File Systems: Getting Beyondthe Limitations of Virtual Disks. In NSDI, 2006.

[22] S. Quinlan and S. Dorward. Venti: A New Approachto Archival Storage. USENIX FAST, 2002.

[23] Zhiming Shen, Zhe Zhang, Andrzej Kochut, AlexeiKarve, Han Chen, Minkyong Kim, Hui Lei, andNicholas Fuller. Vmar: Optimizing i/o performanceand resource utilization in the cloud. In Middleware2013, pages 183–203. 2013.

[24] SourceForge.net. Filebench: file system and storagebenchmark.http://sourceforge.net/projects/filebench.

[25] SourceForge.net. SysBench: a system performancebenchmark. http://sysbench.sourceforge.net.

[26] Kiran Srinivasan, Tim Bisson, Garth Goodson, andKaladhar Voruganti. iDedup: Latency-aware, InlineData Deduplication for Primary Storage. Proc. ofUsenix FAST, 2012.

[27] VirtualBox. VirtualBox Virtual Appliances.http://http://virtualboximages.com/.

[28] C. Waldspurger. Memory Resource Management inVMware ESX Server. OSDI, 2002.

[29] Benjamin Zhu, Kai Li, and Hugo Patterson. Avoidingthe Disk Bottleneck in the Data DomainDeduplication File System. USENIX FAST, 2008.


Recommended