Research - Design Patterns for Tunable and Efficient SSD-based Indexes

Design Patterns for Tunable and Efficient SSD-basedIndexes

Ashok Anand†, Aaron Gember*, Collin Engstrom*, Aditya Akella*†Bell Labs-India *University of Wisconsin-Madison

ABSTRACTA number of data-intensive systems require using random hash-based indexes of various forms, e.g., hash tables, Bloom filters, andlocality sensitive hash tables. In this paper, we present general SSDoptimization techniques that can be used to design a variety of suchindexes while ensuring higher performance and easier tunabilitythan specialized state-of-the-art approaches.

We leverage two key SSD innovations: a) rearranging the datalayout on the SSD to combine multiple read requests into one pageread, and b) intelligent request reordering to exploit inherent paral-lelism in the architecture of SSDs. We build three different indexesusing these techniques and conduct extensive studies showing theirsuperior performance and tunability.

1. INTRODUCTIONData-intensive systems are being employed in a wide variety of

application scenarios today. For example, key-value storage sys-tems are employed in cloud-based applications as diverse as e-commerce and business analytics systems, and picture stores; andlarge object stores are used in a variety of content-based systemssuch as network deduplication engines, storage deduplication, log-ging systems, and content similarity detection engines. To ensurehigh application performance these systems often rely on randomhashing-based indexes whose specific design may depend on thesystem in question. For instance, WAN optimizers [6, 7], Webcaches [5, 8], and video caches [2] employ large streaming hashtables. De-duplication systems [30, 35] employ Bloom filters. Con-tent similarity engines and some video proxies [12, 2] employ lo-cality sensitive hash (LSH) tables [26]. Given the volume of theunderlying data, the indexes often span several 10s to 100s of GB,and they continue to grow in size.

Across these systems, the index is the most intricate in design.Heavy engineering is often devoted to ensure high index perfor-mance at low cost. Most state-of-the-art systems [21, 27, 14, 23]advocate using SSDs to store the indexes, given flash-based media’ssuperior density, 8X lower cost (vs. DRAM), 25X better energy ef-ficiency (vs. DRAM or disk), and high random read performance

(vs. disk) [27]. However, the commonality ends here. The con-ventional wisdom, which universally dictates index design, is thatdomain- and operations-specific SSD optimizations are necessaryto meet appropriate cost-performance trade-offs. This poses twoproblems: (1) Poor flexibility: Index designs often target a specificpoint in the cost-performance spectrum, severely limiting the rangeof applications that can use them. It also makes indexes difficultto tune, e.g., use extra memory for improved performance. Finally,the indexes are designed to work best under specific workloads;minor deviations can make performance quite variable. (2) Poorgenerality: The design patterns employed apply only to the specificdata structure on hand, placing a high bar on innovation. In partic-ular, it is difficult to employ different indexes in tandem (e.g., hashtables for cache lookup alongside LSH tables for content similaritydetection over the same underlying content) as they may employconflicting techniques that result in poor SSD I/O performance.

Our paper questions the conventional wisdom. We present dif-ferent indexes that all leverage a common set of novel SSD opti-mizations, are easy to tune to achieve optimal performance undera given cost constraint, and support widely-varying workload pat-terns and applications with differing resource requirements; yet,they offer better IOPS, cost less, and consume lower energy thantheir counterparts with specialized designs.

We rely on two key innovations. (1) We leverage a unique fea-ture of SSDs that has been overlooked by earlier proposals, namely,that the internal architecture of SSDs offers parallelism at multiplelevels, e.g., channel-, package-, die-, and plane-level. Critically,the parallelism benefits are significant only under certain I/O work-loads. Our key contribution lies in identifying these parallelism-friendly workloads and developing a set of design patterns for en-capsulating the input workload for an index into SSD parallelism-friendly forms. (2) Based on the design patterns, we develop a newprimitive called slicing which helps organize data on the SSD suchthat related entries are co-located. This allows us to combine mul-tiple reads into a single “slice read” of related items, offering highread performance. We show how our design patterns inform slicesize, the number of slices to co-locate at a particular SSD block,and the techniques to use for reading from and writing to slices. Akey feature of slicing is that slice size/composition (i.e., how manyelements constitute a slice) offer simple knobs to trade off I/O per-formance for the memory overhead of any index data structure.

In §4, we conduct several experiments to profile the internal par-allelism behavior on a desktop-grade SSD to identify parallelism-friendly I/O patterns, and derive the appropriate design patterns thatguide the composition, configuration and use of slices. Then, wepresent the design of three random-hash based indexes that lever-age our design patterns and slicing: a streaming hash table called

1

SliceHash (§5), large Bloom filters called SliceBloom, and locality-sensitive hash tables called SliceLSH (§6).

Our index designs can be sketched as follows: We use small in-memory data structures (hash tables, Bloom filters, or LSH tables,as the case may be) as buffers for insert operations to deal withthe well-known problem of slow random writes on SSDs. Whenfull, these are flushed to the SSD; each of these flushed data struc-tures is called an “incarnation”. We organize data on the SSDsuch that all related entries of different incarnations are locatedtogether in a slice, thereby optimizing lookup. Finally, based onan understanding of the SSD’s writing policy, we appropriatelyreorder lookups, without violating application semantics, to dis-tribute them uniformly across different channels and extract maxi-mal parallelism benefits.

In addition to supporting high performance, our parallelism-centereddesign patterns and the slicing primitive together eliminate the needfor maintaining complex metadata to aid index I/O operations; thisis in contrast with state-of-the-art techniques, e.g., [27, 14], wherethe metadata imposes high memory overhead or CPU cost. Weshow that this frees memory and compute resources for use byhigher layer applications. We show that our design techniques fa-cilitate extending the indexes to use multiple SSDs on the samemachine, offering linear scaling in performance while also lower-ing per-key memory overhead. State-of-the-art techniques cannotbe scaled out in a similar fashion.

We build prototype indexes using a 128GB Crucial SSD and atmost 4GB of DRAM. We conduct extensive experiments under arange of realistic workloads to show that our design patterns offerhigh performance, flexibility, and generality. Key findings from ourevaluation are as follows: On a single SSD, SliceHash can provide69K lookups/sec by intelligently exploiting parallelism, which canbe 5.2X better than naively running multiple lookups in parallel.Lookup performance is preserved even with arbitrarily interleavedinserts, whereas state-of-the-art systems take up to a 25% perfor-mance hit. SliceHash has low memory footprint and low CPU over-head, yet it provides high performance. Furthermore, SliceHashcan be tuned to use progressively more memory (from 0.27B/entryto 1.1B/entry) to scale performance (from 70K to 110K ops/s formixed (50%lookup, 50% insert) workload). When leveraging 3SSDs in parallel SliceHash’s throughput improves to between 207K(lookup-only) and 279K (lookup/insert) ops/sec. SliceBloom per-forms 15K ops/sec with a mixed lookup/insert workload, whereasthe state-of-the-art [24] achieves similar performance on a high-endSSD that costs 30X. SliceLSH performs 6.9K lookups/s.

2. DESIGN REQUIREMENTS ANDEXISTING SYSTEMS

Our goal is to develop generic SSD design optimizations thatcan be applied nearly universally to a variety of random hash-basedindexes that each have the following requirements:Large scale: A number of data-intensive systems require large in-dexes. For example, WAN optimizer [7, 6] indexes are ≥32GB;data de-duplication indexes are ≥40 GB [4]. In keeping with thetrend of growing data volumes, we target indexes that are an order-of-magnitude larger, i.e., a few hundred GB.High performance and low cost: The index should provide highthroughput, low per-operation latency, and low overall cost, mem-ory, and energy footprint. To apply to a wide-variety of content-based systems, the index should provide good performance underboth inserts/updates and reads. State-of-the art techniques for hashtables offer 46K IOPS [27, 14]; those for bloom filters offer 12-15KIOPS [24]. Our indexes should match or exceed this performance.

Metrics FlashStore SkimpyStash BufferHash SILTAvg Lookup ∼1 ∼5 ∼1 ∼1(#page read)Worst Lookup 1 10 16 33#page read)Memory ∼6 ∼1 ∼4 ∼0.7(# bytes/entry)CPU overhead Low Low Low High

Table 1: Comparison of different SSD-based Hash tables under dif-ferent metrics. The worst-case lookups are based on default proto-type configurations of these systems.

Flexibility: This covers various aspects of how easy the index is touse, as we discuss below.

Applications leveraging a given index may require significantCPU and memory resources for their internal operations. For ex-ample, data de-duplication applications require CPU resources forcomputing SHA-1 hashes of fingerprints [13]. Various image andvideo search applications require CPU resources for computingsimilarity metrics after they find potential matches. Caching appli-cations may want to use memory for caching frequently or recentlyaccessed content. To ensure that the applications can flexibly useCPU and memory and that their performance does not suffer, the in-dex should impose low CPU and memory overhead. Unfortunately,many prior index designs ignore the high CPU overhead they im-pose in their singular quest for, e.g., low memory footprint, andhigh read performance (e.g., SILT [27]), which makes applicationdesign tricky. Equally importantly, application designers should beable to easily extend the index with evolving application require-ments, e.g., add memory or CPU cores at a modest additional costto obtain commensurately better performance. Finally, the indexshould work well under a variety of workload patterns.

In the rest of this section, we survey other related hash-basedsystems that employ flash storage. As stated earlier, none of thesestudies use techniques that are all generally applicable across dif-ferent random hash-based indexes. Even ignoring this issue, allprior designs fall short on one or more of the above requirements.

2.1 Trade-Offs in SSD-Based Hash TablesWe start by reviewing a specific class of indexes, namely those

based on hash tables. We review several prior systems each de-signed for a specific application domain. We highlight the designchoices made in each case and the restrictions they impose.

Many recent works [22, 23, 14, 21, 27] have proposed SSD-based indexes for large key-value stores. As Table 1 shows, eachdesign optimizes for a subset of metrics that matter in practice (i.e.,high throughput, low latency, low memory footprint or low com-putation overhead). Unfortunately, these optimizations come at theexpense of significantly impacting other metrics and they may im-pact the applications that use the indexes, as we argue below.

FlashStore[22] stores key-value pairs in a log-structured fashionon SSD storage, and uses an in-memory hash table to index them.It requires one SSD read per lookup on average. SkimpyStash [23]uses a low amount of memory - 1 byte/key - to maintain a hashtable with linear chaining on the SSD. Both approaches impose alow CPU overhead, but suffer from high average or worst case I/Ocosts (SkimpyStash requires 5 page reads/lookup on average), orhigh memory overhead (FlashStore requires ∼6 bytes/key. Theyalso work best for read-heavy workloads.

BufferHash [14] buffers all insertions in memory, and writesthem in a batch to the SSD. It maintains in-memory Bloom fil-ters [9] to avoid spurious lookups to any batch on the SSD. Buffer-Hash requires ∼1 page read per lookup on average and works wellacross a range of workloads. However, it may need to read multiple

2

pages in the worst case due to false positives of the Bloom filters.BufferHash also has a high memory overhead (∼4 bytes/key). Fi-nally, BufferHash is difficult to tune: it requires a predeterminedamount of memory (functions of SSD size) to ensure that the falsepositive rate is low and worst-case lookup cost is small.

SILT [27] offers a better balance across the different metrics thanany of the above systems. SILT achieves a low memory footprint(0.7 bytes/entry) and requires a single page lookup on average.However, SILT uses a much more complex design than the systemsdiscussed above. It employs three data structures: one of them ishighly optimized for a low memory footprint, and the others aremore write-optimized but require more memory. SILT continu-ously moves data from the write-optimized data structures to thememory-efficient one. In doing so, SILT has to continuously sortnewly written data and merge it with old data. This increases thecomputation overhead, which may impact the applications that useSILT. Furthermore, these background operations affect the perfor-mance of SILT under a workload of continuous inserts and lookupsas is common with, e.g., WAN optimizers. For example, the lookupperformance drops by 21% for a 50% lookup-50% insert workloadon 64B key-value pairs. While SILT is somewhat tunable – e.g., itis possible to tune the memory overhead between 0.7 and 2B perentry [27] – it doesn’t permit configurations with arbitrarily lowmemory footprint contrary to our index designs.

Other recent works, MicroHash [25] and FlashDB [32], alsomaintain hash tables on SSDs to reduce the memory footprint. How-ever, these systems are designed for memory and power constraineddevices. Unfortunately, they suffer from high lookup latencies:e.g., MicroHash requires looking up multiple pages to locate a key.

Also, none of the above systems are designed for exploiting theintrinsic parallelism of SSDs. As we show in §4, lookup perfor-mance can improve by 5.2X if the underlying parallelism is ex-ploited optimally.

2.2 Other IndexesOther hashing-based data structures have received less attention

than hash tables. But there has been growing interest in using SSDsto support them when the scale is large, especially for Bloom filters.

Buffered Bloom Filter [18] is an approach for SSD-resident Bloomfilters that targets initial construction of Bloom filters ensuring alow memory footprint. However, this data structure cannot han-dle updates over time. BloomFlash [24] is an approach for SSD-resident Bloom filters that optimizes for writes. BloomFlash buffersbit updates in DRAM to avoid random writes to the SSD. It alsouses a hierarchical organization to manage writes. Neither ap-proach leverages parallelism intrinsic to SSDs. In particular, ourexperiments show that by adapting BloomFlash’s design using ourparallelism-centered patterns and techniques we can achieve thesame I/O performance using a commodity SSD that their designachieves with a high-end SSD costing 30X more.

The critical takeaways from the above discussion are that the in-dividual designs are targetted to specific scenarios and workloads;they are often not easy to tune, e.g., to trade-of performance formemory; they are CPU intensive; and techniques used in one oftendon’t extend to another.

Our goal is to develop guidelines to design indexes that offer highI/O performance and low memory overhead, are easy to tune, workwell under a variety of workloads, and apply to a variety of indexesbased on random hashing, including hash tables, locality-sensitivehash tables, and Bloom filters.

Host Inteconnect

Host i/f

logic

RAM buffer

Processor

Buffer manager

Flash Ctrl.

SSD controller

Channel #0

Channel #1

Flash memory pkg #0

Flash memory pkg #1

Flash memory pkg #2

Flash memory pkg #3

Flash pkg

Die0

Die1

Plane0

Plane1

Plane0

Plane1

Plane

Cache register

Data register

Block 0 Page 0

Page 1

Figure 1: SSD internal architecture

3. PARALLELISM IN THE INTERNAL SSDARCHITECTURE

To meet our goal, we must first understand key properties ofSSDs that influence the design and performance of random hash-based indexes. To this end, we describe the internal architecture ofSSDs. We then describe the different forms of parallelism availablewithin SSD architectures.

Figure 1 shows an illustration of a SATA-based SSD architec-ture. SSDs provide logical block addresses (LBA) as an interfaceto host. All I/O requests for logical block addresses are processedby an SSD controller. The controller receives I/O requests fromthe host via an interface connection (i.e., the SATA interface). Thecontroller uses the flash translation layer (FTL) to translate logicalpages of incoming requests to physical pages. It issues commandsto flash packages via flash memory controllers. The flash memorycontroller connects to flash packages via multiple channels (gener-ally 2-10).

Each package has two or more dies or chips. Each die is com-posed of two or more planes. On each plane, memory is organizedinto blocks; each block consists of many 2-4KB pages.

Each plane has a data register to temporarily store the data pageduring reads or writes. For a write command, the controller firsttransfers data to a data register on a channel, and then the data iswritten from the data register to the corresponding physical page.For a read command, the data is first read from the physical page tothe data register, and then transferred to the controller on a channel.

Different Forms of Parallelism: The internal architecture ofSSDs incorporates varying degrees and levels of parallelism. Eachof an SSD’s channels can operate in parallel and independently ofeach other. Thus, SSDs inherently have channel-level parallelism.Typically, the data transfers from/to the multiple packages on thesame channel get serialized. However, data transfers can be in-terleaved with other operations (e.g., reading data from a page tothe data register) on other packages sharing the same channel [11,31]. This interleaving provides package-level parallelism. TheFTL stripes consecutive logical pages across a gang of differentpackages on the same channel [11] to exploit package-level paral-lelism. Furthermore, the command issued to a die can be executedindependently of the others on the same package. This providesdie-level parallelism.

Multiple operations of the same type (read/write/erase) can hap-pen simultaneously on different planes in the same die. Currently,a two plane command is widely used for executing two operationsof the same type on two different planes simultaneously [29]. Thisprovides plane-level parallelism. Furthermore, the data transfersto/from the physical page can be pipelined for consecutive com-mands of the same type. This is achieved using the cache registerin the plane: e.g., for consecutive write commands, the cache reg-ister stores the data temporarily until the previous data is written

3

from the data register to the physical page.

4. PARALLELISM-FRIENDLYDESIGN PATTERNS

At the heart of our work lies a generic set of techniques for care-fully extracting the above intrinsic parallelism of SSDs to ensurehigh performance without sacrificing generality and tunability. Inwhat follows, we first outline known properties of SSD I/O, andtechniques for accommodating them (§4.1). We then describe de-sign patterns that help account for both the known properties aswell as the available forms of parallelism (§4.2).

Before moving further, we note that while the above forms ofparallelism have existed in most SSD designs for a while, supportfor concurrent I/O operations was not available [17], making it dif-ficult to leverage the parallelism. Recently, SSDs have begun tosupport native command queuing (NCQ), which we leverage to ex-tract parallelism. With NCQ, multiple I/O operations can executeconcurrently, thereby helping leverage the inherent parallelism. Forexample, in the Crucial M4 SSD [3], NCQ allows up to 32 I/O re-quests to run in parallel.

4.1 Reads and WritesThe read/write properties of SSDs are well known. In particular,

a page is the smallest unit of read or write operations, meaning thatreading a 16B entry (such as a key-value pair in a hash table) is ascostly as reading an entire page. Also, the performance of randompage reads is comparable to that of sequential page reads. Thus, wearrive at design pattern DP1: Organize data on the SSD in such away that multiple entries to be read reside on the same page. Thiscan help reduce the lookup cost significantly.

Random writes require a block to be erased and written out se-quentially with old and new data. Thus, SSDs show poor perfor-mance under a heavy random write workload [33]. Even the ran-dom read performance is affected in a mixed workload of contin-uous reads and writes [14]. A common design pattern, which wecall DP2, used to accommodate this property is: Leverage a smallamount of memory to buffer writes and flush data out to the SSD ata granularity lower bounded by the size of a block (typically 128K);we adopt this in our design.

We now describe the benefits of, and techniques for, applyingthese insights along with leveraging SSD parallelism.

4.2 Extracting ParallelismChannel-level Parallelism: Reading data from a physical page

to the data register typically takes ∼ 25µs. Data transfers on thechannel, in contrast, take roughly 100µs [11] making it the primarybottleneck for page reads. Arranging multiple related items on apage (e.g., 128 entries of size 16B each per 2KB page) and readingthem all at once allows the per item amortized lookup cost to beclose to 1µs.

Furthermore, by leveraging channel-level parallelism, the through-put of page reads can be significantly improved as well. A simpleway to extract the benefits of channel-level parallelism is to simplyuse multiple threads issuing requests in parallel. Unfortunately, thiswill not work in the general case: when a sudden skew in input keysforces all requests to go to the same channel naive parallel lookupswill obviously not provide any benefits.

To extract the benefits of parallelism under a wide-range of work-loads and workload variations, we need to ensure that the requestsissued to the SSD are spread uniformly across the channels. This

becomes possible if we know the mapping between pages and chan-nels. Armed with this knowledge, we can then reorder lookup re-quests to ensure that those issued concurrently to the SSD are uni-formly spread across channels.

However, the mapping is often internal to SSDs and not exposedby vendors. Luckily, the mapping can be reverse engineered, asshown in recent work [19]. As discussed earlier, the FTL stripesa group of consecutive logical pages across different packages onthe same channel. The authors in [19] discuss a technique to deter-mine the size of the group that gets contiguously allocated within achannel; they call this logical unit of data a chunk. They show howto determine the chunk size and the number of channels in an SSD.Using this, they also show how to derive the mapping policy. Inparticular, they discuss techniques for deriving two common map-ping policies: (1) write-order mapping, where the ith chunk writeis assigned the channel i% N , assuming N is the number of chan-nels, and (2) LBA-based mapping, where the logical block address(LBA) is mapped to channel number LBA% N .

As an example, we employed the technique in [19] to a CrucialSSD. We estimated the chunk size and number of channels to be8KB and 32, respectively. We further found that the Crucial SSDfollows write-order mapping. Figure 2a shows the lookup perfor-mance of the our channel-aware technique that uses the above es-timates of the SSD channel count and mapping policies, for differ-ent numbers of threads (labeled “Best”). We also show the worstcase (labeled “Worst”), where we force requests to go to the samechannel. We find that the gap between the two is quite substantial– nearly 5.2X. As a point of comparison, we also show the per-formance of simply issuing multiple requests usig multiple threadswithout paying attention to channel-awareness (labeled “Rand”):we see that this is up to 1.5X worse for this workload.

Thus, we arrive at the following design pattern DP3: when per-forming lookups, rearrange them such that the requests issued tothe SSD are evenly spread across channels.

We further investigate if issuing concurrent writes leads to simi-lar benefits as concurrent reads; as stated above, each write shouldbe at least the size of a block (DP2). Our experimental results forthe Crucial SSD are shown in Figure 2b. We see that parallelismoffers marginal improvement at best. The reason is that the Cru-cial SSD’s write-order-based mapping assigns consecutive chunks(8KB) to different channels, and so, by default, any write largerthan the chunk size is distributed over multiple channels. Thus, wearrive at DP4: simply issuing large bulk writes suffices - issuingwrites concurrently is not essential to improving write throughput.

Package-level parallelism: Figure 2c shows the random readperformance (in MB/s) for different read sizes. We observe highread bandwidth when large reads are issued. This is because readsup to the chunk size (8KB) can exploit package-level parallelism.Reads larger than the chunk size can exploit both channel-level andpackage level parallelism. Thus, we have DP5: when possible, ithelps to issue large reads.

Note that this design pattern cannot be used for regular lookupsinto an index data structure, as issuing large reads may retrieve use-less data resulting in low system goodput. However, as we willshow later, this design pattern aids in instrumenting patterns 1–4.

Plane-level parallelism: Earlier works [19, 31] have shown thatintermingling small reads and small writes affects plane-level par-allelism, leading to a performance drop of up to 1.3X in throughputcompared to issuing consecutive small reads followed by consecu-tive small writes. However, the above design patterns already dic-tate that we issue large writes (DP2) and small reads (small pagereads for lookup requests; DP1), which already ensure that smallreads and small writes are not intermingled by default. Thus, there

4

0

10000

20000

30000

40000

50000

60000

70000

80000

5 10 15 20 25 30

# 2

KB

Re

ad

s/s

ec

# Threads

Best

Rand

Worst

(a) Concurrent 2K reads

0

500

1000

1500

2000

2500

3000

5 10 15 20 25 30

# W

rite

s/s

ec

# Threads

128KB Writes

256KB Writes

512KB Writes

(b) Concurrent large writes

0

50

100

150

200

250

300

350

1 2 4 8 16 32 64 128

MB

/s

Request size (KB)

(c) Concurrent large reads (32 threads)Figure 2: Concurrent I/O performance

are no further undesirable interactions with plane-level parallelism.

5. STREAMING HASH TABLES: SLICEHASHIn this section, we discuss how, using the design patterns above,

we can develop techniques for building high-performance large stream-ing hash tables where <key, value> pairs can be looked up, in-serted and updated over time. We call our index SliceHash. Wewill describe how to build other index data structures in §6.

Our key innovation, which applies across all the data structures,is the use of a slicing primitive for storing multiple related entrieson the same page (DP1) thereby helping combine multiple indexlookups into one page read. We use known techniques for deal-ing with random writes (DP2) but adapt them to work with slicing(based on DP4 and DP5). Finally, we discuss how we implementsupport for concurrent I/O in SliceHash (based on DP3). We showhow the design patterns influence key aspects of the configurationof the data structure as well as the techniques we use to read fromand write to the SSD. We end with a simple analysis of SliceHash’sperformance as a function of its configuration demonstrating thatSliceHash is both high-performance and easy to tune to meet vari-ous cost-performance trade-offs.

5.1 Basic SliceHashIn a naive implementation of a hash table directly on the SSD,

insertion of keys would require updating/writing to random SSDlocations which can be slow. To overcome this, SliceHash hierar-chically organizes the hash table across DRAM and SSD: that is,we maintain an in-memory hash table and allow inserts to happenonly into this in-memory table. After the in-memory table is full,it is written as an incarnation to the SSD. Over time, multiple in-carnations are written to the SSD. We set the size of the in-memoryhash table in multiples of a block’s size (typically spanning 32 or64 pages), thereby ensuring that the writes we issue to the SSD arelarge. These aspects of the design are motivated by DP2 and DP4.

Figure 3a shows this hierarchical organization of an in-memoryhash table in DRAM and its multiple incarnations on the SSD.While this reorganization may appear similar to BufferHash [14],SliceHash differs in how the data in the incarnations is laid out onthe SSD. In particular, SliceHash uses the idea of slicing to lay outthe data in incarnations in the form of a slicetable on the SSD, asshown in Figure 3b. As we will argue shortly, this is key to ensuringgood lookup performance.

Before describing how the slicetable is constructed, we define afew key terms:

• A slot is an index into the in-memory hash table or an on-SSD incarnation, where an entry (i.e., a key-value pair) is orcan be stored.• For a given slot, a slice is a list of all entries located at the slot

within all on-SSD incarnations. In Figure 3b, the slice for

slot-0, i.e., slice-0, contains entries from slot-0 from each in-carnation, e.g., <K01,V01> from incarnation-0, empty en-try from incarnation-1, and <KN0,VN0> from incarnation-N.• A slicetable then refers to a sequential arrangement of slices

on the SSD, each slice corresponding to a given slot. A slic-etable can span multiple SSD blocks. In Figure 3b, the slic-etable contains slice-0, slice-1, . . ., slice-M.• A SliceHash comprises of both the in-memory hash table and

the on-SSD slicetable.

Incarnation 0 Incarnation 1 Incarnation N

………

KN0 VN0

KNM VNM

.

. . .

K01 V01

K0M V0M

.

. . .

K11 V11

K1M V1M

.

. . .

……

Hash(K)

Hash(K) Hash(K)

Requires looking up multiple incarnations

FlashSSD

(a) Multiple lookups on incarnations

FlashSSD SliceTable

Slice 0

Slice 1

Slice M

.

.

.

Hash(K)

Single read of a slice

K01 V01 KN1 VN1 ………

K12 V12 KN2 VN2 ………

K2M V2M KNM VNM ………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

(b) Single lookup using slicetableFigure 4: Lookup using the slicetable

Slicing improves lookups. The main advantage of using slicingis that lookup is vastly simplified and more efficient compared tostoring incarnations directly on the SSD.

When incarnations are stored directly on the SSD (Figure 4a), wemay have to examine all incarnations since the key may be presentin any of them. Since each incarnation occupies a different set ofSSD pages, a key lookup may incur multiple SSD page reads.

In contrast, using slicing simplifies lookups: we hash a key toobtain the slot, and simply read the corresponding slice (Figure 4b).We then compare the input key against the entries in the slice toobtain the relevant value. If we limit the size of a slice to be one ora small number of pages (DP1), then we can correspondingly limitthe cost of lookup (to 1 or a small constant, respectively).Impact on inserts. While slicing positively impacts lookups, itmakes inserts complex. In particular, when flushing a full in mem-ory hash table to the SSD, we need to maintain the structure of theslicetable on the SSD. Because a slice has entries from all incarna-tions, we would need to modify each slice to include entries fromthe new incarnation. For example, in Figure 5, slice-i is modifiedto include <Ki,Pi> from the in-memory hash table H .

5

DRAM Hashtable

Slot-0

Slot-1

K1 V1

Key Value

Flash SSD

………


Logical View of Incarnations

Hashtable written as an Incarnation to SSD after becoming full

KM VM Slot-M

.

.

.

.

. . .

KN1 VN1

KNM VNM

.

. . .

K01 V01

K0M V0M

.

. . .

K11 V11

K1M V1M

.

. . .

K01 V01

SliceTable

Physical Layout


Logical View

Slot-0 Slot-0 Slot-0

KN1 VN1 ………

K12 V12 KN2 VN2 ………

Slice 0

Slice 1

K2M V2M KNM VNM ………

.

.

.

Slice M

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

………

KN1 VN1

KNM VNM

.

. . .

K01 V01

K0M V0M

.

. . .

K11 V11

K1M V1M

.

. . .

Slicing

Flash SSD

Transformed layout

(a) Hash table in DRAM and its incarnations on SSD (b) Physical layout of incarnations on SSDFigure 3: Basic SliceHash structure

K01 V01 K0 P0 KN1 VN1 ………

KN2 VN2 ………

KM VM KNM VNM ………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Write hashtable H as Incarnation 1

Updates at multiple slices

FlashSSD SliceTable

Slice 0

Slice 1

Slice M

.

.

.

Logical Incarnation-1

K0 P0

KM VM

.

. . . Hashtable H

Figure 5: Write overhead for Slicetable. However, this overhead isamortized over multiple insert operations.

To avoid random writes (which occur when updating a slice onlyat a particular position) we read the slicetable from the SSD tomemory; because a slicetable spans multiple blocks, this amountsto a “large” read and hence can be performed at high throughput(DP5). We then modify the slicetable at the appropriate positionsfor each slice, and write back the entire modified slicetable to theSSD; as DP4 indicates, such large writes help leverage channel-level parallelism.

While this imposes a high overhead, it is only incurred when thein-memory hash table is full. Since the vast majority of insert op-erations still happen in memory, the impact of this flush operationon an average insert is small.

To summarize, the basic operations in SliceHash are as follows:Inserts and updates: Keys are inserted only into the in-memory

hash table. When this becomes full, we flush it to the SSD whilemaintaining the slicetable structure. When the on-SSD slicetablebecomes full, we employ the simple “eviction policy” of overwrit-ing the oldest incarnation. For updates, we simply insert the newkey-value pair in the in-memory hash table.

Lookups: We first look up the key within the in-memory hashtable. If the key is not found, we read the corresponding slice fromthe SSD, scan the entries for all incarnations from the latest to theoldest. This ensures that the lookup does not return stale values inthe face of updates.

5.1.1 Partitioning SliceHashMaintaining a single large slicetable spanning the entire SSD is

not scalable: in particular, this can cause the flush of the in-memoryhash table to take an undue amount of time during which lookupoperations can also be blocked (note that SSD I/Os are blocking).To mitigate this and control the worst case insertion cost, we adopta strategy similar to BufferHash: We partition the in-memory hash

DRAM Hashtable 1

Slot-0

Slot-1

Key Value

Flash SSD

………

Slicetable 0 Slicetable 1 Slicetable N

Slot-M

.

.

.

.

. . .

.

. . .

.

. . .

.

. . .

Slice 0

Slice 1

Slice M

.

.

Hashtable 0

Key Value

.

. . .

………

Hashtable N

Key Value

.

. . .

K incarnations K incarnations K incarnations

Figure 6: Partitioned SliceHashtable to multiple small in-memory hash tables based on the firstfew bits of the keys’ address space. We then maintain a separateslicetable for each in-memory hash table (shown in Figure 6). Ifan in-memory partition becomes full, we only need to update thecorresponding (smaller) slicetable on the SSD. Henceforth, we as-sume a partitioned SliceHash is in use. Furthermore, we use theterm “in-memory hash table” to refer to one of the partitions inmemory (Figure 6).

5.1.2 Some OptimizationsHash table variants. Note that, for ease of explanation, we consid-ered the case of a simple hash table with a single entry per slot. Wecan trivially support hash tables with a fixed-size bucket of entriesfor every slot; in this case, each slice will have buckets of entriesfrom all incarnations for a given slot. We can also support an N -function Cuckoo hash table [34]; in this case, a key lookup mayneed to read up to N slots in the worst case (when the key is notfound in the first N − 1 slots). Lookup cost is bounded by N pagereads.Leveraging available memory. SliceHash may require an SSDpage read for a key lookup even if the key is not present in the en-tire data structure. Additional memory, if available, can be used toreduce such spurious lookups through the use of a summary datastructure, such as a Bloom filter, for every slicetable. All lookupsare first checked against Bloom filters. SSD operations are issuedonly if the Bloom filters indicate that the key is present. Crucially,our design can use memory opportunistically: e.g., we can main-tain Bloom filters only for certain partitions, for example, those thatare accessed frequently. This gives SliceHash the ability to adaptto memory needs, while ensuring that in the absence of such addi-tional memory, application performance targets are still met.

5.2 Adding concurrency to SliceHash

6

Request Queue (operations of type

insert, lookup, update)

Scheduler

Worker (Threads for parallel I/O)

Flash read/write

Process in-memory requests Picks

requests Assign flash read requests to

worker (Exploit channel parallelism)

Assign flash write requests to a worker

Flash SSD

Figure 7: Adding concurrency to SliceHash

In order to leverage the parallelism inherent to SSDs, I/O re-quests should be issued in such a manner that they are spread uni-formly across channels (DP3). In SliceHash, we use two compo-nents to achieve this: (1) a scheduler for request selection, and (2)a worker for SSD reads/writes.

The scheduler processes requests in batches. It first process allrequests that can be instantly served in memory. Then, it processeslookup requests which need reading from the SSD. We have devel-oped a channel-estimator (described in the next section) to estimatethe mapping between read requests and channels. Using these esti-mates, the scheduler finds a set of K requests (we choose K as thesize of the SSD’s NCQ).

We now describe the request selection algorithm. The algorithmensures that the number of requests picked for any channel is min-imized, the idea being that while we want to use as much concur-rency as NCQ can provide, we also want to use it carefully to opti-mally exploit channel parallelism. To meet this object, we maintaina “depth” for each channel, which estimates the number of selectedrequests for a channel. We take multiple passes over the requestqueue until we have selected K requests. In each pass, we selectrequests that would increase the depth of any channel by at most 1.In this manner, we first find the set of read requests to be issued.

The scheduler then instructs the worker to process the chosenread requests in parallel. The worker simply employs multiplethreads to issue requests to the SSD. Each thread is “associated”with a channel and is assigned requests that correspond to this chan-nel. When a thread completes a request, it accepts new requests forthe channel.

As the SSD page reads complete, the worker searches the en-tries of all incarnations on the pages for the input key. After pro-cessing lookups, the scheduler assigns SSD insert requests to theworker soon after an in-memory hash table fills up and needs to beflushed to the SSD. The worker accordingly reads/writes slicetablesfrom/to the SSD.

Note that there may be consistency issues with reordering readsand writes. The scheduler handles such corner cases explicitly.

5.2.1 Channel EstimationWe now describe how to estimate the channels corresponding

to the read requests issued to the SSD, which is a crucial compo-nent in performing concurrent I/O on the SSD (DP3). We focus onSSDs that use write-order mapping (the mapping strategy can beinferred using the techniques in [19] as mentioned in §4). Similarapproaches can be employed for SSDs that use other write policies.

One approach that can help estimate channels is to maintainmetadata that tracks how chunks are assigned to channels as theyare written: when an in-memory hash table is flushed to the SSD,we can estimate the channels for the constituent chunks i % Nfor the ith chunk written globally, and update the metadata (asdiscussed in §4, chunk writes in write-order mapping are stripedacross channels, i.e., the first write goes to the first channel, the

second write goes to the second channel, and so on). The prob-lem with this approach is that the metadata can be very large andconsume a lot of memory.

Instead, we use a technique that does not require any metadataat all. First, we restrict the size of a slicetable to be a multipleof N × ChunkSize, where N is the number of channels. Thus,whenever a slicetable is written to the SSD, there will be N chunkwrites, with the ith chunk write going to the ith channel. In otherwords, once we determine the relative chunk identifier (first, orsecond, or N th) for an offset in the slicetable, we can determinethe channel. The relative chunk identifier can be determined asthe offset modulo chunk size. Although this is a heuristic, exten-sive experiments show that it is remarkably effective at helping thescheduler schedule requests across channels (§7).

5.3 Leveraging multiple SSDsDue to its simple design and low resource footprint, SliceHash

can be easily extended to run across multiple SSDs attached to asingle machine. We elaborate below on two possibilities: one thatoffers high throughput and the other that offers low memory foot-print.Throughput-oriented design. We can exploit multiple SSDs toincrease parallelism and obtain high throughput. To do this, wepartition the key-space across multiple SSDs so that incoming re-quests are distributed across SSDs and can be processed by SSDsin parallel. We have one scheduler-worker combination for eachSSD, which handles the incoming requests and issues correspond-ing I/Os. Similar to partitioned SliceHash, we maintain in-memoryhash tables, but exactly one SSD has the slicetable for an in-memoryhash table.Memory-oriented design. We can also exploit multiple SSDs tolower the memory footprint of SliceHash. In this design, a slic-etable for an in-memory hash table expands across multiple SSDs,i.e., each slice has its entries stored across multiple SSDs. So theslicetable can contain a larger number of incarnations comparedto the number of incarnations when using a single SSD. As thenumber of incarnations increases, the memory footprint is reduced(§7.4). Although more incarnations must be read and from multipleSSDs when reading a slice, lookups can be issued to multiple SSDsin parallel, avoiding any loss in performance.

5.4 AnalysisIn this section, we analyze the I/O latency and the memory over-

head of SliceHash. We also estimate the number of writes to theSSD per unit time, and its impact on SSD lifetime. Alongside, weillustrate the knobs SliceHash offers to easily control cost-performancetrade-offs; such tunability is missing from almost all prior designs.

Table 2 summarizes the notation used.Memory overhead per entry. We estimate the memory overheadper entry. The total number of entries in an in-memory hash tableis H/seff , where H is the size of a single hash table and seffis the effective average space taken by a hash entry (actual size(s)/utilization (u)). The total number of entries overall in SliceHashfor a given size F of the SSD is:

(F +M

H)× H

seff=F +M

seff

Here, M is the total memory size. Hence, the memory overheadper entry is, M

#entries, i.e., M

F+M× seff , or 1

k+1× seff , where k

is the number of incarnations.For s = 16B (key 8 bytes, value 8 bytes), u = 80%, M = 1GB,

and F = 32GB, the memory overhead per entry is 0.6 bytes/entry.

7

Symbol MeaningM Total memory sizeN Number of SSDsn number of partitionsH Size of a single hash table (= M/n)s Size taken by a hash entryu Utilization of the hash table

seff Effective average space taken by a hash entry (= s/u)k Number of incarnations (= F/M )F Total SSD sizeS Size of slicetable (= H × k)P Size of an SSD page/sectorB Size of an SSD blockrp Page read latencyrb Block read latencywb Block write latency

Table 2: Notations used in cost analysis.

In contrast, state-of-the-art approaches for SSD-based hash ta-bles, e.g., SILT [27] and BufferHash [14] have memory overheadsof 0.7 bytes/entry and 4 bytes/entry, respectively. The use of Bloomfilters in BufferHash (used to prevent lookups from incurring toomany SSD reads across multiple incarnations) is a key contributorto the high memory overhead.

When we leverageN SSDs using the scheme in §5.3, the numberof incarnations per in-memory hash table becomes k ×N . In turn,the memory overhead becomes 1

k×N+1× seff . Specifically, using

the same configuration outlined above but now with N = 4 SSDs,the memory overhead becomes just 0.15 bytes/entry.Insertion cost. We estimate the average time taken for insert op-erations. We first calculate the time taken to read a slicetable andthen write it back. This is given by: ( S

B× rb + S

B× wb), where

S is the size of the slicetable, B is the size of an SSD block, andrb and wb are the read and write latencies per block, respectively.This flushing happens after H/seff entries are inserted to the hashtable; all insertions up to this point are made in memory. Hence,the average insertion cost is

( SB× rb +

S

B× wb)×

seffH

Replacing S by H ∗ k, we get (rb+wb)×seff×k

B, which is inde-

pendent of the size of the hash table.For a typical block read latency of 0.31ms [15], a block write

latency of 0.83ms [15], s = 16B, M = 1GB, F = 32GB, andu = 80%, the average insertion cost is ∼ 5.7µs. Similarly, theworst case insertion cost of SliceHash is (0.31+0.83)× S

Bms. By

configuring S to be same size as B, we can control the worse caseinsertion cost to (0.31 + 0.83) = 1.14ms.

In contrast, BufferHash has average and worst case insertion la-tencies of∼ 0.2µs and 0.83ms, both of which are better than Slice-Hash. We believe that the somewhat higher I/O costs are an accept-able trade-off for the much lower memory footprint in SliceHash.Lookup cost. We consider a Cuckoo hashing based hash table im-plementation with 2 hash functions. Suppose that the probabilityof success for the first lookup is p. For each lookup, a correspond-ing slice is read. Configuring H , the size of an in-memory hashtable, to match that of a page, the average lookup cost becomesrp + (1 − p) × rp or (2 − p) × rp, assuming that almost all ofthe lookups go to the SSD and only a negligible fraction are servedby in-memory hash tables. For p = 0.9, rp = 0.15 ms, the aver-age lookup cost is 0.16 ms. SILT and BufferHash have a similaraverage lookup cost.

The worst case happens when we have to read both pages cor-responding to the two hash functions. Thus, the worst case lookup

latency is 2×rp. For rp = 0.15 ms, this cost is 0.3 ms. In contrast,BufferHash may have very high worst case lookup latency becauseit may have to scan all incarnations due to the false positives arisingfrom the use of Bloom filters. For k = 32, this cost would be ashigh as 4.8 ms.Frequency of SSD writes, and knobs for tunability. We estimatethe ratio of the number of insertions to the number of block writesto the SSD; we denote this as rwrite. A hash table becomes full af-ter every H/seff inserts, after which the corresponding slicetableon the SSD is modified. The number of blocks occupied by a slic-etable is S/B or k ×H/B. Thus,

rwrite =H

seff× B

k ×H =B

k × seff

Thus, by increasing the number of incarnations k, the frequencyof writes to the SSD (which is inversely proportional to rwrite) alsoincreases. This in turn affects the overall performance.

Note, however, that increasing the number of incarnations alsodecreases the memory overhead as shown earlier. We investigatethis dependency in more detail in §7.4 and find that our designprovides a smooth trade-off between memory overhead and per-formance, allowing designers the flexibility to pick a point in thedesign space that best fits their specific cost-performance profile.Effect on SSD lifetime. SliceHash increases the number of writesto the SSD which may impact its overall lifetime. We now esti-mate the lifetime of an SSD as follows. For a given insert rate ofR, the number of block writes to the SSD per second is R

rwrites

or the average time interval between block writes is rwritesR

. Saythe SSD supports E erase cycles. Also, assume that the wear lev-eling scheme for the SSD is perfect. Then, the lifetime (T ) ofthe SSD could be approximately estimated as number of blocks,FB

, times erase cycles, E, times the average time interval betweenblock writes, rwrites

R, i.e.,

T =F × E × rwrites

R×B

Consider a 256GB MLC SSD drive that supports 10000 erasecycles [16]. We use SliceHash on this SSD with M = 4GB ofDRAM, i.e., k = 64. With a 16B entry size and utilization of80%, the ratio rwrite would be 102.4. Even with R = 10K in-serts/sec (required, e.g., for a WAN optimizer connected to 500Mbps link), the SSD would last 6.8 years. Thus, despite an in-crease in the writes to SSD, its lifetime would still be reasonablylong.

In sum, our analysis shows that our design patterns help Slice-Hash to reduce the memory overhead to 0.6 bytes/entry and limitthe lookup cost to 1 page read on average, without significantly af-fecting the average insert performance or SSD lifetime. A simpleknob – the number of incarnations – helps control the performance-cost trade-off in a fine-grained fashion. We empirically study theperformance and flexibility benefits of SliceHash in §7.

Next, we discuss how our key techniques can also be applied toother (hashing-based) data structures.

6. GENERALITYIn this section, we discuss how the five design patterns discussed

in §3 and the slicing primitive discussed in §5, can be used to de-sign other hashing-based data structures. We focus on large Bloomfilters and locality sensitive hashing (LSH)-based indexes. Manyof the supporting design techniques we used in SliceHash—the useof incarnations, slices, slicetables, and optimizations for multiple

8

SSDs—are derived directly from the design patterns and hence, asargued below, they also apply directly to alternate data structures.

DRAM

Flash SSD

………

Slicefilter 0 Slicefilter 1 Slicefilter N

.

. . .

.

. . .

.

. . .

Bloomfilter 0

.

.

………

Bloomfilter 1

.

.

Bloomfilter N

.

.

Slice 0

Slice 1

Slice M

.

.

DRAM

LSH hashtable 1

Flash SSD

………

LSH slicetable 0

LSH slicetable 1

LSH slicetable N

.

. . .

.

. . .

.

. . .

.

. . .

Slice 0

Slice 1

Slice M

.

.

LSH hashtable 0

.

. . .

………

LSH hashtable N

.

. . .

(a) SliceBloom (b) SliceLSH

Figure 8: Extension to other hash-based systemsBloom Filters. Bloom filters have traditionally been used as in-memory data structures [9]. As some recent studies have observed [24,18], with storage costs falling and data volumes growing into thepeta- and exa-bytes, space requirements for Bloom filters constructedover such datasets are also growing commensurately. In limitedmemory environments, there is a need to maintain large Bloom fil-ters on secondary storage. We show how we can apply our tech-niques for supporting Bloom filters on SSD storage in an efficientand high-performance fashion.

Figure 8a shows the overview of our SliceBloom data structure.Similar to SliceHash, we maintain several in-memory Bloom filtersand corresponding slicefilters on the SSD; the in-memory Bloomfilters are written to the SSD as incarnations. Each slice in a slice-filter contains the bits from all incarnations taken together for agiven slot (Figure 8a).

In traditional Bloom filters, a key lookup requires computingmultiple hash functions and reading entries corresponding to thebit positions computed by the hash functions. In our case, for eachhash function we first look up the corresponding in-memory Bloomfilter and then the corresponding slicefilter on the SSD.

The number of hash functions would determine the number ofpage lookups, which could limit the throughput. We now arguehow this cost can be controlled.

Since SSD storage is much cheaper than DRAM, we can usemore space per entry on the SSD – i.e., use a large m/n where mand n are the Bloom filter size and the number of unique elements,respectively; this allows us to use fewer hash functions (smallerh) while maintaining similar overall false positive rate [1]. Forexample, for a target false positive rate of 0.0008, instead of usingm/n = 15 and h = 8, we can use m/n = 32 and h = 3. Byreducing h, we can reduce the number of page lookups and improveperformance.

Our design patterns and the techniques we derive from them en-able us to reduce the effective memory footprint per key (wherea “key” refers to a unique element inserted into the Bloom filter)while achieving high performance, similar to the trade-offs we wereable to achieve with SliceHash. For example, choosingm/n = 32,we can use a combination of a 256MB DRAM and a 64GB SSD(leading to 256 incarnations per Bloom filter) to store Bloom fil-ters. This results in an effective memory overhead of 0.125 bits perentry and causes block writes to the SSD every 128 key insertions.Our evaluation in §7.5 shows that we achieve good throughput withthis configuration.LSH-based index. Locality sensitive hashing [26] is a techniqueused in the multimedia community [28, 10] for finding duplicatevideos and images at large scale. LSH systems use multiple hashtables. For each key, the corresponding bucket in each hash table islooked up. Then, all entries in the buckets are compared with thekey to find the nearest neighbor based on a certain metric (e.g., the

Hamming distance or an L2 norm). We discuss how we can ap-ply our design patterns and techniques to build LSH-based indexesefficiently and at large scale on SSDs.

Figure 8b shows the overview of SliceLSH’s design. Each LSHhash table is designed using SliceHash. When a query arrives, it isdistributed to all SliceHash instances. Leveraging the design pat-terns, we can subtly tweak the data structure to more closely alignwith how LSH works and ensure improved I/O performance.

Specifically, when we write in-memory LSH hash tables to theSSD, we arrange them such that: (1) all chunks of each slicetableget mapped to the same channel (this is in contrast with SliceHashwhere each chunk in a slicetable may go to a different channel), and(2) the chunks corresponding to different LSH hash tables map todifferent channels. For example, assume that an SSD has 10 chan-nels, there are 10 LSH in-memory hash tables, there are 10 LSHslice tables on the SSD, and each LSH slicetable has 10 chunks.Now assume that all LSH hash tables get full simultaneously andare written to SSD together. While writing to the SSD, we takethe first chunk from each LSH slicetable and write these chunksin some order. Then, we take the second chunk from each LSHslicetable and write them in the same order. This results in samenumber of chunks as number of channels (i.e., 10) being writtenbetween the first chunk write and the second chunk write of anyLSH slicetable, so both chunks of the LSH slicetable get mappedto the same channel as per write-order mapping (described in §4.2).We repeat this method of writing for the remaining chunks of allLSH slicetables. As a result, all chunks of one LSH slicetable getmapped to the same channel.

The benefit of this approach is that multiple LSH hash tablelookups for a given key will be uniformly distributed over multi-ple channels. This helps us maximally leverage the intrinsic paral-lelism of SSDs resulting in high lookup throughput (§7).

7. EVALUATIONIn this section, we measure the effectiveness of our design pat-

terns as applied to the three different indexes described above. Forsimplicity, a majority of our evaluation focuses on SliceHash.

Our goal is to answer the following key questions:

• Performance, workload variations: What is the lookupand insert performance of SliceHash? To what extent canSliceHash leverage the benefits of the intrinsic parallelism inflash storage? How does SliceHash perform under differentmixes of read and write workloads? How does it comparewith the state-of-the-art in these respects?• Tunability: How much flexibility does SliceHash provide in

terms of meeting different memory footprint (which we useas a proxy for cost) vs. performance trade-offs? How effec-tively can SliceHash leverage the scale offered by multipleSSDs without sacrificing index or application performance?• Generality: How do our design choices improve the perfor-

mance of other indexes?

7.1 Implementation and ConfigurationWe have implemented SliceHash in C++ using roughly 3000

lines of code. I/O concurrency is implemented using the pthreadlibrary. We use direct I/O for access to the SSD. We use the sim-ple “noop” scheduler in the Linux kernel (which implements basicFIFO scheduling of I/O requests) for leveraging the intrinsic paral-lelism of SSDs.

Each hash table is implemented using Cuckoo hashing [34] with2 hash functions and 3 entries per bucket, which corresponds to86% space utilization. As mentioned in §5, we have multiple in-memory hash tables. The size of each of these is 128KB, so each

9

can hold∼7000 key-value entries of size 16B each. Slicetables cor-responding to different in-memory hash tables are arranged acrosscontinuous logical block addresses on the SSD.

We evaluate SliceHash on a 128GB Crucial M4 SSD attached toa desktop with dual 2.26 GHz quad-core Intel Xeon processor. Weuse 32 threads for issuing concurrent I/O requests, correspondingto the number of channels in the Crucial SSD. The size of the NCQis also 32.

Unless otherwise specified, the size of each slicetable is 4096KB and the slicetable contains 32 incarnations of an in-memoryhash table. This amounts to using 4GB DRAM in total toward theSliceHash data structure.

7.2 SliceHash PerformanceWe evaluate the lookup and insert performance of SliceHash,

examining its throughput, memory footprint, and CPU overheadunder different mixes of read and write workloads. We compareSliceHash with BufferHash and SILT.Methodology. BufferHash [14] does not consider concurrent I/Oaccess. For a fair comparison against SliceHash, we added con-currency to BufferHash using the pthread library. We also addedlocking mechanisms to ensure that no two threads access the samein-memory hash table of BufferHash at the same time. We use sim-ilar configuration as in [14], i.e., 16 incarnations and 128 KB in-memory hashtable partition with maximum of 4096 entries. We use8GB DRAM for in-memory hashtables, 8GB DRAM for Bloomfilters and 128 GB for flash SSD. The memory footprint of thisconfiguration is ∼ 4 byte/entry.

SILT [27] considers concurrent access by default. We use 4SILT instances with 16 client threads concurrently issuing requests.A merge operation is triggered when a partition has one or moreHashStores; we do not limit the convert or merge rates. At thebeginning of each experiment, we insert 100K random key-valuepairs to “warm-up” SILT’s stores.

We use YCSB [20] to generate uniformly random key-value work-loads with varying lookup and insert ratios.1 Each workload con-sists of 100K operations, unless otherwise noted.Lookup performance. Figure 9a shows the performance of thethree systems—SliceHash (SH), multi-threaded BufferHash (BH+MT),and SILT (SILT)—for a lookup-only workload. We observe thatSliceHash achieves 69K lookups/sec while SILT and BH+MT achieveonly 62K lookups/sec (10% lower) and 57K lookups/sec (12% lower),respectively. SliceHash achieves higher lookup performance be-cause it exploits channel-level parallelism by running multiple threadsaccessing different channels in parallel. In contrast, neither SILTnor BH+MT are designed to exploit such channel-level parallelism.Insert performance: We now study the insert throughput of Slice-Hash. Figure 9b shows the performance of the three systems for acontinuous insert-only workload. We observe that SliceHash canachieve 125K inserts/sec. In contrast, BufferHash can achieve al-most 1100K inserts/sec for the same configuration (i.e., 128 KBin-memory hash table), and SILT can achieve 254K inserts/sec.

Recall that each in-memory hash table in SliceHash holds ∼7Kitems. Flushing to the SSD happens once the hash table becomesfull. At that time, SliceHash reads the corresponding slicetablefrom the SSD, modifies it and then writes it back to flash. Whilethis costs gets amortized over multiple insertions, it does affect theaverage SliceHash performance. In contrast to SliceHash, SILTachieves higher performance, but at the expense of increase in mem-ory footprint; SILT’s memory footprint increases approximatelylinearly due to a backlog of HashStores. As shown in Table 3,1We use the upper 8 bytes of the SHA1 hash of each YCSB-generated key as our 8 byte key.

Percentage Memory Footprint (bytes/entry)Inserts SliceHash BH+MT SILT

0% 0.6 4 0.2150% 0.6 4 0.21–0.57

100% 0.6 4 0.21–1.46

Table 3: Memory footprint under various workloads

Percentage CPU Utilization (%)Inserts SliceHash BH+MT SILT

0% 16 18 2750% 12 24 67100% 8 92 72

Table 4: CPU utilization under various workloads.

SILT’s memory footprint increases from 0.21 bytes/entry to 1.46bytes/entry over the course of the workload, while SliceHash’s mem-ory footprint remains a constant 0.6 bytes/entry. One can boundSILT’s memory footprint to 0.6 bytes/entry, but this severely im-pacts the rate of inserts that can be handled (as shown by “SILT-cap” in Figure 9b, which achieves only 46K inserts/s). Thus, undersame memory footprint, SliceHash is still better than SILT.

On the other hand, BufferHash is a highly write-optimized datastructure; because Bufferhash has to just write the buffer to SSDwhen it gets full, a much higher write throughput is possible. Webelieve that this is an acceptable trade-off for the significantly lowmemory overhead (Table 3) and better/more consistent lookup per-formance offered by SliceHash. Moreover, SliceHash can be aug-mented with a small write-optimized table (using a BufferHash-likedata structure) for handling bursts of writes; this table can be writ-ten back to SliceHash during a low I/O activity period.

SILT uses a similar idea; it uses a write-optimized data struc-ture for handling writes, which is later merged into SILT’s read-optimized data structures. However, merging in SILT is far morecompute-intensive (needs sorting) than writing a hash table backto a slicetable with SliceHash, which just requires copying entriesto appropriate positions. As shown in Table 4, the average CPUutilization2 during an insert-only workload is 72% when runningSILT and 8% when running SliceHash.Mixed workload. Finally, we investigate how SliceHash performsunder a continuous workload of 50% lookups and 50% inserts. Fig-ure 9c shows the performance of the three systems in this mixedworkload setting. We observe that SliceHash provides 94K ops/sec,versus 121K ops/sec for BH+MT and 92K ops/sec for SILT.

BH+MT has to only write the buffer to SSD when buffer be-comes full, while SILT and SliceHash has to perform extra opera-tions, which affect their performance. The performance of Slice-Hash and SILT is comparable, but SILT imposes high CPU over-head (67%) due to its background converting and merging oper-ations while SliceHash imposes very small overhead (12%) (Ta-ble 4).

7.3 Contribution of SliceHash OptimizationsWe now study how the two main parallelism-centered optimizations—

request-reordering and slice-based data layout—contribute to Slice-Hash’s performance.Request-reordering. We study the extent to which reordering canbe beneficial compared to a naive scheme of issuing requests inFIFO order. SliceHash-noCA does not consider the request-to-channel mapping when assigning requests to a thread; requests aresimply assigned to threads in the order the requests are made.

We consider three types of workloads to study the impact: (1)Random: the keys are generated randomly, so the distribution of

2Utilization is the sum of %user, %nice, and %system as reportedby iostat at 1 second intervals.

10

0

10

20

30

40

50

60

70

80

SH BH+MTSILT

Th

rou

gh

pu

t(K

lo

oku

ps/s

ec)

System

(a) Lookup-only

0

200

400

600

800

1000

1200

SH BH+MTSILT

SILT-cap

Th

rou

gh

pu

t(K

in

se

rts/s

ec)

System

(b) Insert-only

0

20

40

60

80

100

120

SH BH+MTSILT

Th

rou

gh

pu

t(K

op

s/s

ec)

System

(c) Mixed (50% lookup / 50% insert)Figure 9: Performance under varying workloads and systems

0

20

40

60

80

100

Random

Skewed(5)

Skewed(10)

Ordered(8)

Ordered(16)

Perc

enta

ge

Various workloads

SliceHash-noCA

Figure 10: Performance of SliceHash-noCA relative to SliceHash

requests among channels is also random; (2) Skewed: the channeldistribution is skewed, i.e., a certain number of requests (config-ured by the skew parameter S) go to the same channel, while theremaining requests are evenly distributed across channels; and (3)Ordered: the requests are uniformly distributed across channels,however their ordering is such that the first K requests go to firstchannel, the second K requests go to second channel and the ith

set of K requests go to channel i mod N (where N is the numberof channels; N = 32 for Crucial SSD). Essentially, if K = 1, evena FIFO scheme would have all 32 requests going to different chan-nels (the best case), while if K = 32, it would result in all flashpage read requests going to the same channel (the worst case).

Figure 10 shows the performance of SliceHash-noCA relative tothe performance of SliceHash. In the worst case (Ordered (K=16)),SliceHash-noCA can only achieve 42% of SliceHash performance.Under a small skew of 5 requests (S=5), the performance dropsby 17%; larger skew (S=10) deteriorates performance by almost30%. Even with a random workload, where keys are likely to beevenly distributed across channels, we see a performance drop of15%. These results indicate that channel-awareness is crucial tohigh performance in SliceHash.

Both, BufferHash and SILT are oblivious to channel parallelism.§7.2 showed that, under a uniform random workload, SliceHashoutperforms BufferHash and SILT. We expect that under skewedworkloads (such as the ones presented here), the performance ofBufferHash and SILT would be poor, and we find that this indeedis the case (results omitted for brevity).Slicing. We showed in §5.4 that slicing significantly reduces Slice-Hash’s memory footprint compared to BufferHash’s use of Bloomfilters. In principle, BufferHash could avoid using Bloom filtersand maintain the same memory footprint as SliceHash while lever-aging concurrency to obtain good performance. However, we showthat doing so has a severe performance impact.

We use the lookup-only workload from §7.2 to measure the through-put. We observe that BufferHash without Bloom filters achievesvery low performance, only 8K lookups/sec. In contrast, SliceHash-noCA achieved 57K lookups/sec. Since the central difference be-tween SliceHash-noCA and Bufferhash without Bloom filters is theuse of slicing, this result shows that slicing is crucial for collec-tively achieving high performance and a low memory footprint.

# Incarnations Insert-only Mixed Memory footprint(ops/sec) (ops/sec) (B/entry)

16 207K 110K 1.132 139K 93K 0.648 85K 79K 0.3864 66K 70K 0.27

Table 5: Memory vs. performance trade-off with more incarnations# SSDs Lookup-only Mixed Memory footprint

(ops/sec) (ops/sec) (B/entry)1 69K 93K 0.62 138K 186K 0.33 207K 279K 0.2

Table 6: Performance and memory footprint using multiple SSDs

7.4 Tuneability in SliceHashSliceHash is highly flexible and can be tuned to match applica-

tion requirements. SliceHash has a very small memory footprint(∼0.6 bytes/entry), and it can leverage additional memory to im-prove lookup peformance, e.g., by using Bloom filters (§5.1.2). Italso has a small CPU footprint, so it can easily be used with otherapplications requiring compute-intensive tasks. In contrast, Buffer-Hash has a high memory footprint (Table 3), and SILT imposeshigh CPU overhead due to continuous sorting (Table 4); these as-pects limit their suitability to a range of important applications.

In addition, SliceHash provides the flexibility to tune the mem-ory footprint at the cost of performance, and it can scale to multipleSSDs without usurping memory/CPU, as we show below.Memory footprint vs. Performance. As discussed in §5.4, byincreasing the number of incarnations, we can reduce the memoryfootprint of SliceHash. The side effect is that the number of blockwrites to flash SSDs is higher, which can affect the performance.Table 5 shows this trade-off for mixed (50% lookup/50% insert)and insert-only workloads. SliceHash provides a throughput be-tween 110K-70K operations/sec for a mixed workload and 207K-66K operations/sec for an insert-only workload; SliceHash’s mem-ory footprint ranges from 1.1 bytes/entry to 0.27 bytes/entry. Thelookup-only workload is not shown here, as performance remainsclose to 69K lookups/sec regardless of the number of incarnations.Scaling using multiple SSDs. We evaluate SliceHash on our IntelXeon machine using up to 3 SSDs for both the high-throughput andlow memory footprint configurations outlined in §5.3.

We find that SliceHash can provide linear scaling in performancewith the throughput-oriented configuration (Table 6). With 3 SSDs,SliceHash offers 207 K ops/sec for a lookup-only workload, and279K ops/sec for a mixed workload. Because of its low CPU andmemory footprint, SliceHash can easily leverage the multiple SSDson a single physical machine to match higher data volumes and pro-vide higher overall throughput without usurping the machine’s re-sources. Neither SILT nor BufferHash can scale in this fashion: theformer due to high CPU overhead (67% CPU utilization when oneSSD is used; 3 SSDs exceed the CPU budget for a single machine)and the latter due to high memory overhead (48 GB for 3 SSDs).

11

In the memory-oriented configuration, SliceHash’s memory over-head falls as the number of SSDs is increased, to 0.2 B/entry when3 SSDs are used. But the throughput stays the same as using a sin-gle SSD. Neither SILT not BufferHash can offer similar scale downof memory.

7.5 Generality: SliceBloom and SliceLSHWe now show how our general design patterns improve the per-

formance of other indexes.We evaluate SliceBloom on the 128GB Crucial SSD using 512

MB DRAM. We use m/n = 32 and k = 3 hash functions witha memory overhead of 0.1 bits/entry. Under a continuous mixedworkload, our system can perform 15K ops/sec. With naive paral-lelism, the system performance can drop to 5K ops/sec, especiallywhen all requests go to the same channel. In contrast, Bloom-Flash [24] achieves similar performance for a mixed workload, buton a high-end Fusion-io SSD (100,000 4KB I/Os per sec) that costs30X more ($6K vs. $200). Furthermore, on a low-end Samsungdrive, BloomFlash only provides 4-5K lookups/sec.

We also evaluate SliceLSH on the Crucial SSD. We use 10 hashtables, where each hash table uses 256MB in memory, and thecorresponding slicetable occupies 8GB on flash. SliceLSH canperform 6.9K lookups/sec, as it has to look up each hash table.By design, SliceLSH can intrinsically exploit channel parallelism.Hence, our system consistently offers similar performance undervarious workload patterns (results omitted for brevity).

8. CONCLUSIONA key impediment in the design of emerging high-performance

data-intensive systems is the design of large hash-based indexesthat offer good throughput and latency under specific workloadsand at specific cost points. Prior works have explored point solu-tions using SSDs that are each suited to a narrow setting and cru-cially lack flexibility and generalizability.

In this paper, we develop a set of general techniques for build-ing large, efficient and flexible hash-based systems by carefullyleveraging unique properties of SSDs. Using these techniques, wefirst build a large streaming hash table, called SliceHash, that pro-vides higher performance, while imposing low computation over-head and low memory overhead, compared to the state-of-the-art.Developers can easily tune SliceHash to meet performance goalsunder tight memory constraints and satisfy the diverse requirementsof various data-intensive applications. The indexes also performwell under a range of workloads. We illustrate the generality of ourideas by showing that they can be applied to building other efficientand flexible hash-based indexes.

Additionally, our work shows the promise of adopting the designpatterns and primitives we advocate to develop other general SSD-based indexes.

9. REFERENCES[1] Bloom filter maths.

http://cs.wisc.edu/ cao/papers/summary-cache/node8.html.[2] BlueCoat video caching appliance.

http://www.bluecoat.com/company/press-releases/blue-coat-introduces-carrier-caching-appliance-large-scale-bandwidth-savings.

[3] Crucial SSD. http://crucial.com/store/ssd.aspx.[4] Disk Backup and deduplication with DataDomain.

http://datadomain.com.[5] Memcached: A distributed memory object caching system.

http://memchached.org.[6] Peribit Networks (Acquired by Juniper in 2005): WAN Optimization

Solution. http://www.juniper.net/.

[7] Riverbed: WAN Optimization.http://riverbed.com/solutions/wan optimization.

[8] A. Badam, K. Park, V. S. Pai and L. Peterson. HashCache: CacheStorage for the Next Billion. In NSDI, 2009.

[9] A. Broder and M. Mitzenmacher. Network applications of bloomfilters: A survey. Internet Mathematics, 2005, 1(4):485–509.

[10] A. Torralba, R. Fergus, and Y. Weiss. Small code and large imagedatabases for recognition. In CVPR, 2008.

[11] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S. Manasse,and R. Panigrahy. Design tradeoffs for ssd performance. In USENIXATC, 2008.

[12] A. Anand, A. Akella, V. Sekar, and S. Seshan. A case forinformation-bound referencing. In HotNets, 2010.

[13] Ashok Anand, Archit Gupta, Aditya Akella, Srinivasan Seshan andScott Shenker. Packet Caches on Routers: The Implications ofUniversal Redundant Traffic Elimination. In SIGCOMM 2008.

[14] Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akellaand Suman Nath. Cheap and Large CAMs for High PerformanceData-Intensive Networked Systems. In NSDI, 2010.

[15] Ashok Anand, Steven Kappes, Aditya Akella and Suman Nath.Building Cheap and Large CAMs Using BufferHash. TechnicalReport 1651, University of Wisconsin, 2009.

[16] M. Balakrishnan, A. Kadav, V. Prabhakaran, and D. Malkhi.Differential raid: rethinking raid for ssd reliability. In EuroSys, 2010.

[17] M. Bjørling, L. L. Folgoc, A. Mseddi, P. Bonnet, L. Bouganim, andB. T. Jonsson. Performing sound flash device measurements: somelessons from uflip. In SIGMOD Conference, pages 1219–1222, 2010.

[18] M. Canim, G. A. Mihaila, B. Bhattacharjee, C. A. Lang, and K. A.Ross. Buffered bloom filters on solid state storage. In ADMS, 2010.

[19] F. Chen, R. Lee, and X. Zhang. Essential roles of exploiting internalparallelism of flash memory based solid state drives in high-speeddata processing. In HPCA, pages 266–277, 2011.

[20] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears.Benchmarking cloud serving systems with ycsb. In SoCC, 2010.

[21] David Andersen, Jason Franklin, Michael Kaminsky, AmarPhanishayee, Lawrence Tan and Vijay Vasudevan. FAWN: A FastArray of Wimpy Nodes. In SOSP, 2009.

[22] B. K. Debnath, S. Sengupta, and J. Li. Flashstore: High throughputpersistent key-value store. PVLDB, 3(2):1414–1425, 2010.

[23] B. K. Debnath, S. Sengupta, and J. Li. Skimpystash: Ram spaceskimpy key-value store on flash-based storage. In SIGMOD, 2011.

[24] B. K. Debnath, S. Sengupta, J. Li, D. J. Lilja, and D. H.-C. Du.Bloomflash: Bloom filter on flash-based storage. In ICDCS, 2011.

[25] Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, DimitriosGunopulos and Walid A. Najjar. Microhash: an efficient indexstructure for flash-based sensor devices. In FAST, 2005.

[26] A. Gionis, P. Indyk, and R. Motwani. Similarity search in highdimensions via hashing. In VLDB, 1999.

[27] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. Silt: amemory-efficient, high-performance key-value store. In SOSP, 2011.

[28] Q. Lv, M. Charikar, and K. Li. Image Similarity Search withCompact Data Structures. In CIKM, 2004.

[29] S.-Y. Park, E. Seo, J.-Y. Shin, S. Maeng, and J. Lee. Exploitinginternal parallelism of flash-based ssds. Computer ArchitectureLetters, 9(1), 2010.

[30] S. Quinlan and S. Dorward. Venti: A new approach to archivalstorage. In FAST, 2002.

[31] H. Roh, S. Park, S. Kim, M. Shin, and S.-W. Lee. B+-tree indexoptimization by exploiting internal parallelism of flash-based solidstate drives. PVLDB, 5, 2011.

[32] Suman Nath and Aman Kansal. FlashDB: dynamic self-tuningdatabase for NAND flash. In ACM/IEEE IPSN, 2007.

[33] Suman Nath and Phillp B. Gibbons. Online Maintenance of VeryLarge Random Samples on Flash Storage . In VLDB, 2008.

[34] Ulfar Erlingsson, Mark Manasse and Frank McSherry. A cool andpractical alternative to traditional hash tables. In orkshop onDistributed Data and Structures (WDAS), 2006.

[35] B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in thedata domain deduplication file system. In FAST, 2008.

12

Date post:	22-Jan-2018
Category:	Software
Upload:	instart-logic
View:	965 times
Download:	1 times

Research - Design Patterns for Tunable and Efficient SSD-based Indexes

Software