Cache Craftiness for Fast Multicore Key-Value Storage

Post on 25-Feb-2016

137 views 1 download

Tags:

description

Cache Craftiness for Fast Multicore Key-Value Storage. Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT). Let’s build a fast key-value store. KV store systems are important Google Bigtable , Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost - PowerPoint PPT Presentation

transcript

Cache Craftiness for Fast Multicore Key-Value Storage

Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store

• KV store systems are important– Google Bigtable, Amazon Dynamo, Yahoo! PNUTS

• Single-server KV performance matters– Reduce cost– Easier management

• Goal: fast KV store for single multi-core server– Assume all data fits in memory– Redis, VoltDB

Feature wish list

• Clients send queries over network

• Persist data across crashes

• Range query

• Perform well on various workloads– Including hard ones!

Hard workloads

• Skewed key popularity– Hard! (Load imbalance)

• Small key-value pairs– Hard!

• Many puts– Hard!

• Arbitrary keys– String (e.g. www.wikipedia.org/...) or integer– Hard!

First try: fast binary tree

Series10

1

2

3

4

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

• Network/disk not bottlenecks• High-BW NIC• Multiple disks

• 3.7 million queries/second!

• Better?• What bottleneck remains?• DRAM!

Cache craftiness goes 1.5X farther

Binary Masstree0

1

2

3

4

5

6

7

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Cache-craftiness: careful use of cache and memory

Contributions

• Masstree achieves millions of queries per second across various hard workloads– Skewed key popularity– Various read/write ratios– Variable relatively long keys– Data >> on-chip cache

• New ideas– Trie of B+ trees, permuter, etc.

• Full system– New ideas + best practices (network, disk, etc.)

Experiment environment• A 16-core server– three active DRAM nodes

• Single 10Gb Network Interface Card (NIC)

• Four SSDs

• 64 GB DRAM

• A cluster of load generators

Potential bottlenecks in Masstree

Single multi-core server

Network

Disk

log log

…DRAM

NIC bottleneck can be avoided

• Single 10Gb NIC– Multiple queue, scale to many cores– Target: 100B KV pair => 10M/req/sec

• Use network stack efficiently– Pipeline requests– Avoid copying cost

Disk bottleneck can be avoided

• 10M/puts/sec => 1GB logs/sec!• Single disk

• Multiple disks: split log– See paper for details

Single multi-core server

Write throughput Cost

Mainstream Disk 100-300 MB/sec 1 $/GB

High performance SSD up to 4.4GB/sec > 40 $/GB

DRAM bottleneck – hard to avoid

Binary Masstree0

1

2

3

4

5

6

7

140M short KV, put-only, @16 coresTh

roug

hput

(req

/sec

, mill

ions

)

Cache-craftiness goes 1.5X father, including the cost of:• Network• Disk

DRAM bottleneck – w/o network/disk

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Cache-craftiness goes 1.7X father!

DRAM latency – binary tree

Binary0

1

2

3

4

5

6140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

B

A C

Y

X Z

serial DRAM latencies!

10M keys =>

VoltDB

2.7 us/lookup 380K lookups/core/sec

DRAM latency – Lock-free 4-way tree

• Concurrency: same as binary tree• One cache line per node => 3 KV / 4 children

X Y Z

A B … … …

½ levels as binary tree½ DRAM latencies as binary tree

4-tree beats binary tree by 40%

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

4-tree may perform terribly!

• Unbalanced: serial DRAM latencies– e.g. sequential inserts

• Want balanced tree w/ wide fanout

A B C

D E F

G H I

O(N) levels!

B+tree – Wide and balanced

• Balanced!

• Concurrent main memory B+tree [OLFIT]– Optimistic concurrency control: version technique– Lookup/scan is lock-free– Puts hold ≤ 3 per-node locks

Wide fanout B+tree is 11% slower!

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree

• 4-tree: each internal node is full• B+tree: nodes are ~75% full

• Serial DRAM latencies >= 4-tree

B+tree – Software prefetch

• Same as [pB+-trees]

• Masstree: B+tree w/ fanout 15 => 4 cache lines• Always prefetch whole node when accessed• Result: one DRAM latency per node vs. 2, 3, or 4

4 lines

1 line

=

B+tree with prefetch

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Beats 4-tree by 9%Balanced beats unbalanced!

Concurrent B+tree problem

• Lookups retry in case of a concurrent insert

• Lock-free 4-tree: not a problem– keys do not move around– but unbalanced

A C D A C D

A B C D

insert(B)Intermediate state!

B+tree optimization - Permuter

• Keys stored unsorted, define order in tree nodes

• A concurrent lookup does not need to retry– Lookup uses permuter to search keys– Insert appears atomic to lookups

A C D A C D B

A C D B

insert(B)

0 1 2

Permuter: 64-bit integer

…0 3 1 …2

B+tree with permuter

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Improve by 4%

Performance drops dramatically when key length increases

8 16 24 32 40 480

1

2

3

4

5

6

7

8

9

Short values, 50% updates, @16 cores, no logging

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Key lengthKeys differ in last 8B

Why? Stores key suffix indirectly, thus each key comparison • compares full key• extra DRAM fetch

… B+tree, indexed by k[0:7]

B+tree, indexed by k[8:15]

B+tree, indexed by k[16:23]

Masstree – Trie of B+trees

• Trie: a tree where each level is indexed by fixed-length key fragment

• Masstree: a trie with fanout 264, but each trie node is a B+tree

• Compress key prefixes!

Case Study: Keys share P byte prefix – Better than single B+tree

• trie levels• each has one node only

A single B+tree with 8B keys

Complexity DRAM access

Masstree O(log N) O(log N)

Single B+tree O(P log N) O(P log N)

Masstree performs better for long keys with prefixes

8 16 24 32 40 480123456789

10

MasstreeB+tree

Short values, 50% updates, @16 cores, no logging

8B key comparison vs.

full key comparison

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

Key length

Does trie of B+trees hurt short key performance?

Binary 4-tree B+tree +Prefetch +Permuter Masstree0123456789

10

140M short KV, put-only, @16 cores

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

8% faster! More efficient code – internal node handle 8B keys only

Evaluation

• Masstree compare to other systems?• Masstree compare to partitioned trees?– How much do we pay for handling skewed

workloads?• Masstree compare with hash table?– How much do we pay for supporting range queries?

• Masstree scale on many cores?

Masstree performs well even with persistence and range queries

MongoDB VoltdB Redis Memcached Masstree0

2

4

6

8

10

12

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

20M short KV, uniform dist., read-only, @16 cores, w/ network

0.04 0.22

Unfair: both have a richer data and query model

Memcached: not persistent and no range queries

Redis: no range queries

Multi-core – Partition among cores?

• Multiple instances, one unique set of keys per inst.– Memcached, Redis, VoltDB

• Masstree: a single shared tree– each core can access all keys– reduced imbalance

B

A C

Y

X Z

B

A C

Y

X Z

A single Masstree performs better for skewed workloads

0 1 2 3 4 5 6 7 8 90

2

4

6

8

10

12Masstree16 partitioned Masstrees

Thro

ughp

ut (r

eq/s

ec, m

illio

ns)

δ

140M short KV, read-only, @16 cores, w/ network

One partition receives δ times more queries

No remote DRAM accessNo concurrency control

Partition: 80% idle time1 partition: 40% 15 partitions: 4%

Cost of supporting range queries

• Without range query? One can use hash table– No resize cost: pre-allocate a large hash table– Lock-free: update with cmpxchg– Only support 8B keys: efficient code– 30% full, each lookup = 1.1 hash probes

• Measured in the Masstree framework– 2.5X the throughput of Masstree

• Range query costs 2.5X in performance

Scale to 12X on 16 cores

Number of cores

Thro

ughp

ut (r

eq/s

ec/c

ore,

mill

ions

)

1 2 4 8 160

100000

200000

300000

400000

500000

600000

700000

Get

Perfect scalability

• Scale to 12X • Put scales similarly• Limited by the shared

memory system

Short KV, w/o logging

Related work

• [OLFIT]: Optimistic Concurrency Control• [pB+-trees]: B+tree with software prefetch• [pkB-tree]: store fixed # of diff. bits inline• [PALM]: lock-free B+tree, 2.3X as [OLFIT]

• Masstree: first system combines them together, w/ new optimizations– Trie of B+trees, permuter

Summary

• Masstree: a general-purpose high-performance persistent KV store

• 5.8 million puts/sec, 8 million gets/sec– More comparisons with other systems in paper

• Using cache-craftiness improves performance by 1.5X

Thank you!