SLM-DB: Single Level Merge Key-Value Store with Persistent Memory
Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam, Sam H. Noh, Young-ri Choi
Outline
• Background
• Contributions
• Architecture
• Evaluation
• Conclusion
FAST 2019 2
Key-Value Databases
FAST 2019 3
“100”
“html_doc”
“linux_logo”
Key Value
{[Green, Word, Gates]}
<html><head>…..</body></html>
Log-Structured Merge (LSM) Tree
• Optimized for heavy write application usage
• Designed for slow hard drives
FAST 2019 4
CK C1 C0
…
mergemergemerge
Disk Memory
In-memory buffer
Components are sorted
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemerge
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
• Large overhead to locate needed data
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
• Large overhead to locate needed data
LSM-tree: disadvantages
FAST 2019 5
CK C1 C0
Disk Memory
…
mergemergemergeGet(key)
Search(key)
• Large overhead to locate needed data
• High disk write amplification
State-of-the-art LSM-tree: LevelDB
FAST 2019 6
Level 0
Level 1MemTable
ImmutableMemTable
Application
Level 2
Sorted String Tables (SST)
Compaction
Merge from Level N to Level N+1
Flush
WAL
Write-Ahead-Log (no fsync)MANIFEST
Store file organization and
metadata In-memory skiplist to
buffer updates
Disk Memory
Each level is 10x larger than
previous
Mark Immutable when becoming
full
Sequential write to the disk
LSM-tree optimizations
• Improve parallelism:• RocksDB (Facebook)
• HyperLevelDB
• Reduce write amplification:• PebblesDB (SOSP ‘17)
• Optimize for hardware(SSD):• VT-tree (FAST ‘13)
• WiscKey (FAST ‘16)
FAST 2019 7
New era
FAST 2019 8
speedfast slow
Byte addressable Persistent storagePersistent
Memory
Simple approach
FAST 2019 9
Disk Memory
CK C1 C0
…
mergemergemerge
Simple approach
FAST 2019 9
Disk MemoryPersistent
Memory
CK C1 C0
…
mergemergemerge
Simple approach
FAST 2019 9
Disk MemoryPersistent
Memory
CK C1 C0
…
mergemergemerge
Our approach
FAST 2019 10
C1
Disk Memory
merge
C0
Our approach
FAST 2019 10
C1
Disk Memory
merge
C0
PersistentMemory
Our approach
FAST 2019 10
C1
Disk Memory
merge
C0merge
PersistentMemory
Single disk component C1that does self-merging
Our approach
FAST 2019 10
Index
C1
Disk Memory
merge
C0merge
PersistentMemory
Single disk component C1that does self-merging B+-tree to manage data
stored in the disk
Single-Level Merge DB (SLM-DB)
FAST 2019 11
MemTable
ImmutableMemTable
Disk Persistent Memory
… Data
Flush
Compaction
Level 0
Global B+-Tree
Application
MANIFESTNo WAL
Similar as in LevelDB
Index per-key that
stored in the diskSelect candidate
files to merge them together
Single level of SST files
Contributions
FAST 2019 12
Persistent MemTableNo Write-Ahead Logging (WAL)
Stronger consistency compared to LevelDB
Persistent B+-tree IndexPer-key index for fast search
No multi-leveled merge structure
Selective CompactionLive-key ratio of a Sorted-String Table
Leaf node scan in the B+-treeDegree of sequentiality per range query
Persistent MemTable
FAST 2019 13
Consistency
guaranteed
No consistency
guaranteed
0 1 2 3 5 6 7 8 9
Recoverable after failure
Insert into Persistent MemTable
FAST 2019 14
(1) create node
4
(2) Assign next
pointer and clflush()(3) Atomically change
next pointer
Consistency
guaranteed
No consistency
guaranteed
0 1 2 3 5 6 7 8 9
Single-Level Merge DB
FAST 2019 15
MemTable
ImmutableMemTable
Disk Persistent Memory
… Data
Compaction
Level 0
GlobalB+-Tree
Application
MANIFEST Flush
Flush
FAST 2019 16
File Creation
Index Insertion
Save to MANIFEST
• Key-Index insertion into B+-tree happens during Immutable Memtable Flush to disk
• FAST-FAIR B+-tree (Hwang et al., FAST ’18)
FlushFile creation
thread
B+-tree insertion thread
Time
Single-Level Merge DB
FAST 2019 17
MemTable
ImmutableMemTable
Disk Persistent Memory
… DataLevel 0
GlobalB+-Tree
Application
MANIFEST Flush
Compaction
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35
- Valid KV pair
- Obsolete KV pair
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35 File#3 1 11 14
New file
- Valid KV pair
- Obsolete KV pair
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35 File#3 1 11 14
New file
- Valid KV pair
- Obsolete KV pair
KV-pairs became obsolete
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35 File#3 1 11 14
New file
File#4 12 17 35
New file
- Valid KV pair
- Obsolete KV pair
KV-pairs became obsolete
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35 File#3 1 11 14
New file
File#4 12 17 35
New file
- Valid KV pair
- Obsolete KV pair
KV-pairs became obsolete
Why we need Compaction?
FAST 2019 18
File#0 File#1 File#21 10 17 11 13 19 6 14 35 File#3 1 11 14
New file
File#4 12 17 35
New file
Need garbage collection (GC)
- Valid KV pair
- Obsolete KV pair
KV-pairs became obsolete
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12)
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12)
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12)
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12)
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12)
Why else?
FAST 2019 19
File#0 File#1 File#2 File#3 File#41 10 17 11 13 19 6 14 35 14 21 32 2 8 17
RangeQuery(5, 12) Need to improve sequentiality
Selective compaction
• Selectively pick SSTable files
• Make those files as compaction candidates
• Merge together most overlapping compaction candidates
• Selection schemes for compaction candidates:oLive-key ratio selection of an SSTable (for GC)
oLeaf node scans in the B+-tree (for sequentiality) [see paper]
oDegree of sequentiality per range query (for sequentiality) [see paper]
FAST 2019 20
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 66.6% Ratio 66.6%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 66.6% Ratio 66.6%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 66.6%Ratio 33.3%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 33.3% Ratio 33.3%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 0.0% Ratio 33.3%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 2 1 2 4 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 0.0% Ratio 33.3% Ratio 100.0%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 0.0% Ratio 33.3% Ratio 100.0%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5 File 3 2 6 7 File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 0.0% Ratio 33.3% Ratio 100.0%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
CompactionCandidates
Live-key ratio selection
FAST 2019 21
File 1
PM B+-tree
1 3 5
File 3 2 6 7
File 4 1 2 4
• To collect garbage• If live (valid) to total key ratio is below threshold, then add to candidates
Ratio 66.6% Ratio 0.0% Ratio 33.3% Ratio 100.0%
- Valid KV pair
- Obsolete KV pair
Ratio threshold - 50% PM
Disk
Compaction
FAST 2019 22
Pick
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
Pick
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
Pick
Merge
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 CreationPick
Merge
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 Creation
Index File#7
Pick
Merge
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 Creation
Index File#7
Pick
Merge
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 Creation
Index File#7
File#8 Creation Pick
Merge
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 Creation
Index File#7
File#8 Creation
Index File#8
Pick
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
Compaction
FAST 2019 22
File #7 Creation
Index File#7
Save to MANIFEST
File#8 Creation
Index File#8
Pick
File#1 File#2 File#3 File#4File#0 File#5 File#6
Compaction candidate files Files
• Compaction triggered when there are too many compaction candidate files
File creation thread
B+-tree insertion thread
Time
General operations
•Put
•Put if exists/Put if not exists
•Get
•Scan
FAST 2019 23
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Put(key, value)
FAST 2019 24
Disk PM
K V
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Put(key, value)
FAST 2019 24
Disk PM
K V
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Put(key, value)
FAST 2019 24
Disk PM
K V
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Put(key, value)
FAST 2019 24
Disk PM
K VK
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Make sure if statement is
fulfilled before Put()
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Query
Make sure if statement is
fulfilled before Put()
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Query
Make sure if statement is
fulfilled before Put()
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Query
Make sure if statement is
fulfilled before Put()
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Query
Make sure if statement is
fulfilled before Put()
Statement is true
B+-tree Index
Files
ImmutableMemTable
MemTable
Put(key, value) if exists/if not exists
FAST 2019 25
Disk PM
ClientK V
Query
Make sure if statement is
fulfilled before Put()
Statement is true
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
Query
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
Query
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
Query
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
Query
Key exists
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK
VQuery
Key exists
B+-tree Index
Files
ImmutableMemTable
MemTable
Get(key)
FAST 2019 26
Disk PM
ClientK V
Query
Key exists
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Scan(keyi, keyj)
FAST 2019 27
Disk PM
Ki Kj
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Scan(keyi, keyj)
FAST 2019 27
Disk PM
Ki Kj
Create iterator
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Scan(keyi, keyj)
FAST 2019 27
Disk PM
Ki KjKi+3Ki
Ki+1Ki
Ki+1Ki Ki+2
…
…
…
Create iterator
B+-tree Index
Files
ImmutableMemTable
MemTableClient
Scan(keyi, keyj)
FAST 2019 27
Disk PM
Ki KjKi+3Ki
Ki+1Ki
Ki+1Ki Ki+2
…
…
…
Create iterator
Evaluation
FAST 2019 28
Intel Xeon E5-2640 v3
DRAM: 4GBEmulated PM: 7GB
Intel SSD DC S3520
Ubuntu 18.04Kernel 4.15
DB: 8GB/20GBMemtable: 64MB
• PM write latency 500ns (5x of DRAM write latency)• PM read latency & bandwidth same same as DRAM’s• Intel’s PMDK used to control PM pool
db_bench microbenchmark
FAST 2019 29
Random write Random read Range query
Overhead amortized from large value size
Low sequentiality
Steady performance increase
Low file locating overhead
Range size = 100
db_bench microbenchmark
FAST 2019 29
Random write Random read Range query
Overhead amortized from large value size
Low sequentiality
Steady performance increase
Low file locating overhead
• ~2.56x less disk write amplification• Max 700MB used in PM Range size = 100
PM sensitivity
FAST 2019 30
PM write latency sensitivityRandom write benchmark
PM bandwidth sensitivity
Emulated by inserting cpu pause after clfush()
Emulated using Thermal Throttling
db_bench
YCSB
FAST 2019 31
100% I 50% R50% U
95% R5% U
95% R5% U
100% I95% LR5% U
95% S5% U
50% R50% RMW
YCSB
FAST 2019 31
100% I 50% R50% U
95% R5% U
95% R5% U
100% I95% LR5% U
95% S5% U
50% R50% RMW
Better write performance
YCSB
FAST 2019 31
100% I 50% R50% U
95% R5% U
95% R5% U
100% I95% LR5% U
95% S5% U
50% R50% RMW
Very fast on update operations
Better write performance
YCSB
FAST 2019 31
100% I 50% R50% U
95% R5% U
95% R5% U
100% I95% LR5% U
95% S5% U
50% R50% RMW
Very fast on update operations
Only 1KB case is slower
Better write performance
YCSB
FAST 2019 31
100% I 50% R50% U
95% R5% U
95% R5% U
100% I95% LR5% U
95% S5% U
50% R50% RMW
Very fast on update operations
Only 1KB case is slower
• On average, beats every workload• Up to 7.7x less disk write amplification
Better write performance
Conclusion
• Novel design of Key-Value stores with Persistent Memory
• High write/read performance compared to LevelDB
• Comparable scan performance
• Low write amplification
• Near-optimal read amplification
FAST 2019 32
Thanks!Questions?
FAST 2019 33
SLM-DB: Single Level Merge Key-Value Store with Persistent Memory
Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam, Sam H. Noh, Young-ri Choi
db_bench microbenchmark (20GB)
FAST 2019 35
Random write Random read Range query
Effect of persistent MemTable
FAST 2019 36
Random write performance Total disk write
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
B+-tree
B+-tree leaf node scan
FAST 2019 37
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Files
CompactionCandidates
• To increase sequentiality of key-values with scans in round-robin fashion• If the number of unique file accesses is above threshold, then add to candidates
Threshold = 2
Degree of sequentiality per range query
FAST 2019 38
B+-tree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
RangeQuery(7, 14)
Files
CompactionCandidates
• To increase sequentiality of key-values during range query operation• If subrange max unique file accesses is above threshold, then add to
candidates
Threshold = 2
Degree of sequentiality per range query
FAST 2019 38
B+-tree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
RangeQuery(7, 14)
Files
CompactionCandidates
• To increase sequentiality of key-values during range query operation• If subrange max unique file accesses is above threshold, then add to
candidates
Threshold = 2
Degree of sequentiality per range query
FAST 2019 38
B+-tree
1 2 3 4 5 6
7 8 9 10 11 12 13 14
15 16
RangeQuery(7, 14)
Files
CompactionCandidates
• To increase sequentiality of key-values during range query operation• If subrange max unique file accesses is above threshold, then add to
candidates
Threshold = 2
Degree of sequentiality per range query
FAST 2019 38
B+-tree
1 2 3 4 5 6
7 8 9 10 11 12 13 14
15 16
RangeQuery(7, 14)
Files
CompactionCandidates
• To increase sequentiality of key-values during range query operation• If subrange max unique file accesses is above threshold, then add to
candidates
Threshold = 2
Degree of sequentiality per range query
FAST 2019 38
B+-tree
1 2 3 4 5 6
7 8 9 10 11 12 13 14
15 16
RangeQuery(7, 14)
Files
CompactionCandidates
• To increase sequentiality of key-values during range query operation• If subrange max unique file accesses is above threshold, then add to
candidates
Threshold = 2