Post on 13-Jan-2022
transcript
PebblesDB: Building Key-Value Stores using Fragmented Log
Structured Merge Trees
Pandian Raju1, Rohan Kadekodi1, Vijay Chidambaram1,2, Ittai Abraham2
1The University of Texas at Austin2VMware Research
What is a key-value store?
• Store any arbitrary value for a given key
123
124
Keys{“name”:“JohnDoe”,“age”:25}
{“name”:“RossGel”,“age”:28}
Values
2
What is a key-value store?
• Store any arbitrary value for a given key
• Insertions:• Point lookups:• Range Queries:
123
124
Keys{“name”:“JohnDoe”,“age”:25}
{“name”:“RossGel”,“age”:28}
Values
3
What is a key-value store?
• Store any arbitrary value for a given key
• Insertions: put(key, value)• Point lookups:• Range Queries:
123
124
Keys{“name”:“JohnDoe”,“age”:25}
{“name”:“RossGel”,“age”:28}
Values
4
What is a key-value store?
• Store any arbitrary value for a given key
• Insertions: put(key, value)• Point lookups: get(key)• Range Queries:
123
124
Keys{“name”:“JohnDoe”,“age”:25}
{“name”:“RossGel”,“age”:28}
Values
5
What is a key-value store?
• Store any arbitrary value for a given key
• Insertions: put(key, value)• Point lookups: get(key)• Range Queries: get_range(key1, key2)
123
124
Keys{“name”:“JohnDoe”,“age”:25}
{“name”:“RossGel”,“age”:28}
Values
6
Key-Value Stores - widely used
• Google’s BigTable powers Search, Analytics, Maps and Gmail• Facebook’s RocksDB is used as storage engine in production
systems of many companies
7
Write-optimized data structures• Log Structured Merge Tree (LSM) is a write-optimized data structure
used in key-value stores• Provides high write throughput with good read throughput, but
suffers high write amplification
8
• Log Structured Merge Tree (LSM) is a write-optimized data structure used in key-value stores • Provides high write throughput with good read throughput, but
suffers high write amplification• Write amplification - Ratio of amount of write IO to amount of user
data
KV-storeClient10GB
Userdata
IftotalwriteI/Ois200GB
Writeamplification=20
9
Write-optimized data structures
• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB
450
300
600
900
1200
1500
1800
2100
RocksDB LevelDB PebblesDB UserData
WriteIO(G
B)
Write amplification in LSM based KV stores
10
• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB
1868(42x)
1222(27x)
756(17x)
450
300
600
900
1200
1500
1800
2100
RocksDB LevelDB PebblesDB UserData
WriteIO(G
B)
11
Write amplification in LSM based KV stores
Why is write amplification bad?
• Reduces the write throughput• Flash devices wear out after limited write cycles
(Intel SSD DC P4600 – can last ~5 years assuming ~5 TB write per day)
RocksDB can write ~500 GB of user data per day to a SSD to last 1.25 years
Data source: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-1-6tb-2-5inch-3d1.html12
PebblesDB
Built using new data structure Fragmented Log-Structured Merge Tree
High performance write-optimized key-value store
Achieves 3-6.7x higher write throughput and 2.4-3xlesser write amplification compared to RocksDB
Gets the highest write throughput and least write amplification as a backend store to MongoDB
13
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
14
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
15
Log Structured Merge Tree (LSM)
Data is stored both in memory and storage
Memory
Storage
In-memory
16
File1
Writesaredirectlyputtomemory
In-memoryMemory
Storage
Write(key,value)
17
File1
Log Structured Merge Tree (LSM)
Memory
File1
File2
In-memory data is periodically written as files to storage (sequential I/O)
In-memory
18
Storage
Log Structured Merge Tree (LSM)
Files on storage are logically arranged in different levels
In-memoryMemory
Level0
Level1
Leveln
19
Storage
Log Structured Merge Tree (LSM)
Compaction pushes data to higher numbered levels
In-memoryMemory
Level0
Level1
Leveln
20
Storage
Log Structured Merge Tree (LSM)
Files are sorted and have non-overlapping key ranges
In-memoryMemory
1.…12 15….19 25….75 79….99
Searchusingbinarysearch
Level0
Level1
Leveln
21
Storage
Log Structured Merge Tree (LSM)
Level 0 can have files with overlapping (but sorted) key ranges
In-memoryMemory
2….57 23….78Level0
Level1
Leveln
Limitonnumberoflevel0files
22
Storage
Log Structured Merge Tree (LSM)
Write amplification: Illustration
Max files in level 0 is configured to be 2
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
In-memory58….68
Level1re-writecounter:1
23
Storage
Write amplification: Illustration
Level 0 has 3 files (> 2), which triggers a compaction
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1
24
Storage
Write amplification: Illustration
* Files are immutable * Sorted non-overlapping files
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1
25
Storage
Write amplification: Illustration
Set of overlapping files between levels 0 and 1
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1
26
Storage
Write amplification: Illustration
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1
27
Storage
Set of overlapping files between levels 0 and 1
Write amplification: Illustration
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1
28
Storage
Set of overlapping files between levels 0 and 1
1….2347….6824….461….68
Write amplification: Illustration
Compacting level 0 with level 1
Memory
2….37 23….48
1….12 15….25 39….62 77….95
Level0
Level1
Leveln
58….68
In-memory
Level1re-writecounter:1Level1re-writecounter:2
29
Storage
Write amplification: Illustration
Level 0 is compacted
Memory
1….23 24….46 47….68 77….95
Level0
Level1
Leveln
In-memory
Level1re-writecounter:2
30
Storage
Write amplification: Illustration
Data is being flushed as level 0 files after some Write operations
Memory
1….23 24….46 47….68 77….95
Level0
Level1
Leveln
10….3317….531….121
Level1re-writecounter:2
31
Storage
Write amplification: Illustration
Compacting level 0 with level 1
Memory
1….23 24….46 47….68 77….95
Level0
Level1
Leveln
10….33 17….53 1….121
Level1re-writecounter:2
32
Storage
92….12162….9031….601….30
Write amplification: Illustration
Memory
Level0
Level1
Leveln
1….121 Level1re-writecounter:2Level1re-writecounter:3
33
Storage
Compacting level 0 with level 1
Write amplification: Illustration
Existing data is re-written to the same level (1) 3 times
Memory
1….30 31….60 62….90 92….121
Level0
Level1
Leveln
Level1re-writecounter:3
34
Storage
Root cause of write amplification
Rewriting data to the same levelmultiple times
To maintain sorted non-overlapping files in each level
35
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
36
Naïve approach to reduce write amplification
• Just append the file to the end of next level• Many (possibly all) overlapping files within a level
• Affects the read performance
1….89 6….915….65 9….99 1….102 1…2718….95Leveli
(all files have overlapping key ranges)
37
Partially sorted levels
• Hybrid between all non-overlapping files and all overlapping files• Inspired from Skip-List data structure• Concrete boundaries (guards) to group together overlapping files
1….12 18….3113….34 42….65 72….8745….5640….47Leveli
(filesofsamecolorcanhaveoverlappingkeyranges)
38
13 35 70
Fragmented Log-Structured Merge Tree
Novel modification of LSM data structure
Uses guards to maintain partially sorted levels
Writes data only once per level in most cases
39
FLSM structure
Note how files are logically grouped within guards
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
40
Storage
Guards get more fine grained deeper into the tree
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
41
Storage
FLSM structure
In-memory
How does FLSM reduce write amplification?
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
15 70
40 7015 95
30….68
Max files in level 0 is configured to be 2
43
Storage
2….1415….68
Compacting level 0
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
30….68
2….68
44
15
Storage
How does FLSM reduce write amplification?
15….59
2….14 15….68
Fragmented files are just appended to next level
Memory
1….12
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15
40 7015 95
77….87 82….95
70
45
15
Storage
How does FLSM reduce write amplification?
15….592….14 15….68
Guard 15 in Level 1 is to be compacted
Memory
1….12
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15
40 7015 95
77….87 82….95
70
15….68
46
Storage
How does FLSM reduce write amplification?
15….3940….68
2….14
Files are combined, sorted and fragmented
Memory
1….12
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15
40 7015 95
77….87 82….95
70
15….68
47
40
Storage
How does FLSM reduce write amplification?
15….39 40….68
2….14
Fragmented files are just appended to next level
Memory
1….12
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15
40 7015 95
77….87 82….95
70
48
40
Storage
How does FLSM reduce write amplification?
FLSM maintains partially sorted levels to efficiently reduce the search space
How does FLSM reduce write amplification?
FLSM doesn’t re-write data to the same levelin most cases
How does FLSM maintain read performance?
49
Selecting Guards
50
• Guards are chosen randomly and dynamically• Dependent on the distribution of data
Selecting Guards
51
1 1e+9Keyspace
• Guards are chosen randomly and dynamically• Dependent on the distribution of data
Selecting Guards
52
1 1e+9Keyspace
• Guards are chosen randomly and dynamically• Dependent on the distribution of data
Selecting Guards
• Guards are chosen randomly and dynamically• Dependent on the distribution of data
53
1 1e+9Keyspace
Operations: Write
FLSM structure
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Put(1,“abc”)Write(key,value)
54
Storage
FLSM structure
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Get(23)
55
Storage
Operations: Get
Search level by level starting from memory
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Get(23)
56
Storage
Operations: Get
All level 0 files need to be searched
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Get(23)
57
Storage
Operations: Get
Level 1: File under guard 15 is searched
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Get(23)
58
Storage
Operations: Get
Level 2: Both the files under guard 15 are searched
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Get(23)
59
Storage
Operations: Get
High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled
Memory
Storage1….37 18….48Level0
In-memory
2….98 23….48
Write(key,value)
Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction
60
High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled
Memory
Storage1….37 18….48Level0
In-memory
2….98 23….48
Write(key,value)
Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction
61
FLSMhasfastercompaction becauseoflesserI/Oandhencehigherwritethroughput
Challenges in FLSM
• Every read/range query operation needs to examine multiple files per level• For example, if every guard has 5 files, read latency is
increased by 5x (assuming no cache hits)
Trade-off between write I/O and read performance
62
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
63
PebblesDB
• Built by modifying HyperLevelDB (±9100 LOC) to use FLSM• HyperLevelDB, built over LevelDB, to provide improved
parallelism and compaction• API compatible with LevelDB, but not with RocksDB
64
Optimizations in PebblesDB
• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter
65
Optimizations in PebblesDB
• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter
66
BloomfilterIskey25
present?Definitelynot
Possiblyyes
Optimizations in PebblesDB
1….12 15….39 82….95Level1
15 70
BloomFilterBloomFilterBloomFilterBloomFilter
77….97 Maintainedin-memory
67
• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter
Optimizations in PebblesDB
1….12 15….39 82….95Level1
15 70
BloomFilterBloomFilterBloomFilterBloomFilter
77….97 Maintainedin-memory
68
• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter
PebblesDBreads samenumberoffilesasanyLSMbasedstore
Optimizations in PebblesDB
• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter• Range query performance is improved using parallel threads
and better compaction
69
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
70
Evaluation
Micro-benchmarks
71
LowmemorySmalldataset
Crashrecovery
CPUandmemoryusage
Agedfilesystem
Realworldworkloads- YCSB
NoSQLapplications
Evaluation
Micro-benchmarks
72
LowmemorySmalldataset
Crashrecovery
CPUandmemoryusage
Agedfilesystem
Realworldworkloads- YCSB
NoSQLapplications
Real world workloads - YCSB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
73
35.08Ko
ps/s
25.8Kop
s/s
33.98Ko
ps/s
22.41Ko
ps/s
57.87Ko
ps/s
34.06Ko
ps/s
5.8Ko
ps/s
32.09Ko
ps/s
952.93GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
74
Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
35.08Ko
ps/s
25.8Kop
s/s
33.98Ko
ps/s
22.41Ko
ps/s
57.87Ko
ps/s
34.06Ko
ps/s
5.8Ko
ps/s
32.09Ko
ps/s
952.93GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
75
Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
35.08Ko
ps/s
25.8Kop
s/s
33.98Ko
ps/s
22.41Ko
ps/s
57.87Ko
ps/s
34.06Ko
ps/s
5.8Ko
ps/s
32.09Ko
ps/s
952.93GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
76
Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
35.08Ko
ps/s
25.8Kop
s/s
33.98Ko
ps/s
22.41Ko
ps/s
57.87Ko
ps/s
34.06Ko
ps/s
5.8Ko
ps/s
32.09Ko
ps/s
952.93GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
77
Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
35.08Ko
ps/s
25.8Kop
s/s
33.98Ko
ps/s
22.41Ko
ps/s
57.87Ko
ps/s
34.06Ko
ps/s
5.8Ko
ps/s
32.09Ko
ps/s
952.93GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
78
Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB
NoSQL stores - MongoDB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
79
20.73Ko
ps/s
9.95Kop
s/s
15.52Ko
ps/s
19.69Ko
ps/s
23.53Ko
ps/s
20.68Ko
ps/s
0.65Kop
s/s
9.78Kop
s/s
426.33GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
80
NoSQL stores - MongoDB
20.73Ko
ps/s
9.95Kop
s/s
15.52Ko
ps/s
19.69Ko
ps/s
23.53Ko
ps/s
20.68Ko
ps/s
0.65Kop
s/s
9.78Kop
s/s
426.33GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
81
NoSQL stores - MongoDB
20.73Ko
ps/s
9.95Kop
s/s
15.52Ko
ps/s
19.69Ko
ps/s
23.53Ko
ps/s
20.68Ko
ps/s
0.65Kop
s/s
9.78Kop
s/s
426.33GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
82
NoSQL stores - MongoDB
20.73Ko
ps/s
9.95Kop
s/s
15.52Ko
ps/s
19.69Ko
ps/s
23.53Ko
ps/s
20.68Ko
ps/s
0.65Kop
s/s
9.78Kop
s/s
426.33GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
83
NoSQL stores - MongoDB
20.73Ko
ps/s
9.95Kop
s/s
15.52Ko
ps/s
19.69Ko
ps/s
23.53Ko
ps/s
20.68Ko
ps/s
0.65Kop
s/s
9.78Kop
s/s
426.33GB
0
0.5
1
1.5
2
2.5
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Wire
dTiger
• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes
84
NoSQL stores - MongoDB
PebblesDBcombineslowwriteIOofWiredTigerwithhighperformanceofRocksDB
Outline
• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion
85
Conclusion
• PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees• Increases write throughput and reduces write IO at the same time• Obtains 6X the write throughput of RocksDB
• As key-value stores become more widely used, there have been several attempts to optimize them• PebblesDB combines algorithmic innovation (the FLSM data
structure) with careful systems building
86
Operations: Seek
• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries
between 5 and 18)
Get(1)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500
90
Operations: Seek
• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries
between 5 and 18)
Seek(200)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500
91
Operations: Seek
• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries
between 5 and 18)
92
Operations: Seek
FLSM structure
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Seek(23)
93
Storage
Operations: Seek
All levels and memtable need to be searched
Memory
2….37 23….48
1….12 15….59 77….87 82….95
2….8 15….2316….32 70….90 96….9945….65
Level0
Level1
Level2
In-memory
15 70
40 7015 95
Seek(23)
94
Storage
Optimizations in PebblesDB
• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set
BloomfilterIskey25
present?Definitelynot
Possiblyyes95
Optimizations in PebblesDB
• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set
1….12 15….39 82….95Level1
15 70
Get(97)True
BloomFilterBloomFilterBloomFilterBloomFilter
77….97 Maintainedin-memory
96
Optimizations in PebblesDB
• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set
1….12 15….39 82….95Level1
15 70
Get(97)False True
BloomFilterBloomFilterBloomFilterBloomFilter
77….97
97
Optimizations in PebblesDB
• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set
1….12 15….39 82….95Level1
15 70
BloomFilterBloomFilterBloomFilterBloomFilter
77….97
PebblesDBreads atmostonefileperguardwithhighprobability98
Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard
1….12 15….39 77….97 82….95Level1
15 70
Seek(85)
Thread1 Thread2
99
Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard• Seek based compaction: Triggers compaction for a level
during a seek-heavy workload• Reduce the average number of sstables per guard• Reduce the number of active levels
SeekbasedcompactionincreaseswriteI/O butasatrade-offtoimproveseekperformance
100
Tuning PebblesDB
• PebblesDB characteristics like• Increase in write throughput,• decrease in write amplification and• overhead of read/seek operationall depend on one parameter, maxFilesPerGuard (default 2 in PebblesDB)
• Setting this to a very high value favors write throughput• Setting this to a very low value favors read throughput
101
Horizontal compaction
• Files compacted within the same level for the last two levels in PebblesDB• Some optimizations to prevent huge increase in write IO
102
Experimental setup
• Intel Xeon 2.8 GHz processor• 16 GB RAM• Running Ubuntu 16.04 LTS with the Linux 4.4 kernel• Software RAID0 over 2 Intel 750 SSDs (1.2 TB each)• Datasets in experiments 3x bigger than DRAM size
103
Write amplification
7.2GB
100.7GB
756GB
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
10M 100M 500M
WriteIOra
tiowrtPe
bblesD
B
Numberofkeysinserted
• Inserted different number of keys with key size 16 bytes and value size 128 bytes
104
Micro-benchmarks
11.72Ko
ps/s
6.89Kop
s/s
7.5Ko
ps/s
0
0.5
1
1.5
2
2.5
3
Random-Writes Reads Range-Queries
Throug
hputra
tiowrtHy
perLevelDB
Benchmark
• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M
105
Micro-benchmarks
239.05Kop
s/s
11.72Ko
ps/s
6.89Kop
s/s
7.5Ko
ps/s
126.2Ko
ps/s
0
0.5
1
1.5
2
2.5
3
Seq-Writes Random-Writes Reads Range-Queries Deletes
Throug
hputra
tiowrtHy
perLevelDB
Benchmark
• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M
106
Multi threaded micro-benchmarks
44.4Kop
s/s
40.2Kop
s/s
38.8Kop
s/s
0
0.5
1
1.5
2
2.5
Writes Reads MixedThroug
hputra
tiowrtHy
perLevelDB
Benchmark
• Writes – 4 threads each writing 10M• Reads – 4 threads each reading 10M• Mixed – 2 threads writing and 2 threads reading (each 10M)
107
Small cached dataset• Insert 1M key-value pairs with 16 bytes key and 1 KB value• Total data set (~1 GB) fits within memory• PebblesDB-1: with maximum one file per guard
108
45.25Ko
ps/s
205.76Kop
s/s
205.34Kop
s/s
0
0.5
1
1.5
2
2.5
Writes Reads Range-QueriesThroug
hputra
tiowrtHy
perLevelDB
Benchmark
Small key-value pairs• Inserted 300M key-value pairs• Key 16 bytes and 128 bytes value
109
44.48Ko
ps/s
6.34Kop
s/s
6.31Kop
s/s
0
0.5
1
1.5
2
2.5
3
3.5
Writes Reads Range-Queries
Throug
hputra
tiowrtHy
perLevelDB
Benchmark
Aged FS and KV store
17.37Ko
ps/s
5.65Kop
s/s
6.29Kop
s/s
0
0.5
1
1.5
2
2.5
Writes Reads Range-Queries
Throug
hputra
tiowrtHy
perLevelDB
Benchmark
• File system aging: Fill up 89% of the file system• KV store aging: Insert 50M, delete 20M and update 20M key-value
pairs in random order
110
Low memory micro-benchmark
27.78Ko
ps/s
2.86Kop
s/s
4.37Kop
s/s
0
0.5
1
1.5
2
2.5
Writes Reads Range-Queries
Throug
hputra
tiowrtHy
perLevelDB
Benchmark
• 100M key-value pairs with 1KB (~65 GB data set)• DRAM was limited to 4 GB
111
Impact of empty guards
• Inserted 20M key-value pairs (0 to 20M) in random order with value size 512 bytes• Incrementally inserted new 20M keys after deleting the older
keys• Around 9000 empty guards at the start of the last iteration• Read latency did not reduce with the increase in empty
guards
112
22.08Ko
ps/s
21.85Ko
ps/s
31.17Ko
ps/s
32.75Ko
ps/s
38.02Ko
ps/s
7.62Kop
s/s
0.37Kop
s/s
19.11Ko
ps/s
1349.5GB
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO
Throug
hputra
tiowrt
Hype
rLevelDB
• HyperDex – distributed key-value store from Cornell• Inserted 20M key-value pairs with 1 KB value size and 10M operations
LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads
RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes 113
NoSQL stores - HyperDex
CPU usage
• Median CPU usage by inserting 30M keys and reading 10M keys• PebblesDB: ~171%• Other key-value stores: 98-110%• Due to aggressive compaction, more CPU operations due to
merging multiple files in a guard
114
Memory usage
• 100M records (16 bytes key, 1 KB value) – 106 GB data set• 300 MB memory space• 0.3% of data set size
• Worst case: 100M records (16 bytes key, 16 bytes value) ~3.2 GB• 9% of data set size
115
Impact of different optimizations
• Sstable level bloom filter improve read performance by 63%• PebblesDB without optimizations for seek – 66%
117