+ All Categories
Home > Documents > PebblesDB: Building Key-Value Stores using Fragmented Log ...

PebblesDB: Building Key-Value Stores using Fragmented Log ...

Date post: 13-Jan-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
118
PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research
Transcript

PebblesDB: Building Key-Value Stores using Fragmented Log

Structured Merge Trees

Pandian Raju1, Rohan Kadekodi1, Vijay Chidambaram1,2, Ittai Abraham2

1The University of Texas at Austin2VMware Research

What is a key-value store?

• Store any arbitrary value for a given key

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

2

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions:• Point lookups:• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

3

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups:• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

4

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups: get(key)• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

5

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups: get(key)• Range Queries: get_range(key1, key2)

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

6

Key-Value Stores - widely used

• Google’s BigTable powers Search, Analytics, Maps and Gmail• Facebook’s RocksDB is used as storage engine in production

systems of many companies

7

Write-optimized data structures• Log Structured Merge Tree (LSM) is a write-optimized data structure

used in key-value stores• Provides high write throughput with good read throughput, but

suffers high write amplification

8

• Log Structured Merge Tree (LSM) is a write-optimized data structure used in key-value stores • Provides high write throughput with good read throughput, but

suffers high write amplification• Write amplification - Ratio of amount of write IO to amount of user

data

KV-storeClient10GB

Userdata

IftotalwriteI/Ois200GB

Writeamplification=20

9

Write-optimized data structures

• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB

450

300

600

900

1200

1500

1800

2100

RocksDB LevelDB PebblesDB UserData

WriteIO(G

B)

Write amplification in LSM based KV stores

10

• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB

1868(42x)

1222(27x)

756(17x)

450

300

600

900

1200

1500

1800

2100

RocksDB LevelDB PebblesDB UserData

WriteIO(G

B)

11

Write amplification in LSM based KV stores

Why is write amplification bad?

• Reduces the write throughput• Flash devices wear out after limited write cycles

(Intel SSD DC P4600 – can last ~5 years assuming ~5 TB write per day)

RocksDB can write ~500 GB of user data per day to a SSD to last 1.25 years

Data source: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-1-6tb-2-5inch-3d1.html12

PebblesDB

Built using new data structure Fragmented Log-Structured Merge Tree

High performance write-optimized key-value store

Achieves 3-6.7x higher write throughput and 2.4-3xlesser write amplification compared to RocksDB

Gets the highest write throughput and least write amplification as a backend store to MongoDB

13

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

14

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

15

Log Structured Merge Tree (LSM)

Data is stored both in memory and storage

Memory

Storage

In-memory

16

File1

Writesaredirectlyputtomemory

In-memoryMemory

Storage

Write(key,value)

17

File1

Log Structured Merge Tree (LSM)

Memory

File1

File2

In-memory data is periodically written as files to storage (sequential I/O)

In-memory

18

Storage

Log Structured Merge Tree (LSM)

Files on storage are logically arranged in different levels

In-memoryMemory

Level0

Level1

Leveln

19

Storage

Log Structured Merge Tree (LSM)

Compaction pushes data to higher numbered levels

In-memoryMemory

Level0

Level1

Leveln

20

Storage

Log Structured Merge Tree (LSM)

Files are sorted and have non-overlapping key ranges

In-memoryMemory

1.…12 15….19 25….75 79….99

Searchusingbinarysearch

Level0

Level1

Leveln

21

Storage

Log Structured Merge Tree (LSM)

Level 0 can have files with overlapping (but sorted) key ranges

In-memoryMemory

2….57 23….78Level0

Level1

Leveln

Limitonnumberoflevel0files

22

Storage

Log Structured Merge Tree (LSM)

Write amplification: Illustration

Max files in level 0 is configured to be 2

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

In-memory58….68

Level1re-writecounter:1

23

Storage

Write amplification: Illustration

Level 0 has 3 files (> 2), which triggers a compaction

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

24

Storage

Write amplification: Illustration

* Files are immutable * Sorted non-overlapping files

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

25

Storage

Write amplification: Illustration

Set of overlapping files between levels 0 and 1

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

26

Storage

Write amplification: Illustration

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

27

Storage

Set of overlapping files between levels 0 and 1

Write amplification: Illustration

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

28

Storage

Set of overlapping files between levels 0 and 1

1….2347….6824….461….68

Write amplification: Illustration

Compacting level 0 with level 1

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1Level1re-writecounter:2

29

Storage

Write amplification: Illustration

Level 0 is compacted

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

In-memory

Level1re-writecounter:2

30

Storage

Write amplification: Illustration

Data is being flushed as level 0 files after some Write operations

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

10….3317….531….121

Level1re-writecounter:2

31

Storage

Write amplification: Illustration

Compacting level 0 with level 1

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

10….33 17….53 1….121

Level1re-writecounter:2

32

Storage

92….12162….9031….601….30

Write amplification: Illustration

Memory

Level0

Level1

Leveln

1….121 Level1re-writecounter:2Level1re-writecounter:3

33

Storage

Compacting level 0 with level 1

Write amplification: Illustration

Existing data is re-written to the same level (1) 3 times

Memory

1….30 31….60 62….90 92….121

Level0

Level1

Leveln

Level1re-writecounter:3

34

Storage

Root cause of write amplification

Rewriting data to the same levelmultiple times

To maintain sorted non-overlapping files in each level

35

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

36

Naïve approach to reduce write amplification

• Just append the file to the end of next level• Many (possibly all) overlapping files within a level

• Affects the read performance

1….89 6….915….65 9….99 1….102 1…2718….95Leveli

(all files have overlapping key ranges)

37

Partially sorted levels

• Hybrid between all non-overlapping files and all overlapping files• Inspired from Skip-List data structure• Concrete boundaries (guards) to group together overlapping files

1….12 18….3113….34 42….65 72….8745….5640….47Leveli

(filesofsamecolorcanhaveoverlappingkeyranges)

38

13 35 70

Fragmented Log-Structured Merge Tree

Novel modification of LSM data structure

Uses guards to maintain partially sorted levels

Writes data only once per level in most cases

39

FLSM structure

Note how files are logically grouped within guards

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

40

Storage

Guards get more fine grained deeper into the tree

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

41

Storage

FLSM structure

How does FLSM reduce write amplification?

42

In-memory

How does FLSM reduce write amplification?

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

15 70

40 7015 95

30….68

Max files in level 0 is configured to be 2

43

Storage

2….1415….68

Compacting level 0

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

30….68

2….68

44

15

Storage

How does FLSM reduce write amplification?

15….59

2….14 15….68

Fragmented files are just appended to next level

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

45

15

Storage

How does FLSM reduce write amplification?

15….592….14 15….68

Guard 15 in Level 1 is to be compacted

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

15….68

46

Storage

How does FLSM reduce write amplification?

15….3940….68

2….14

Files are combined, sorted and fragmented

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

15….68

47

40

Storage

How does FLSM reduce write amplification?

15….39 40….68

2….14

Fragmented files are just appended to next level

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

48

40

Storage

How does FLSM reduce write amplification?

FLSM maintains partially sorted levels to efficiently reduce the search space

How does FLSM reduce write amplification?

FLSM doesn’t re-write data to the same levelin most cases

How does FLSM maintain read performance?

49

Selecting Guards

50

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Selecting Guards

51

1 1e+9Keyspace

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Selecting Guards

52

1 1e+9Keyspace

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Selecting Guards

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

53

1 1e+9Keyspace

Operations: Write

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Put(1,“abc”)Write(key,value)

54

Storage

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

55

Storage

Operations: Get

Search level by level starting from memory

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

56

Storage

Operations: Get

All level 0 files need to be searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

57

Storage

Operations: Get

Level 1: File under guard 15 is searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

58

Storage

Operations: Get

Level 2: Both the files under guard 15 are searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

59

Storage

Operations: Get

High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled

Memory

Storage1….37 18….48Level0

In-memory

2….98 23….48

Write(key,value)

Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction

60

High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled

Memory

Storage1….37 18….48Level0

In-memory

2….98 23….48

Write(key,value)

Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction

61

FLSMhasfastercompaction becauseoflesserI/Oandhencehigherwritethroughput

Challenges in FLSM

• Every read/range query operation needs to examine multiple files per level• For example, if every guard has 5 files, read latency is

increased by 5x (assuming no cache hits)

Trade-off between write I/O and read performance

62

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

63

PebblesDB

• Built by modifying HyperLevelDB (±9100 LOC) to use FLSM• HyperLevelDB, built over LevelDB, to provide improved

parallelism and compaction• API compatible with LevelDB, but not with RocksDB

64

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

65

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

66

BloomfilterIskey25

present?Definitelynot

Possiblyyes

Optimizations in PebblesDB

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

67

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

Optimizations in PebblesDB

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

68

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

PebblesDBreads samenumberoffilesasanyLSMbasedstore

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter• Range query performance is improved using parallel threads

and better compaction

69

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

70

Evaluation

Micro-benchmarks

71

LowmemorySmalldataset

Crashrecovery

CPUandmemoryusage

Agedfilesystem

Realworldworkloads- YCSB

NoSQLapplications

Evaluation

Micro-benchmarks

72

LowmemorySmalldataset

Crashrecovery

CPUandmemoryusage

Agedfilesystem

Realworldworkloads- YCSB

NoSQLapplications

Real world workloads - YCSB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

73

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

74

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

75

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

76

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

77

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

78

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

NoSQL stores - MongoDB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

79

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

80

NoSQL stores - MongoDB

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

81

NoSQL stores - MongoDB

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

82

NoSQL stores - MongoDB

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

83

NoSQL stores - MongoDB

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

84

NoSQL stores - MongoDB

PebblesDBcombineslowwriteIOofWiredTigerwithhighperformanceofRocksDB

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

85

Conclusion

• PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees• Increases write throughput and reduces write IO at the same time• Obtains 6X the write throughput of RocksDB

• As key-value stores become more widely used, there have been several attempts to optimize them• PebblesDB combines algorithmic innovation (the FLSM data

structure) with careful systems building

86

https://github.com/utsaslab/pebblesdb

https://github.com/utsaslab/pebblesdb

Backup slides

89

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

Get(1)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500

90

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

Seek(200)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500

91

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

92

Operations: Seek

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Seek(23)

93

Storage

Operations: Seek

All levels and memtable need to be searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Seek(23)

94

Storage

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

BloomfilterIskey25

present?Definitelynot

Possiblyyes95

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

Get(97)True

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

96

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

Get(97)False True

BloomFilterBloomFilterBloomFilterBloomFilter

77….97

97

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97

PebblesDBreads atmostonefileperguardwithhighprobability98

Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard

1….12 15….39 77….97 82….95Level1

15 70

Seek(85)

Thread1 Thread2

99

Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard• Seek based compaction: Triggers compaction for a level

during a seek-heavy workload• Reduce the average number of sstables per guard• Reduce the number of active levels

SeekbasedcompactionincreaseswriteI/O butasatrade-offtoimproveseekperformance

100

Tuning PebblesDB

• PebblesDB characteristics like• Increase in write throughput,• decrease in write amplification and• overhead of read/seek operationall depend on one parameter, maxFilesPerGuard (default 2 in PebblesDB)

• Setting this to a very high value favors write throughput• Setting this to a very low value favors read throughput

101

Horizontal compaction

• Files compacted within the same level for the last two levels in PebblesDB• Some optimizations to prevent huge increase in write IO

102

Experimental setup

• Intel Xeon 2.8 GHz processor• 16 GB RAM• Running Ubuntu 16.04 LTS with the Linux 4.4 kernel• Software RAID0 over 2 Intel 750 SSDs (1.2 TB each)• Datasets in experiments 3x bigger than DRAM size

103

Write amplification

7.2GB

100.7GB

756GB

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10M 100M 500M

WriteIOra

tiowrtPe

bblesD

B

Numberofkeysinserted

• Inserted different number of keys with key size 16 bytes and value size 128 bytes

104

Micro-benchmarks

11.72Ko

ps/s

6.89Kop

s/s

7.5Ko

ps/s

0

0.5

1

1.5

2

2.5

3

Random-Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M

105

Micro-benchmarks

239.05Kop

s/s

11.72Ko

ps/s

6.89Kop

s/s

7.5Ko

ps/s

126.2Ko

ps/s

0

0.5

1

1.5

2

2.5

3

Seq-Writes Random-Writes Reads Range-Queries Deletes

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M

106

Multi threaded micro-benchmarks

44.4Kop

s/s

40.2Kop

s/s

38.8Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads MixedThroug

hputra

tiowrtHy

perLevelDB

Benchmark

• Writes – 4 threads each writing 10M• Reads – 4 threads each reading 10M• Mixed – 2 threads writing and 2 threads reading (each 10M)

107

Small cached dataset• Insert 1M key-value pairs with 16 bytes key and 1 KB value• Total data set (~1 GB) fits within memory• PebblesDB-1: with maximum one file per guard

108

45.25Ko

ps/s

205.76Kop

s/s

205.34Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-QueriesThroug

hputra

tiowrtHy

perLevelDB

Benchmark

Small key-value pairs• Inserted 300M key-value pairs• Key 16 bytes and 128 bytes value

109

44.48Ko

ps/s

6.34Kop

s/s

6.31Kop

s/s

0

0.5

1

1.5

2

2.5

3

3.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

Aged FS and KV store

17.37Ko

ps/s

5.65Kop

s/s

6.29Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• File system aging: Fill up 89% of the file system• KV store aging: Insert 50M, delete 20M and update 20M key-value

pairs in random order

110

Low memory micro-benchmark

27.78Ko

ps/s

2.86Kop

s/s

4.37Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• 100M key-value pairs with 1KB (~65 GB data set)• DRAM was limited to 4 GB

111

Impact of empty guards

• Inserted 20M key-value pairs (0 to 20M) in random order with value size 512 bytes• Incrementally inserted new 20M keys after deleting the older

keys• Around 9000 empty guards at the start of the last iteration• Read latency did not reduce with the increase in empty

guards

112

22.08Ko

ps/s

21.85Ko

ps/s

31.17Ko

ps/s

32.75Ko

ps/s

38.02Ko

ps/s

7.62Kop

s/s

0.37Kop

s/s

19.11Ko

ps/s

1349.5GB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

• HyperDex – distributed key-value store from Cornell• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes 113

NoSQL stores - HyperDex

CPU usage

• Median CPU usage by inserting 30M keys and reading 10M keys• PebblesDB: ~171%• Other key-value stores: 98-110%• Due to aggressive compaction, more CPU operations due to

merging multiple files in a guard

114

Memory usage

• 100M records (16 bytes key, 1 KB value) – 106 GB data set• 300 MB memory space• 0.3% of data set size

• Worst case: 100M records (16 bytes key, 16 bytes value) ~3.2 GB• 9% of data set size

115

Bloom filter calculation cost

• 1.2 sec per GB of sstable• 3200 files – 52 GB – 62 seconds

116

Impact of different optimizations

• Sstable level bloom filter improve read performance by 63%• PebblesDB without optimizations for seek – 66%

117

Thank you!Questions?

118


Recommended