+ All Categories
Home > Documents > LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail … · 2020. 12....

LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail … · 2020. 12....

Date post: 03-Feb-2021
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
31
LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency Shukai Han, Dejun Jiang, Jin Xiong Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences MSST '20 October 29-30, 2020
Transcript
  • LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency

    Shukai Han, Dejun Jiang, Jin XiongInstitute of Computing Technology, Chinese Academy of Sciences

    University of Chinese Academy of Sciences

    MSST '20October 29-30, 2020

  • MSST '20

    2

    Outline

    ü Background & Motivation

    • Design

    • Evaluation

    • Conclusion

  • MSST '20

    3

    Key-Value Store• Key-Value (KV) stores are widely deployed in data centers.

    • KV stores are latency-critical applications.

    Workloads with a high percentage of small KV items[1] Applications with low latency requirements[1] Berk, SIGMETRICS '2012

  • MSST '20

    Log-Structured Merge Tree (LSM-Tree)

    4

    Immutable MemTable MemTable

    LOGSSTable

    Level0

    Level1

    Level2

    Level k-1

    Level k

    ……

    ……

    …………

    3.Flush

    KV Pair

    1.WAL2.Write

    KV data(sorted)

    Metedata(bloom filter, index, etc)

    SSTable Structure

    4.Compaction

    4

    4

  • MSST '20

    Limitations of Persistent KV StoreInefficient indexing for cross-media

    5

    Immutable MemTable MemTable

    LOGSSTable

    Level0

    Level1

    Level2

    Level k-1

    Level k

    ……

    ……

    …………

    4.Compaction

    3.Flush

    KV Pair

    1.WAL2.Write

    4

    4

    ü On one hand, LSM-Tree adopts skiplist to index in-memory data.

    ü On the other hand, LSM-Tree builds manifest files to record key range of each on-disk SSTable.

    Read

    Rea

    d

  • MSST '20

    Limitations of Persistent KV StoreHigh write amplification

    6

    Immutable MemTable MemTable

    LOGSSTable

    Level0

    Level1

    Level2

    Level k-1

    Level k

    ……

    ……

    …………

    4.Compaction

    3.Flush

    KV Pair

    1.WAL2.Write

    4

    4

    Write log and data transfer between levels increase write amplification.

    Write (read) amplification is defined as the ratio between the amount of data written to (read from) the underlying storage device and the amount of data requested by the user.

  • MSST '20

    8.80

    10.4711.42

    12.2912.90 13.28

    13.57 14.0114.41 14.64

    6.847.78

    8.699.38

    10.14 10.6010.75 10.90 11.23

    11.68

    5.166.30

    7.809.22 9.43 9.72

    10.16 10.56 10.5410.95

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    10G 20G 30G 40G 50G 60G 70G 80G 90G 100G

    wri

    te a

    mpl

    ifica

    tion

    LevelDBHyperLevelDB RocksDB

    Limitations of Persistent KV StoreHigh write amplification

    7

    10X+

    The write amplification of LSM-Tree can reach 10x, and with the continuous increase of data amount, the write amplification continue to show an upward trend.

  • MSST '20

    Limitations of Persistent KV StoreHeavy tailed read latency under mixed workload

    8

    ü We first warm up LevelDB with 100 GB data.

    ü We measure the average latency as well as 99th and 99.9th percentile read latencies every 10 seconds.

    The maximum 99th and 99.9th percentile read latencies can reach 13 and 28 times than the average read latency.

    t1: Run a mixed workload of randomly reading 50 GB existing data and randomly inserting another 50 GB data.

    28X

  • MSST '20

    Limitations of Persistent KV StoreHeavy tailed read latency under mixed workload

    9

    t2: Run read-only workload.

    After the compaction finishes, the read tail latency is significantly reduced.

    Reducing write amplification is not only helpful for reducing the total write amount of the disk, increasing system throughput, but also helping to reduce the read tail latency under mixed read and write loads.

  • MSST '20

    1.NVM can persist data after power off

    Non-Volatile Memory

    10

    • The first PM product, Intel Optane DC Persistent Memory (PM), was announced [19] in April 2019.

    2.The write latency of Optane DC PM is close to DRAM, while its read latency is 3 to 4 times that of DRAM.

    3.The write and read bandwidths of Optane DC PM are around 2GB/s and 6.5GB/s, which is about 1/8 and 1/4 that of DRAM separately.

    • Non-Volatile Memories (NVMs) provide low latency and byte addressable features.

    • 3D XPoint, Phase Change Memory (PCM), and Resistive Memory (ReRAM)

  • MSST '20

    11

    Outline

    • Background & Motivation

    ü Design

    • Evaluation

    • Conclusion

  • MSST '20

    LightKV System Overview

    12

    1.Radix Hash Tree(RH-Tree)

    index

    Segment

    ……flush

    2.Persistent Write Buffer (PWB)

    SSTable……

    Partition1 Partition2 Partition N

    ……

    ……compaction

    3.Main Data Store

    Persistent Memory

    DRAM

    SSD

  • MSST '20

    Challenges• How does Radix Hash tree index KV items across media?

    • How does Radix Hash tree balance performance and data growth?

    • How does Radix Hash tree conduct well-controlled data compaction to reduce write amplification?

    13

  • MSST '20

    Radix Hash Tree Structure

    14

    Prefix Search Tree

    HashTable HashTable HashTable……

    …[0,32] [128,255]

    [0,64] [96,255]

    … …[64,255]

    Prefix Search Tree

    SSTable or Semgnet

    KV

    KV

    KV

    KV

    pointer

    HashTable

    signature cache kv offset

    HashTable Bucket (64B)4B*4=16B 4B*4=16B 8B*4=32B

  • MSST '20

    RH-Tree split

    15

    IN1

    LN1

    [0,3]

    ……

    IN1

    LN1

    [0,127]

    ……

    IN1

    LN1

    [0,63]

    LN2[64,127]

    ……

    normal split

    IN1

    [0,3]

    ……IN2[0,127]

    LN1 LN2

    [128,255]

    level split

  • MSST '20

    Linked hash leaf node

    16

    LN1

    Segment1

    index

    LN1’persist

    SSTable

    flush

    index

    LN2

    Segment2

    linklinkLN2

    LN1’

    Segment2

    SSTable

    index

    index

    stage1

    LN1

    index

    Segment1

    stage2 stage3

    DRAM

    PM

    SSD

  • MSST '20

    RH-Tree placement

    17

    Prefix Search Tree

    Segment

    …… Persistent Write Buffer

    SSTable……

    Partition1 Partition2 Partition N

    ……

    ……Main Data Store

    Radix Hashing Tree

    ………… DRAM

    Persistent Memory

    SSD

    ……

    …… …… …………

  • MSST '20

    Partition-based data compaction

    18

    S5 (1) S5 (1)

    S10 (1)

    compactiont1 t2 t3 t4 t5 t6

    1log Nk

    S21 (2)

    Compaction Size (CS) is 4

    S1 (0)

    S2 (0)

    S3 (0)

    S4 (0)

    S6 (0)

    S7 (0)

    S8 (0)

    S9 (0)

    S5 (1)

    S10 (1)

    S15 (1)S110)

    S12(0)

    S13(0)

    S14 (0)

    S16 (0)

    S17 (0)

    S18(0)

    S19(0)

    S5 (1)

    S10 (1)

    S15 (1)

    S20 (1)

  • MSST '20

    Recovery

    19

    Prefix Search Tree

    Segment

    ……

    ……

    …… Persistent Write Buffer

    SSTable……

    Partition1 Partition2 Partition N

    ……

    ……Main Data Store

    Radix Hashing Tree

    DRAM

    Persistent Memory

    SSD

    rebuild

  • MSST '20

    20

    Outline

    • Background & Motivation

    • Design

    ü Evaluation

    • Conclusion

  • MSST '20

    Experiment Setup• System and hardware configuration

    – Two Intel Xeon Gold 5215 CPU (2.5GHZ), 64GB memory and one Intel DC P3700 SSD of 400GB.

    – CentOS Linux release 7.6.1810 with 4.18.8 kernel and use ext4 file system.

    • Compared systems– LevelDB、RocksDB– NoveLSM、SLM-DB

    • Workloads– db_bench as microbenchmark– YCSB as the actual workload

    21

    Workload YCSB Workload Description

    A 50% reads and 50% updates

    B 95% reads and 5% updates

    C 100% reads

    D 95% reads for latest keys and 5% inserts

    E 95% scan and 5% inserts

    F 50% reads and 50% read-modify-writes

  • MSST '20

    Reducing write amplification

    22

    LightKV are reduced by 7.1x, 5.1x, 2.9x and 2.3x compared to that of LevelDB, RocksDB, NoveLSM, and SLM-DB respectively.

    When the total amount of written data increases, the write amplification of LightKV remains stable (e.g. from 1.6 to 1.8 when the data amount increases from 50 GB to 100 GB).

  • MSST '20

    Basic Operations

    23

    Thanks to the global index and partition compaction, LightKV can effectively reduce read-write amplification and improve read and write performance.

    13.5x, 8.3x, 5.0x, 4.0x 4.5x, 1.9x, 4.2x, 1.3x

  • MSST '20

    Basic Operations

    24

    The performance of LightKV in short range query is low. This is because it needs to search all SSTables in one or more partitions when performing a short range query.

    reduced by 24.3% and 13.2%

  • MSST '20

    Tail latency under read-write workload

    25

    Thanks to lower write amplification and global indexing, LightKV provides a lower and stable read and write tail latency.

    lower and stable

    99th:17.9x, 10.5x, 6.4x, 3.5x 99.9th:15.7x, 9.2x, 8.8x, 3.4x

  • MSST '20

    Results with YCSB

    26

    LightKV provides better throughput in simulating actual workloads.

  • MSST '20

    27

    Outline

    • Background & Motivation

    • Design

    • Evaluation

    ü Conclusion

  • MSST '20

    Conclusion• LSM-Tree based on traditional storage devices faces problems such as read-write

    amplification

    • At the same time, the emergence of non-volatile memory provides opportunities and challenges for building efficient key-value storage systems

    • In this paper, we propose LightKV a cross media key-value store with persistent memory. LightKV effectively reduces the read-write amplification of the system by establishing a RH-Tree and adopting a column-based partition compaction.

    • The experiment results show that LightKV reduces write amplification by up to 8.1x and improves read performance by up to 9.2x. It also reduces read tail latency by up to 18.8x under read-write mixed workload.

    28

  • MSST '20

    29

    THANK YOU !Q & A

    Author Email: [email protected]

  • MSST '20

    Sensitivity analysis

    30

    As the maximum number of partitions increases, the read and write performance of LightKV increases, but the NVM capacity consumption also increase.

  • MSST '20

    Sensitivity analysis

    31

    As the compaction size increases, the merging frequency is reduced, and the write amplification is reduced, which is beneficial to improve the write performance, but is not conducive to reading.


Recommended