+ All Categories
Home > Documents > HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf ·...

HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf ·...

Date post: 15-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
27
HMVFS: A Hybrid Memory Versioning File System Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu, Jin Zha Department of Computer Science and Engineering Shanghai Jiao Tong University
Transcript
Page 1: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

HMVFS: A Hybrid Memory Versioning File System

Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu, Jin Zha

Department of Computer Science and Engineering

Shanghai Jiao Tong University

Page 2: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Outline

• Introduction

• Design

• Implementation

• Evaluation

• Conclusion

Page 3: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Introduction

• Emerging Non-Volatile Memory (NVM)• Persistency as disk• Byte addressability as DRAM

• Current file systems for NVM• PMFS, SCMFS, BPFS• Non-versioning, unable to recover old data

• Hardware and software errors• Large dataset and long execution time• Fault tolerance mechanism is needed

• Current versioning file systems• BTRFS, NILFS2• Not optimized for NVM

Page 4: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Design Goals

• Strong consistency• A Stratified File System Tree (SFST) represents the snapshot of whole file system

• Atomic snapshotting is ensured

• Fast recovery• Almost no redo or undo overhead in recovery

• High performance• Utilize the byte-addressability of NVM to update the tree metadata at the granularity of bytes

• Log-structured updates to files balance the endurance of NVM

• Avoid write amplification

• User friendly• Snapshots are created automatically and transparently

Page 5: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Overview

• HMVFS is an NVM-friendly log-structured versioning file system

• Space-efficient file system snapshotting

• HMVFS decouples tree metadata from tree data

• High performance and consistency guarantee

• POSIX compliant

Page 6: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Outline

• Introduction

• Design

• Implementation

• Evaluation

• Conclusion

Page 7: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

On-Memory Layout

• DRAM: cache and journal

• Sequential write zone• File metadata and data• Tree data

• Random write zone• File system metadata• Tree metadata

• NVM:

Block Information

Table(BIT)

Node Address Tree Cache(NAT Cache )

Segment Information

Table(SIT)

Random Writes Sequential Writes

NVM

Segment Information

Table Journal

DRAM Checkpoint Information Tree(CIT)

Node Address Tree (NAT)

Main Area (SFST)Auxiliary Information

Node Blocks

Checkpoint Blocks (CP)

Data Blocks

Superblock

Superblock

Page 8: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

in traditional Log-structured File Systems

• Update propagation problem

Direct pointerOr

Inline data

Metadata

Single-indirect

Double-indirect

Triple-indirect

Inode block

Direct node

Direct node

Indirect node

Indirect node

Indirect node

Direct node

Direct node

Dat

a bl

ock

Direct node

Indirect node

Data

Node

… …

… …

… … … … …

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Indirectnode

Inode

Updated blocks

Direct node

Datablock

Index Structure

Page 9: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Index Structure without write amplification

• Node Address Table

Direct pointerOr

Inline data

Metadata

Single-indirect

Double-indirect

Triple-indirect

Inode block

Direct node

Direct node

Indirect node

Indirect node

Indirect node

Direct node

Direct node

Dat

a bl

ock

Direct node

Indirect node

Data

Node

… …

… …

… … … … …

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Updated blocks

Direct node

Datablock

Node Address Table

Node-ID Address… …n-1 0x38n 0x42

n+1 0x24… …

0x73

Page 10: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Index Structure for versioning

• Node Address Table with the dimension of version.

Direct pointerOr

Inline data

Metadata

Single-indirect

Double-indirect

Triple-indirect

Inode block

Direct node

Direct node

Indirect node

Indirect node

Indirect node

Direct node

Direct node

Dat

a bl

ock

Direct node

Indirect node

Data

Node

… …

… …

… … … … …

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Updated blocks

Direct node

Datablock

Node Address Table with Version

Node-ID Address… …n-1 0x14n

n+1 0x24… …

0x42

Address…

0x38

0x24…

x42

Address…

0x38

0x24…

0x73

Version1 Version2 Version3

Page 11: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

How to store different trees space-efficiently

• Node Address Tree (NAT)• A four-level B-tree to store multi-version Node Address

Table space-efficiently

• Adopt the idea of CoW friendly B-tree

• NAT leaves contain NodeID-address pairs

• Other tree blocks in NAT contain pointers to lower level blocks.

Node

NAT root

NAT internal

NAT internal

NAT leaf

NAT leaf

NAT leaf

Indirect node

NAT internal

NAT internal

NAT internal

NAT root

NAT internal

NAT internal

NAT leaf

Direct nodeInode Direct

node

Node Address Tree

P,1

A,1 B,1 C,1 D,1

E,1 F,1

P,1

A,1 B,1 C,2 D,1

E,2 F,1

Q,1

D',1

F',1

P,0

A,1 B,1 C,1 D,0

E,1 F,0

Q,1

D',1

F',1

Original New

Page 12: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Stratified File System Tree (SFST)

• Four different categories of blocks:• Checkpoint layer

• Node Address Tree (NAT) layer

• Node layer

• Data layer

• All blocks from SFST are stored in the main area with log-structured writes

• Balance the endurance of NVM media

• Each SFST represents a valid snapshot of file system• Share overlapped blocks to achieve space-efficiency

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Node

Data

NAT root

NAT internal

NAT internal

NAT leaf

NAT leaf

NAT leaf

Indirect node

NAT internal

NAT internal

NAT internal

NAT root

NAT internal

NAT internal

NAT leaf

Direct nodeInode Direct

node

Node Address Tree

Original snapshot New snapshot

CPblock

CPblock

Checkpoint

Page 13: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Stratified File System Tree (SFST)

• The metadata of SFST• In auxiliary information zone• Random write updates

• Segment Information Table (SIT)• Contains the status information of every segment

• Block Information Table (BIT)• Keeps the information of every block• Update precisely at variable bytes granularity• Contains:

• Start and end version number• Block type• Node ID• Reference count D

ata

bloc

k

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Node

Data

NAT root

NAT internal

NAT internal

NAT leaf

NAT leaf

NAT leaf

Indirect node

NAT internal

NAT internal

NAT internal

NAT root

NAT internal

NAT internal

NAT leaf

Direct nodeInode Direct

node

Node Address Tree

Original snapshot New snapshot

CPblock

CPblock

Checkpoint

Page 14: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Garbage Collection in HMVFS

• Move all the valid blocks in the victim segment to the current segment

• When finished, update SIT and create a snapshot

• Handle block sharing problem

NATblock

NodeBlock

1

NodeBlock

2

Version 1

NATblock

NodeBlock

2

Version 2

NATblock

NodeBlock

2

Version 3

NATblock

NodeBlock

2

Version 4

Segment A Segment B

Page 15: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Outline

• Introduction

• Design

• Implementation

• Evaluation

• Conclusion

Page 16: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Block Information Table (BIT)

• Block sharing problem• The corresponding pointer in the parent block must be updated if a new child block is

written in the main area

• Node ID and block type• Used to locate parent node

Type of the block Type of the parent Node ID

Checkpoint N/A N/ANAT internal

NAT internal Index code in NATNAT leaf

InodeNAT leaf Node IDIndirect

DirectData Inode or direct Node ID of parent node

Page 17: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Block Information Table (BIT)

• Start and end version number• The first and last versions in which the block is valid• Operations like write and delete set these two variables to the current version

number

• Reference count• The number of parent nodes which are linked to the block• Update with lazy reference counting• File level operations and snapshot level operations update the reference

count• If the count reaches zero, the block will become garbage

Page 18: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Snapshot Creation

• Strong consistency is guaranteed

• Flush dirty NAT entries from DRAM to form a new Node Address Tree

• Follow the bottom-up procedure

• Status information are stored in checkpoint block

• Space-efficient snapshot

• The atomicity of snapshot creation is ensured• Atomic update to the pointer in superblock to announce

the validity of the new snapshot

• Crash during snapshot creation can be recovered by undo or redo depend on the validity

Node

Data

NAT root

NAT internal

NAT internal

NAT leaf

NAT leaf

NAT leaf

Indirect node

NAT internal

NAT internal

NAT internal

NAT root

NAT internal

NAT internal

NAT leaf

Direct nodeInode Direct

node

Node Address Tree

Original snapshot New snapshot

CPblock

CPblock

Checkpoint

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

Dat

a bl

ock

… …

……

… …

SuperBlock

Page 19: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Snapshot Deletion

• Deletion starts from the checkpoint block• Checkpoint cache is stored in DRAM

• Follows the top-down procedure to decrease reference counts

• Consistency is ensured by journaling

• Call garbage collection afterwards• Many reference counts have decreased to zero

P,0

A,1 B,1 C,1 D,0

E,1 F,0

Q,1

D',1

F',1

P,1

A,1 B,1 C,2 D,1

E,2 F,1

Q,1

D',1

F',1

Page 20: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Crash Recovery

• Mount the writable last completed snapshot• No additional recovery overhead

• Mount the read-only old snapshots• Locate the checkpoint block of the snapshot• Retrieve files via SFST

Checkpoint Checkpoint Checkpoint Checkpoint

Superblock

NAT root

NAT root

NAT root

NAT root

Page 21: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Outline

• Introduction

• Design

• Implementation

• Evaluation

• Conclusion

Page 22: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Evaluation

• Experimental Setup• A commodity server with 64 Intel Xeon 2GHz processors and 512GB DRAM

• Performance comparison with PMFS, EXT4, BTRFS, NILFS2

• Postmark results• Different read bias numbers

020406080

100120140160180

0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %

sec

Percentage of Reads

HMVFS BTRFS NILFS2 EXT4 PMFS

0

20

40

60

80

100

120

0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %

Effic

ienc

y (s

ec-1

)

Percentage of Reads

HMVFS BTRFS NILFS2

Transaction performance Snapshotting efficiency

2.7x and 2.3x

Page 23: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Evaluation

• Filebench results• Fileserver

• Different numbers of files

0

5

10

15

20

25

2k 4k 8k 16k

ops/s

ec (

x100

0)

Number of Files

HMVFS BTRFS NILFS2 EXT4 PMFS

05

10152025303540

2k 4k 8k 16k

Effic

ienc

y (se

c-1)

Number of Files

HMVFS BTRFS NILFS2

Throughput performance Snapshotting efficiency

9.7x and 6.6x

Page 24: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Evaluation

• Filebench results• Varmail

• Different depths of directories

0

2

4

6

8

10

12

0.7 1.2 1.4 2.1

Effic

ienc

y (s

ec-1

)

Directory Depth

HMVFS BTRFS NILFS2

0

5

10

15

20

25

0.7 1.2 1.4 2.1

ops/s

ec (

x100

0)

Directory Depth

HMVFS BTRFS NILFS2 EXT4 PMFS

Throughput performance Snapshotting efficiency

8.7x and 2.5x

Page 25: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Outline

• Introduction

• Design

• Implementation

• Evaluation

• Conclusion

Page 26: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

Conclusion

• HMVFS is the first file system to solve the consistency problem for NVM-based in-memory file systems using snapshotting.

• Metadata of the Stratified File System Tree (SFST) is decoupled from data and is updated at byte granularity

• HMVFS stores the snapshots space-efficiently with shared blocks in SFST and handles write amplification problem and block sharing problem well

• HMVFS exploits the structural benefit of CoW friendly B-tree and the byte-addressability of NVM to automatically take frequent snapshots

• HMVFS outperforms tradition versioning file systems in snapshotting and performance while providing strong consistency guarantee and having little impact on foreground operations

Page 27: HMVFS: A Hybrid Memory Versioning File Systemstorageconference.us/2016/Slides/ShenganZheng.pdf · • BTRFS, NILFS2 • Not optimized for NVM. Design Goals • Strong consistency

• Q & A• Thank you


Recommended