+ All Categories
Home > Documents > NOVA: A High -Performance, Hardened File System for...

NOVA: A High -Performance, Hardened File System for...

Date post: 02-Mar-2018
Category:
Upload: dinhkhuong
View: 219 times
Download: 4 times
Share this document with a friend
76
1 NOVA: A High-Performance, Hardened File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego
Transcript
Page 1: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

1

NOVA: A High-Performance, Hardened File System for Non-Volatile Main Memories

Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson

Non-Volatile Systems LaboratoryDepartment of Computer Science and EngineeringUniversity of California, San Diego

Page 2: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

2

NVDIMM Usage Models• Legacy File IO Acceleration – fast and easy

– Run existing IO-intensive apps on NVDIMMs– “just works”– NOVA is 30% - 10x faster than Ext4 for write intensive

workloads.– Need strong protections on data.

• DAX Mmap -- maximum speed + programming challenges– Load-store access– You still need a strongly-consistent file system

• File system corruption can still destroy your data• NOVA is strongly consistent

– Data protection is still critical

0

50

100

150

200

250

300

350

400

450

Ops

per

seco

nd (x

1000

)

Legacy IO Throughput

Ext4-datajournal NOVA

Page 3: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

3

XFS

EXT4

F2FS

BTRFS

NILFS

Page 4: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

4

Disk-based file systems are inadequate for NVMM

• Disk-based file systems cannot exploit NVMM performance

• Performance optimization compromises consistency on system failure [1]

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14.

Atomicity Data Protection

1-Sector overwrite

1-Sector append

1-Block overwrite

1-Block append

N-Block overwrite

N-Block append Data Meta-

dataSnap-shots

Ext4wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓

Ext4Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Ext4Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓

Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓

xfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Page 5: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

5

XFS-DAXEXT4-DAX

BPFS SCMFS PMFS Aerie

M1FS

Page 6: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

6

NVMM file systems don’t provide strong consistency or data protection

• DAX does not provide data atomicity guarantees

• Programming is more difficult

Atomicity Data Protection

Metadata Data Data Meta-data

Snap-shots

BPFS ✓ ✓ ✗ ✗ ✗

PMFS ✓ ✗ ✗ ✗ ✗Ext4DAX ✓ ✗ ✗ ✓ ✗

XFSDAX ✓ ✗ ✗ ✓ ✗

SCMFS ✗ ✗ ✗ ✗ ✗

Aerie ✓ ✗ ✗ ✗ ✗

Page 7: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

7

NOVA provides strong atomicity guarantee

Atomicity Data Protection

Metadata Data Data Meta-data

Snap-shots

BPFS ✓ ✓ ✗ ✗ ✗

PMFS ✓ ✗ ✗ ✗ ✗Ext4DAX ✓ ✗ ✗ ✓ ✗

XFSDAX ✓ ✗ ✗ ✓ ✗

SCMFS ✗ ✗ ✗ ✗ ✗

Aerie ✓ ✗ ✗ ✗ ✗

Atomicity Data Protection

1-Sector overwrite

1-Sector append

1-Block overwrite

1-Block append

N-Block overwrite

N-Block append Data Meta-

dataSnap-shots

Ext4wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓

Ext4Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Ext4Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓

Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓

xfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Page 8: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

8

NOVA provides strong atomicity guarantee

Atomicity Data Protection

Metadata Data Data Meta-data

Snap-shots

BPFS ✓ ✓ ✗ ✗ ✗

PMFS ✓ ✗ ✗ ✗ ✗Ext4DAX ✓ ✗ ✗ ✓ ✗

XFSDAX ✓ ✗ ✗ ✓ ✗

SCMFS ✗ ✗ ✗ ✗ ✗

Aerie ✓ ✗ ✗ ✗ ✗

NOVA ✓ ✓ ✓ ✓ ✓

Atomicity Data Protection

1-Sector overwrite

1-Sector append

1-Block overwrite

1-Block append

N-Block overwrite

N-Block append Data Meta-

dataSnap-shots

Ext4wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓

Ext4Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Ext4Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓

Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓

xfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

NOVA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Page 9: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

9

NOVA’s Key Features

• Features– High-performance– Strong Consistency– Snapshot support– Data protection

• Usage Models– open()/close(), read()/write()– DAX-mmap()

Page 10: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

10

NOVA’s Architecture

Page 11: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

11

Log Structure + copy-on-write + Journals

• One log per iNode• Non-contiguous• Fast, Simple atomic

updates• Meta-data only

File log

Tail Tail

Core NOVA Structures

Page 12: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

12

Log Structure + copy-on-write + Journals

• Multi-page atomic update

• Fast allocation• Instant data GC

File log

Tail

Data 1 Data 2

Tail

Data 0 Data 1

Core NOVA Structures

Page 13: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

13

Log Structure + copy-on-write + Journals

• Small, fixed sized journals

• For complex ops.File log

Directory log

Tail

TailTail

Tail

Dir tail

File tail

Journal

Core NOVA Structures

Page 14: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

14

Supporting Backups with Snapshots

Page 15: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

15

Snapshots for Normal File Access

0Current epoch

0File log

Data

0x1000

Snapshot entryData in snapshot

File write entryReclaimed data

Epoch IDCurrent data

Snapshot 0

1

1

Data

0x2000

Data

1

Data

0x3000

Data

Snapshot 1

2

Data

2

Data

0x4000

Data

Page 16: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

16

False

?

V = True;D = 1;

Corrupt Snapshots with DAX-mmap()

R/W RO PageFault

Copy on Write

ValueChange

Application:

Page hosting D:

Page hosting V:

?

T

TimeSnapshot

Snapshot

True

1

• Recovery invariant: if V == True, then D is valid– Incorrect: Naïvely mark pages read-only one-at-a-time

Page 17: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

17

False

?

D = 1;

Consistent Snapshots with DAX-mmap()

R/W RO PageFault

ValueChange

Application:

Page hosting D:

Page hosting V:

?

Time

Snapshot

V = True;

True

1

• Recovery invariant: if V == True, then D is valid– Correct: Block page faults until all pages are read-only

ROBlocking

F

Copy on Write

Snapshot

Page 18: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

18

• Normal execution vs. taking snapshots every 10s– Negligible performance loss through read()/write()– Average performance loss 6.2% through mmap()

Performance impact of snapshots

Conventional workloads NVMM-aware workloads from WHISPER

Page 19: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

19

Data Protection: Metadata

Page 20: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

20

NVMM Failure Modes: Media Failures• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• May consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software:

NVMM Ctrl.:

Read

NVMM data:

Detects & corrects errors

Consumes good data

Media error

Page 21: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

21

NVMM Failure Modes : Media Failures• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• May consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

NVMM data:

Software:

NVMM Ctrl.: Detects uncorrectable errorsRaises exception

Receives MCE

Media error &Poison Radius (PR)e.g. 512 bytes

Read

Page 22: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

22

Detecting NVMM Media Errors

Recoverable

Unrecoverable

Kernel

User

Yes

No

memcpy_mcsafe()• Copy data from NVMM• Catch MCEs and return failure

Whose access?

Handler registered?

Process and return

Kernel panic

SIGBUS

Kernel panic

MCE

Page 23: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

23

NVMM Failure Modes : Media Failures• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

NVMM data:Media error

Software:

NVMM Ctrl.: Sees no error

Consumes corrupted data

Read

Page 24: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

24

NVMM Failure Modes: Scribbles• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software “scribbles”– Kernel bugs or NOVA bugs– NVMM file systems are highly vulnerable

NVMM data:

Software:

NVMM Ctrl.: Updates ECC

Bug code scribbles NVMM

Scribble error

Write

Page 25: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

25

NVMM Failure Modes: Scribbles• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software “scribbles”– Kernel bugs or NOVA bugs– NVMM file systems are highly vulnerable

NVMM data:

Software:

NVMM Ctrl.: Sees no error

Consumes corrupted data

Scribble error

Read

Page 26: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

26

Head’ Tail’ csum’

Head TailHead Tail csum

Head’ Tail’ csum’ H1’ T1’

Head Tail csum H1 T1

• Replicate everything– Inodes– Logs– Superblock– …

• CRC32 Checksums everywhereent1’ c1’ entN’ cN’…

NOVA Metadata Protection

inode

ent1 c1 entN cN…

Data 1 Data 2

inode’

Page 27: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

27

Defense Against Scribbles

• Tolerating Larger Scribbles– Allocate replicas far from one another– Can tolerate arbitrarily large scribbles to metadata.

• Preventing scribbles– Mark all NVMM as read-only– Disable CPU write protection while accessing NVMM

Page 28: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

28

Data Protection: Data

Page 29: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

29

• Divide 4KB blocks into 512-byte stripes

• Compute a RAID 5-style parity stripe• Compute and replicate checksums for each stripe

NOVA Data Protection

S0 S1 S2 S3 S4 S5 S6 S7 P

1 Block

P = ⊕ S0..7

512-Byte stripe segments

Ci = CRC32C(Si)Replicated

Page 30: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

30

• With DAX-Mmap(), file data changes are invisible to NOVA• NOVA cannot protect mmap’ed file data• NOVA logs mmap() and restores protection on munmap() or

recovery

File data protection with DAX-mmap

File data:

File log:

NOVA: read(), write()

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

load/storeload/store

protectedunprotected

mmap log entry

Page 31: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

31

• NOVA cannot protect mmap’ed file data– User applications directly load/store the mmap’ed region– NOVA has to know what file pages are mmap’ed

File data protection with DAX-mmap

File data:

File log:

NOVA: read(), write()

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

munmap()

Protection restored

load/store

Page 32: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

32

• NOVA cannot protect mmap’ed file data– User applications directly load/store the mmap’ed region– NOVA has to know what file pages are mmap’ed

File data protection with DAX-mmap

File data:

File log:

NOVA: read(), write()

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

System Failure + recovery

Page 33: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

33

Performance

Page 34: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

34

Performance Cost of Data Integrity

0

0.2

0.4

0.6

0.8

1

1.2

Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average

xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP+WP w/ MP+DP+WP

Page 35: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

35

Conclusion

• Existing file systems do not meet the requirements of applications on NVMM file systems

• NOVA’s multi-log design achieves high performance and strong consistency

• NOVA’s data protection features ensure data integrity• NOVA outperforms existing file systems while providing

stronger consistency and data protection guarantees

Page 36: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

36

Thank you!

Try NOVA!https://github.com/NVSL/NOVA

Page 37: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

37

Backup Slides

Page 38: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

38

Protecting Against Scribbles

• Metadata allocator separates metadata replicas– Allocate primary and replica pages in opposite directions– Use allocator ‘dead-zone’ to guarantee minimal distance– Protect against scribbles from other kernel bugs and own bugs

log1Simple allocation:

A page-sized scribble can affect most pairs of replicated metadata pages

log1’

log1Two-way allocation: log2’log2 log1’logN logN’

log2 log2’ logN logN’

log1Dead-zone allocation: log2’log2 log1’logN logN’1 MB

A page-sized scribble can affect limited pairs of replicated metadata pages

A scribble less than 1 MB can not corrupt any metadata

Page 39: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

39

Minimize the chance of corruptions –x86 write protection

• Leverage x86 CPU’s write protection– CR0.WP disables/enables writing to RO memories of each x86 core– Only enable writing when NOVA writes to NVMM– Protect against scribbles from other kernel bugs, not own bugs

Page 40: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

40

Filebench throughput

• NOVA achieves high performance with strong data consistency

0

50

100

150

200

250

300

350

400

450

Fileserver Varmail Webproxy Webserver

Ops

per

seco

nd (x

1000

)

Filebench throughput

Ext4-datajournal Ext4-DAX m1fs NOVA

Page 41: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

41

• Update tails of primary inode• Update csum of primary inode• Same procedure for inode’

Tick-tock inode update

Secondary

Primary Head’ Tail’ csum’ H1’ T1’

Head Tail csum H1 T1

Old Updating New

Page 42: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

42

Performance Cost of Data Integrity

0

0.2

0.4

0.6

0.8

1

1.2

Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average

xfs-DAX ext4-DAX ext4-dataj Fortis baselinew/ MP w/ MP+WP w/ MP+DP w/ MP+DP+WP

Page 43: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

43

Conclusions• NVMM file systems need unique solutions for reliability

– Error reporting mechanisms different than disks– DAX-mmap complicates designs

• Performance and storage penalties vary– Storage cost is modest for the presented hardening techniques– Performance impact is significant for some applications

• More knowledge is necessary to determine the trade-offs– Uncorrectable media errors in emerging NVMM technologies– The frequency and size of scribbles in kernel space

• NOVA provides all hardening techniques as mount options

Page 44: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

44

Performance impact of data integrity

• File operation latency

Page 45: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

45

Reliability evaluation – metadata pages at risk

• Scan an aged NOVA file system image– Examine distances

between the primaryand replica pages

– Count vulnerablepage pairs for a givenscribble size

Y == 0 points do notshow in log-log plot

Page 46: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

46

PMFS shortcomings

• No data atomicity support• High consistency overhead with persistent B-tree• Not scalable

– Directory operations (linear search)– NVMM allocation (Single allocator)– Single journal shared by all transactions

• Poor performance on large directories• Intel has deprecated PMFS

Page 47: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

47

Ext4-DAX and xfs-DAX shortcomings

• No data atomicity support• Single journal shared by all the transactions (JBD2-based)• Poor performance

Page 48: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

48

Non-volatile main memory is about to happen

NVMM needs a new file system: PMFS, Ext4-DAX, SCMFS, Aerie, NOVA, …

Page 49: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

49

Why a new file system?

Source: Memory-Driven Computing, Kimberly Keeton, HP Labs

We need to reduce the software overhead.

Page 50: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

50

What Should a File System Provide?• Performance current focus (of all known efforts)• Consistency

– Atomic metadata operations– Atomic file updates

• Data Protection– Snapshots– Media error protection– Software error protection

• Cost optimizations– Compression– Deduplication

We need to study the impact of adding more file system services in the context of NVMM.

Page 51: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

51

Evaluation: Latency

• Intel PM Emulation Platform– Emulates different NVM

characteristics– Emulates clwb/PCOMMIT

latency• NOVA provides low latency

atomicity0

5

10

15

20

25

Create Append (4KB) Delete

Late

ncy

(mic

rose

cond

)

Operation latency

Ext4-datajournal Ext4-DAX m1fs NOVA

Page 52: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

52

NOVA design and in-NVMM data layout

DRAMNVMM

Journal

Inode table

Free list

CPU 0

Journal

Inode table

Free list

CPU 1

Head TailInode

Inode log

Superblock

Recoveryinode

• High performance– No page cache– Memory semantics– Segregated data structures– Per-CPU freelist– Per-inode logging

• Strong consistency– Copy-on-write file data– Using 8-byte atomic stores

...

Page 53: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

53

NOVA design and in-NVMM data layout

File log

Data 1 Data 2Data 0 Data 1

Head TailInode

Inode table

Per-CPU inode table

Per-inode log

• High performance– No page cache– Memory semantics– Segregated data structures– Per-CPU freelist– Per-inode logging

• Strong consistency– Copy-on-write file data– Using 8-byte atomic stores writing to page 1 and 2

64-bit tail ptr

Page 54: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

54

NVMM file systems should support snapshot

• Snapshot is essential for file system backup• Available in file systems for block devices

– ZFS, Btrfs, WAFL

• NOVA is the first NVMM file system providing snapshot– Efficient full-filesystem snapshot at minimal performance cost– Creating/deleting snapshots does not halt file system– Creating consistent snapshots with DAX-mmap enabled

Page 55: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

55

Enable snapshot in NOVA

• Maintain a current ‘epoch_id’ for the file system– Stored in the superblock– Incremented after every snapshot taken

• Add the ‘epoch_id’ to each log entry epoch_id

epoch_id

Page 56: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

56

Taking snapshots

0Current epoch

0File log

Data

0x1000

Snapshot entryData in snapshot

File write entryReclaimed data

Epoch IDCurrent data

Snapshot 0 log

Snapshot 1 log

Snapshot 0

1

1

Data

0x2000

Data

0x1000, 1

1

Data

0x3000

Data

Snapshot 1

2

Data

2

Data

0x4000

Data

0x3000, 2

Page 57: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

57

Deleting snapshots

0Current epoch

0File log

Data

0x1000

Snapshot entryData in snapshot

File write entryReclaimed data

Epoch IDCurrent data

Snapshot 0 log

Snapshot 1 log

Snapshot 0

1

1

Data

0x2000

Data

0x1000, 1

1

Data

0x3000

Snapshot 1

2

Data

2

Data

0x4000

Data

0x3000, 2

Data DataDataData

Background GC

Page 58: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

58

Mounting snapshots

0Current epoch

0File log

Data

0x1000

Snapshot entryData in snapshot

File write entryReclaimed data

Epoch IDCurrent data

Snapshot 1 log1

1

Data

0x2000

Data

1

Data

0x3000

Snapshot 1

2

Data

2

Data

0x4000

Data

0x3000, 2

Data

log tail

DataData

Page 59: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

59

• Goal: Applications take snapshots and keep running– Virtual addresses do not change– Consistency must be guaranteed

• How: Set each mmap’ed page as read-only– Then do copy-on-write for new stores (detected by page fault)

• Caveat: Can only atomically set one page as read-only– What if the order of becoming read-only conflicts with consistency?

Snapshots with DAX-mmap

Page 60: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

60

NOVA (meta)data integrity features• Detect (meta)data corruptions

– Media errors: error code from memcpy_mcsafe(), and checksums– Software scribbles: checksums

• Repair (meta)data corruptions– Metadata: Fully replicated– File data: Stripe and parity-code each block

• Minimize scribbles– Leverage x86 CPU’s write protection (CR0.WP)– Metadata allocators separate replicas

Page 61: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

61

Metadata error detection and correction

• Use inode access as an example:

• If any step raises an error:– Attempt to repair and retry– If recovery fails: return –EIO to user

Read inodememcpy_mcsafe

Read inode’memcpy_mcsafe

Verify inode csumVerify inode’ csum

memcmp(inode, inode’) Use inode

Page 62: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

62

• If any step raises an error:– Attempt to repair and retry– If recovery fails: return –EIO to user

File data error detection and correction

Read a stripmemcpy_mcsafe

Verify strip’s checksum Copy data to userCheck errors in

checksums

Page 63: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

63

• With DAX-Mmap(), file data changes are invisible to NOVA

File data protection with DAX-mmap

File data:

File log:

NOVA: read(), write()

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

Following dax-mmap() semantics, NOVA doesn’t interfere with mmap’ed file data.

load/store

protectedprotected

Page 64: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

64

• NOVA cannot protect mmap’ed file data– User applications directly load/store the mmap’ed region– NOVA has to know what file pages are mmap’ed

File data protection with DAX-mmap

File data:

File log:

NOVA: read(), write()

Applications:

Kernel-space

NVDIMMs

User-space

mmap() vm area list:

NOVA skips protection routines for reads and writes to the regions found in the vmarea list.

load/storeread() load/store

no protection

Page 65: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

65

Performance impact of data integrity

• Latency breakdown on NVDIMM-N

metadata protection

file data protection

x86 write protection

Page 66: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

66

Performance impact of data integrity

• Random read/write bandwidth on NVDIMM-N

Page 67: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

67

Storage utilization with data integrity

• Conceptual view of NOVA’s utilization of NVMM

• Actual usage of a practical workload: fileserver

Dead-zone (DZ) only virtually exists to separate metadata replicas.File data can still live inside.

Page 68: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

68

Differences from disk FS implementation• Low-latency storage media

– Need to choose fast methods for any involved computation• Fine-grained random access

– Need fine-grained checksum ranges, not per block (as in Btrfs, ZFS)• Small atomicity guarantees (64-bit)

– Need metadata replication to assist consistent updates• Media errors cause machine check exceptions (MCEs)

– Need awareness and mitigations• Demands for DAX-mmap (no copy-on-write, no FS control)

– Need awareness and lowering the protection level on demand

Page 69: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

69

• Snapshot metadata list resides in DRAM to reduce consistency overhead

• Clean unmount:– Finish background snapshot delete– Save snapshot lists to NVMM

• Power failure:– Snapshot transaction ID is persistent– Rebuild snapshot metadata lists during power failure recovery

Recovery for snapshot metadata

Page 70: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

70

NVMM (Meta)data corruptions• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• May consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software

Hardware Read

NVMM data:

HW ECC

Media error

Good data

Page 71: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

71

NVMM (Meta)data corruptions• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• May consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software

Hardware

NVMM data:

HW ECC

Media error &Poison Radius (PR)e.g. 512 bytes

Read

User mode

SIGBUS

Kernel mode

Might panic

MCE

Page 72: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

72

NVMM (Meta)data corruptions• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software

Hardware

NVMM data:

HW ECC

Media error

Corrupted data

Read

Page 73: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

73

NVMM (Meta)data corruptions• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software

Hardware

NVMM data:

Page 74: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

74

NVMM (Meta)data corruptions• Media errors

– Detectable & correctable• Transparent to software

– Detectable & uncorrectable• Affect a contiguous range of data• Raise machine check exception (MCE)

– Undetectable• Consume corrupted data

• Software scribbles– Kernel bugs or own bugs– Transparent to hardware

Software

Hardware

NVMM data:

HW ECC

Corrupted data

Read

Page 75: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

75

Metadata error detection and correction

• Use inode access as example

Read inodememcpy_mcsafe MCE

Read inode’memcpy_mcsafe

Read inode’memcpy_mcsafe

OK

MCE

-EIO to user

MCEPR(inode)PR(inode’)

Verify inode csumVerify inode’ csum

OK

PR(inode)PR(inode’)

OK

Good inodeError inode

memcmp(inode, inode’)

Goto

-EIO to user

inodeinode’

Both fail

One fails

Both OK

neq

eq

Continue

Page 76: NOVA: A High -Performance, Hardened File System for …storageconference.us/2017/Presentations/Swanson.pdf · NOVA: A High -Performance, Hardened File System for Non-Volatile Main

76

File data error detection and correction

Read a stripmemcpy_mcsafe

MCE

OK

Calculate csum

Read other stripsand the parity

-EIO to user

AnyMCE

AllOK

Copy data to user

Good csumError csum

csum == csum0 orcsum == csum1 ?

No

Repair bad strip& verify csums

csum0 == csum1?

YesSuccess

Fail

Yes

No

Judged by the majority:csum, csum0, csum1

• Detect and repair both data and checksum errors


Recommended