+ All Categories
Home > Documents > BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers...

BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers...

Date post: 13-Mar-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
SNAPI 2010 · Jan Stender BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Björn Kolbeck, Mikael Högqvist Zuse Institute Berlin Felix Hupfeld Google GmbH Zurich
Transcript
Page 1: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Fast and Efficient File System Metadata Storage

Jan Stender, Björn Kolbeck, Mikael Högqvist

Zuse Institute Berlin

Felix Hupfeld

Google GmbH Zurich

Page 2: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Motivation

– Modern parallel / distributed file systems:– Huge numbers of files and directories

– Many storage servers but few metadata servers

– Examples:

– Lustre, Panasas Active Scale, Google File System

– Metadata access critical wrt. system performance

– ~75% of all file system calls are metadata accesses

– Metadata servers are bottlenecks

Page 3: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Motivation

– B-tree-like data structures used for metadata storage– ZFS, btrfs, Lustre, PVFS2

– Downsides:

– Hard to implement and test,high code complexity

– Multi-version B-trees even more complex

– On-disk re-balancing expensive

Page 4: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB

– Key-value store

– FS metadata: key-value pairs stored in DB indices

Page 5: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Index

Page 6: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example

Page 7: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Insertions

Page 8: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Insertions

Page 9: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Lookups

Page 10: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Lookups

Page 11: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Lookups

Page 12: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Lookups

Page 13: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Deletions

Page 14: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Deletions

Page 15: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Deletions

Page 16: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Deletions

Page 17: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Range Lookups

Page 18: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Range Lookups

Page 19: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Range Lookups

Page 20: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Range Lookups

Page 21: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Checkpoints

Page 22: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Checkpoints

Page 23: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Checkpoints

Page 24: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Example: Checkpoints

Page 25: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

On-disk Index

– Sorted by Keys

– Block index in RAM, blocks mmap'ed

Page 26: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Related Work

– Inspired by log-structured merge trees (LSM-trees)

– Only one on-disk index

– No „rolling merge“

– Made popular by Google Bigtable– Insert/lookup/merge similar as in Bigtable's Tablets

Page 27: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Metadata Mapping

– Mapping a hierarchical directory tree to a flat database index:

Page 28: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Advantages

– Why BabuDB for File System Metadata?

– Short-lived files

▪ 50% of all files deleted within 5 minutes

– Atomic file system operations w/o locking or transactions

▪ e.g. rename

– Directory content in contiguous disk regions

▪ Efficient readdir + stat

– Snapshots

▪ No need for multi-version data structures

Page 29: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Evaluation

– Linux kernel build

– ~10M calls: 44% stat, 40% open, 15% readlink, 1% others

– Dovecot mail server + imaptest

– ~2M calls: 51% stat, 48% open, 1% others

seco

nd

sDovecot test

0

50

100

150

200

250

300

350

400

BabuDBext4

Kernel build

0200400600800

100012001400160018002000

BabuDBext4

seco

nd

s

Page 30: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

BabuDB: Evaluation

– Listing directory content

Page 31: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Summary

– BabuDB is ...

– an efficient key-value store

– optimized for file system metadata but also suitable for other purposes

– suitable for large-scale databases

– available for Java and C++ under BSD license

– used in the XtreemFS metadata server

http://babudb.googlecode.com

http://www.xtreemfs.org

Page 32: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Thank you for your attention!

Page 33: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

Background: XtreemFS

– XtreemFS: a distributed replicated Internet file system

– part of the XtreemOS research project

– developed since 2006 by partners fromGermany, Spain and Italy

www.xtreemfs.org

– Object-based architecture:

– MRC stores metadata

– OSDs store pure file content as objects

– Clients provide POSIX file system interface

Page 34: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

The XtreemOS Project

– Research project funded by the European Commission

– 19 partners from Europe and China

– XtreemFS is the data management component– developed by ZIB, NEC HPC Europe,

Barcelona Supercomputing Center and ICAR-CNR Italy

– ~ 3 years of development

– first public release in August 2008

Page 35: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

XtreemFS: Overview

– What is XtreemFS?

– a distributed and replicatedPOSIX compliant file system

– off-the-shelve Servers – no expensive hardware

– servers in Java, runs onLinux / OS X / Solaris

– client in C, runs onLinux / OS X / Windows

– secure (X.509 and SSL)

– easy to install and maintain

– open source (GPL)

Page 36: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:

SNAPI 2010 · Jan Stender

File System Landscape

ext3, ZFS,NTFS

NFS, SMBAFS/Coda

Lustre, Panasas,GPFS, CEPH...

Internet

Cluster FS/Data Center

Network FS/Centralized

PC

GDM"gridftp"

Grid File SystemGFarm


Recommended