b tree file system report

1-Introduction to B-Trees and Shadowing

1.1- B-tree-

In computer science, a B-tree is a tree data structure that keeps data sorted and allows

searches, sequential access, insertions, and deletions in logarithmic. The B-tree is a

generalization of a binary search tree in that a node can have more than two children (Comer

1979, p. 123). Unlike self, the B-tree is optimized for systems that read and write large blocks of

data. It is commonly used in databases andfilesystems.

B-Tree is the generalization of the binary search tree. B+-Tree can be consideredas B-Tree

variant, with an exception that in B+-Tree only leafs contain the data. Inbinary search trees, we

have nodes having single search-key and left sub-tree and rightsub-tree containing all nodes with

search-keys that are less than and greater than parentsearch-key respectively. In B+-Tree, we can

have multiple search-keys, and multiple child nodes.

1

http://en.wikipedia.org/wiki/Filesystem

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/B-tree#CITEREFComer1979

http://en.wikipedia.org/wiki/B-tree#CITEREFComer1979

http://en.wikipedia.org/wiki/Binary_search_tree

http://en.wikipedia.org/wiki/Computer_science

In BST, the distance of leaf from the tree root is not fixed. It depends on thesequence of

insertions in BST. But in case of B or B+ trees, the insertion algorithm ensures that distance

between leaf and root is same for all cases. The Figure 1.2 shows the B+-Tree. In this example,

the ordering of the words is alphabetical. The size of node 1 is 2 and any more insertions in node

containing 2 search-keys will cause splitting of node and rebalancing operation. In case of B+-

Trees the leafs are chained together. This is because, anyway all search keys in adjacent leafs are

in sorted manner. So chaining can help for efficient sequential access to data associated with the

sorted keys in the leafs inbottom.

In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some

pre-defined range. When data is inserted or removed from a node, its number of child nodes

changes. In order to maintain the pre-defined range, internal nodes may be joined or split.

Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as

other self-balancing search trees, but may waste some space, since nodes are not entirely full.

The lower and upper bounds on the number of child nodes are typically fixed for a particular

implementation

B+- Tree

B-Trees ensures the logarithmic time key-search, insert and remove operations.B-Trees can be

used to represent files or directories in file-system. Files are typically represented by b-tree that

hold disk-extents i.e. set of free disk blocks in their leafs. In the next section, we will cover the

basic concept of the shadowing.

2

http://en.wikipedia.org/wiki/Leaf_node

1.2 Shadowing-

Shadowing scheme is also known as copy-on-write (COW) scheme. Shadowing technique is

used to ensure atomic update to persistent data-structures in file-system. In this scheme, we look

at the storage in terms of fixed-size pages. There is a page tablewhich has a pointer to all valid

pages. Shadowing means that to update an on-disk page,the entire page is read into memory,

modified, and later written to disk at an alternatelocation. Now all we have to do is to update the

pointer in the page table to point tothis new page in the disk.

Byte-size of pointer is small and it can fit is one sector inthe disk. There are hard drives that

offer atomic sector upgrades and promise you thateither all of the old or new data in the sector.

This means you either have an old page ornewly written page. So atomic persistent updates are

ensured due to this scheme. It is apowerful mechanism to implement the crash recovery,

snapshots.

1.3 Problems with conventional B-Trees-

The entire file-system tree on the disk can be looked as made of fixed-size pages. When a page

is to be modified, it is read into memory, modified, and later written to some other location in the

disk. Now let us assume that the leaf in b-tree shown below is equivalent to one one-disk page. If

we try to modify the leaf, then page corresponding tothe leaf will be shadowed. Now, the next

immediate ancestor of this node should point tothis node. This means we will have to modify the

ancestor of this node. Again shadowingis involved, and this process continues up to the root

recursively.

So entire path up to the root need to be shadowed. We will call this type of shadowing as

strict shadowing.Now the one additional problem arises due to linking of the leafs in tree. Since

adjacent leaf should also point to the modified leaf, it is also needed to be shadowed.This process

leads to shadowing of the entire tree just because of modification in oneleaf. Remember, this all

is going to happen in the hard-disks! This lead to performancedegradation. The root of the

problem is leaf chaining.

3

To solve the issues related to concurrency, we use mutex locks or semaphores.Now, let’s

assume for while that there are no links in leafs. In normal b-tree, suppose weneed to modify a

single node, we take a lock on it, make changes and then release thelock. But if method of

updation is shadowing, then we know that changes propagate tothe root, making it necessary to

take locks on the way up to the tree root. So there israce to take the lock on the tree root. Waiting

for lock is time consuming process, andhence there is need of efficient synchronization.

The regular b-trees shuffle the keys between neighboring nodes for the re-balacing purpose

after key-insertion or deletion. If any leaf is modified then, then path up to rootwill be shadowed

by default. Suppose that the exchange of the keys happens betweennodes whose immediate

ancestor is not same, then additional path up to tree root willhave to be shadowed due to

modification because of exchange of keys.Removing a key and effects of re-balancing and

shadowing.

Removing a key and effects of re-balancing and shadowing.

So we can say that B-Trees + Shadowing = Expensive choice, if conventionalb-trees are used.

4

1.4 Modifications in conventional B-tree-

Ohad Rodeh, IBM Researcher, have suggested modifications to conventional b-treeand

algorithms related to it, for integrating b-tree schemes with shadowing technique. Wewill cover

few of them, related to problems discussed above.

1. To solve the problem related to shadowing of whole tree, the links between the leafsare

removed. Due to this, only the path up to the tree root needs to be shadowed.

2. In case of rebalancing operation, it is better to exchange the keys between nodeswhose

immediate ancestor is same, because this will involve shadowing of singlenode, which is better

instead of shadowing the another path up to the root involvingmany nodes.

5

2-Introduction and History of BTRFS

2.1 Introduction-

Btrfs is GPL-licensed copy-on-write file system for Linux. Its development beganat Oracle

Corporation in 2007. Principal BTRFS author is Chris Mason. Following areSome general points

about btrfs:

1. The core data structure of Btrfs is the copy-on-write B-tree which was originallyproposed by

IBM researcher Ohad Rodeh at a presentation at USENIX 2007.

2. Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release,and was

finally accepted into the mainline kernel as of 2.6.29 in 2009.

3. Btrfs is intended to address the lack of pooling, snapshots, checksums in Linux filesystems.

4. Goal of btrfs was "to let Linux scale for the storage that will be available. Scalingis not just

about addressing the storage but also means being able to administerand to manage it with a

clean interface that lets people see what’s being used andmakes it more reliable."

5. Btrfs has a number of the same design ideas that reiser3/4(Chris Mason was workingon

reiserFS before starting his work on btrfs).

The maximum number of files is 18,446,744,073,709,551,616 or 2 to the 64 power of filesThe

maximum file length is 255 characters. The theoretical max file size limit is 16 EB, or 8EB . The

BTRFS file system helps reduce fragmentation. Storage devices usually show a loss of

performance due to fragmentation (usually when fuller). BTRFS does allow for Online.

When disk space should become full, it is possible to add space to the existing BTRFS

volume. The method refers to Online Resize. The BTRFS file system does not need to be

unmounted or taken offline. An existing volume can be added, or removed, from the volume to

If a volume has an existing ext3 or ext4 file system, it can be converted to BTRFS. The

conversion is an in-place conversion. This means that the existing data does not have to be

removed before the file system is converted. It is good practice to perform a backup in case .

6

http://www.linux.org/threads/online-resize.4229/

2.1 History-

The core data structure of Btrfs — the copy-on-write B-tree — was originally proposed by

IBM researcher Ohad Rodeh at a presentation at USENIX2007. Chris Mason, an engineer

working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a

new file system based on these B-trees.

Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was

finally accepted into the mainline kernel as of 2.6.29 in 2009. Several Linux distributions began

offering Btrfs as an experimental choice of root file system during installation, including Arch

Linux, openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux

6, Fedora 15, MeeGo, Debian, and Slackware 13.37. In summer 2012, several Linux

distributions have moved Btrfs from experimental to production / supported status, including

SLES 11 SP2 and Oracle Linux 5 and 6, with the Unbreakable Enterprise Kernel Release 2.

In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides

Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.In June 2012,

Chris Mason left Oracle for Fusion-io, and in November 2013 he left Fusion-io for Facebook. He

continues to work on Btrfs.

2.3 Why btrfs File-System?

Linux kernel currently supports almost as 140 file-systems. Most of these file-systems are

generally very good. So why do we need a new file-system even when we have thesemany file

systems? Reasons for the same are:

1. This file-system scales to very large storage. This is evident because maximum sizeof storage

that file-system can address is 16 EiB (264 Bytes).

2. This file-system is feature focused, providing features the other file-systems cannot.

3. Performance is important. This file-system does not intend to race with current filesystems

because they are anyway good. It’s the features that makes btrfs standout.

4. This file-system is administrator focused, so that it is easy to configure, and faulttolerant.

7

http://en.wikipedia.org/wiki/Facebook

http://en.wikipedia.org/wiki/Fusion-io

http://en.wikipedia.org/wiki/Oracle_Linux

http://en.wikipedia.org/wiki/Slackware

http://en.wikipedia.org/wiki/Debian

http://en.wikipedia.org/wiki/MeeGo

http://en.wikipedia.org/wiki/Fedora_(operating_system)

http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux

http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux

http://en.wikipedia.org/wiki/Sabayon_Linux

http://en.wikipedia.org/wiki/Ubuntu_10.10

http://en.wikipedia.org/wiki/SUSE_Linux_Enterprise_Server

http://en.wikipedia.org/wiki/OpenSUSE

http://en.wikipedia.org/wiki/Arch_Linux

http://en.wikipedia.org/wiki/Arch_Linux

http://en.wikipedia.org/wiki/Linux_distribution

http://en.wikipedia.org/wiki/SUSE_Linux_distributions

http://en.wikipedia.org/wiki/ReiserFS

http://en.wikipedia.org/wiki/USENIX

http://en.wikipedia.org/wiki/B-tree

3- Specifications and Features of BTRFS

3.1 Features-

1. As of version 3.12 of the Linux kernel mainline, Btrfs implements the following features:

2. Mostly self-healing in some configurations due to the nature of copy on write

3. Online defragmentation

4. Online volume growth and shrinking

5. Online block device addition and removal

6. Online balancing (movement of objects between block devices to balance load)

7. Offline filesystem check

8. Online data scrubbing for finding errors and automatically fixing them for files with

redundant copies

9. RAID 0, RAID 1, RAID 5, RAID 6 and RAID 10

10. Subvolumes (one or more separately mountable filesystem roots within each disk

partition)

11. Transparent compression (zlib and LZO)

12. Snapshots (read-only or copy-on-write clones of subvolumes)

13. File cloning (copy-on-write on individual files, or byte ranges thereof)

14. Checksums on data and metadata (CRC-32C)

15. In-place conversion (with rollback) from ext3/4 to Btrfs

16. File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a

writeable Btrfs)

17. Block discard support (reclaims space on some virtualized setups and improves wear

leveling on SSDs with TRIM)

18. Send/receive (saving diffs between snapshots to a binary stream)

19. Hierarchical per-subvolume quotas

20. Out-of-band data deduplication (requires user space tools)

8

http://en.wikipedia.org/wiki/Data_deduplication

http://en.wikipedia.org/wiki/Diff

http://en.wikipedia.org/wiki/TRIM

http://en.wikipedia.org/wiki/Wear_leveling

http://en.wikipedia.org/wiki/Wear_leveling

http://en.wikipedia.org/wiki/Hardware_virtualization

http://en.wikipedia.org/wiki/CRC-32C

http://en.wikipedia.org/wiki/Checksum

http://en.wikipedia.org/wiki/Copy-on-write

http://en.wikipedia.org/wiki/Snapshot_(computer_storage)

http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer

http://en.wikipedia.org/wiki/Zlib

http://en.wikipedia.org/wiki/Data_compression

http://en.wikipedia.org/wiki/Disk_partitioning

http://en.wikipedia.org/wiki/Disk_partitioning

http://en.wikipedia.org/wiki/Root_directory

http://en.wikipedia.org/wiki/RAID_10





http://en.wikipedia.org/wiki/Data_scrubbing

http://en.wikipedia.org/wiki/Fsck

http://en.wikipedia.org/wiki/Block_device

http://en.wikipedia.org/wiki/Defragmentation


http://en.wikipedia.org/wiki/Linux_kernel

3.2 Planned features include:

1. In-band data deduplication

2. Online filesystem check

3. Very fast offline filesystem check

4. Object-level RAID 0, RAID 1, and RAID 10[citation needed]

5. Incremental backup

6. Ability to handle swap files and swap partitions

7. Encryption

In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun

Microsystems.[40] After Oracle's acquisition of Sun in 2009, Mason and Oracle decided to

continue on with Btrfs development.[41]

Cloning-

Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file.

Such cloned files are sometimes referred to as reflinks, in light of the associated Linux

kernel system calls.

By cloning, the file system does not create a new link pointing to an existing inode — it

instead creates a new inode that shares the same disk blocks as the original file. As a result, this

operation only works within the boundaries of the same Btrfs file system, while it can cross the

boundaries of subvolumes since Linux kernel version 3.6. The actual data blocks are not

becoming duplicated but, due to the copy-on-write nature of cloning, modifications to any of the

cloned files are not visible in their parent files and vice-versa.

This should not be confused with hard links, which are directory entries that associate

multiple file names with actual files on a file system. While hard links can be taken as different

names for the same underlying group of disk blocks (known as a file), cloning in Btrfs provides

independent files that are sharing their disk blocks as a form of data deduplication on the disk

block level. Any later changes to the content of such "dependent" files invoke the copy-on-write

mechanism, which creates independent copies of all altered disk blocks.

9

http://en.wikipedia.org/wiki/Data_deduplication

http://en.wikipedia.org/wiki/Hard_links

http://en.wikipedia.org/wiki/Inode

http://en.wikipedia.org/wiki/System_call

http://en.wikipedia.org/wiki/Computer_file



http://en.wikipedia.org/wiki/Atomicity_(programming)

http://en.wikipedia.org/wiki/Btrfs#cite_note-41

http://en.wikipedia.org/wiki/Btrfs#cite_note-aurora-1-40

http://en.wikipedia.org/wiki/Sun_Microsystems

http://en.wikipedia.org/wiki/Sun_Microsystems

http://en.wikipedia.org/wiki/ZFS

http://en.wikipedia.org/wiki/Swap_file

http://en.wikipedia.org/wiki/Incremental_backup

http://en.wikipedia.org/wiki/Wikipedia:Citation_needed

http://en.wikipedia.org/wiki/Object_(computer_science)



Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the --

reflink option to the cp command.

Cloning can be especially effective in case of storing disk images of virtual machines or their

snapshots. Those are large files differing only in small portions, where the cloning provides both

their faster (instantaneous) copying and minimal consumption of storage space due to data

deduplication.

3.3 Snapshots-

Snapshots means the read only copy of data set frozen at particular point in time.Here we will

consider the case of the of writable snapshots of the tree structure.In btrfs, the cloning or

snapshot algorithm allowstheoretically large number snapshots.

In the above example, We have initial tree Tp. Here we have shown reference countof the each

block. Initially all the tree block have the reference-count as 1. Now wewill clone the btree using

tree Tq. The root of the Tq refer the same block as that of Tp.Now as there are two tree root

referencing some common blocks B and C, the referencecountof these blocks is increased by

one. So cloning algorithm just sets the pointer to the blocks referred by original tree root and

increase the reference-count of blocks referenced.Hence we can have as many as the snapshots

as we want, because pointer occupies less space than the actual data.

10

http://en.wikipedia.org/wiki/Virtual_machine

http://en.wikipedia.org/wiki/Cp_(Unix)

http://en.wikipedia.org/wiki/Coreutils

Now we consider the case of process of editing of the shared blocks. Figure 4.2shows the

example of the same. In this example, there are two tree root Tp and Tq.Now suppose that we are

editing the snapshot with respect to the Tq, and the leaf beingmodified is H. Node C is the

immediate ancestor of the leaf H. It should point to themodified copy of the leaf H. So block C is

shadowed to C0 which points to same blocks as that of C. The reference count of the C

isdecremented. Then the leaf H is shadowed toH0 and the reference count of the block H is

decremented by one. Hence due to this kind of sharing, the space requirement instead of copying

and modifying entire tree is low.

3.4 Subvolumes-

It is volume within volume which can be mounted separately. The user sees thevolumes as the

directories. There are benefits of doing this. We can, for example, makethe database directory as

subvolume, which will enable you to take snapshots for use withbackup. But like volumes in

other file-system, subvolume can’t be mounted anywhere inthe logical view of the directories. It

has to be mounted under the parent directory itself.

11

A subvolume in Btrfs is quite different from the usual LVM logical volumes. With LVM, a

logical volume is a block device in its own right — while this is not the case with Btrfs. A Btrfs

subvolume is not a separate block device, and it cannot be treated or used that way.

Instead, a Btrfs subvolume can be thought of as a separate POSIX file namespace. This

namespace can be accessed either through the top-level subvolume of the file system, or it can be

mounted on its own and accessed separately by specifying the subvol or subvolid option

to mount. When accessed through the top-level subvolume, subvolumes are visible and accessed

as its subdirectories.

Subvolumes can be created at any place within the file system hierarchy, and they can also be

nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similar to

the way top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume

deletes all subvolumes below it in the nesting hierarchy, and for this reason the top-level

subvolume cannot be deleted.

Any Btrfs file system always has a default subvolume, which is initially set to be the top-level

subvolume, and it is mounted by default if no subvolume selection option is passed to mount. Of

course, the default subvolume can be changed as required.

3.5 Send/receive-

Given any pair of subvolumes (or snapshots), Btrfs can generate a binary diff between them

(by using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on

a different Btrfs file system. The send/receive feature effectively creates (and applies) a set of

data modifications required for converting one subvolume into another.

The send/receive feature can be used with regularly scheduled snapshots for implementing a

simple form of file system master/slave replication, or for the purpose of performing incremental

backups.

3.6 Quota groups-

12



http://en.wikipedia.org/wiki/Replication_(computing)

http://en.wikipedia.org/wiki/Diff

http://en.wikipedia.org/wiki/Namespace

http://en.wikipedia.org/wiki/Logical_Volume_Manager

A quota group (or qgroup) imposes an upper limit to the space a subvolume or snapshot may

consume. A new snapshot initially consumes no quota because its data is shared with its parent,

but thereafter incurs a charge for new files and copy-on-write operations on existing files. When

quotas are active, a quota group is automatically created with each new subvolume or snapshot.

These initial quota groups are building blocks which can be grouped (with the btrfs

qgroup command) into hierarchies to implement quota pools.

Quota groups only apply to subvolumes and snapshots, while having quotas enforced on

individual subdirectories is not possible.In-place ext2/3/4 conversion

As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit

unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability

to do an in-place conversion of any ext2/3/4 file system, by nesting the equivalent Btrfs metadata

in its unallocated space — while preserving an unmodified copy of the original file system.

The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files

simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks

shared between the two filesystems before the conversion becomes permanent. Thanks to the

copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during

all file modifications. Until the conversion becomes permanent, only the blocks that were

marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion

can be undone at any time.

All converted files are available and writable in the default subvolume of the Btrfs. A sparse file

holding all of the references to the original ext2/3/4 filesystem is created in a separate

subvolume, which is mountable on its own as a read-only disk image, allowing both original and

converted file systems to be accessed at the same time. Deleting this sparse file frees up the

space and makes the conversion permanent.

3.7 Seed devices-

When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system.

The new file system will then act as a copy-on-write overlay on the seed. The seed can be later

detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still

referenced by the new file system before detaching. Mason has suggested this may be useful for

13

aLive CD installer, which might boot from a read-only Btrfs seed on optical disc, rebalance itself

to the target partition on the install disk in the background while the user continues to work, then

eject the disc to complete the installation without rebooting.

3.8 Encryption-

Though Chris Mason said in his interview in 2009 that encryption was planned for Btrfs, this

is unlikely to be implemented for some time, if ever, due to the complexity of implementation

and pre-existing tested and peer-reviewed solutions. The current recommendation for encryption

with Btrfs is to use a full-disk encryption mechanism such as dm-crypt/LUKS on the underlying

devices, and to create the Btrfs filesystem on top of that layer (and that if a RAID is to be used

with encryption, encrypting a dm-raid device or a hardware-RAID device gives much faster disk

performance than dm-crypt overlaid by Btrfs' own filesystem-level RAID features).

3.9 Checking and recovery-

Unix systems traditionally rely on "fsck" programs to check and repair filesystems.

The btrfsck program is now available but, as of May 2012, it is described by the authors as

"relatively new code which has "not seen widespread testing on a large range of real-life

breakage", and that "may cause additional damage in the process of repair".

There is another tool, named btrfs-restore, that can be used to recover files from an unmountable

filesystem, without modifying the broken filesystem itself (i.e., non-destructively).

In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time,

thanks to making periodic data flushes to permanent storage every 30 seconds (which is the

default period). Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes

to be lost at the next mount. This period can be changed by specifying a desired value (in

seconds) for the commit mount option.

14


http://en.wikipedia.org/wiki/LUKS

http://en.wikipedia.org/wiki/Dm-crypt

http://en.wikipedia.org/wiki/Live_CD

3- Design

Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used

as on-disk data structures for databases, could not efficiently support copy-on-write-based

snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings

and parents would have to be as well, as would their siblings and parents and so on until the

entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with

a refcount associated to each tree node but stored in an ad-hoc free map structure and certain

relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result

would be a data structure suitable for a high-performance object store that could perform copy-

on-write snapshots, while maintaining good concurrency.

At Oracle later that year, Chris Mason began work on a snapshot-capable file system that

would use this data structure almost exclusively—not just for metadata and file data, but also

recursively to track space allocation of the trees themselves. This allowed all traversal and

modifications to be funneled through a single code path, against which features such as copy-on-

write, checksumming and mirroring needed to be implemented only once to benefit the entire file

system.

Btrfs is structured as several layers of such trees, all using the same B-tree implementation.

The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a

unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an

item filter in tree lookups. Objects can have multiple items of multiple types. The remaining

right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up

adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values,

objects can further put items of the same type in a particular order.

Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical

block number of a child node. Leaf nodes contain item keys packed into the front of the node and

item data packed into the end, with the two growing toward each other as the leaf fills up.

15

http://en.wikipedia.org/wiki/Concurrent_computing




http://en.wikipedia.org/wiki/Reference_counting

http://en.wikipedia.org/wiki/B-tree

http://en.wikipedia.org/wiki/B%2B_tree

In this section, we will cover basic data structures that are used in the btrfs. Everytree block is

either a leaf or node. Every leaf and node begins with the header.

//Node :

struct btrfs_node {

struct btrfs_header header ;

struct btrfs_key_ptr ptrs [ ] ;

}

// Leaf :

struct b t r f s_l e a f {

struct bt r f s_heade r header ;

struct bt r f s_i t em i tems [ ] ;

}

// header ( p r e s ent in node and l e a f )

struct bt r f s_heade r {

u8 csum[ 3 2 byt e s ] ;

u8 f s i d [ 1 6 ] ;

__le64 blocknr ;

__le64 g ene r a t i on ;

__le64 owner ;

__le16 nr i t ems ;

__le16 f l a g s ;

u8 l e v e l ;

}

//Key p t r s ( p r e s ent in node )

struct btrfs_key_ptr {

struct btrfs_disk_key ;

6

__le64 bl o ckpt r ;

__le64 g ene r a t i on ;

16

}

// Items ( p r e s ent in l e a f )

struct bt r f s_i t em {

struct btrfs_disk_key key ;

__le32 o f f s e t ;

__le32 s i z e ;

}

// Items ( p r e s ent in i tems in the l e a f )

struct bt r f s_key {

u64 o b j e c t i d ;

u32 f l a g s ;

u64 o f f s e t ;

}

Every tree node carries the header. The block header contains a checksum for theblock

contents, the uuid of the filesystem that owns the block, the level of the block inthe tree, and the

block number where this block is supposed to live. These fields allow the contents of the

metadata to be verified when the data is read.The generation fieldcorresponds to the transaction

id that allocated the block. So nodes have pointer array which points to other leafs or node (i.e.

some blocks on disk) using blockptr field in key.

Now we will look at more details about the leaf structure.Leaf node containsthe header and

the array of items. Now, each logical object in file-system (e.g. files anddirectories) contains

various items. B-tree implementation are used to store these itemssorted on a 136-bit key (struct

btrfs-key) in leaf. The first 64 bits of the key are a objectidwhich is unique id for each logical

object. This id is reported as the inode number. Typesof items in leaf can be inode, directory

entries, extent and so on, associated with object.

Next field in btrfs-key is type which tells information about type of item associated with

object.Next field in key is offset which tell about the position of item in leaf.Now, interesting

thing is, as objectid forms MSB in the btrfs-key of items, so allitems related to the object ends up

being adjacent to each other i.e. they are automaticallygrouped together. This means metadata

and optionally data associated with an object isgrouped together. This results in compact packing

of the data and metadata. Suppose thethat there are N items in the leaf, then index data-item

17

associated with item[X] is dataitem[N-X]. This means the items and the data associated with the

items grow towards

each other in leaf. Following Figure 3.1 summarize this paragraph.Now let’s see some

information about the disk layout. The scheme to storeitems in leaf associated with an object is

space and time efficient as well. Normally, filesystems put only one kind of data - bitmaps, or

inodes, or directory entries - in any given

Leaf structure in Btrfs

file system block. This wastes disk space, since unused space in one kind of block can’tbe used

for any other purpose, and it wastes time, since getting to one particular pieceof file data requires

reading several different kinds of metadata, all located in differentblocks in the file system. In

btrfs, items are packed together (or pushed out to leaves) inarrangements that optimize both

access time and disk space. You can see the difference in these (very schematic, very simplified)

diagrams.

Old-school filesystems tend to organize data as shown in Btrfs, instead, creates a disk

layoutwhich looks as shown in As we can see, there is no fixed block for the inodes, bitmaps, dir

entries, file data or block pointer. The blocks associated with these can overlap for the sake of

compaction.The red arrows in Figure shows the disk seeks to locate data or meta-data.The red

portion in the blocks shows the unused or wasted space. As all metadata relatedto an object is

closely packed, the there are less disk seek. Hence the scheme is time andspace efficient.

18

There are various b-trees in btrfs. Everything is stored the btrees. There issingle tree

manipulation code. Also trees does not care about the object types in theb-tree. So same code can

be reused for all kinds of trees that are there in the btrfs. Hence scheme are not only space and

time efficient, but code efficient too.

Data organization in old-school filesystems

19

Data organization in Btrfs

4.1 Root tree-

Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file

system trees and log trees, have a variable number of instances, each of which is given its own

object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned

special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.

Trees refer to each other by object id. They may also refer to individual nodes in other trees as a

triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such

references are independent of where the tree is actually stored.

4.2 File system tree-

subvolume. Subvolumes can nest, in which case they appear as a directory item (described

User-visible files and directories all live in a file system tree. There is one file system tree per

below) whose data is a reference to the nested subvolume's file system tree.

Within the file system tree, each file and directory objects has an inode item. Extended

attributes and ACL entries are stored alongside in separate items.

20

http://en.wikipedia.org/wiki/Access_control_list

http://en.wikipedia.org/wiki/Extended_attribute

http://en.wikipedia.org/wiki/Extended_attribute

http://en.wikipedia.org/wiki/Inode

http://en.wikipedia.org/wiki/Singleton_pattern

Within each directory, directory entries appear as directory items, whose right-hand key

values are a CRC32C hash of their filename. Their data is a location key, or the key of the inode

item it points to. Directory items together can thus act as an index for path-to-inode lookups, but

are not used for iteration because they are sorted by their hash, effectively randomly

permuting them. This means user applications iterating over and opening files in a large

directory would thus generate many more disk seeks between non-adjacent files—a notable

performance drain in other file systems with hash-ordered directories such as ReiserFS, ext3

(with Htree-indexes enabled and ext4, all of which have TEA-hashed filenames. To avoid this,

each directory entry has a directory index item, whose right-hand value of the item is set to a per-

directory counter that increments with each new directory entry. Iteration over these index items

thus returns entries in roughly the same order as they are stored on disk.

Besides inode items, files and directories also have a reference item whose right-hand key

value is the object id of their parent directory. The data part of the reference item is the filename

that inode is known by in that directory. This allows upward traversal through the directory

hierarchy by providing a way to map inodes back to paths.

Files with hard links in other directories have multiple reference items, one for each parent

directory. Files with hard links in the same directory pack all of the links' filenames into the

same reference item. This was a design flaw that limited the number of same-directory hard links

to however many could fit in a single tree block. (On the default block size of 4 KB, an average

filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.)

Applications which made heavy use of same-directory hard links, such

as git, GNUS, GMame and BackupPCwere later observed to fail after hitting this limit.] The limit

was eventually removed (and as of October 2012 has been merged pending release in Linux by

introducing spilloverextended reference items to hold hard link filenames which could not

otherwise fit.

4.3 Relocation trees-

Defragmentation, shrinking and rebalancing operations require extents to be relocated. However,

doing a simple copy-on-write of the relocating extent will break sharing between snapshots and

21

http://en.wikipedia.org/wiki/Btrfs#cite_note-hard_link_limit-64

http://en.wikipedia.org/wiki/BackupPC

http://en.wikipedia.org/wiki/MAME

http://en.wikipedia.org/wiki/Gnus

http://en.wikipedia.org/wiki/Git_(software)

http://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm

http://en.wikipedia.org/wiki/ReiserFS

http://en.wikipedia.org/wiki/Random_permutation

http://en.wikipedia.org/wiki/Random_permutation

http://en.wikipedia.org/wiki/Cyclic_redundancy_check

consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a

special relocation tree serving as scratch space for affected metadata. The extent to be relocated

is first copied to its destination. Then, by following back references upward through the affected

subvolume's file system tree, metadata pointing to the old extent is progressively updated to

point at the new one; any newly updated items are stored in the relocation tree. Once the update

is complete, items in the relocation tree are swapped with their counterparts in the affected

subvolume, and the relocation tree is discarded.

(b)

(a) File system forest (b) the changes that occur after modification

22

4.4 Extents-

File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent

blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file

data. In compressed extents, individual blocks are not compressed separately; rather, the

compression stream spans the entire extent.

Files have extent data items to track the extents which hold their contents. The item's right-hand

key value is the starting byte offset of the extent. This makes for efficient seeks in large files

with many extents, because the correct extent for any given file offset can be computed with just

one tree lookup.

Snapshots and cloned files share extents. When a small part of a large such extent is overwritten,

the resulting copy-on-write may create three new extents: a small one containing the overwritten

data, and two large ones with unmodified data on either side of the overwrite. To avoid having to

re-write unmodified data, the copy-on-write may instead create bookend extents, or extents

which are simply slices of existing extents. Extent data items allow for this by including an offset

into the extent they are tracking: items for bookends are those with non-zero offsets.

If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored

inline in the extent data item. Each tree node is stored in its own tree block—a single

uncompressed block with a header. The tree block is regarded as a free-standing, single-block

extent.

23

5-Performance(Disk)-

24

6- Limitations

Here we will list few of the problems to be addressed.

1. Transactions

(a) Btrfs supports limited transactions without Atomicity-Consistency-Isolation-Durability

semantics.

(b) Only one transaction may run at a time which is not atomic wrt storage.

2. Checking and recovery

(a) fsck tool is available but not recommended as of now.

25

7-Future Development

Some of the planned features are:

1. Encryption

2. Data Deduplication

3. Parity Based RAID(RAID5 and RAID6)

4. Ability to handle swap

5. Incremental dumps

26

7-References

[1] Ohad Rodeh, Research paper on "B-trees, Shadowing, and Clones" oracle in 2008

[2] Kerner, Sean Michael "A Better File System For Linux". InternetNews.com. Archived from

the original on 24 June 2012. Retrieved 2008-10-30.

[4] Valerie Aurora, Article on "A short history of btrfs",IEEE Magazine in 2011

[5]MACEDONIA,M.R.: “B-tree file system” IEEE Commune. Magazine , vol. 47 pp. S30-S38 ,

Mar 2007

[6] Mason, Chris. "Btrfs: a copy on write, snapshotting FS”, (2007-06-12)

[7] Brown, Eric "Linux 3.0 scrubs up Btrfs, gets more Xen". Linux devices (eWeek). Archived

from the original on 2013-01-27. Retrieved 8 November 2011.

27

28

Date post:	28-Nov-2014
Category:	Engineering
Upload:	dinesh-gupta
View:	305 times
Download:	2 times

b tree file system report

Engineering