Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
11/20/2006 ecs150, Fall 2006 1
UCDavis, ecs150Fall 2006
ecs150 Fall 2006:Operating SystemOperating System#5: File Systems(chapters: 6.4~6.7, 8)
Dr. S. Felix Wu
Computer Science Department
University of California, Davishttp://www.cs.ucdavis.edu/~wu/
11/20/2006 ecs150, Fall 2006 2
UCDavis, ecs150Fall 2006
File System AbstractionFile System Abstraction
Files Directories
11/20/2006 ecs150, Fall 2006 3
UCDavis, ecs150Fall 2006
System-call interfaceActive file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
11/20/2006 ecs150, Fall 2006 4
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 5
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 6
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 7
UCDavis, ecs150Fall 2006
dirp = opendir(const char *filename);struct dirent *direntp = readdir(dirp);
struct dirent {ino_t d_ino;char d_name[NAME_MAX+1];
};
directory
direntinode
file_name
file
file
file
direntinode
file_name
direntinode
file_name
11/20/2006 ecs150, Fall 2006 8
UCDavis, ecs150Fall 2006
Local versus RemoteLocal versus Remote
System Call Interface V-node Local versus remote
– NFS or i-node– Stackable File System
Hard-disk blocks
11/20/2006 ecs150, Fall 2006 9
UCDavis, ecs150Fall 2006
File-System StructureFile-System Structure File structure
– Logical storage unit– Collection of related information
File system resides on secondary storage (disks).
File system organized into layers. File control block – storage structure
consisting of information about a file.
11/20/2006 ecs150, Fall 2006 10
UCDavis, ecs150Fall 2006 File File Disk Disk
separate the disk into blocks separate the file into blocks as well paging from file to disk
blocks: 4 - 7- 2- 10- 12
How to represent the file??How to link these 5 pages together??
11/20/2006 ecs150, Fall 2006 11
UCDavis, ecs150Fall 2006
Bit torrent piecesBit torrent pieces
1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others).
How much disk space do we need at this moment?
11/20/2006 ecs150, Fall 2006 12
UCDavis, ecs150Fall 2006 Hard DiskHard Disk
Track, Sector, Head– Track + Heads Cylinder
Performance– seek time– rotation time– transfer time
LBA– Linear Block Addressing
11/20/2006 ecs150, Fall 2006 13
UCDavis, ecs150Fall 2006 File File Disk blocks Disk blocks
fileblock
0
4
fileblock
1
7
fileblock
2
2
fileblock
3
10
0file
block4
12
What are the disadvantages?1. disk access can be slow for “random access”.2. How big is each block? 64 bytes? 68 bytes?
11/20/2006 ecs150, Fall 2006 14
UCDavis, ecs150Fall 2006
Kernel Hacking SessionKernel Hacking Session
This Friday from 7:30 p.m. until midnight.. 3083 Kemper
– Bring your laptop– And bring your mug…
11/20/2006 ecs150, Fall 2006 15
UCDavis, ecs150Fall 2006 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
11/20/2006 ecs150, Fall 2006 16
UCDavis, ecs150Fall 2006
One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks
efficient representation & access
11/20/2006 ecs150, Fall 2006 17
UCDavis, ecs150Fall 2006 An i-nodeAn i-node
Typical:each block 8K or 16K bytes
??? entries inone disk block
A file
11/20/2006 ecs150, Fall 2006 18
UCDavis, ecs150Fall 2006
inode (index node) structureinode (index node) structure meta-data of the file.
– di_mode 02– di_nlinks 02– di_uid 02– di_gid 02– di_size 04– di_addr 39– di_gen 01– di_atime 04– di_mtime 04– di_ctime 04
11/20/2006 ecs150, Fall 2006 19
UCDavis, ecs150Fall 2006
System-call interfaceActive file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
11/20/2006 ecs150, Fall 2006 20
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 21
UCDavis, ecs150Fall 2006 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
11/20/2006 ecs150, Fall 2006 22
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 23
UCDavis, ecs150Fall 2006
125 struct ufs2_dinode {126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */127 int16_t di_nlink; /* 2: File link count. */128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 };
11/20/2006 ecs150, Fall 2006 24
UCDavis, ecs150Fall 2006166 struct ufs1_dinode {
167 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 168 int16_t di_nlink; /* 2: File link count. */ 169 union { 170 u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */ 171 } di_u; 172 u_int64_t di_size; /* 8: File byte count. */ 173 int32_t di_atime; /* 16: Last access time. */ 174 int32_t di_atimensec; /* 20: Last access time. */ 175 int32_t di_mtime; /* 24: Last modified time. */ 176 int32_t di_mtimensec; /* 28: Last modified time. */ 177 int32_t di_ctime; /* 32: Last inode change time. */ 178 int32_t di_ctimensec; /* 36: Last inode change time. */ 179 ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ 180 ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ 181 u_int32_t di_flags; /* 100: Status flags (chflags). */ 182 int32_t di_blocks; /* 104: Blocks actually held. */ 183 int32_t di_gen; /* 108: Generation number. */ 184 u_int32_t di_uid; /* 112: File owner. */ 185 u_int32_t di_gid; /* 116: File group. */ 186 int32_t di_spare[2]; /* 120: Reserved; currently unused */ 187 };
11/20/2006 ecs150, Fall 2006 25
UCDavis, ecs150Fall 2006
Bittorrent piecesBittorrent pieces
File size: 10 GBPieces downloaded: 512 MBHow much disk space do we need?
11/20/2006 ecs150, Fall 2006 26
UCDavis, ecs150Fall 2006
#include <stdio.h>#include <stdlib.h>
intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;
for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}
# ./t# ls –l ./sss.txt
11/20/2006 ecs150, Fall 2006 27
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 28
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 29
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 30
UCDavis, ecs150Fall 2006 An i-nodeAn i-node
Typical:each block 1K
??? entries inone disk block
A file
11/20/2006 ecs150, Fall 2006 31
UCDavis, ecs150Fall 2006
i-nodei-node
How many disk blocks can a FS have? How many levels of i-node indirection will be
necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3) What is the largest possible file size in i-node? What is the size of the i-node itself for a file of 10GB
with only 512 MB downloaded?
11/20/2006 ecs150, Fall 2006 32
UCDavis, ecs150Fall 2006
AnswerAnswer How many disk blocks can a FS have?
– 264 or 232: Pointer (to blocks) size is 8/4 bytes. How many levels of i-node indirection will be
necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3)– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231
What is the largest possible file size in i-node?– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10
– 264 –1– 232 * 210
You need to consider three issues and find the minimum!
11/20/2006 ecs150, Fall 2006 33
UCDavis, ecs150Fall 2006
AnswerAnswer
How many pointers?– 512MB divided by the block size (1K)– 512K pointers times 8 (4) bytes = 4 (2) MB
11/20/2006 ecs150, Fall 2006 34
UCDavis, ecs150Fall 2006 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
11/20/2006 ecs150, Fall 2006 35
UCDavis, ecs150Fall 2006
FFS and UFSFFS and UFS
/usr/src/sys/ufs/ffs/*– Higher-level: directory structure– Soft updates & Snapshot
/usr/src/sys/ufs/ufs/*– Lower-level: buffer, i-node
11/20/2006 ecs150, Fall 2006 36
UCDavis, ecs150Fall 2006
# of i-nodes# of i-nodes
UFS1: pre-allocation– 3% of HD, about < 25% used.
UFS2: dynamic allocation– Still limited # of i-nods
11/20/2006 ecs150, Fall 2006 37
UCDavis, ecs150Fall 2006
di_size vs. di_blocksdi_size vs. di_blocks
???
11/20/2006 ecs150, Fall 2006 38
UCDavis, ecs150Fall 2006
One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks
efficient representation & access
11/20/2006 ecs150, Fall 2006 39
UCDavis, ecs150Fall 2006
di_size vs. di_blocksdi_size vs. di_blocks
Logical Physical
fstat du
11/20/2006 ecs150, Fall 2006 40
UCDavis, ecs150Fall 2006
Extended Attributes in UFS2Extended Attributes in UFS2 Attributes associated with the File
– di_extb[2]; – two blocks, but indirection if needed.
Format– Length 4– Name Space 1– Content Pad Length 1– Name Length 1– Name mod 8– Content variable
Applications: ACL, Data Labelling
11/20/2006 ecs150, Fall 2006 41
UCDavis, ecs150Fall 2006
Some thoughts….Some thoughts…. What can you do with “extended attributes”? How to design/implement?
– Should/can we do it “Stackable File Systems”?– Otherwise, the program to manipulate the EA’s
will have to be very UFS2-dependent or FiST with an UFS2 optimization option.
Are there any counter examples?– security and performance considerations.
11/20/2006 ecs150, Fall 2006 42
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 43
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 44
UCDavis, ecs150Fall 2006 struct dirent {
ino_t d_ino;char d_name[NAME_MAX+1];
};
struct stat {…short nlinks;
…};
directory
direntinode
file_name
file
file
file
direntinode
file_name
direntinode
file_name
11/20/2006 ecs150, Fall 2006 45
UCDavis, ecs150Fall 2006 A File SystemA File System
partition partition partition
i-list directory and data blockssb
i-node i-node ……. i-node
d
11/20/2006 ecs150, Fall 2006 46
UCDavis, ecs150Fall 2006
ln –s /usr/src/sys/sys/proc.h ppp.h ln /usr/src/sys/sys/proc.h ppp.h
11/20/2006 ecs150, Fall 2006 47
UCDavis, ecs150Fall 2006
File System Buffer CacheFile System Buffer Cacheapplication: read/write files
OS: translate file to disk blocks
...buffer cache ...maintains
controls disk accesses: read/write blocks
hardware:
Any problems?
11/20/2006 ecs150, Fall 2006 48
UCDavis, ecs150Fall 2006
File System ConsistencyFile System Consistency
To maintain file system consistency the ordering of updates from buffer cache to disk is critical
Example:– if the directory block is written back before the
i-node and the system crashes, the directory structure will be inconsistent
11/20/2006 ecs150, Fall 2006 49
UCDavis, ecs150Fall 2006
File System ConsistencyFile System Consistency File system almost always use a buffer/disk cache for
performance reasons This problem is critical especially for the blocks that
contain control information: i-node, free-list, directory blocks
Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk
Write back critical blocks from the buffer cache to disk immediately
Data blocks are also written back periodically: sync
11/20/2006 ecs150, Fall 2006 50
UCDavis, ecs150Fall 2006
Two StrategiesTwo Strategies Prevention
– Use un-buffered I/O when writing i-nodes or pointer blocks
– Use buffered I/O for other writes and force sync every 30 seconds
Detect and Fix– Detect the inconsistency
– Fix them according to the “rules”
– Fsck (File System Checker)
11/20/2006 ecs150, Fall 2006 51
UCDavis, ecs150Fall 2006
File System IntegrityFile System Integrity Block consistency:
– Block-in-use table
– Free-list table
File consistency:– how many directories pointing to that i-node?
– nlink?
– three cases: D == L, L > D, D > L What to do with the latter two cases?
0 1 1 1 0 0 0 1 0 0 0 2
1 0 0 0 1 1 1 0 1 0 2 0
11/20/2006 ecs150, Fall 2006 52
UCDavis, ecs150Fall 2006 File System IntegrityFile System Integrity
File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block
11/20/2006 ecs150, Fall 2006 53
UCDavis, ecs150Fall 2006
Metadata OperationsMetadata Operations
Metadata operations modify the structure of the file system– Creating, deleting, or renaming
files, directories, or special files– Directory & I-node
Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash
11/20/2006 ecs150, Fall 2006 54
UCDavis, ecs150Fall 2006
Metadata IntegrityMetadata Integrity
FFS uses synchronous writes to guarantee the integrity of metadata– Any operation modifying multiple pieces of
metadata will write its data to disk in a specific order
– These writes will be blocking Guarantees integrity and durability of
metadata updates
11/20/2006 ecs150, Fall 2006 55
UCDavis, ecs150Fall 2006
Deleting a file (I)Deleting a file (I)
abc
def
ghi
i-node-1
i-node-2
i-node-3
Assume we want to delete file “def”
11/20/2006 ecs150, Fall 2006 56
UCDavis, ecs150Fall 2006
Deleting a file (II)Deleting a file (II)
abc
def
ghi
i-node-1
i-node-3
Cannot delete i-node before directory entry “def”
?
11/20/2006 ecs150, Fall 2006 57
UCDavis, ecs150Fall 2006
Deleting a file (III)Deleting a file (III)
Correct sequence is1. Write to disk directory block containing deleted
directory entry “def”
2. Write to disk i-node block containing deleted i-node
Leaves the file system in a consistent state
11/20/2006 ecs150, Fall 2006 58
UCDavis, ecs150Fall 2006
Creating a file (I)Creating a file (I)
abc
ghi
i-node-1
i-node-3
Assume we want to create new file “tuv”
11/20/2006 ecs150, Fall 2006 59
UCDavis, ecs150Fall 2006
Creating a file (II)Creating a file (II)
abc
ghi
tuv
i-node-1
i-node-3
Cannot write directory entry “tuv” before i-node
?
11/20/2006 ecs150, Fall 2006 60
UCDavis, ecs150Fall 2006
Creating a file (III)Creating a file (III)
Correct sequence is1. Write to disk i-node block containing new i-node
2. Write to disk directory block containing new directory entry
Leaves the file system in a consistent state
11/20/2006 ecs150, Fall 2006 61
UCDavis, ecs150Fall 2006
Synchronous UpdatesSynchronous Updates
Used by FFS to guarantee consistency of metadata:– All metadata updates are done through blocking
writes
Increases the cost of metadata updates Can significantly impact the performance
of whole file system
11/20/2006 ecs150, Fall 2006 62
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 63
UCDavis, ecs150Fall 2006
SOFT UPDATESSOFT UPDATES
Use delayed writes (write back) Maintain dependency information about
cached pieces of metadata:This i-node must be updated before/after this directory entry
Guarantee that metadata blocks are written to disk in the required order
11/20/2006 ecs150, Fall 2006 64
UCDavis, ecs150Fall 2006
3 Soft Update Rules3 Soft Update Rules
Never point to a structure before it has been initialized.
Never reuse a resource before nullifying all previous pointers to it.
Never reset the old pointer to a live resource before the new pointer has been set.
11/20/2006 ecs150, Fall 2006 65
UCDavis, ecs150Fall 2006
Problem #1 with S.U.Problem #1 with S.U.
Synchronous writes guaranteed that metadata operations were durable once the system call returned
Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one– Some updates could be lost
11/20/2006 ecs150, Fall 2006 66
UCDavis, ecs150Fall 2006
We want to delete file “foo” and create new file “bar”
i-node-2 foo
NEW bar
NEW i-node-3
Block A Block B
What are the dependency relationship?
11/20/2006 ecs150, Fall 2006 67
UCDavis, ecs150Fall 2006
We want to delete file “foo” and create new file “bar”
i-node-2 foo
NEW bar
NEW i-node-3
Block A Block B
Circular DependencyX-2nd Y-1st
11/20/2006 ecs150, Fall 2006 68
UCDavis, ecs150Fall 2006
Problem #2 with S.U.Problem #2 with S.U.
Cyclical dependencies:– Same directory block contains entries to be
created and entries to be deleted– These entries point to i-nodes in the same block
Brainstorming:– How to resolve this issue in S.U.?
11/20/2006 ecs150, Fall 2006 69
UCDavis, ecs150Fall 2006
How to update?? i-node first or director block first?
11/20/2006 ecs150, Fall 2006 70
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 71
UCDavis, ecs150Fall 2006
Solution in S.U.Solution in S.U.
Roll back metadata in one of the blocks to an earlier, safe state
(Safe state does not contain new directory entry)
def
Block A’
11/20/2006 ecs150, Fall 2006 72
UCDavis, ecs150Fall 2006
Write first block with metadata that were rolled back (block A’ of example)
Write blocks that can be written after first block has been written (block B of example)
Roll forward block that was rolled back Write that block Breaks the cyclical dependency but must now
write twice block A
11/20/2006 ecs150, Fall 2006 73
UCDavis, ecs150Fall 2006
Before any Write Operation
After any Write Operation
SU Dependency Checking(roll back if necessary)
SU Dependency Processing(task list updating)(roll forward if necessary)
11/20/2006 ecs150, Fall 2006 74
UCDavis, ecs150Fall 2006
two most popular approaches for improving the performance of metadata operations and recovery:– Journaling – Soft Updates
Journaling systems record metadata operations on an auxiliary log
Soft Updates uses ordered writes
11/20/2006 ecs150, Fall 2006 75
UCDavis, ecs150Fall 2006 JOURNALINGJOURNALING
Journaling systems maintain an auxiliary log that records all meta-data operations
Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations.– After a crash, can replay the log to bring the file
system to a consistent state
11/20/2006 ecs150, Fall 2006 76
UCDavis, ecs150Fall 2006
JOURNALINGJOURNALING
Log writes are performed in addition to the regular writes
Journaling systems incur log write overhead but– Log writes can be performed efficiently
because they are sequential (block operation consideration)
– Metadata blocks do not need to be written back after each update
11/20/2006 ecs150, Fall 2006 77
UCDavis, ecs150Fall 2006
JOURNALINGJOURNALING
Journaling systems can provide– same durability semantics as FFS if log is
forced to disk after each meta-data operation– the laxer semantics of Soft Updates if log
writes are buffered until entire buffers are full
11/20/2006 ecs150, Fall 2006 78
UCDavis, ecs150Fall 2006
Soft Updates vs. JournalingSoft Updates vs. Journaling
Advantages disadvantages
11/20/2006 ecs150, Fall 2006 79
UCDavis, ecs150Fall 2006
With Soft Updates??With Soft Updates??
CPU
Do we still need “FSCK”? at boot time?
11/20/2006 ecs150, Fall 2006 80
UCDavis, ecs150Fall 2006
Recover the Missing ResourcesRecover the Missing Resources
In the background, in an active FS…– We don’t want to wait for the lengthy FSCK
process to complete…
A related issue:– the virus scanning process– what happens if we get a new virus signature?
11/20/2006 ecs150, Fall 2006 81
UCDavis, ecs150Fall 2006
Snapshot of the FSSnapshot of the FS
backup and restore dump reliably an active File System
– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)
“background FSCK checks”
11/20/2006 ecs150, Fall 2006 82
UCDavis, ecs150Fall 2006
What is a snapshot?What is a snapshot?(I mean “conceptually”.)(I mean “conceptually”.)
Freeze all activities related to the FS. Copy everything to “some space”. Resume the activities.
How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive?
11/20/2006 ecs150, Fall 2006 83
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 84
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 85
UCDavis, ecs150Fall 2006
Copy-on-Write
11/20/2006 ecs150, Fall 2006 86
UCDavis, ecs150Fall 2006 Snapshot: a fileSnapshot: a file
Logical sizeVersus physical size
11/20/2006 ecs150, Fall 2006 87
UCDavis, ecs150Fall 2006
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
11/20/2006 ecs150, Fall 2006 88
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 89
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 90
UCDavis, ecs150Fall 2006
#include <stdio.h>#include <stdlib.h>
intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;
for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}
11/20/2006 ecs150, Fall 2006 91
UCDavis, ecs150Fall 2006
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
11/20/2006 ecs150, Fall 2006 92
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 93
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 94
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 95
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 96
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 97
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 98
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 99
UCDavis, ecs150Fall 2006
ExampleExample
# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon
11/20/2006 ecs150, Fall 2006 100
UCDavis, ecs150Fall 2006
Copy-on-Write
11/20/2006 ecs150, Fall 2006 101
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 102
UCDavis, ecs150Fall 2006 A File SystemA File System
??? entries inone disk block
A file
11/20/2006 ecs150, Fall 2006 103
UCDavis, ecs150Fall 2006 A Snapshot i-nodeA Snapshot i-node
??? entries inone disk block
A file
Not used orNot yet copy
11/20/2006 ecs150, Fall 2006 104
UCDavis, ecs150Fall 2006 Copy-on-writeCopy-on-write
??? entries inone disk block
A file
Not used orNot yet copy
11/20/2006 ecs150, Fall 2006 105
UCDavis, ecs150Fall 2006 Copy-on-writeCopy-on-write
??? entries inone disk block
A file
Not used orNot yet copy
11/20/2006 ecs150, Fall 2006 106
UCDavis, ecs150Fall 2006
Multiple SnapshotsMultiple Snapshots
about 20 snapshots Interactions/sharing among snapshots
11/20/2006 ecs150, Fall 2006 107
UCDavis, ecs150Fall 2006
Snapshot of the FSSnapshot of the FS
backup and restore dump reliably an active File System
– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)
“background FSCK checks”
11/20/2006 ecs150, Fall 2006 108
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 109
UCDavis, ecs150Fall 2006
VFS: the FS SwitchVFS: the FS Switch
syscall layer (file, uio, etc.)
user space
Virtual File System (VFS)networkprotocol
stack(TCP/IP) NFS FFS LFS etc.*FS etc.
device drivers
Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly.
VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.
VFS was an internal kernel restructuringwith no effect on the syscall interface.
Incorporates object-oriented concepts:a generic procedural interface withmultiple implementations.
Based on abstract objects with dynamicmethod binding by type...in C.Other abstract interfaces in the kernel: device drivers,
file objects, executable files, memory objects.
11/20/2006 ecs150, Fall 2006 110
UCDavis, ecs150Fall 2006
vnodevnode In the VFS framework, every file or directory in active use
is represented by a vnode object in kernel memory.
syscall layer
NFS UFS
free vnodes
Each vnode has a standardfile attributes struct.
Vnode operations aremacros that vector tofilesystem-specificprocedures.
Generic vnode points atfilesystem-specific struct(e.g., inode, rnode), seenonly by the filesystem. Each specific file system
maintains a cache of its resident vnodes.
11/20/2006 ecs150, Fall 2006 111
UCDavis, ecs150Fall 2006
vnode Operations and vnode Operations and AttributesAttributes
directories onlyvop_lookup (OUT vpp, name)vop_create (OUT vpp, name, vattr)vop_remove (vp, name)vop_link (vp, name)vop_rename (vp, name, tdvp, tvp, name)vop_mkdir (OUT vpp, name, vattr)vop_rmdir (vp, name)vop_symlink (OUT vpp, name, vattr, contents)vop_readdir (uio, cookie)vop_readlink (uio)
files onlyvop_getpages (page**, count, offset)vop_putpages (page**, count, sync, offset)vop_fsync ()
vnode attributes (vattr)type (VREG, VDIR, VLNK, etc.)mode (9+ bits of permissions)nlink (hard link count)owner user IDowner group IDfilesystem IDunique file IDfile size (bytes and blocks)access timemodify timegeneration number
generic operationsvop_getattr (vattr)vop_setattr (vattr)vhold()vholdrele()
11/20/2006 ecs150, Fall 2006 112
UCDavis, ecs150Fall 2006
Network File System (NFS)Network File System (NFS)
syscall layer
UFS
NFSserver
VFS
VFS
NFSclient
UFS
syscall layer
client
user programs
network
server
11/20/2006 ecs150, Fall 2006 113
UCDavis, ecs150Fall 2006
vnode Cachevnode CacheHASH(fsid, fileid)
VFS free list headActive vnodes are reference- counted by the structures that hold pointers to them.
- system open file table
- process current directory
- file system mount points
- etc.
Each specific file system maintains its own hash of vnodes (BSD).
- specific FS handles initialization
- free list is maintained by VFSvget(vp): reclaim cached inactive vnode from VFS free listvref(vp): increment reference count on an active vnodevrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed)
11/20/2006 ecs150, Fall 2006 114
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 115
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 116
UCDavis, ecs150Fall 2006
struct vnode {struct mtx v_interlock; /* lock for "i" things */u_long v_iflag; /* i vnode flags (see below) */int v_usecount; /* i ref count of users */long v_numoutput; /* i writes in progress */struct thread *v_vxthread; /* i thread owning VXLOCK */int v_holdcnt; /* i page & buffer references */struct buflists v_cleanblkhd; /* i SORTED clean blocklist */struct buf *v_cleanblkroot;/* i clean buf splay tree */int v_cleanbufcnt; /* i number of clean buffers */struct buflists v_dirtyblkhd; /* i SORTED dirty blocklist */struct buf *v_dirtyblkroot; /* i dirty buf splay tree */int v_dirtybufcnt;
11/20/2006 ecs150, Fall 2006 117
UCDavis, ecs150Fall 2006
System-call interfaceActive file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
11/20/2006 ecs150, Fall 2006 118
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 119
UCDavis, ecs150Fall 2006 How Stacking WorksHow Stacking Works
EXT2FS
US
ER
KE
RN
EL
User process
data &error codes
read()System CallInterface
File SystemInterface ext2fs_read()
ncryptfs_read()
data &error codes
NCryptfs
11/20/2006 ecs150, Fall 2006 120
UCDavis, ecs150Fall 2006
FiST: File System TranslatorLanguage + compilerCode portabilityAverage code size over other stackable file-systems is reduced ten times.Average development time is reduced seven timesDevelopers need only to describe the core functionality of their file systems.Basefs = minimalist template derived from WrapfsExtending platform-specific vnode interfaces in a platform independent way.
11/20/2006 ecs150, Fall 2006 121
UCDavis, ecs150Fall 2006
11/20/2006 ecs150, Fall 2006 122
UCDavis, ecs150Fall 2006
Transaction-based FSTransaction-based FS
Performance versus consistency “Atomic Writes” on Multiple Blocks
– See the paper titled “Atomic Writes for Data Integrity and Consistency in Shared Storage Devices for Clusters” by Okun and Barak, FGCS, vol. 20, pages 539-547, 2004.
– Modify SCSI handling