CS 4284Systems Capstone
Godmar Back
Disks & File Systems
Filesystems
CS 4284 Spring 2013
Files vs Disks
File Abstraction• Byte oriented• Names• Access protection• Consistency guarantees
Disk Abstraction• Block oriented• Block #s• No protection• No guarantees beyond block write
CS 4284 Spring 2013
Filesystem Requirements• Naming
– Should be flexible, e.g., allow multiple names for same files
– Support hierarchy for easy of use• Persistence
– Want to be sure data has been written to disk in case crash occurs
• Sharing/Protection– Want to restrict who has access to files– Want to share files with other users
CS 4284 Spring 2013
FS Requirements (cont’d)• Speed & Efficiency for different access patterns
– Sequential access– Random access– Sequential is most common & Random next– Other pattern is Keyed access (not usually provided by OS)
• Minimum Space Overhead– Disk space needed to store metadata is lost for user data
• Twist: all metadata that is required to do translation must be stored on disk– Translation scheme should minimize number of additional
accesses for a given access pattern– Harder than, say page tables where we assumed page tables
themselves are not subject to paging!
Filesystems
Software Architecture(including in-memory data
structures)
CS 4284 Spring 2013
OverviewFile Operations:
create(), unlink(), open(),read(), write(), close()
Buffer Cache
Device Driver
File System
• Uses names for files• Views files as
sequence of bytes
Uses disk id + sector indices
Must implement translation (file name, file offset) (disk id, disk sector, sector offset)
Must manage free space on disk
CS 4284 Spring 2013
The Big Picture
PCB
…543210
Data structures to keep track of open files
struct file inode + position + …
struct dir inode + position
struct inode
Per-process file descriptor table
Buffer C
ache
Open file tableFilesystem Information
File Descriptors(inodes)
DirectoryData
File Data
Cached data and metadata in buffer cache
On-DiskData Structures
?
CS 4284 Spring 2013
Steps in Opening & Reading a File
• Lookup (via directory)– find on-disk file descriptor’s block number
• Find entry in open file table (struct inode list in Pintos)– Create one if none, else increment ref count
• Find where file data is located– By reading on-disk file descriptor
• Read data & return to user
CS 4284 Spring 2013
Open File Table• inode – represents file
– at most 1 in-memory instance per unique file– #number of openers & other properties
• file – represents one or more processes using an file– With separate offsets for byte-stream
• dir – represents an open directory file• Generally:
– None of data in OFT is persistent– Reflects how processes are currently using files– Lifetime of objects determined by open/close
• Reference counting is used
CS 4284 Spring 2013
File Descriptors (“inodes”)• Term “inode” can refer to 3 things:
1. in-memory inode– Store information about an open file, such as how many
openers, corresponds to on-disk file descriptor2. on-disk inode
– Region on disk, entry in file descriptor table, that stores persistent information about a file – who owns it, where to find its data blocks, etc.
3. on-disk inode, when cached in buffer cache– A bytewise copy of 2. in memory
– Q.: Should in-memory inode store a pointer to cached on-disk inode? (Answer: No.)
Filesystems
On-Disk Data Structures and Allocation Strategies
CS 4284 Spring 2013
Filesystem Information• Contains “superblock”
stores information such assize of entire filesystem, etc.– Location of file descriptor table & free map
• Free Block Map– Bitmap used to find free blocks– Typically cached in memory
• Superblock & free map often replicated in different positions on disk
Free Block Map0100011110101010101
Super Block
CS 4284 Spring 2013
File Allocation Strategies
• Contiguous allocation• Linked files• Indexed files• Multi-level indexed files
CS 4284 Spring 2013
Contiguous Allocation
• Idea: allocate files in contiguous blocks• File Descriptor = (first block, length)• Good sequential & random access• Problems:
– hard to extend files – may require expensive compaction
– external fragmentation– analogous to segmentation-based VM
• Pintos’s baseline implementation does this
File A File B
CS 4284 Spring 2013
Linked Files
• Idea: implement linked list – either with variable sized blocks– or fixed sized blocks (“clusters”)
• Solves fragmentation problem, but now– need lots of seeks for sequential accesses and
random accesses– unreliable: lose first block, may lose file
• Solution: keep linked list in memory– DOS: FAT File Allocation Table
File APart 1
File BPart 1
File APart 2
File BPart 2
CS 4284 Spring 2013
DOS FAT• FAT stored at beginning of disk & replicated for redundancy
• FAT cached in memory• Size: n-bit entries, m-bit blocks
2^(m+n) limit– n=12, 16, 28– m=9 … 15 (0.5KB-32KB)
• As disk size grows, m & n must grow– Growth of n means larger in-memory
table
1 62 03 54 -15 76 -17 118 09 -1
10 911 -112 10
Filename Length First Block“a” 2 1“b” 4 3“c” 3 12“d” 1 4
CS 4284 Spring 2013
DOS FAT Scalability Limits• FAT-12 uses 12 bit entries, max of 4096 clusters
– FAT-16: 65536 clusters, FAT-32 uses 28bits, so theoretical max of 2^28 (1 Gi) clusters
• Floppy disk, say 1.4MB; FAT-12, 1K clusters, need 1,400 entries, 2 bytes each -> 2.8KB
• Modern disk, say ~500 GB (~2^41 bytes)– At 4 KB cluster size, would need 2^29 entries. Each
entry at 4 bytes, would need 2^31 bytes, or 2GB, RAM just to hold the FAT.
– At 32 KB cluster size, would need only 1/8, but still 256MB RAM to hold FAT; simple operations, such as determining how much space is free on disk, require reading entire FAT
CS 4284 Spring 2013
Blocksize Trade-Offs
• Chart above assumes all files are 2KB in size (observed median file size is about 2KB)– Larger blocks: faster reads (because seeks are amortized & more bytes
per transfer)– More wastage (2KB file in 32KB block means 15/16th are unused)
• Source: Tanenbaum, Modern Operating Systems
CS 4284 Spring 2013
Indexed Allocation
• Single-index: specify maximum filesize, create index array, then note blocks in index– Random access ok – one translation step– Sequential access requires more seeks –
depending on contiguous allocation• Drawback: hard to grow beyond maximum
File APart 1
File APart 2
File AIndex
File APart 3
CS 4284 Spring 2013
Multi-Level Indices• Used in Unix &
(possibly) Pintos (P4)
123..N
FLISLITLI
1
2
index
N
index2
index
index
N+IN+1
N+I+1
index3 index2
DirectBlocks
IndirectBlock
DoubleIndirectBlock
TripleIndirectBlock index
N+I+I2
CS 4284 Spring 2013
34350 1 2 3 4 5 6 7 121314 2021 2728
Logical View (Per File) offset in file
Physical View (On Disk) (ignoring other files)
Inode
Data
Index
Index2
sector numbers on disk
CS 4284 Spring 2013
34350 1 2 3 4 5 6 7 121314 2021 2728
Logical View (Per File) offset in file
Physical View (On Disk) (ignoring other files)
Inode
Data
Index
Index2
sector numbers on disk
…5
12
4321
…1011
9876
…-1-1
34272013
…1819
17161514
CS 4284 Spring 2013
Multi-Level Indices• If filesz < N * BLKSIZE, can store all information
in direct block array– Biased in favor of small files (ok because most files
are small…)• Assume index block stores I entries
– If filesz < (I + N) * BLKSIZE, 1 indirect block suffices• Q.: What’s the maximum size before we need
triple-indirect block?• Q.: What’s the per-file overhead (best case,
worst case?)
CS 4284 Spring 2013
Extents• Index-tree based scheme avoids external
fragmentation, and is efficient for small files, but incurs relatively high meta-data overhead for large files
• Extents can improve that – store (bnum, length) pair to denote that file occupies blocks [bnum, … , bnum+length-1]– But complicates offset -> sector translation– Used in ext4.
CS 4284 Spring 2013
Storing Inodes
• Unix v7, BSD 4.3
• FFS (BSD 4.4)
• Cylindergroups have superblock+bitmap+inode list+file space
• Try to allocate file & inode in same cylinder group to improve access locality
I0 I1 I2 I3 I4 …..Superblock Rest of disk for files & directories
I0 I1 …SB1 Files … I3 I4 ….. Files … I8 I9 ….. Files …SB2 SB3
CGi
CS 4284 Spring 2013
Positioning Inodes• Putting inodes in fixed place makes finding
inodes easier– Can refer to them simply by inode number– After crash, there is no ambiguity as to what
are inodes vs. what are regular files• Disadvantage: limits the number of files
per filesystem at creation time– Use “df –ih” on Linux/ext3 to see how many
inodes are used/free
Filesystems
Directories and Name Resolution
CS 4284 Spring 2013
Directories• Need to find file descriptor (inode), given a name • Approaches:
– Single directory (old PCs), Two-level approaches with 1 directory per user
• Now exclusively hierarchical approaches:– File system forms a tree (or DAG)
• How to tell regular file from directory?– Set a bit in the inode
• Data Structures– Linear list of (inode, name) pairs– B-Trees that map name -> inode– Combinations thereof
CS 4284 Spring 2013
Using Linear Lists
• Advantage: (relatively) simple to implement
• Disadvantages:– Scan makes lookup (& delete!) really slow for
large directories– Could cause fragmentation (though not a
problem in practice)
23 multi-oom 15 sample.txt
offset 0
inode #
CS 4284 Spring 2013
Using B+-Trees• Advantages:
– Scalable to large number of files: in growth, in lookup time
• Disadvantage:– Complex– Overhead for small directories (some filesystems switch to
B+-Tree only for large directories)• Note: some filesystems use B+-Tree not only for
directory files, but for block indexes as well.– HFS’s ‘catalog’ – single B+-Tree that stores inodes +
directories.– Also done in NTFS, XFS & Reiserfs, ZFS, and Btrfs
Source: Wikipedia)
CS 4284 Spring 2013
Absolute Paths• How to resolve a path name such as
“/usr/bin/ls”?– Split into tokens using “/” separator– Find inode corresponding to root directory
• (how? Use fixed inode # for root)– (*) Look up “usr” in root directory, find inode– If not last component in path, check that inode
is a directory. Go to (*), looking for next comp– If last component in path, check inode is of
desired type, return
CS 4284 Spring 2013
Name Resolution• Must have a way to scan an entire directory
without other processes interfering -> need a “lock” function– But don’t need to hold lock on /usr when scanning
/usr/bin• Directories can only be removed if they’re empty
– Requires synchronization also• Most OS cache translations in “namei” cache –
maps absolute pathnames to inode– Must keep namei cache consistent if files are deleted
CS 4284 Spring 2013
Current Directory• Relative pathnames are resolved relative to
current directory– Provides default context– Every process has one in Unix/Pintos
• chdir(2) changes current directory– cd tmp; ls; pwd vs (cd tmp; ls); pwd
• lookup algorithm the same, except starts from current dir– process should keep current directory open– current directory inherited from parent
CS 4284 Spring 2013
Hard & Soft Links• Provides aliases (different names) for a file• Hard links: (Unix: ln)
– Two independent directory entries have the same inode number, refer to same file
– Inode contains a reference count– Disadvantage: alias only possible with same
filesystem• Soft links: (Unix: ln –s)
– Special type of file (noted in inode); content of file is absolute or relative pathname – stored inside inode instead of direct block list
• Windows: “junctions” & “shortcuts”