+ All Categories
Home > Documents > Outline for Today’s Lecture

Outline for Today’s Lecture

Date post: 10-Feb-2016
Category:
Upload: meagan
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Outline for Today’s Lecture. Administrative: Midterm questions? Objective: Beginning of I/O and File Systems. File System Issues. What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing files. Controlling access to files. - PowerPoint PPT Presentation
111
Outline for Today’s Lecture Administrative: – Midterm questions? Objective: – Beginning of I/O and File Systems
Transcript
Page 1: Outline for Today’s Lecture

Outline for Today’s LectureAdministrative:

– Midterm questions?

Objective: – Beginning of I/O and File Systems

Page 2: Outline for Today’s Lecture

File System Issues• What is the role of files?

What is the file abstraction?• File naming. How to find the file we want?

Sharing files. Controlling access to files.• Performance issues - how to deal with the

bottleneck of disks? What is the “right” way to optimize file access?

Page 3: Outline for Today’s Lecture

Role of Files• Persistence long-lived data for

posterity non-volatile storage media semantically meaningful (memorable)

names

What are the challenges in delivering this functionality?

Page 4: Outline for Today’s Lecture

AbstractionsAddressbook, record for Duke CPS

Userview

Application

File System

addrfile fid, byte range*

Disk Subsystem

device, block #

surface, cylinder, sector

bytes

fid

block#

Page 5: Outline for Today’s Lecture

*File Abstractions• UNIX-like files

– Sequence of bytes– Operations: open (create), close, read, write,

seek• Memory mapped files

– Sequence of bytes – Mapped into address space– Page fault mechanism does data transfer

• Named, Possibly typed

Page 6: Outline for Today’s Lecture

Unix File Syscallsint fd, num, success, bufsize; char data[bufsize]; long offset, pos;

fd = open (filename, mode [,permissions]);success = close (fd);pos = lseek (fd, offset, mode);num = read (fd, data, bufsize);num = write (fd, data, bufsize);

O_RDONLYO_WRONLYO_RDWRO_CREATO_APPEND...

User grp othersrwx rwx rwx111 100 000

Relative tobeginning, currentposition, end of file

Page 7: Outline for Today’s Lecture

UNIX File System Calls

char buf[BUFSIZE];int fd;

if ((fd = open(“../zot”, O_TRUNC | O_RDWR) == -1) {perror(“open failed”);exit(1);

}while(read(0, buf, BUFSIZE)) {

if (write(fd, buf, BUFSIZE) != BUFSIZE) {perror(“write failed”);exit(1);

}}

Pathnames may be relative to process current directory.

Process does not specify current file offset: the system remembers it.

Process passes status back to parent on exit, to report success/failure.

Open files are named to by an integer file descriptor.

Standard descriptors (0, 1, 2) for input, output, error messages (stdin, stdout, stderr).

Page 8: Outline for Today’s Lecture

Memory Mapped Filesfd = open (somefile, consistent_mode);pa = mmap(addr, len, prot, flags, fd,

offset);

VAS

len

len

pa

fd + offset

R, W, X,none

Shared,Private,Fixed,Noreserve

Reading performed by Load instr.

Page 9: Outline for Today’s Lecture

Functions of Device Subsystem

In general, deal with device characteristics• Translate block numbers (the abstraction of

device shown to file system) to physical disk addresses. Device specific (subject to change with upgrades in technology) intelligent placement of blocks.

• Schedule (reorder?) disk operations

Page 10: Outline for Today’s Lecture

Disk Devices

Page 11: Outline for Today’s Lecture

What to do about Disks?• Disk scheduling

– Idea is to reorder outstanding requests to minimize seeks.

• Layout on disk– Placement to minimize disk overhead

• Build a better disk (or substitute)– Example: RAID

Page 12: Outline for Today’s Lecture

Avoiding the Disk -- Caching

Page 13: Outline for Today’s Lecture

File Buffer Cache• Avoid the disk for as

many file operations as possible.

• Cache acts as a filter for the requests seen by the disk reads served best.

• Delayed writeback will avoid going to disk at all for temp files.

Memory

Filecache

Proc

Page 14: Outline for Today’s Lecture

Handling Updates in the File Cache

1. Blocks may be modified in memory once they have been brought into the cache.

Modified blocks are dirty and must (eventually) be written back.

2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous).Delayed writes absorb many small updates with one disk write.

How long should the system hold dirty data in memory?Asynchronous writes allow overlapping of computation and

disk update activity (write-behind).Do the write call for block n+1 while transfer of block n is in

progress.

Page 15: Outline for Today’s Lecture

Linux Page Cache• Page Cache is the disk cache for all page-

based I/O – subsumes file buffer cache.– All page I/O flows through page cache

• pdflush daemons – writeback to disk any dirty pages/buffers.– When free memory falls below threshold, wakeup

daemon to reclaim free memory• Specified number written back• Free memory above threshold

– Periodically, to prevent old data not getting written back, wakeup on timer expiration

• Writes all pages older than specified limit.

Page 16: Outline for Today’s Lecture

Disk Scheduling – Seek Opt.

Page 17: Outline for Today’s Lecture

Rotational MediaSectorTrack

Cylinder

HeadPlatter

Arm

Access time = seek time + rotational delay + transfer time

seek time = 5-15 milliseconds to move the disk arm and settle on a cylinderrotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 mstransfer time = 1 millisecond for an 8KB block at 8 MB/s

Page 18: Outline for Today’s Lecture

Disk Scheduling• Assuming there are sufficient

outstanding requests in request queue• Focus is on seek time - minimizing

physical movement of head.• Simple model of seek performance

Seek Time = startup time (e.g. 3.0 ms) + N (number of cylinders ) * per-cylinder move (e.g. .04 ms/cyl)

Page 19: Outline for Today’s Lecture

“Textbook” Policies• Generally use FCFS as baseline

for comparison• Shortest Seek First (SSTF) -

closest– danger of starvation

• Elevator (SCAN) - sweep in one direction, turn around when no requests beyond– handle case of constant

arrivals at same position• C-SCAN - sweep in only one

direction, return to 0– less variation in response

1, 3, 2, 4, 3, 5, 0FCFS

SSTF

SCAN

CSCAN

Page 20: Outline for Today’s Lecture

Sector Scheduling

Page 21: Outline for Today’s Lecture

Linux Disk Schedulers• Linus Elevator

– Merging and sorting: when new request comes in • Merge with any enqueued request for adjacent sector• If any request is too old, put new request at end of queue• Sort by sector location in queue (between existing requests)• Otherwise at end

• Deadline – each request placed on 2 of 3 queues– sector-wise – as above– read FIFO and write FIFO – whenever expiration time exceeded,

service from here

• Anticipatory– Hang around waiting for subsequent request just a bit

Page 22: Outline for Today’s Lecture

Disk Layout

Page 23: Outline for Today’s Lecture

Layout on Disk• Can address both seek and rotational latency• Cluster related things together

(e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi-block file, files in same directory)

• Sub-block allocation to reduce fragmentation for small files

• Log-Structure File Systems

Page 24: Outline for Today’s Lecture

The Problem of Disk Layout• The level of indirection in the file block maps

allows flexibility in file layout.• “File system design is 99% block allocation.” [McVoy]

• Competing goals for block allocation:– allocation cost– bandwidth for high-volume transfers– efficient directory operations

• Goal: reduce disk arm movement and seek overhead.

• metric of merit: bandwidth utilization

Page 25: Outline for Today’s Lecture

UNIX Inodes

FileAttributes

Blo

ck A

ddr

...

......

...

...

... ...

Data Block Addr

1

1

1

2

2

2

2

3 3 3 3

Data blocks

Decoupling meta-datafrom directory entries

Page 26: Outline for Today’s Lecture

FFS Cylinder Groups• FFS defines cylinder groups as the unit of disk locality,

and it factors locality into allocation choices.– typical: thousands of cylinders, dozens of groups– Strategy: place “related” data blocks in the same cylinder group

whenever possible.• seek latency is proportional to seek distance

– Smear large files across groups:• Place a run of contiguous blocks in each group.

– Reserve inode blocks in each cylinder group.• This allows inodes to be allocated close to their directory entries

and close to their data blocks (for small files).

Page 27: Outline for Today’s Lecture

FFS Allocation Policies1. Allocate file inodes close to their containing

directories.For mkdir, select a cylinder group with a more-than-average

number of free inodes.For creat, place inode in the same group as the parent.

2. Concentrate related file data blocks in cylinder groups.

Most files are read and written sequentially.Place initial blocks of a file in the same group as its inode.

How should we handle directory blocks?Place adjacent logical blocks in the same cylinder group.

Logical block n+1 goes in the same group as block n.Switch to a different group for each indirect block.

Page 28: Outline for Today’s Lecture

Allocating a Block1. Try to allocate the rotationally optimal physical

block after the previous logical block in the file.Skip rotdelay physical blocks between each logical block.(rotdelay is 0 on track-caching disk controllers.)

2. If not available, find another block a nearby rotational position in the same cylinder group

We’ll need a short seek, but we won’t wait for the rotation.If not available, pick any other block in the cylinder group.

3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group.

Pick a block at the beginning of a run of free blocks.

Page 29: Outline for Today’s Lecture

Clustering in FFS• Clustering improves bandwidth utilization for large

files read and written sequentially.• Allocate clumps/clusters/runs of blocks contiguously; read/write

the entire clump in one operation with at most one seek.– Typical cluster sizes: 32KB to 128KB.

• FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space.– This (usually) occurs as a side effect of setting rotdelay = 0.

• Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well.

– Must modify buffer cache to group buffers together and read/write in contiguous clusters.

Page 30: Outline for Today’s Lecture

Effect of ClusteringAccess time = seek time + rotational delay + transfer time

average seek time = 2 ms for an intra-cylinder group seek, let’s sayrotational delay = 8 milliseconds for full rotation at 7200 RPM: average

delay = 4 mstransfer time = 1 millisecond for an 8KB block at 8 MB/s

8 KB blocks deliver about 15% of disk bandwidth. 64KB blocks/clusters deliver about 50% of disk bandwidth.128KB blocks/clusters deliver about 70% of disk bandwidth.Actual performance will likely be better with good

disk layout, since most seek/rotate delays to read the next block/cluster will be “better than average”.

Page 31: Outline for Today’s Lecture

Disk Alternatives

Page 32: Outline for Today’s Lecture

Build a Better Disk?• “Better” has typically meant density to disk

manufacturers - bigger disks are better.• I/O Bottleneck - a speed disparity caused by

processors getting faster more quickly• One idea is to use parallelism of multiple

disks– Striping data across disks– Reliability issues - introduce redundancy

Page 33: Outline for Today’s Lecture

RAIDRedundant Array of Inexpensive Disks

Striped Data Parity Disk

(RAID Levels 2 and 3)

Page 34: Outline for Today’s Lecture

MEMS-based StorageGriffin, Schlosser, Ganger, Nagle

• Paper in OSDI 2000 on OS Management

• Comparing MEMS-based storage with disks– Request scheduling– Data layout– Fault tolerance– Power management

Page 35: Outline for Today’s Lecture

• Settling time after X seek• Spring factor - non-uniform over sled positions• Turnaround time

Page 36: Outline for Today’s Lecture

Data on Media Sled

Page 37: Outline for Today’s Lecture

Disk Analogy

• 16 tips• MxN = 3 x 280• Cylinder – same x

offset• 4 tracks of 1080 bits,

4 tips• Each track – 12

sectors of 80 bits (8 encoded bytes)

• Logical blocks striped across 2 sectors

Page 38: Outline for Today’s Lecture

Logical Blocks and LBN• Sectors are

smaller than disk• Multiple sectors

can be accessed concurrently

• Bidirectional access

Page 39: Outline for Today’s Lecture

ComparisonMEMS• Positioning – X and Y

seek (0.2-0.8 ms)• Settling time 0.2ms• Seeks near edges take

longer due to springs, turnarounds depend on direction – it isn’t just distance to be moved.

• More parts to break• Access parallelism

Disk• Seek (1-15 ms) and

rotational delay• Settling time 0.5ms• Seek times are

relatively constant functions of distance

• Constant velocity rotation occurring regardless of accesses

Page 40: Outline for Today’s Lecture

File System

Page 41: Outline for Today’s Lecture

Functions of File System• (Directory subsystem) Map filenames to fileids-

open (create) syscall. Create kernel data structures.Maintain naming structure (unlink, mkdir, rmdir)

• Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad blocks.

• Handle read and write system calls• Initiate I/O operations for movement of blocks

to/from disk.• Maintain buffer cache

Page 42: Outline for Today’s Lecture

File System Data Structures

stdinstdoutstderr

Process descriptor

per-

proc

ess

file

ptr a

rray

System-wideOpen file table

r-w pos, mode

System-wideFile descriptor table

in-memorycopy of inodeptr to on-diskinode

r-w pos, mode

File data

pos

pos

Page 43: Outline for Today’s Lecture

UNIX Inodes

FileAttributes

Blo

ck A

ddr

...

......

...

...

... ...

Data Block Addr

1

1

1

2

2

2

2

3 3 3 3

Data blocks

Decoupling meta-datafrom directory entries

Page 44: Outline for Today’s Lecture

File Sharing Between Parent/Child

main(int argc, char *argv[]) {char c;int fdrd, fdwt, fdpriv;

if ((fdrd = open(argv[1], O_RDONLY)) == -1)exit(1);

if ((fdwt = creat([argv[2], 0666)) == -1)exit(1);

fork();

if ((fdpriv = open([argv[3], O_RDONLY)) == -1)exit(1);

while (TRUE) {if (read(fdrd, &c, 1) != 1)

exit(0);write(fdwt, &c, 1);

}}

Page 45: Outline for Today’s Lecture

File System Data Structures

stdinstdoutstderr

Process descriptor

per-

proc

ess

file

ptr a

rray

System-wideOpen file table

r-w pos, mode

System-wideFile descriptor table

in-memorycopy of inodeptr to on-diskinode

r-w pos, mode

forked process’sProcess descriptor

openafterfork

Page 46: Outline for Today’s Lecture

Sharing Open File Instances

shared seek offset in shared file table entry

system open file table

user IDprocess ID

process group IDparent PIDsignal state

siblingschildren

user IDprocess ID

process group IDparent PIDsignal state

siblingschildren

process file descriptorsprocess

objects

shared file(inode or vnode)

child

parent

Page 47: Outline for Today’s Lecture

Goals of File Naming• Foremost function - to find files

(e.g., in open() ), Map file name to file object.

• To store meta-data about files.• To allow users to choose their own file names

without undue name conflict problems.• To allow sharing.• Convenience: short names, groupings.• To avoid implementation complications

Page 48: Outline for Today’s Lecture

Pathname Resolution

inode#current

Directory node FileAttributes

inode#Proj

Directory node FileAttributes

cps110

current

inode#proj3

Directory nodeProj

FileAttributes

proj3proj3data filedata file

“cps110/current/Proj/proj3”

FileAttributes

index node of wd

Page 49: Outline for Today’s Lecture

Linux dcache

cps210dentry

spr04dentry

Projdentry

proj1dentry

Inodeobject

Inodeobject

Inodeobject

Inodeobject

Hashtable

Page 50: Outline for Today’s Lecture

Naming Structures• Flat name space - 1 system-wide table,

– Unique naming with multiple users is hard.Name conflicts.

– Easy sharing, need for protection• Per-user name space

– Protection by isolation, no sharing– Easy to avoid name conflicts– Register identifies with directory to use to

resolve names, possibility of user-settable (cd)

Page 51: Outline for Today’s Lecture

Naming StructuresNaming network• Component names - pathnames

– Absolute pathnames - from a designated root– Relative pathnames - from a working directory– Each name carries how to resolve it.

• Short names to files anywhere in the network produce cycles, but convenience in naming things.

Page 52: Outline for Today’s Lecture

Full Naming Network*• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)

A• (relative from Jamie)

d

root

Lynn

Jamie

Terry

project

A

B

CD E

proj1

project

D

d

lynn

jam

TA

grp1

* not Unix

Page 53: Outline for Today’s Lecture

Full Naming Network*• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)

A• (relative from Jamie)

d

root

Lynn

Jamie

Terry

project

A

B

CD E

proj1

project

D

d

lynn

jam

TA

grp1

* UnixWhy?

Page 54: Outline for Today’s Lecture

Meta-Data• File size• File type• Protection - access

control information• History:

creation time, last modification,last access.

• Location of file - which device

• Location of individual blocks of the file on disk.

• Owner of file• Group(s) of users

associated with file

Page 55: Outline for Today’s Lecture

Restricting to a Hierarchy• Problems with full naming network

– What does it mean to “delete” a file?– Meta-data interpretation

Page 56: Outline for Today’s Lecture

Operations on Directories (UNIX)

• link (oldpathname, newpathname) - make entry pointing to file

• unlink (filename) - remove entry pointing to file

• mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file)

• getdents(fd, buf, structsize) - reads dir entries

Page 57: Outline for Today’s Lecture

Reclaiming Storage

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

What shouldbe dealloc?

Page 58: Outline for Today’s Lecture

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

Reclaiming Storage

Page 59: Outline for Today’s Lecture

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

2

31

21

2

Reference Counting?

Page 60: Outline for Today’s Lecture

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

Garbage Collection

*

** Phase 1 marking

Phase 2 collect

Page 61: Outline for Today’s Lecture

Restricting to a Hierarchy• Problems with full naming network

– What does it mean to “delete” a file?– Meta-data interpretation

• Eliminating cycles– allows use of reference counts for

reclaiming file space– avoids garbage collection

Page 62: Outline for Today’s Lecture

Given: Naming Hierarchy (because of implementation

issues)/

tmp usretcbin vmunix

ls sh project users

packages

(volume root)

tex emacs

mount point

leaf

Page 63: Outline for Today’s Lecture

A Typical Unix File Tree

/

tmp usretc

File trees are built by graftingvolumes from different devicesor from network servers.

Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

coveredDir

In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.

mount point

mount (coveredDir, volume)coveredDir: directory pathnamevolume: device

volume root contents become visible at pathname coveredDir

Page 64: Outline for Today’s Lecture

A Typical Unix File Tree

/

tmp usretc

File trees are built by graftingvolumes from different devicesor from network servers.

Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

coveredDir

In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.

mount point

mount (coveredDir, volume)

/usr/project/packages/coveredDir/emacs

(volume root)

tex emacs

Page 65: Outline for Today’s Lecture

Reclaiming Convenience• Symbolic links - indirect files

filename maps, not to file object, but to another pathname– allows short aliases– slightly different semantics

• Search path rules

Page 66: Outline for Today’s Lecture

Unix File Naming (Hard Links)0

rain: 32

hail: 48

0wind: 18

sleet: 48

inode 48

inode link count = 2

directory A directory B

A Unix file may have multiple names.

link system calllink (existing name, new name)create a new name for an existing fileincrement inode link count

unlink system call (“remove”)unlink(name)destroy directory entrydecrement inode link countif count = 0 and file is not in active usefree blocks (recursively) and on-disk inode

Each directory entry naming thefile is called a hard link.

Each inode contains a reference countshowing how many hard links name it.

Page 67: Outline for Today’s Lecture

Unix Symbolic (Soft) Links• Unix files may also be named by symbolic (soft) links.

– A soft link is a file containing a pathname of some other file.

0

rain: 32

hail: 48

inode 48

inode link count = 1

directory A

0wind: 18

sleet: 67

directory B

../A/hail/0

inode 67

symlink system callsymlink (existing name, new name)allocate a new file (inode) with type symlinkinitialize file contents with existing namecreate directory entry for new file with new name

The target of the link may beremoved at any time, leavinga dangling reference.

How should the kernel handle recursive soft links?

Convenience, but not performance!

Page 68: Outline for Today’s Lecture

Soft vs. Hard LinksWhat’s the difference in behavior?

Page 69: Outline for Today’s Lecture

Soft vs. Hard LinksWhat’s the difference in behavior?

Terry Lynn

Jamie

/

Page 70: Outline for Today’s Lecture

Soft vs. Hard LinksWhat’s the difference in behavior?

Terry Lynn

Jamie

/

X

Page 71: Outline for Today’s Lecture

Soft vs. Hard LinksWhat’s the difference in behavior?

Terry Lynn

Jamie

/

X

Page 72: Outline for Today’s Lecture

After Resolving Long PathnamesOPEN(“/usr/faculty/carla/classes/cps110/spring02/lectures/lecture13.ppt”,…)

Finally Arrive at File• What do users seem to want from the file

abstraction?• What do these usage patterns mean for file

structure and implementation decisions?– What operations should be optimized 1st?– How should files be structured?– Is there temporal locality in file usage?– How long do files really live?

Page 73: Outline for Today’s Lecture

Know your Workload!• File usage patterns should influence design

decisions. Do things differently depending:– How large are most files? How long-lived?

Read vs. write activity. Shared often?– Different levels “see” a different workload.

• Feedback loop

Usage patterns observed today

File Systemdesign and impl

Page 74: Outline for Today’s Lecture

Generalizations from UNIX Workloads

• Standard Disclaimers that you can’t generalize…but anyway…

• Most files are small (fit into one disk block) although most bytes are transferred from longer files.

• Most opens are for read mode, most bytes transferred are by read operations

• Accesses tend to be sequential and 100%

Page 75: Outline for Today’s Lecture

More on Access Patterns• There is significant reuse (re-opens) most

opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality.– Looks good for caching!

• Use of temp files is significant part of file system activity in UNIX very limited reuse, short lifetimes (less than a minute).

Page 76: Outline for Today’s Lecture

File Structure Implementation:

Mapping File Block• Contiguous

– 1 block pointer, causes fragmentation, growth is a problem.

• Linked– each block points to next block, directory points to

first, OK for sequential access• Indexed

– index structure required, better for random access into file.

Page 77: Outline for Today’s Lecture

UNIX Inodes

FileAttributes

Blo

ck A

ddr

...

......

...

...

... ...

Data Block Addr

1

1

1

2

2

2

2

3 3 3 3

Data blocks

Decoupling meta-datafrom directory entries

Page 78: Outline for Today’s Lecture

File Allocation Table (FAT)

Lecture.ppt

Pic.jpg

Notes.txt

eof

eof

eof

Page 79: Outline for Today’s Lecture

Meta-Data• File size• File type• Protection - access

control information• History:

creation time, last modification,last access.

• Location of file - which device

• Location of individual blocks of the file on disk.

• Owner of file• Group(s) of users

associated with file

Page 80: Outline for Today’s Lecture

File Access Control

Page 81: Outline for Today’s Lecture

Access Control for Files• Access control lists - detailed list

attached to file of users allowed (denied) access, including kind of access allowed/denied.

• UNIX RWX - owner, group, everyone

Page 82: Outline for Today’s Lecture

UNIX access control• Each file carries its access control with it.rwx rwx rwx setuid

OwnerUID

GroupGID

Everybody else When bit set, itallows processexecuting objectto assume UID ofowner temporarily -enter owner domain(rights amplification)

• Owner has chmod, chgrp rights (granting, revoking)

Page 83: Outline for Today’s Lecture

The Access Model• Authorization problems can be represented

abstractly by of an access model.– each row represents a subject/principal/domain– each column represents an object– each cell: accesses permitted for the {subject,

object} pair• read, write, delete, execute, search, control, or any other

method

• In real systems, the access matrix is sparse and dynamic.

• need a flexible, efficient representation

Page 84: Outline for Today’s Lecture

87

Two Representations• ACL - Access Control Lists

– Columns of previous matrix– Permissions attached to Objects– ACL for file hotgossip: Terry, rw; Lynn, rw

• Capabilities– Rows of previous matrix– Permissions associated with Subject– Tickets, Namespace (what it is that one can name)– Capabilities held by Lynn: luvltr, rw; hotgossip,rw

Page 85: Outline for Today’s Lecture

Access Control Lists• Approach: represent the access matrix by

storing its columns with the objects.• Tag each object with an access control list (ACL) of

authorized subjects/principals.

• To authorize an access requested by S for O– search O’s ACL for an entry matching S– compare requested access with permitted access– access checks are often made only at bind time

Page 86: Outline for Today’s Lecture

Capabilities• Approach: represent the access matrix by

storing its rows with the subjects.• Tag each subject with a list of capabilities for the objects it is

permitted to access.– A capability is an unforgeable object reference, like a

pointer.– It endows the holder with permission to operate on

the object• e.g., permission to invoke specific methods

– Typically, capabilities may be passed from one subject to another.

• Rights propagation and confinement problems

Page 87: Outline for Today’s Lecture

Dynamics of Protection Schemes

• How to endow software modules with appropriate privilege?– What mechanism exists to bind principals with

subjects?• e.g., setuid syscall, setuid bit

– What principals should a software module bind to?• privilege of creator: but may not be sufficient to perform

the service• privilege of owner or system: dangerous

Page 88: Outline for Today’s Lecture

91

Dynamics of Protection Schemes

• How to revoke privileges?• What about adding new subjects or new

objects?• How to dynamically change the set of objects

accessible (or vulnerable) to different processes run by the same user?– Need-to-know principle / Principle of minimal privilege– How do subjects change identity to execute a more

privileged module?• protection domain, protection domain switch (enter)

Page 89: Outline for Today’s Lecture

92

Protection Domains• Processes execute in a

protection domain, initially inherited from subject

• Goal: to be able to change protection domains

• Introduce a level of indirection

• Domains become protected objects with operations defined on them: owner, copy, control

TA

grp

Terry

Lynngr

adef

ile

solu

tions

proj

1

rwx

rw rwo

r

rxc

luvl

tr

r

rw

hotg

ossi

p

rw

rw

Domain0

Dom

ain0

ctl

enter

r

Page 90: Outline for Today’s Lecture

93

• If domain contains copy on right to some object, then it can transfer that right to the object to another domain.

• If domain is owner of some object, it can grant that right to the object, with or without copy to another domain

• If domain is owner or has ctl right to a domain, it can remove right to object from that domain

• Rights propagation.

TA

grp

Terry

Lynngr

adef

ile

solu

tions

proj

1

rwo

rw rwo

r

rc

luvl

tr

r

rw

hotg

ossi

p

rw

rw

Domain0

Dom

ain0

ctl

enter

r

rc

r

Page 91: Outline for Today’s Lecture

Distributed File Systems• Naming

– Location transparency/ independence

• Caching– Consistency

• Replication– Availability and

updates

server

network

server

client

client

client

Page 92: Outline for Today’s Lecture

Naming• \\His\d\pictures\castle.jpg

– Not location transparent - both machine and drive embedded in name.

• NFS mounting– Remote directory mounted

over local directory in local naming hierarching.

– /usr/m_pt/A– No global view

Her local directory tree

usr

m_pt

His localdir tree

for_export

A B

usr

m_pt

A B

Her local tree after mount A B

usr

m_pt

His after mount on B

Page 93: Outline for Today’s Lecture

Global Name SpaceExample: Andrew File System

/

afs

tmp bin lib

local files

shared files -looks identical toall clients

Page 94: Outline for Today’s Lecture

VFS: the Filesystem Switch

syscall layer (file, uio, etc.)

user space

Virtual File System (VFS)networkprotocol

stack(TCP/IP) NFS FFS LFS etc.*FS etc.

device drivers

Sun Microsystems introduced the virtual file system framework in 1985 to accommodate the Network File System cleanly.

• VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.

VFS was an internal kernel restructuringwith no effect on the syscall interface.

Incorporates object-oriented concepts:a generic procedural interface withmultiple implementations.

Other abstract interfaces in the kernel: device drivers,file objects, executable files, memory objects.

Page 95: Outline for Today’s Lecture

VnodesIn the VFS framework, every file or directory in active use

is represented by a vnode object in kernel memory.

syscall layer

NFS UFS

free vnodes

Active vnodes are reference-counted by the structures thathold pointers to them, e.g.,the system open file table.

Each vnode has a standardfile attributes struct.

Vnode operations aremacros that vector tofilesystem-specificprocedures.

Generic vnode points atfilesystem-specific struct(e.g., inode, rnode), seenonly by the filesystem.

Each specific file system maintains a hash of its resident vnodes.

Page 96: Outline for Today’s Lecture

Example:Network File System (NFS)

syscall layer

UFS

NFSserver

VFS

VFS

NFSclient

UFS

syscall layer

clientuser programs

network

server

Page 97: Outline for Today’s Lecture

Vnode Operations and Attributes

directories onlyvop_lookup (OUT vpp, name)vop_create (OUT vpp, name, vattr)vop_remove (vp, name)vop_link (vp, name)vop_rename (vp, name, tdvp, tvp, name)vop_mkdir (OUT vpp, name, vattr)vop_rmdir (vp, name)vop_readdir (uio, cookie)vop_symlink (OUT vpp, name, vattr, contents)vop_readlink (uio)

files onlyvop_getpages (page**, count, offset)vop_putpages (page**, count, sync, offset)vop_fsync ()

vnode/file attributes (vattr or fattr)type (VREG, VDIR, VLNK, etc.)mode (9+ bits of permissions)nlink (hard link count)owner user IDowner group IDfilesystem IDunique file IDfile size (bytes and blocks)access timemodify timegeneration number

generic operationsvop_getattr (vattr)vop_setattr (vattr)vhold()vholdrele()

Page 98: Outline for Today’s Lecture

Pathname Traversal• When a pathname is passed as an argument to a

system call, the syscall layer must “convert it to a vnode”.

• Pathname traversal is a sequence of vop_lookup calls to descend the tree to the named file or directory.

open(“/tmp/zot”)vp = get vnode for / (rootdir)vp->vop_lookup(&cvp, “tmp”);vp = cvp;vp->vop_lookup(&cvp, “zot”);

Issues:1. crossing mount points2. obtaining root vnode (or current dir)3. finding resident vnodes in memory4. caching name->vnode translations5. symbolic (soft) links6. disk implementation of directories7. locking/referencing to handle races with name create and delete operations

Page 99: Outline for Today’s Lecture

Hints• A valuable distributed systems design technique that

can be illustrated in naming.• Definition: information that is not guaranteed to be

correct. If it is, it can improve performance. If not, things will still work OK. Must be able to validate information.

• Example: Sprite prefix tables

Page 100: Outline for Today’s Lecture

Prefix Tables

m_pt1

usr

m_pt2

A

/

/A/m_pt1/usr/m_pt2 pink

/A/m_pt1 blue

/A/m_pt1/usr/B pink

B

/A/m_pt1/usr/m_pt2/stuff.below

Page 101: Outline for Today’s Lecture

Distributed File Systems• Naming

– Location transparency/ independence

• Caching– Consistency

• Replication– Availability and

updates

server

network

server

client

client

client

Page 102: Outline for Today’s Lecture

Caching was “The Answer”• Avoid the disk for as

many file operations as possible.

• Cache acts as a filter for the requests seen by the disk reads served best.

• Delayed writeback will avoid going to disk at all for temp files.

Memory

Filecache

Proc

Page 103: Outline for Today’s Lecture

Caching in Distributed F.S.• Location of cache on

client - disk or memory• Update policy

– write through– delayed writeback– write-on-close

• Consistency– Client does validity check,

contacting server– Server call-backs

server

network

server

client

client

client

Page 104: Outline for Today’s Lecture

File Cache ConsistencyCaching is a key technique in distributed systems.

The cache consistency problem: cached data may become stale if cached data is updated elsewhere in the network.

Solutions:Timestamp invalidation (NFS).

Timestamp each cache entry, and periodically query the server: “has this file changed since time t?”; invalidate cache if stale.

Callback invalidation (AFS).Request notification (callback) from the server if the file

changes; invalidate cache on callback.Leases (NQ-NFS) [Gray&Cheriton89]

Page 105: Outline for Today’s Lecture

108

Sun NFS Cache Consistency• Server is stateless• Requests are self-contained.• Blocks are transferred and

cached in memory.• Timestamp of last known

mod kept with cached file, compared with “true” timestamp at server on Open. (Good for an interval)

• Updates delayed but flushed before Close ends.

server

network

server

client

client

client

titj

openti== tj ?

write/close

Page 106: Outline for Today’s Lecture

109

Cache Consistency for the Web

• Time-to-Live (TTL) fields - HTTP “expires” header

• Client polling -HTTP “if-modified-since” request headers– polling frequency?

possibly adaptive (e.g. based on age of object and assumed stability)

network

Webserver

proxycache

lan

clientclient

Page 107: Outline for Today’s Lecture

110

AFS Cache Consistency• Server keeps state of all

clients holding copies (copy set)

• Callbacks when cached data are about to become stale

• Large units (whole files or 64K portions)

• Updates propagated upon close

• Cache on local disk memory

server

network

server

c0

c1

c2

{c0, c1}

close

callback

• If client crashes, revalidation on recovery (lost callback possibility)

Page 108: Outline for Today’s Lecture

NQ-NFS LeasesIn NQ-NFS, a client obtains a lease on the file that permits the

client’s desired read/write activity.“A lease is a ticket permitting an activity; the lease is valid

until some expiration time.”– A read-caching lease allows the client to cache clean data.

Guarantee: no other client is modifying the file.– A write-caching lease allows the client to buffer modified

data for the file.Guarantee: no other client has the file cached.

Leases may be revoked by the server if another client requests a conflicting operation (server sends eviction notice).

Since leases expire, losing “state” of leases at server is OK.

Page 109: Outline for Today’s Lecture

NFS ProtocolNFS is a network protocol layered above TCP/IP.

– Original implementations (and most today) use UDP datagram transport for low overhead.

• Maximum IP datagram size was increased to match FS block size, to allow send/receive of entire file blocks.

• Some newer implementations use TCP as a transport.

NFS protocol is a set of message formats and types.

• Client issues a request message for a service operation.• Server performs requested operation and returns a reply

message with status and (perhaps) requested data.

Page 110: Outline for Today’s Lecture

File HandlesQuestion: how does the client tell the server which file

or directory the operation applies to?– Similarly, how does the server return the result of a lookup?

• More generally, how to pass a pointer or an object reference as an argument/result of an RPC call?

In NFS, the reference is a file handle or fhandle, a 32-byte token/ticket whose value is determined by the server.– Includes all information needed to identify the

file/object on the server, and get a pointer to it quickly.

volume ID inode # generation #

Page 111: Outline for Today’s Lecture

NFS: From Concept to Implementation

Now that we understand the basics, how do we make it work in a real system?– How do we make it fast?

• Answer: caching, read-ahead, and write-behind.– How do we make it reliable? What if a message is

dropped? What if the server crashes?• Answer: client retransmits request until it receives a response.

– How do we preserve file system semantics in the presence of failures and/or sharing by multiple clients?

• Answer: well, we don’t, at least not completely.– What about security and access control?


Recommended