Hard State Revisited: Network Filesystemschase/cps212-f00/slides/nfs.pdf · 2000. 9. 27. · A...

Hard State Revisited: Network FilesystemsHard State Revisited: Network Filesystems

Jeff Chase

CPS 212, Fall 2000

Network File System (NFS)Network File System (NFS)

syscall layer

*FS

NFSserver

VFS

VFS

NFSclient

*FS

syscall layer

client

user programs

RPC over UDP or TCP

server

NFS VnodesNFS Vnodes

syscall layer

*FS

NFSserver

VFS

RPC

network

nfsnode

NFS client stubs

nfs_vnodeops

Thenfsnodeholds client stateneeded to interact with the serverto operate on the file.

struct nfsnode* np = VTONFS(vp);

The NFS protocol has an operation type for (almost) everyvnode operation, with similar arguments/results.

File HandlesFile Handles

Question: how does the client tell the server which file ordirectory the operation applies to?

• Similarly, how does the server return the result of alookup?

More generally, how to pass a pointer or an object reference as anargument/result of an RPC call?

In NFS, the reference is afile handleor fhandle, a token/ticketwhose value is determined by the server.

• Includes all information needed to identify the file/object onthe server, and find it quickly.

volume ID inode # generation #

NFS: From Concept to ImplementationNFS: From Concept to Implementation

Now that we understand the basics, how do we make it fast?

• caching

data blocks

file attributes

lookup cache (dnlc): name->fhandle mappings

directory contents?

• read-ahead and write-behind

file I/O at wire speed

And of course we want the full range of other desirable “*ility”properties....

NFS as a “Stateless” ServiceNFS as a “Stateless” Service

A classical NFS server maintains no in-memory hard state.The only hard state is the stable file system image on disk.

• no record of clients or open files

• no implicit arguments to requests

E.g., no server-maintained file offsets: read andwrite requestsmust explicitly transmit the byte offset for each operation.

• no write-back caching on the server

• no record of recently processed requests

• etc., etc....

Statelessness makes failure recovery simple and efficient.

Recovery in Stateless NFSRecovery in Stateless NFS

If the server fails and restarts, there is no need to rebuild in-memory state on the server.• Client reestablishes contact (e.g., TCP connection).

• Client retransmits pending requests.

Classical NFS uses a connectionless transport (UDP).• Server failure is transparent to the client; no connection to

break or reestablish.A crashed server is indistinguishable from a slow server.

• Sun/ONC RPC masks network errors by retransmitting arequest after an adaptive timeout.

A dropped packet is indistinguishable from a crashed server.

Drawbacks of a Stateless ServiceDrawbacks of a Stateless Service

The stateless nature of classical NFS has compelling designadvantages (simplicity), but also some key drawbacks:• Recovery-by-retransmission constrains the server interface.

ONC RPC/UDP hasexecute-at-least-oncesemantics (“send andpray”), which compromises performance and correctness.

• Update operations are disk-limited.Updatesmust commit synchronouslyat the server.

• NFS cannot (quite) preserve localsingle-copy semantics.Files may be removed while they are open on the client.

Server cannot help in client cache consistency.

Let’s explore these problems and their solutions...

Problem 1: Retransmissions and IdempotencyProblem 1: Retransmissions and Idempotency

For a connectionless RPC transport, retransmissions can saturatean overloaded server.

Clients “kick ‘em while they’re down”, causing steep hockey stick.

Execute-at-least-once constrains the server interface.

• Service operations should/must beidempotent.

Multiple executions should/must have the same effect.

• Idempotent operations cannot capture the full semantics weexpect from our file system.

remove, append-mode writes, exclusive create

Solutions to the Retransmission ProblemSolutions to the Retransmission Problem

1. Hope for the best and smooth over non-idempotent requests.E.g., map ENOENT and EEXIST to ESUCCESS.

2. Use TCP or some other transport protocol that producesreliable, in-order delivery.

higher overhead...and we still need sessions.

3. Implement an execute-at-most once RPC transport.TCP-like features (sequence numbers)...and sessions.

4. Keep aretransmission cacheon the server[Juszczak90].Remember the most recent request IDs and their results, and just

resend the result....does this violate statelessness?

DAFS persistent session cache.

Problem 2: Synchronous WritesProblem 2: Synchronous Writes

Stateless NFS servers must commit each operation to stablestorage before responding to the client.

• Interferes with FS optimizations, e.g., clustering, LFS, anddisk write ordering (seek scheduling).

Damages bandwidth and scalability.

• Imposes disk access latency for each request.

Not so bad for a logged write; much worse for a complexoperation like an FFS file write.

The synchronous update problem occurs for any storageservice with reliable update (commit).

Speeding Up Synchronous NFS WritesSpeeding Up Synchronous NFS Writes

Interesting solutions to the synchronous write problem, usedin high-performance NFS servers:• Delay the response until convenient for the server.

E.g., NFSwrite-gatheringoptimizations for clustered writes(similar togroup commitin databases).

Relies on write-behind from NFS I/O daemons (iods).

• Throw hardware at it: non-volatile memory (NVRAM)Battery-backed RAM or UPS (uninterruptible power supply).

Use as an operation log (Network Appliance WAFL)...

...or as a non-volatile disk write buffer (Legato).

• Replicate server and buffer in memory (e.g., MIT Harp).

NFS V3 Asynchronous WritesNFS V3 Asynchronous Writes

NFS V3 sidesteps the synchronous write problem by adding anewasynchronous writeoperation.

• Server may reply to client as soon as it accepts the write,before executing/committing it.

If the server fails, it may discardany subsetof the accepted butuncommitted writes.

• Client holds asynchronously written data in its cache, andreissues the writes if the server fails and restarts.

When is it safe for the client to discard its buffered writes?

How can the client tell if the server has failed?

NFS V3 CommitNFS V3 Commit

NFS V3 adds a newcommitoperation to go with async-write.

• Client may issue acommitfor a file byte range at any time.

• Server must execute all covered uncommitted writes beforereplying to the commit.

• When the client receives the reply, it may safely discard anybuffered writes covered by the commit.

• Server returns averifier with every reply to anasync writeorcommitrequest.

The verifier is just an integer that is guaranteed to change if theserver restarts, and to never change back.

• What if the client crashes?

Problem 3: File Cache ConsistencyProblem 3: File Cache Consistency

Problem: Concurrent write sharing of files.Contrast withread sharingor sequential write sharing.

Solutions:

• Timestamp invalidation(NFS).

Timestamp each cache entry, and periodically query the server:“has this file changed since timet?”; invalidate cache if stale.

• Callback invalidation(AFS, Sprite, Spritely NFS).

Request notification (callback) from the server if the filechanges; invalidate cache and/or disable caching on callback.

• Leases(NQ-NFS)[Gray&Cheriton89,Macklem93,NFS V4]• Later: distributed shared memory

File Cache Example: NQFile Cache Example: NQ--NFS LeasesNFS Leases

In NQ-NFS, a client obtains aleaseon the file that permitsthe client’s desired read/write activity.

“A lease is a ticket permitting an activity; the lease is valid untilsome expiration time.”

• A read-caching leaseallows the client to cache clean data.

Guarantee: no other client is modifying the file.

• A write-caching leaseallows the client to buffer modifieddata for the file.

Guarantee: no other client has the file cached.

Allows delayed writes: client may delay issuing writes toimprove write performance (i.e., client has a writeback cache).

Using NQUsing NQ--NFS LeasesNFS Leases

1. Client NFS piggybacks lease requests for a given file onI/O operation requests (e.g., read/write).

NQ-NFS leases areimplicit and distinct from file locking.

2. The server determines if it can safely grant the request, i.e.,does it conflict with a lease held by another client.

read leasesmay be granted simultaneously to multiple clients

write leasesare granted exclusively to a single client

3. If a conflict exists, the server may send anevictionnoticeto the holder of the conflicting lease.

If a client is evicted from a write lease, it must write back.

Grace period: server grants extensions while the client writes.

Client sendsvacatednotice when all writes are complete.

NQNQ--NFS Lease RecoveryNFS Lease Recovery

Key point: the bounded lease term simplifies recovery.

• Before a lease expires, the client mustrenewthe lease.

• What if a client fails while holding a lease?

Server waits until the lease expires, then unilaterally reclaims thelease; client forgets all about it.

If a client fails while writing on an eviction, server waits forwrite slacktime before granting conflicting lease.

• What if the server fails while there are outstanding leases?

Wait for lease period + clock skewbefore issuing new leases.

• Recovering server must absorb lease renewal requests and/orwrites for vacated leases.

NQNQ--NFS Leases and Cache ConsistencyNFS Leases and Cache Consistency

• Every lease contains a file version number.

Invalidation cache iff version number has changed.

• Clients may disable client caching when there is concurrentwrite sharing.

no-cachinglease

• What consistency guarantees do NQ-NFS leases provide?

Does the server eventually receive/accept all writes?

Does the server accept the writes in order?

Are groups of related writes atomic?

How are write errors reported?

What is the relationship to NFS V3commit?

The Distributed Lock LabThe Distributed Lock Lab

The lock implementation is similar to DSM systems, withreliability features similar to distributed file caches.

• use Java RMI

• lock token caching with callbacks

lock tokens passed through server, not peer-peer as DSM

• synchronizes multiple threads on same client

• state bit for pending callback on client

• server must reissue callback each lease interval (or use RMItimeouts to detect a failed client)

• client must renew token each lease interval

Background: Unix Filesystem InternalsBackground: Unix Filesystem Internals

A Typical Unix File TreeA Typical Unix File Tree

/

tmp usretc

File trees are built bygraftingvolumes from different volumesor from network servers.

Each volume is a set of directories and files; a host’sfile tree is the set ofdirectories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

(volume root)

tex emacs

In Unix, the graft operation isthe privilegedmountsystem call,and each volume is afilesystem.

mount pointmount (coveredDir, volume)

coveredDir:directory pathnamevolume: device specifier or network volume

volume root contents become visible at pathnamecoveredDir

FilesystemsFilesystems

Each file volume (filesystem) has atype, determined by itsdisk layout or the network protocol used to access it.

ufs (ffs), lfs, nfs, rfs, cdfs, etc.

Filesystems are administered independently.

Modern systems also include “logical” pseudo-filesystems inthe naming tree, accessible through the file syscalls.

procfs: the/proc filesystem allows access to process internals.

mfs: thememory file systemis a memory-based scratch store.

Processes access filesystems through common system calls.

VFS: the Filesystem SwitchVFS: the Filesystem Switch

syscall layer (file, uio, etc.)

user space

Virtual File System (VFS)networkprotocolstack

(TCP/IP) NFS FFS LFS etc.*FS etc.

device drivers

Sun Microsystems introduced thevirtual file systeminterfacein 1985 to accommodate diverse filesystem types cleanly.

VFS allows diversespecific file systemsto coexist in a file tree,isolating all FS-dependencies in pluggable filesystem modules.

VFS was an internal kernel restructuringwith no effect on the syscall interface.

Incorporates object-oriented concepts:a generic procedural interface withmultiple implementations.

Based on abstract objects with dynamicmethod binding by type...in C.Other abstract interfaces in the kernel: device drivers,

file objects, executable files, memory objects.

VnodesVnodes

In the VFS framework, every file or directory in active use isrepresented by avnodeobject in kernel memory.

syscall layer

NFS UFS

free vnodes

Each vnode has a standardfile attributesstruct.

Vnode operations aremacros that vector tofilesystem-specificprocedures.

Generic vnode points atfilesystem-specific struct(e.g.,inode, rnode), seenonly by the filesystem.

Each specific file systemmaintains acacheof itsresident vnodes.

Vnode Operations and AttributesVnode Operations and Attributes

directories onlyvop_lookup (OUT vpp, name)vop_create (OUT vpp, name, vattr)vop_remove (vp, name)vop_link (vp, name)vop_rename (vp, name, tdvp, tvp, name)vop_mkdir (OUT vpp, name, vattr)vop_rmdir (vp, name)vop_symlink (OUT vpp, name, vattr, contents)vop_readdir (uio, cookie)vop_readlink (uio)

files onlyvop_getpages (page**, count, offset)vop_putpages (page**, count, sync, offset)vop_fsync ()

vnode attributes (vattr)type (VREG, VDIR, VLNK, etc.)mode (9+ bits of permissions)nlink (hard link count)owner user IDowner group IDfilesystem IDunique file IDfile size (bytes and blocks)access timemodify timegeneration number

generic operationsvop_getattr (vattr)vop_setattr (vattr)vhold()vholdrele()

V/Inode CacheV/Inode Cache

HASH(fsid, fileid)VFS free list head

Active vnodes arereference- countedby the structures that hold pointers tothem.

- system open file table

- process current directory

- file system mount points

- etc.

Each specific file system maintains itsown hash of vnodes (BSD).

- specific FS handles initialization

- free list is maintained by VFSvget(vp): reclaim cached inactive vnode from VFS free listvref(vp): increment reference count on an active vnodevrele(vp): release reference count on a vnodevgone(vp): vnode is no longer valid (file is removed)

Pathname TraversalPathname Traversal

When a pathname is passed as an argument to a system call,the syscall layer must “convert it to a vnode”.

Pathname traversal is a sequence ofvop_lookupcalls to descendthe tree to the named file or directory.

open(“/tmp/zot”)vp = get vnode for / (rootdir)vp->vop_lookup(&cvp, “tmp”);vp = cvp;vp->vop_lookup(&cvp, “zot”);

Issues:1. crossing mount points2. obtaining root vnode (or current dir)3. finding resident vnodes in memory4. caching name->vnode translations5. symbolic (soft) links6. disk implementation of directories7. locking/referencing to handle races

with name create and delete operations

NFS ProtocolNFS Protocol

NFS is a network protocol layered above TCP/IP.

• Original implementations (and most today) use UDPdatagram transport for low overhead.

Maximum IP datagram size was increased to match FS blocksize, to allow send/receive of entire file blocks.

Some implementations use TCP as a transport.

• The NFS protocol is a set of message formats and types.

Client issues arequestmessage for a service operation.

Server performs requested operation and returns areply messagewith status and (perhaps) requested data.

Date post:	03-Feb-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Hard State Revisited: Network Filesystemschase/cps212-f00/slides/nfs.pdf · 2000. 9. 27. · A...

Documents