+ All Categories
Home > Documents > UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007:...

UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007:...

Date post: 22-Dec-2015
Category:
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
193
09/27/2007 File Systems 1 UCDavis, ecs251 Fall 2007 ecs251 Fall 2007: Operating System Operating System Models Models #1: File Systems Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.cs.ucdavis.edu/~wu/ [email protected]
Transcript
Page 1: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 1

UCDavis, ecs251Fall 2007

ecs251 Fall 2007:

Operating System ModelsOperating System Models#1: File Systems

Dr. S. Felix Wu

Computer Science Department

University of California, Davishttp://www.cs.ucdavis.edu/~wu/

[email protected]

Page 2: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 2

UCDavis, ecs251Fall 2007

File SystemsFile Systems

Information organization, access, and storage

Reliability Content-centric computing & networking

– Database, Web/Google, Bittorrent, emails

Page 3: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 3

UCDavis, ecs251Fall 2007

Page 4: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 4

UCDavis, ecs251Fall 2007

System-call interfaceActive file entries

VNODE Layer or VFS

Local naming (UFS)

FFS

Buffer cache

Block or character device driver

Hardware

Page 5: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 5

UCDavis, ecs251Fall 2007 File File Disk Disk

separate the disk into blocks separate the file into blocks as well “paging” from file to disk

blocks: 4 - 7- 2- 10- 12

How to represent the file??How to link these 5 pages together??

Page 6: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 6

UCDavis, ecs251Fall 2007

Bit Torrent piecesBit Torrent pieces

1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others).

How much disk space do we need at this moment?

Page 7: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 7

UCDavis, ecs251Fall 2007 File File Disk blocks Disk blocks

fileblock

0

4

fileblock

1

7

fileblock

2

2

fileblock

3

10

0file

block4

12

What are the disadvantages?1. disk access can be slow for “random access”.2. How big is each block? 2^X bytes? 2^X+8 bytes?

Page 8: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 8

UCDavis, ecs251Fall 2007

One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks

efficient representation & access

Page 9: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 9

UCDavis, ecs251Fall 2007 An i-nodeAn i-node

??? entries inone disk block

A file

Page 10: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 10

UCDavis, ecs251Fall 2007

Page 11: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 11

UCDavis, ecs251Fall 2007125 struct ufs2_dinode {

126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */127 int16_t di_nlink; /* 2: File link count. */128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 };

Page 12: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 12

UCDavis, ecs251Fall 2007166 struct ufs1_dinode {

167 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 168 int16_t di_nlink; /* 2: File link count. */ 169 union { 170 u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */ 171 } di_u; 172 u_int64_t di_size; /* 8: File byte count. */ 173 int32_t di_atime; /* 16: Last access time. */ 174 int32_t di_atimensec; /* 20: Last access time. */ 175 int32_t di_mtime; /* 24: Last modified time. */ 176 int32_t di_mtimensec; /* 28: Last modified time. */ 177 int32_t di_ctime; /* 32: Last inode change time. */ 178 int32_t di_ctimensec; /* 36: Last inode change time. */ 179 ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ 180 ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ 181 u_int32_t di_flags; /* 100: Status flags (chflags). */ 182 int32_t di_blocks; /* 104: Blocks actually held. */ 183 int32_t di_gen; /* 108: Generation number. */ 184 u_int32_t di_uid; /* 112: File owner. */ 185 u_int32_t di_gid; /* 116: File group. */ 186 int32_t di_spare[2]; /* 120: Reserved; currently unused */ 187 };

Page 13: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 13

UCDavis, ecs251Fall 2007

Bittorrent piecesBittorrent pieces

File size: 10 GBPieces downloaded: 512 MBHow much disk space do we need?

Page 14: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 14

UCDavis, ecs251Fall 2007

Welcome to Welcome to ecs251ecs251

Alagesan, Shri Vidhya Anand, Manish K.Apple, James B.Bhattacharyya, Prantik Brasseur, Brian A.Cai, Liang Chan, An Chan, Kelcey Ding, Dan Gulati, Supriya S.Hadke, Amit A.Harve, Shruthi K.Jones, Chad E.Kim, Yejin Kishore, Apoorv Kwon, Taeho Lorca-Martinez, Daniel A.Mason, Blake Muppala, Yugandhar Nelson, Jarom R.Schwarz, Christopher G.Shafii, Sohail S.Shi, Lei Singh, Anhad P.Stegers, Till J.Sukharev, Jeffrey Thomason, Russell D.Xia, Ming Xu, Dan

Page 15: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 15

UCDavis, ecs251Fall 2007

Subjects to coverSubjects to cover

File Systems Distributed File Systems (NFS/AFS/Coda) Peer-to-Peer Systems (Bittorrent, Chord, Web 2.0) Social Network Systems (Facebook, LinkedIn,

Orkut, MySpace,…) Transaction, Recovery & Distributed Commit Trust and Reputation Systems Self-Stabilization Systems (if time available)

Page 16: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 16

UCDavis, ecs251Fall 2007

Class websiteClass website

http://www.cs.ucdavis.edu/~wu/ecs251

Page 17: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 17

UCDavis, ecs251Fall 2007

About the InstructorAbout the Instructor

S. Felix Wu – [email protected]– Facebook group (under UC Davis network): “ecs251”

Office: 3057 Engineering II Phone: 530-754-7070 Office Hours:

– 10-11 a.m. on Monday and Friday– by appointment

Page 18: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 18

UCDavis, ecs251Fall 2007

Why 3 email addresses?Why 3 email addresses?– [email protected]

read/response during the quarters, especially before the homework deadlines.

[email protected]– My main email contact for everything all the time.

[email protected]– Read only once in the past three months…

Page 19: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 19

UCDavis, ecs251Fall 2007

Page 20: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 20

UCDavis, ecs251Fall 2007

Page 21: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 21

UCDavis, ecs251Fall 2007

Anti-SpamAnti-Spam

[email protected] subject: [ecs251_f2007]…

ecs251_f2007_0927 is the cyber social link between the instructor and the students in ecs251, Fall 2007.

Page 22: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 22

UCDavis, ecs251Fall 2007

TextbookTextbook

A set of papers on these subjects

Page 23: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 23

UCDavis, ecs251Fall 2007

RequirementsRequirements

Midterm: 30% (11/01/2007) Final Project: 70% (1~2 persons)

– Pre-proposal (10%, 10/18/2007)– Proposal/Design/Prototype (30%, 11/20/2007)– Final Demo & Report (30%, 12/11/2007)

Page 24: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 24

UCDavis, ecs251Fall 2007

Projects (10/02/2007)Projects (10/02/2007)

On-demand/interactive Bittorrent– Kernel, BT client, & applications

Transaction-oriented File System P2P-X over Internet

– X = Facebook, Google, eBay, ud-DFS, Data Center “Internet-less information system”

– MANET-based systems Social network information crawling

Page 25: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 25

UCDavis, ecs251Fall 2007

On-Demand/Interactive BTOn-Demand/Interactive BT BT piece selection -- rarest-first plus priority Downloading xyz.rm and we want to fast forward

to a particular part of the file Can “the kernel file system” pass that particular

need to the BT?– E.g., the system call read will be blocked until the BT

obtains the needed pieces. Advantage: don’t need to modify the applications! Disadvantages: ???

Page 26: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 26

UCDavis, ecs251Fall 2007

Transaction-oriented FSTransaction-oriented FS Meta File system consistency:

– FreeBSD: Soft Updates– Linux: Journaling

Can we relax “All or Nothing”?– A better user interface for us to recover partial

file modifications.

Page 27: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 27

UCDavis, ecs251Fall 2007

P2P over XP2P over X Facebook Google eBay MegaUpload

Problems with centralized service providers???

Page 28: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 28

UCDavis, ecs251Fall 2007

DaceBookDaceBook

Davis version of FaceBook Peer-to-Peer social networking Community gateways to other social

networks such as Facebook

Socially here, information here

Page 29: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 29

UCDavis, ecs251Fall 2007

Internet-less Information Internet-less Information SharingSharing

Ad Hoc Collaboration for Content Sharing in the Ad Hoc Collaboration for Content Sharing in the Infrastructure-less Heterogeneous NetworkInfrastructure-less Heterogeneous Network

E (WiFi)

S (WiFi)

C (WiMAX )

A (Bluetooth )

D (WiFi+WiMAX)

F (WiFi + WiMAX)

B (WiFi + Bluetooth )

Page 30: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 30

UCDavis, ecs251Fall 2007

Infrastructure-LessInfrastructure-Less

“No IP address, DNS, or Web servers” Purely P2P from Layers 3 to 7

Page 31: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 31

UCDavis, ecs251Fall 2007

Social Network CrawlingSocial Network Crawling

Page 32: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 32

UCDavis, ecs251Fall 2007

Page 33: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 33

UCDavis, ecs251Fall 2007

Social NetworkingSocial Networking Social Network services

– Friendster, MySpace, Facebook, Orkut, LinkIn..

Social Links, Interest Keywords, Search & Community Services

Page 34: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 34

UCDavis, ecs251Fall 2007

Social Networks: # of Users Social Networks: # of Users

FaceBook ~34M Friendster ~47M Hi5 ~50M LinkIn ~12M MySpace ~200M Orkut ~47M Yahoo! 360o ~4M … among many others.

http://en.wikipedia.org/wiki/List_of_social_networking_sites

Page 35: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 35

UCDavis, ecs251Fall 2007

Page 36: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 36

UCDavis, ecs251Fall 2007

Social Network Access Social Network Access InterfacesInterfaces

FBML - Facebook Markup Language FQL - Facebook Query Language REST (REpresentational State Transfer)

Web API FBJS - Implementation of Javascript on the

Facebook platform

Page 37: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 37

UCDavis, ecs251Fall 2007

#include <stdio.h>#include <stdlib.h>

intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;

for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}

# ./t# ls –l ./sss.txt

Page 38: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 38

UCDavis, ecs251Fall 2007

Page 39: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 39

UCDavis, ecs251Fall 2007

Page 40: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 40

UCDavis, ecs251Fall 2007

Page 41: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 41

UCDavis, ecs251Fall 2007

di_size vs. di_blocksdi_size vs. di_blocks

Logical Physical

fstat du

Page 42: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 42

UCDavis, ecs251Fall 2007

One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks

efficient representation & access

Page 43: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 43

UCDavis, ecs251Fall 2007 An i-nodeAn i-node

Typical:each block 1K

??? entries inone disk block

A file

Page 44: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 44

UCDavis, ecs251Fall 2007 An i-nodeAn i-node

Typical:each block 1K

??? entries inone disk block

A file

Page 45: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 45

UCDavis, ecs251Fall 2007

i-nodei-node

How many disk blocks can a FS have? How many levels of i-node indirection will be

necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3) What is the largest possible file size in i-node? What is the size of the i-node itself for a file of 10GB

with only 512 MB downloaded?

Page 46: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 46

UCDavis, ecs251Fall 2007

AnswerAnswer How many disk blocks can a FS have?

– 264 or 232: Pointer (to blocks) size is 8/4 bytes. How many levels of i-node indirection will be

necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3)– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231

What is the largest possible file size in i-node?– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10

– 264 –1– 232 * 210

You need to consider three issues and find the minimum!

Page 47: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 47

UCDavis, ecs251Fall 2007

Answer: Lower BoundAnswer: Lower Bound

How many pointers?– 512MB divided by the block size (1K)– 512K pointers times 8 (4) bytes = 4 (2) MB

Page 48: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 48

UCDavis, ecs251Fall 2007

Answer: Upper BoundAnswer: Upper Bound

In the worst case, EVERY indirection block has at least one entry!

How many indirection blocks?– Single: 1 block– Double: 1 + 28

– Tripple: 1 + 28 + 216

Total ~ 216 blocks times 1K = 64 MB– 214 times 1K = 16MB (ufs2 inode)

Page 49: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 49

UCDavis, ecs251Fall 2007

Answer (4)Answer (4)

2 MB ~ 64 MB ufs1 4 MB ~ 16 MB ufs2

Answer: sss.txt ~17 MB– ~16 MB (inode indirection blocks)– 1000 writes times 1K ~ 1MB

Page 50: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 50

UCDavis, ecs251Fall 2007 A File SystemA File System

partition partition partition

i-list directory and data blockssb

i-node i-node ……. i-node

d

Page 51: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 51

UCDavis, ecs251Fall 2007

dirp = opendir(const char *filename);struct dirent *direntp = readdir(dirp);

struct dirent {ino_t d_ino;char d_name[NAME_MAX+1];

};

directory

direntinode

file_name

file

file

file

direntinode

file_name

direntinode

file_name

Page 52: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 52

UCDavis, ecs251Fall 2007

Page 53: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 53

UCDavis, ecs251Fall 2007

drwxr-xr-xApr 1 2004

root wheel

drwxr-xr-xApr 1 2004

root wheel

rwxr-xr-xApr 15 2004

root wheel

rw-rw-r--Jan 19 2004

kirk staff

drwxr-xr-xApr 1 2004

root wheel

rwxr-xr-xApr 15 2004

bin bin

2

3

4

5

6

7

8

9

. 2

.. 2usr 4

vmunix 5

. 4

.. 2bin 7foo 6

text data

Hello World!

. 7

.. 4ex 9

groff 10vi 9

text data

directory/

directory/usr

directory/usr/bin

file/vmunix

file/usr/foo

file/usr/bin/vi

Page 54: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 54

UCDavis, ecs251Fall 2007 struct dirent {

ino_t d_ino;char d_name[NAME_MAX+1];

};

struct stat {…short nlinks;

…};

directory

direntinode

file_name

file

file

file

direntinode

file_name

direntinode

file_name

Page 55: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 55

UCDavis, ecs251Fall 2007 A File SystemA File System

partition partition partition

i-list directory and data blockssb

i-node i-node ……. i-node

d

Page 56: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 56

UCDavis, ecs251Fall 2007

What is the difference?What is the difference?

ln –s /usr/src/sys/sys/proc.h ppp.h ln /usr/src/sys/sys/proc.h ppp.h

Page 57: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 57

UCDavis, ecs251Fall 2007

File System Buffer CacheFile System Buffer Cacheapplication: read/write files

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:

Any problems?

Page 58: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 58

UCDavis, ecs251Fall 2007

File System ConsistencyFile System Consistency

To maintain file system consistency the ordering of updates from buffer cache to disk is critical

Example:– if the directory block is written back before the

i-node and the system crashes, the directory structure will be inconsistent

Page 59: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 59

UCDavis, ecs251Fall 2007

File System ConsistencyFile System Consistency File system almost always use a buffer/disk cache for

performance reasons This problem is critical especially for the blocks that

contain control information: i-node, free-list, directory blocks

Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk

Write back critical blocks from the buffer cache to disk immediately

Data blocks are also written back periodically: sync

Page 60: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 60

UCDavis, ecs251Fall 2007

Two StrategiesTwo Strategies Prevention

– Use un-buffered I/O when writing i-nodes or pointer blocks

– Use buffered I/O for other writes and force sync every 30 seconds

Detect and Fix– Detect the inconsistency

– Fix them according to the “rules”

– Fsck (File System Checker)

Page 61: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 61

UCDavis, ecs251Fall 2007

File System IntegrityFile System Integrity Block consistency:

– Block-in-use table

– Free-list table

File consistency:– how many directories pointing to that i-node?

– nlink?

– three cases: D == L, L > D, D > L What to do with the latter two cases?

0 1 1 1 0 0 0 1 0 0 0 2

1 0 0 0 1 1 1 0 1 0 2 0

Page 62: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 62

UCDavis, ecs251Fall 2007 File System IntegrityFile System Integrity

File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block

Page 63: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 63

UCDavis, ecs251Fall 2007

Metadata OperationsMetadata Operations

Metadata operations modify the structure of the file system– Creating, deleting, or renaming

files, directories, or special files– Directory & I-node

Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash

Page 64: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 64

UCDavis, ecs251Fall 2007

Metadata IntegrityMetadata Integrity

FFS uses synchronous writes to guarantee the integrity of metadata– Any operation modifying multiple pieces of

metadata will write its data to disk in a specific order

– These writes will be blocking Guarantees integrity and durability of

metadata updates

Page 65: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 65

UCDavis, ecs251Fall 2007

Deleting a file (I)Deleting a file (I)

abc

def

ghi

i-node-1

i-node-2

i-node-3

Assume we want to delete file “def”

Page 66: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 66

UCDavis, ecs251Fall 2007

Deleting a file (II)Deleting a file (II)

abc

def

ghi

i-node-1

i-node-3

Cannot delete i-node before directory entry “def”

?

Page 67: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 67

UCDavis, ecs251Fall 2007

Deleting a file (III)Deleting a file (III)

Correct sequence is1. Write to disk directory block containing deleted

directory entry “def”

2. Write to disk i-node block containing deleted i-node

Leaves the file system in a consistent state

Page 68: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 68

UCDavis, ecs251Fall 2007

Creating a file (I)Creating a file (I)

abc

ghi

i-node-1

i-node-3

Assume we want to create new file “tuv”

Page 69: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 69

UCDavis, ecs251Fall 2007

Creating a file (II)Creating a file (II)

abc

ghi

tuv

i-node-1

i-node-3

Cannot write directory entry “tuv” before i-node

?

Page 70: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 70

UCDavis, ecs251Fall 2007

Creating a file (III)Creating a file (III)

Correct sequence is1. Write to disk i-node block containing new i-node

2. Write to disk directory block containing new directory entry

Leaves the file system in a consistent state

Page 71: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 71

UCDavis, ecs251Fall 2007

Synchronous UpdatesSynchronous Updates

Used by FFS to guarantee consistency of metadata:– All metadata updates are done through blocking

writes

Increases the cost of metadata updates Can significantly impact the performance

of whole file system

Page 72: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 72

UCDavis, ecs251Fall 2007

SOFT UPDATESSOFT UPDATES

Use delayed writes (write back) Maintain dependency information about

cached pieces of metadata:This i-node must be updated before/after this directory entry

Guarantee that metadata blocks are written to disk in the required order

Page 73: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 73

UCDavis, ecs251Fall 2007

3 Soft Update Rules3 Soft Update Rules

Never point to a structure before it has been initialized.

Never reuse a resource before nullifying all previous pointers to it.

Never reset the old pointer to a live resource before the new pointer has been set.

Page 74: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 74

UCDavis, ecs251Fall 2007

Problem #1 with S.U.Problem #1 with S.U.

Synchronous writes guaranteed that metadata operations were durable once the system call returned

Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one– Some updates could be lost

Page 75: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 75

UCDavis, ecs251Fall 2007

We want to delete file “foo” and create new file “bar”

i-node-2 foo

NEW bar

NEW i-node-3

Block A Block B

What are the dependency relationship?

Page 76: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 76

UCDavis, ecs251Fall 2007

We want to delete file “foo” and create new file “bar”

i-node-2 foo

NEW bar

NEW i-node-3

Block A Block B

Circular DependencyX-2nd Y-1st

Page 77: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 77

UCDavis, ecs251Fall 2007

Problem #2 with S.U.Problem #2 with S.U.

Cyclical dependencies:– Same directory block contains entries to be

created and entries to be deleted– These entries point to i-nodes in the same block

Brainstorming:– How to resolve this issue in S.U.?

Page 78: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 78

UCDavis, ecs251Fall 2007

How to update?? i-node first or director block first?

Page 79: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 79

UCDavis, ecs251Fall 2007

Page 80: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 80

UCDavis, ecs251Fall 2007

Solution in S.U.Solution in S.U.

Roll back metadata in one of the blocks to an earlier, safe state

(Safe state does not contain new directory entry)

def

Block A’

Page 81: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 81

UCDavis, ecs251Fall 2007

Write first block with metadata that were rolled back (block A’ of example)

Write blocks that can be written after first block has been written (block B of example)

Roll forward block that was rolled back Write that block Breaks the cyclical dependency but must now

write twice block A

Page 82: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 82

UCDavis, ecs251Fall 2007

Before any Write Operation

After any Write Operation

SU Dependency Checking(roll back if necessary)

SU Dependency Processing(task list updating)(roll forward if necessary)

Page 83: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 83

UCDavis, ecs251Fall 2007

two most popular approaches for improving the performance of metadata operations and recovery:– Journaling – Soft Updates

Journaling systems record metadata operations on an auxiliary log

Soft Updates uses ordered writes

Page 84: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 84

UCDavis, ecs251Fall 2007 JOURNALINGJOURNALING

Journaling systems maintain an auxiliary log that records all meta-data operations

Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations.– After a crash, can replay the log to bring the file

system to a consistent state

Page 85: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 85

UCDavis, ecs251Fall 2007

JournalingJournaling Atomically updated Old and new versions of data held on disk until

the update commits Undo logging:

– Copy old data to the log– Write new data to disk– If you crash during update, copy old data from log

Redo logging:– Write new data to the log– Old data remains on disk– If you crash, copy new data from the log

Page 86: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 86

UCDavis, ecs251Fall 2007

How does it work?How does it work? Each disk update is a Transaction (atomic

update)– Write new data to the disk (journal)– The update is not final until a commit

Only after the commit block is written is the update final– The commit block is a single block of data on

the disk– Not necessarily flushed to disk yet!

Page 87: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 87

UCDavis, ecs251Fall 2007

JOURNALINGJOURNALING

Log writes are performed in addition to the regular writes

Journaling systems incur log write overhead but– Log writes can be performed efficiently

because they are sequential (block operation consideration)

– Metadata blocks do not need to be written back after each update

Page 88: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 88

UCDavis, ecs251Fall 2007

JOURNALINGJOURNALING

Journaling systems can provide– same durability semantics as FFS if log is

forced to disk after each meta-data operation– the laxer semantics of Soft Updates if log

writes are buffered until entire buffers are full

Page 89: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 89

UCDavis, ecs251Fall 2007

Versus log-structured file Versus log-structured file systemsystem

A log structured file system ONLY contains a log, everything is written to the end of this log

LSFS dictates how the data is stored on disk JFS does not dictate how the data is stored

on disk

Page 90: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 90

UCDavis, ecs251Fall 2007

Soft Updates vs. JournalingSoft Updates vs. Journaling

Advantages disadvantages

Page 91: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 91

UCDavis, ecs251Fall 2007

With Soft Updates??With Soft Updates??

CPU

Do we still need “FSCK”? at boot time?

Page 92: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 92

UCDavis, ecs251Fall 2007

Recover the Missing ResourcesRecover the Missing Resources

In the background, in an active FS…– We don’t want to wait for the lengthy FSCK

process to complete…

A related issue:– the virus scanning process– what happens if we get a new virus signature?

Page 93: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 93

UCDavis, ecs251Fall 2007

Snapshot of the FSSnapshot of the FS

backup and restore dump reliably an active File System

– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

“background FSCK checks”

Page 94: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 94

UCDavis, ecs251Fall 2007

What is a snapshot?What is a snapshot?(I mean “conceptually”.)(I mean “conceptually”.)

Freeze all activities related to the FS. Copy everything to “some space”. Resume the activities.

How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive?

Page 95: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 95

UCDavis, ecs251Fall 2007

Page 96: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 96

UCDavis, ecs251Fall 2007

Page 97: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 97

UCDavis, ecs251Fall 2007

Copy-on-Write

Page 98: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 98

UCDavis, ecs251Fall 2007 Snapshot: a fileSnapshot: a file

Logical sizeVersus physical size

Page 99: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 99

UCDavis, ecs251Fall 2007

ExampleExample

# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon

/* do whatever you want to test it */

# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon

Page 100: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 100

UCDavis, ecs251Fall 2007

Page 101: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 101

UCDavis, ecs251Fall 2007

Page 102: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 102

UCDavis, ecs251Fall 2007

#include <stdio.h>#include <stdlib.h>

intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;

for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}

Page 103: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 103

UCDavis, ecs251Fall 2007

ExampleExample

# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon

/* do whatever you want to test it */

# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon

Page 104: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 104

UCDavis, ecs251Fall 2007

Page 105: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 105

UCDavis, ecs251Fall 2007

Page 106: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 106

UCDavis, ecs251Fall 2007

Page 107: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 107

UCDavis, ecs251Fall 2007

Page 108: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 108

UCDavis, ecs251Fall 2007

Page 109: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 109

UCDavis, ecs251Fall 2007

Page 110: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 110

UCDavis, ecs251Fall 2007

Page 111: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 111

UCDavis, ecs251Fall 2007

ExampleExample

# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon

/* do whatever you want to test it */

# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon

Page 112: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 112

UCDavis, ecs251Fall 2007

Copy-on-Write

Page 113: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 113

UCDavis, ecs251Fall 2007

Page 114: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 114

UCDavis, ecs251Fall 2007 A File SystemA File System

??? entries inone disk block

A file

Page 115: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 115

UCDavis, ecs251Fall 2007 A Snapshot i-nodeA Snapshot i-node

??? entries inone disk block

A file

Not used orNot yet copy

Page 116: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 116

UCDavis, ecs251Fall 2007 Copy-on-writeCopy-on-write

??? entries inone disk block

A file

Not used orNot yet copy

Page 117: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 117

UCDavis, ecs251Fall 2007 Copy-on-writeCopy-on-write

??? entries inone disk block

A file

Not used orNot yet copy

Page 118: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 118

UCDavis, ecs251Fall 2007

Multiple SnapshotsMultiple Snapshots

about 20 snapshots Interactions/sharing among snapshots

Page 119: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 119

UCDavis, ecs251Fall 2007

Snapshot of the FSSnapshot of the FS

backup and restore dump reliably an active File System

– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

“background FSCK checks”

Page 120: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 120

UCDavis, ecs251Fall 2007

Transaction-based FSTransaction-based FS

Performance versus consistency “Atomic Writes” on Multiple Blocks

– See the paper titled “Atomic Writes for Data Integrity and Consistency in Shared Storage Devices for Clusters” by Okun and Barak, FGCS, vol. 20, pages 539-547, 2004.

– Modify SCSI handling

Page 121: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 121

UCDavis, ecs251Fall 2007

File System MountingFile System Mounting

A file system must be mounted before it can be accessed.

A unmounted file system is mounted at a mount point.

Page 122: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 122

UCDavis, ecs251Fall 2007

Page 123: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 123

UCDavis, ecs251Fall 2007 Mount PointMount Point

Page 124: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 124

UCDavis, ecs251Fall 2007

logical diskslogical disks/

usr sys dev etc bin

/

local adm home lib bin

fs0: /dev/hd0a

fs1: /dev/hd0e

mount -t ufs /dev/hd0e /usr

mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom

Page 125: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 125

UCDavis, ecs251Fall 2007

Distributed FSDistributed FS

Distributed File System– NFS (Network File System)– AFS (Andrew File System)– GFS (Google File System)– CODA

Page 126: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 126

UCDavis, ecs251Fall 2007 Distributed FSDistributed FS

/

usr sys dev etc bin

/

local adm home lib bin

ftp.cs.ucdavis.edu fs0: /dev/hd0a

Server.yahoo.com fs0: /dev/hd0e

Page 127: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 127

UCDavis, ecs251Fall 2007

Distributed File SystemDistributed File System

Transparency and Location Independence Reliability and Crash Recovery Scalability and Efficiency Correctness and Consistency Security and Safety

Page 128: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 128

UCDavis, ecs251Fall 2007 CorrectnessCorrectness

One-copy Unix Semantics??

Page 129: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 129

UCDavis, ecs251Fall 2007 CorrectnessCorrectness

One-copy Unix Semantics– every modification to every byte of a file has to

be immediately and permanently visible to every client.

Page 130: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 130

UCDavis, ecs251Fall 2007 CorrectnessCorrectness

One-copy Unix Semantics– every modification to every byte of a file has to

be immediately and permanently visible to every client.

– Conceptually FS sequent access Make sense in a local file system Single processor versus shared memory

Is this necessary?

Page 131: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 131

UCDavis, ecs251Fall 2007 DFS ArchitectureDFS Architecture

Server– storage for the distributed/shared files.– provides an access interface for the clients.

Client– consumer of the files.– runs applications in a distributed environment.

open closeread writeopendir statreaddir

applications

Page 132: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 132

UCDavis, ecs251Fall 2007 NFS (SUN, 1985)NFS (SUN, 1985)

Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation)

Server maintains no state– a READ on the server opens, seeks, reads, and closes– a WRITE is similar, but the buffer is flushed to disk before

closing Server crash: client continues to try until server reboots –

no loss Client crashes: client must rebuild its own state – no effect

on server

Page 133: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 133

UCDavis, ecs251Fall 2007

RPC - XDRRPC - XDR

RPC: Standard protocol for calling procedures in another machine

Procedure is packaged with authorization and admin info

XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering.

Page 134: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 134

UCDavis, ecs251Fall 2007

rpcgenrpcgen

RPC program

rpcgen

RPC client.c RPC server.cRPC.h

datastructure

datastructure

Page 135: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 135

UCDavis, ecs251Fall 2007

NFS OperationsNFS Operations

Every operation is independent: server opens file for every operation

File identified by handle -- no state information retained by server

client maintains mount table, v-node, offset in file table etc.

What do these imply???

Page 136: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 136

UCDavis, ecs251Fall 2007

Client computer Server computer

UNIXfile

system

NFSclient

NFSserver

UNIXfile

system

Applicationprogram

Applicationprogram

Virtual file systemVirtual file system

Oth

er f

ile s

yste

mUNIX kernel

system calls

NFSprotocol

(remote operations)

UNIX

Operations on local files

Operationson

remote files

*

Applicationprogram

NFSClient

KernelApplicationprogram

NFSClient

Client computer

mount –t nfs home.yahoo.com:/pub/linux /mnt/linux

Page 137: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 137

UCDavis, ecs251Fall 2007

Page 138: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 138

UCDavis, ecs251Fall 2007 State-ful vs. State-lessState-ful vs. State-less

A server is fully aware of its clients– does the client have the newest copy?

– what is the offset of an opened file?

– “a session” between a client and a server!

A server is completely unaware of its clients– memory-less: I do not remember you!!

– Just tell me what you want to get (and where).

– I am not responsible for your offset values (the client needs to maintain the state).

Page 139: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 139

UCDavis, ecs251Fall 2007 The StateThe State

applications

openreadstatlseek

applications

openreadstatlseek

offset

Page 140: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 140

UCDavis, ecs251Fall 2007

Network File SharingNetwork File Sharing Server side:

– Rpcbind (portmap)– Mountd - respond to mount requests (sometimes called

rpc.mountd). Relies on several files

– /etc/dfs/dfstab, – /etc/exports, – /etc/netgroup

– nfsd - serves files - actually a call to kernel level code.– lockd – file locking daemon.– statd – manages locks for lockd.– rquotad – manages quotas for exported file systems.

Page 141: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 141

UCDavis, ecs251Fall 2007

Network File SharingNetwork File Sharing Client Side

– biod - client side caching daemon

– mount must understand the hostname:directory convention.

– Filesystem entries in /etc/[v]fstab tell the client what filesystems to mount.

Page 142: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 142

UCDavis, ecs251Fall 2007

Unix file semanticsUnix file semantics

NFS:– open a file with read-write mode– later, the server’s copy becomes read-only

mode– now, the application tries to write it!!

Page 143: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 143

UCDavis, ecs251Fall 2007

Problems with NFSProblems with NFS

Performance not scaleable:– maybe it is OK for a local office.– will be horrible with large scale systems.

Page 144: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 144

UCDavis, ecs251Fall 2007

Similar to UNIX file caching for local files:– pages (blocks) from disk are held in a main memory buffer cache until

the space is required for newer pages. Read-ahead and delayed-write optimisations.

– For local files, writes are deferred to next sync event (30 second intervals)

– Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.

NFS v3 servers offers two strategies for updating the disk:– write-through - altered pages are written to disk as soon as they are

received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk.

– delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.

*

Page 145: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 145

UCDavis, ecs251Fall 2007

Server caching does nothing to reduce RPC traffic between client and server– further optimisation is essential to reduce server load in large networks– NFS client module caches the results of read, write, getattr, lookup and

readdir operations– synchronization of file contents (one-copy semantics) is not guaranteed

when two or more clients are sharing the same file. Timestamp-based validity check

– reduces inconsistency, but doesn't eliminate it– validity condition for cache entries at the client:

(T - Tc < t) v (Tmclient = Tmserver)– t is configurable (per file) but is typically set to

3 seconds for files and 30 secs. for directories– it remains difficult to write distributed

applications that share files with NFS

*

t freshness guaranteeTc time when cache entry was

last validatedTm time when block was last

updated at serverT current time

Page 146: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 146

UCDavis, ecs251Fall 2007 AFSAFS

State-ful clients and servers. Caching the files to clients.

– File close ==> check-in the changes. How to maintain consistency?

– Using “Callback” in v2/3 (Valid or Cancelled)

openread

applications

invalidate and re-cache

Page 147: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 147

UCDavis, ecs251Fall 2007

Why AFS?Why AFS?

Shared files are infrequently updated Local cache of a few hundred mega bytes

– Now 50~100 giga bytes Unix workload:

– Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts.

– Are these still true?

Page 148: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 150

UCDavis, ecs251Fall 2007

Fault Tolerance in AFSFault Tolerance in AFS

a server crashes

a client crashes– check for call-back tokens first.

Page 149: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 151

UCDavis, ecs251Fall 2007

Problems with AFSProblems with AFS

Availability what happens if call-back itself is lost??

Page 150: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 152

UCDavis, ecs251Fall 2007

GFS – Google File SystemGFS – Google File System

“failures” are norm Multiple-GB files are common Append rather than overwrite

– Random writes are rare Can we relax the consistency?

Page 151: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 153

UCDavis, ecs251Fall 2007

Page 152: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 154

UCDavis, ecs251Fall 2007 The MasterThe Master

Maintains all file system metadata.names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

Page 153: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 155

UCDavis, ecs251Fall 2007

The MasterThe Master

Helps make sophisticated chunk placement and replication decision, using global knowledge

For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers

Master is not a bottleneck for reads/writes

Page 154: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 156

UCDavis, ecs251Fall 2007 ChunkserversChunkservers

Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle.

handle is assigned by the master at chunk creation

Chunk size is 64 MB

Each chunk is replicated on 3 (default) servers

Page 155: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 157

UCDavis, ecs251Fall 2007

ClientsClients

Linked to apps using the file system API.

Communicates with master and chunkservers for reading and writing

Master interactions only for metadata

Chunkserver interactions for data

Only caches metadata informationData is too large to cache.

Page 156: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 158

UCDavis, ecs251Fall 2007

Chunk LocationsChunk Locations

Master does not keep a persistent record of locations of chunks and replicas.

Polls chunkservers at startup, and when new chunkservers join/leave for this.

Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)

Page 157: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 159

UCDavis, ecs251Fall 2007

Page 158: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 160

UCDavis, ecs251Fall 2007 Atomic commit protocols Atomic commit protocols

one-phase atomic commit protocol– the coordinator tells the participants whether to commit or abort– what is the problem with that?– this does not allow one of the servers to decide to abort – it may

have discovered a deadlock or it may have crashed and been restarted

two-phase atomic commit protocol– is designed to allow any participant to choose to abort a

transaction– phase 1 - each participant votes. If it votes to commit, it is

prepared. It cannot change its mind. In case it crashes, it must save updates in permanent store

– phase 2 - the participants carry out the joint decision

The decision could be commit or abort – participants record it in permanent store

Page 159: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 161

UCDavis, ecs251Fall 2007 Two phase commit (2PC)Two phase commit (2PC)

Coordinator

Server Server Server Server Server…..

Coordinator

Server Server Server Server Server…..

What is your result?

Final consensus.

Page 160: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 162

UCDavis, ecs251Fall 2007 Failure modelFailure model

Commit protocols are designed to work in– asynchronous system (e.g. messages may take a very long time)– servers/coordinator may crash – messages may be lost. – assume corrupt and duplicated messages are removed. – no byzantine faults – servers either crash or they obey their

requests 2PC is an example of a protocol for reaching a consensus.

– because crash failures of processes are masked by replacing a crashed process with a new process whose state is set from information saved in permanent storage and information held by other processes.

Page 161: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 163

UCDavis, ecs251Fall 2007 2PC2PC

2PC– voting phase: coordinator asks all servers if

they can commit if yes, server records updates in permanent storage

and then votes

– completion phase: coordinator tells all servers to commit or abort

Page 162: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 164

UCDavis, ecs251Fall 2007

INIT

WAIT

ABORT COMMIT

INIT

READY

ABORT COMMIT

canCommit

doCommitdoAbort

haveCommitted

vote

Page 163: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 165

UCDavis, ecs251Fall 2007

INIT

WAIT

ABORT COMMIT

INIT

READY

ABORT COMMIT

canCommit

doCommitdoAbort

haveCommitted

vote

Page 164: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 166

UCDavis, ecs251Fall 2007

FailuresFailures

Some servers missed “canCommit”. Coordinator missed some “votes”. Some servers missed “doAbort” or

“doCommit”.

Page 165: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 167

UCDavis, ecs251Fall 2007

Failures/CrashesFailures/Crashes

Some servers crashed b/a “canCommit”. Coordinator crashed b/a receiving some

“votes”. Some servers crashes b/a receiving

“doAbort” or “doCommit”.

Page 166: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 168

UCDavis, ecs251Fall 2007

INIT

WAIT

ABORT COMMIT

INIT

READY

ABORT COMMIT

canCommit

doCommitdoAbort

haveCommitted

vote

Assume the coordinator crashed after “canCommit” messageshave been sent:(0). Some servers have not received the vote requests. WAIT/INIT(1). All good servers are in the WAIT state. WAIT/???(2). Some servers are in either ABORT or COMMIT state. ABORT/COMMIT(3). All servers are in either ABORT or COMMIT state. ABORT/COMMIT

Page 167: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 169

UCDavis, ecs251Fall 2007 3PC3PC

Skeen & Stonebraker, 1983Skeen & Stonebraker, 1983

INIT

WAIT

ABORT Pre-COMMIT

vote

COMMIT

ACK

Uncertain

Aborted

Committable

Committed

Page 168: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 170

UCDavis, ecs251Fall 2007

ConsistencyConsistency

Read: are we reading the fresh copy? Write: have we updated all the copies? Failure/Partition

Page 169: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 171

UCDavis, ecs251Fall 2007

NFS, AFS, & GFSNFS, AFS, & GFS

Read the newest copy– NFS: as long as we don’t cache…– AFS: the read callback might be broken…– GFS: we can check the master…

Page 170: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 172

UCDavis, ecs251Fall 2007

NFS, AFS, & GFSNFS, AFS, & GFS

Write to the master copy– NFS: write through– AFS: you have to have the callback …– GFS: we will update all three replicas, but what

happen if a read occurs during the process we are updating the copies. (name space and file to chunks mapping)

Page 171: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 173

UCDavis, ecs251Fall 2007 Pessimistic versus Pessimistic versus

OptimisticOptimistic

Locking the world while I am updating it…– Scaleable?– “Serializable schedule”– Assuming a close system -- “you can NOT fail

at this moment as I am updating this particular transaction” -- we can use “log” to handle the issue of “atomicity” (all or nothing).

Page 172: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 174

UCDavis, ecs251Fall 2007

Soft UpdateSoft Update

Create X (t1) and Delete Y (t2)

T1, T2 T2, T1

We will enforce it when resolve the problem of circular dependency!

Page 173: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 175

UCDavis, ecs251Fall 2007

SU & Background FSCKSU & Background FSCK

Soft Update guarantees that the File System will always in a “consistent” state at all time.

Essentially, Soft Update prevents any chances of inconsistency!

Page 174: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 176

UCDavis, ecs251Fall 2007

An Alternative ApproachAn Alternative Approach “Optimistic” The chance for certain bad things to occur is

very small (depending on what you are doing).

And, it is very expensive to pessimistically prevent the probability.

“Let it happen, and we try to detect and recover from that…”

Page 175: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 177

UCDavis, ecs251Fall 2007

The Optimistic ApproachThe Optimistic Approach

Regular Execution with “recording” Conflict Detection based on the “recorded” Conflict resolution

Page 176: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 178

UCDavis, ecs251Fall 2007

ExampleExample

Allowing “inconsistencies”– Without soft update, we have to do FSCK in

the foreground (before we can use it).– I.e., we try to eliminate “inconsistencies”

But, do we really need “perfectly consistent FS” for all the applications?– Why not, take a snapshot and then do

background FSCK anyway!

Page 177: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 179

UCDavis, ecs251Fall 2007

Optimistic?Optimistic?

NFS AFS GFS

Page 178: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 180

UCDavis, ecs251Fall 2007

Optimistic?Optimistic? NFS

– If the server changes the access mode in the middle of an open session from a client…

AFS– “Callback” is the check for inconsistencies.

GFS– Failure in the middle of a write/append

Page 179: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 181

UCDavis, ecs251Fall 2007

CODACODA

Server Replication:– if one server goes down, I can get another.

Disconnected Operation:– if all go down, I will use my own cache.

Page 180: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 182

UCDavis, ecs251Fall 2007

Page 181: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 183

UCDavis, ecs251Fall 2007

Disconnected OperationDisconnected Operation

Continue critical work when that repository is inaccessible.

Key idea: caching data.– Performance– Availability

Server Replication

Page 182: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 184

UCDavis, ecs251Fall 2007

Page 183: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 185

UCDavis, ecs251Fall 2007

Page 184: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 186

UCDavis, ecs251Fall 2007

Page 185: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 187

UCDavis, ecs251Fall 2007

Page 186: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 188

UCDavis, ecs251Fall 2007

Page 187: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 189

UCDavis, ecs251Fall 2007

Page 188: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 190

UCDavis, ecs251Fall 2007

ConsistencyConsistency

If John update file X on server A and Mary read file X on server B….

Read-one & Write-all

Page 189: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 191

UCDavis, ecs251Fall 2007 Read x & Write (N-x+1)Read x & Write (N-x+1)

read

write

Page 190: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 192

UCDavis, ecs251Fall 2007 Example: R3W4 (6+1)Example: R3W4 (6+1)

Initial 0 0 0 0 0 0Alice-W 2 2 0 2 2 0Bob-W 2 3 3 3 3 0Alice-R 2 3 3 3 3 0Chris-W 2 1 1 1 1 0Dan-R 2 1 1 1 1 0Emily-W 7 7 1 1 1 7Frank-R 7 7 1 1 1 7

Page 191: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 193

UCDavis, ecs251Fall 2007 RecoveryRecovery

The paper describes two recovery experiments on a cluster containing 15,000-16,000 chunks with 600-660 GB of data.

(a) One chunk server was killed. Recovery time was 23 minutes. Describe the process.

(b) Two chunk server were killed. Explain how the recovery process was changed.

(c) Suppose that the master server had been killed. How would the recovery have taken place?

Page 192: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 194

UCDavis, ecs251Fall 2007

Consistency Model Consistency Model

(a) What is a consistency model?(b) How does the consistency model used by the

Google File System differ from the model that might be used by a general purpose file system?

(c) How does the file system guarantee correctness of the namespace?

(d) Explain what happens, from a consistency viewpoint, when an application attempts to append to a file.

Page 193: UCDavis, ecs251 Fall 2007 09/27/2007File Systems1 Operating System Models ecs251 Fall 2007: Operating System Models #1: File Systems Dr. S. Felix Wu Computer.

09/27/2007 File Systems 195

UCDavis, ecs251Fall 2007

Client computer Server computer

Applicationprogram

Applicationprogram

Client module

Flat file service

Directory service

LookupAddNameUnNameGetNames

ReadWriteCreateDeleteGetAttributesSetAttributes

*


Recommended