+ All Categories
Home > Documents > 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C....

1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C....

Date post: 17-Dec-2015
Category:
Upload: nicholas-stephens
View: 223 times
Download: 5 times
Share this document with a friend
Popular Tags:
35
1 EIO: E rror-handling i s O ccasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci- Dusseau, Ben Liblit University of Wisconsin – Madison FAST ’08 – February 28, 2008
Transcript
Page 1: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

1

EIO: Error-handling is Occasionally Correct

Haryadi S. Gunawi, Cindy Rubio-González,

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit

University of Wisconsin – Madison

FAST ’08 – February 28, 2008

Page 2: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

2

Robustness of File Systems

Today’s file systems have robustness issues

Buggy implementation[FiSC-OSDI’04, EXPLODE-OSDI’06]

Unexpected behaviors in corner-case situations

Deficient fault-handling[IRONFS-SOSP’05]

Inconsistent policies: propagate, retry, stop, ignore

Prevalent ignorance Ext3: Ignore write failures during checkpoint and journal

replay NFS: Sync-failure at the server is not propagated to

client What is the root cause?

Page 3: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

3

Incorrect Error Code Propagation

void dosync() {

fdatawrite();

sync_file();

fdatawait();}

NFS Client

NFS Serversync()

dosync

fdatawrite sync_file fdatawait

...... ......

...... ......

... ...... ......

Page 4: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

4

Incorrect Error Code Propagation

void dosync() {

fdatawrite();

sync_file();

fdatawait();}

NFS Client

NFS Serversync()

X

X

X

dosync

fdatawrite sync_file fdatawait

...

...

...

...... ...

... ...

... ......

...

...

...

...

...

fdatawrite

return EIO;

dosync

sync_file

...

...

return EIO;

fdatawait

...

...

...

return EIO;

Unsavederror-codes

Page 5: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

5

Implications

Misleading error-codes in distributed systems NFS client receives SUCCEED instead of

ERROR

Useless policies Retry in NFS client is not invoked

Silent failures Much harder debugging process

Page 6: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

6

EDP: Error Detection and Propagation Analysis Static analysis

Useful to show how error codes flow Currently: 34 basic error codes (e.g. EIO,

ENOMEM)

Target systems 51 file systems (all directories in linux/fs/*)

3 storage drivers (SCSI, IDE, Software-RAID)

Page 7: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

7

Results Number of violations

Error-codes flow through 9022 function calls 1153 (13%) calls do not save the returned

error-codes

Analysis, a closer look More complex file systems, more violations Location distance affects error propagation

correctness Write errors are neglected more than read errors Many violations are not corner-case bugs

− Error-codes are consistently ignored

Page 8: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

8

Outline Introduction

Methodology Challenges EDP tool

Results

Analysis

Discussion and Conclusion

Page 9: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

9

Challenges in Static Analysis

File systems use many error codes bufferstate[Uptodate] = 0 journalflags = ABORT int err = -EIO; ... return err;

Error codes transform Block I/O error becomes journal error Journal error becomes generic error code

Error codes propagate through: Function call path Asynchronous path (e.g. interrupt, network

messages)

Page 10: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

10

State Current State: Integer error-codes, function call

path Future: Error transformation, asynchronous path

Implementation Utilize CIL: Infrastructure for C program

analysis[Necula-CC’02]

EDP: ~4000 LOC in Ocaml

3 components of EDP architecture Specifying error-code information (e.g. EIO,

ENOMEM) Constructing error channels Identifying violation points

EDP

Page 11: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

11

sys_fsync

do_fsync

filemap_fdatawrite

filemap_fdatawrt_rn

do_writepages

generic_writepages

mpage_writepages

ext3_writepage

VFS

EIO

if (...) return –EIO;

ext3_writepage (int *err) *err = –EIO;

Constructing Error Channels

Propagate function Dataflow analysis Connect function

pointers

Generation endpoint Generates error code Example: return –EIO

ext3

Page 12: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

12

func() { err = func_call(); }

func() { err = func_call(); if (err) ...}

Detecting Violations Termination endpoint

Error code is no longer propagated

Two termination endpoints:− error-complete (minimally

checks)− error-broken

(unchecked, unsaved, overwritten)

Goal: Find error-broken endpoints

func() { err = func_call(); err = func_call_2();}

func() { func_call(); }

Error-complete endpoint

Unchecked

Unsaved / Bad Call

Overwritten

Page 13: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

13

Outline Introduction

Methodology

Results (unsaved error-codes / bad calls) Graphical outputs Complete results

Analysis of Results

Discussion and Conclusion

Page 14: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

14

Functions that generate/propagate error-codesFunctions that make bad calls (do not save

error-codes)

Good calls (calls that propagate error-codes)Bad calls (calls that do not save error-codes)

HFS

func

func

1

2

3

Page 15: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

15

int find_init(find_data *fd) { … fd->search_key = kmalloc(…); if (!fd->search_key) return –ENOMEM; …}

HFS (Example 1)

int file_lookup() { … find_init(fd); fd->search_key->cat = …; …}

Bad call!

Null pointer dereference

Inconsistencies

Callee Good Calls

Bad Calls

find_init 3 11

1

Page 16: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

16

HFS (Example 2)

2

Page 17: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

17

int __brec_find(key) { Finds a record in an HFS node that best matches the given key. Returns ENOENT if it fails.}int brec_find(key) { … result = __brec_find(key); … return result; }

Inconsistencies

Callee Good Calls

Bad Calls

find_init 3 11

__brec_find 1 4

brec_find 18 0

HFS (Example 2)2

Page 18: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

18

HFS (Example 3)

3

Page 19: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

19

int free_exts(…) { Traverses a list of extents and locate the extents to be freed. If not found, returns EIO. “panic?” is written before the return EIO statement.}

HFS (Example 3)

Inconsistencies

Callee Good Calls

Bad Calls

find_init 3 11

__brec_find 1 4

brec_find 18 0

free_exts 1 3

3

Page 20: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

20

HFS (Summary)

Not only in HFS

Almost all file systems and storage systems have major inconsistencies

Inconsistencies Callee Good Calls Bad Calls

find_init 3 11__brec_find 1 4brec_find 18 0free_exts 1 3

Page 21: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

21

ext3 37 bad / 188 calls = 20%

Page 22: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

22

35 bad / 218 calls = 16%ReiserFS

Page 23: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

23

IBM JFS 61 bad / 340 calls = 18%

Page 24: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

24

NFS Client 54 bad / 446 calls = 12%

Page 25: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

25

Coda 0 bad / 54 calls = 0% (internal)

0 bad / 95 calls = 0% (external)

Page 26: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

26

Summary Incorrect error propagation plagues

almost all file systems and storage systems

Bad Calls

EC Calls

Fraction

File systems 914 7400 12%

Storage drivers 177 904 20%

Page 27: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

27

Outline Introduction

Methodology

Results

Analysis of Results

Discussion and Conclude

Page 28: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

28

Analysis of Results Correlate robustness and complexity

Correlate file system size with number of violations− More complex file systems, more violations (Corr = 0.82)

Correlate file system size with frequency of violations− Small file systems make frequent violations (Corr = -

0.20)

Location distance of calls affects correct error propagation Inter-module > inter-file > intra-file bad calls

Read vs. Write failure-handling

Corner-case or consistent mistakes

Page 29: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

29

Read vs. Write Failure-Handling Filter read/write operations (string

comparison) Callee contains “write”, or “sync”, or “wait”

Write ops Callee contains “read” Read ops

Callee Type Bad Calls

EC Calls

Fraction

Read 26* 603 4%

Sync+Wait+Write 177 904 20%mm/readahead.cRead prefetching inMemory Management

Lots of write failures are ignored!

Page 30: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

30

Corner-Case or Consistent Mistakes? Define bad call frequency =

Example: sync_blockdev, 15/21 Bad call frequency: 71%

Corner-case bugs Bad call frequency < 20%

Consistent bugs Bad call frequency > 50%

# Bad calls to f()# All calls to f()

Page 31: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

31

Bad Call Frequencysync_blockdev 15 bad calls / 21 EC callsBad Call Freq: 71 %At x = 71, y += 15

Less than 100 violations are corner-case bugs

850 bad calls fall above the 50% mark

CDF of Bad Call Frequency

Cumulative

#Bad Calls

CumulativeFraction

Page 32: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

32

What’s going on?

Not just bugs

But more fundamental design issues Checkpoint failures are ignored

− Why? Maybe because of journaling flaw [IOShepherd-SOSP’07] − Cannot recover from checkpoint failures− Ex: A simple block remap could not result in a consistent

state Many write failures are ignored

− Lack of recovery policies? Hard to recover? Many failures are ignored in the middle of operations

− Hard to rollback?

Page 33: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

33

Conclusion (developer comments)

ext3 “there's no way of reporting error to userspace. So ignore it”

XFS “Just ignore errors at this point. There is nothing we can do except to try to

keep going”

ReiserFS “we can't do anything about an error here”

IBM JFS “note: todo: log error handler”

CIFS “should we pass any errors back?”

SCSI “Todo: handle failure”

Page 34: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

34

Thank you!Questions?

ADvanced Systems Laboratory www.cs.wisc.edu/adsl

Page 35: 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University.

35

Extra Slides


Recommended