+ All Categories
Home > Documents > EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy...

EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy...

Date post: 29-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
35
1 EIO: E rror-handling i s O ccasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University of Wisconsin – Madison FAST ’08 – February 28, 2008
Transcript
Page 1: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

1

EIO: Error-handling is Occasionally Correct

Haryadi S. Gunawi, Cindy Rubio-González,

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit

University of Wisconsin – Madison

FAST ’08 – February 28, 2008

Page 2: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

2

Robustness of File Systems

Today’s file systems have robustness issues

Buggy implementation[FiSC-OSDI’04, EXPLODE-OSDI’06]

Unexpected behaviors in corner-case situations

Deficient fault-handling[IRONFS-SOSP’05]

Inconsistent policies: propagate, retry, stop, ignore

Prevalent ignoranceExt3: Ignore write failures during checkpoint and journal replayNFS: Sync-failure at the server is not propagated to client What is the root cause?

Page 3: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

3

Incorrect Error Code Propagation

void dosync() {

fdatawrite();

sync_file();

fdatawait();}

NFS Client

NFS Serversync()

dosync

fdatawrite sync_file fdatawait

...... ......

...... ......

... ...... ......

Page 4: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

4

Incorrect Error Code Propagation

void dosync() {

fdatawrite();

sync_file();

fdatawait();}

NFS Client

NFS Serversync()

X

X

X

dosync

fdatawrite sync_file fdatawait

...

...

...

...... ...

... ...

... ......

...

...

...

...

...

fdatawrite

return EIO;

dosync

sync_file

...

...

return EIO;

fdatawait

...

...

...

return EIO;

Unsavederror-codes

Page 5: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

5

Implications

Misleading error-codes in distributed systemsNFS client receives SUCCEED instead of ERROR

Useless policies Retry in NFS client is not invoked

Silent failuresMuch harder debugging process

Page 6: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

6

EDP: Error Detection and Propagation Analysis

Static analysisUseful to show how error codes flowCurrently: 34 basic error codes (e.g. EIO, ENOMEM)

Target systems51 file systems (all directories in linux/fs/*)3 storage drivers (SCSI, IDE, Software-RAID)

Page 7: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

7

ResultsNumber of violations

Error-codes flow through 9022 function calls1153 (13%) calls do not save the returned error-codes

Analysis, a closer lookMore complex file systems, more violationsLocation distance affects error propagation correctnessWrite errors are neglected more than read errorsMany violations are not corner-case bugs− Error-codes are consistently ignored

Page 8: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

8

OutlineIntroduction

MethodologyChallengesEDP tool

Results

Analysis

Discussion and Conclusion

Page 9: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

9

Challenges in Static Analysis

File systems use many error codesbuffer state[Uptodate] = 0journal flags = ABORTint err = -EIO; ... return err;

Error codes transformBlock I/O error becomes journal errorJournal error becomes generic error code

Error codes propagate through:Function call pathAsynchronous path (e.g. interrupt, network messages)

Page 10: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

10

StateCurrent State: Integer error-codes, function call pathFuture: Error transformation, asynchronous path

ImplementationUtilize CIL: Infrastructure for C program analysis[Necula-CC’02]

EDP: ~4000 LOC in Ocaml

3 components of EDP architectureSpecifying error-code information (e.g. EIO, ENOMEM)Constructing error channelsIdentifying violation points

EDP

Page 11: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

11

sys_fsync

do_fsync

filemap_fdatawrite

filemap_fdatawrt_rn

do_writepages

generic_writepages

mpage_writepages

ext3_writepage

VFS

EIO

if (...) return –EIO;

ext3_writepage (int *err)*err = –EIO;

Constructing Error Channels

Propagate functionDataflow analysisConnect function pointers

Generation endpointGenerates error codeExample: return –EIO

ext3

Page 12: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

12

func() {err = func_call();

}

func() {err = func_call();if (err)...

}

Detecting Violations

Termination endpointError code is no longer propagatedTwo termination endpoints:− error-complete (minimally checks)

− error-broken (unchecked, unsaved, overwritten)

Goal:Find error-broken endpoints

func() {err = func_call();err = func_call_2();

}

func() {func_call();

}

Error-complete endpoint

Unchecked

Unsaved / Bad Call

Overwritten

Page 13: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

13

OutlineIntroduction

Methodology

Results (unsaved error-codes / bad calls)Graphical outputs Complete results

Analysis of Results

Discussion and Conclusion

Page 14: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

14

Functions that generate/propagate error-codesFunctions that make bad calls (do not save error-codes)

Good calls (calls that propagate error-codes)Bad calls (calls that do not save error-codes)

HFS

func

func

1

2

3

Page 15: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

15

int find_init(find_data *fd) { …fd->search_key = kmalloc(…);if (!fd->search_key)return –ENOMEM;

…}

HFS (Example 1)

int file_lookup() { …find_init(fd);fd->search_key->cat = …; …

}

Bad call!

Null pointer dereference

Inconsistencies

113find_init

Good Calls Bad CallsCallee

1

Page 16: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

16

HFS (Example 2)

2

Page 17: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

17

int __brec_find(key) {

Finds a record in an HFS node that best matches the given key.Returns ENOENT if it fails.

}

int brec_find(key) { …result = __brec_find(key);…return result;

}

Inconsistencies

113find_init

41__brec_find

18

Good Calls

0

Bad Calls

brec_find

Callee

HFS (Example 2)2

Page 18: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

18

HFS (Example 3)

3

Page 19: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

19

int free_exts(…) {

Traverses a list of extents and locate the extents to be freed. If not found, returns EIO. “panic?” is written before the return EIO statement.

}

HFS (Example 3)

Inconsistencies

113find_init

41__brec_find

1

18

Good Calls

3

0

Bad Calls

free_exts

brec_find

Callee

3

Page 20: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

20

HFS (Summary)

Not only in HFS

Almost all file systems and storage systems have major inconsistencies

Inconsistencies

113find_init

41__brec_find

118

Good Calls

30

Bad Calls

free_exts

brec_find

Callee

Page 21: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

21

ext3 37 bad / 188 calls = 20%

Page 22: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

22

35 bad / 218 calls = 16%ReiserFS

Page 23: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

23

IBM JFS 61 bad / 340 calls = 18%

Page 24: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

24

NFS Client 54 bad / 446 calls = 12%

Page 25: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

25

Coda 0 bad / 54 calls = 0% (internal)

0 bad / 95 calls = 0% (external)

Page 26: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

26

SummaryIncorrect error propagation plagues almost all file systems and storage systems

177

914

Bad Calls

904

7400

EC Calls

20%Storage drivers

12%File systems

Fraction

Page 27: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

27

OutlineIntroduction

Methodology

Results

Analysis of Results

Discussion and Conclude

Page 28: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

28

Analysis of ResultsCorrelate robustness and complexity

Correlate file system size with number of violations− More complex file systems, more violations (Corr = 0.82)

Correlate file system size with frequency of violations− Small file systems make frequent violations (Corr = -0.20)

Location distance of calls affects correct error propagationInter-module > inter-file > intra-file bad calls

Read vs. Write failure-handling

Corner-case or consistent mistakes

Page 29: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

29

Read vs. Write Failure-Handling

Filter read/write operations (string comparison)Callee contains “write”, or “sync”, or “wait” Write opsCallee contains “read” Read ops

17726*

Bad Calls

904603

EC Calls

20%Sync+Wait+Write4%Read

FractionCallee Type

mm/readahead.cRead prefetching inMemory Management

Lots of write failures are ignored!

Page 30: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

30

Corner-Case or Consistent Mistakes?

Define bad call frequency =Example: sync_blockdev, 15/21Bad call frequency: 71%

Corner-case bugsBad call frequency < 20%

Consistent bugsBad call frequency > 50%

# Bad calls to f()# All calls to f()

Page 31: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

31

Bad Call Frequencysync_blockdev15 bad calls / 21 EC callsBad Call Freq: 71 %At x = 71, y += 15

Less than 100violations are corner-case bugs

850 bad callsfall above the 50% mark

CDF of Bad Call Frequency

Cumulative #Bad Calls

CumulativeFraction

Page 32: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

32

What’s going on?

Not just bugs

But more fundamental design issuesCheckpoint failures are ignored − Why? Maybe because of journaling flaw [IOShepherd-SOSP’07]

− Cannot recover from checkpoint failures− Ex: A simple block remap could not result in a consistent state

Many write failures are ignored− Lack of recovery policies? Hard to recover?

Many failures are ignored in the middle of operations− Hard to rollback?

Page 33: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

33

Conclusion (developer comments)

ext3 “there's no way of reporting error to userspace. So ignore it”

XFS “Just ignore errors at this point. There is nothing we can do except to try to keep going”

ReiserFS “we can't do anything about an error here”

IBM JFS “note: todo: log error handler”

CIFS “should we pass any errors back?”

SCSI “Todo: handle failure”

Page 34: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

34

Thank you!Questions?

ADvanced Systems Laboratory www.cs.wisc.edu/adsl

Page 35: EIO Error-handling i · 1 EIO: Error-handling is Occasionally Correct Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University

35

Extra Slides


Recommended