+ All Categories
Home > Documents > 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +,...

1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +,...

Date post: 18-Dec-2015
Category:
Upload: ashlynn-barbara-mcbride
View: 224 times
Download: 3 times
Share this document with a friend
39
1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran + , Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - University of Wisconsin - Madison Madison +
Transcript
Page 1: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

1

Improving File System Reliability with I/O Shepherding

Improving File System Reliability with I/O Shepherding

Haryadi S. Gunawi, Vijayan Prabhakaran+, Swetha Krishnan,

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

University of Wisconsin - MadisonUniversity of Wisconsin - Madison

+

Page 2: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

2

Storage RealityStorage Reality

Complex Storage Subsystem– Mechanical/electrical

failures, buggy drivers

Complex Failures:– Intermittent faults,

latent sector errors, corruption, lost writes, misdirected writes, etc.

FS Reliability is important– Managing disk and

individual block failures

File SystemFile System

Electrical

Mechanical

Firmware Media

Transport

Device DriverDevice Driver

Page 3: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

3

File System RealityFile System Reality

Good news:– Rich literature

• Checksum, parity, mirroring• Versioning, physical/logical identity

– Important for single and multiple disks setting

Bad news:– File system reliability is broken[SOSP’05]

• Unlike other components (performance, consistency)

• Reliability approaches hard-to understand and evolve

Page 4: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

4

Broken FS ReliabilityBroken FS Reliability

Lack of good reliability strategy– No remapping, checksumming, redundancy– Existing strategy is coarse-grained

• Mount read-only, panic, retry

Inconsistent policies– Different techniques in similar failure

scenarios

Bugs– Ignored write failures

Let’s fix them!With

currentFramework?Not so easy

Page 5: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

5

No Reliability FrameworkNo Reliability Framework Diffused

– Handle each fault in each I/O location

– Different developers might increase diffusion

Inflexible– Fixed policies, hard to change– But, no policy that fits all diverse settings

• Less reliable vs. more reliable drives• Desktop workload vs. web-server apps

The need for new framework– Reliability is a first-class file system concern

Reliability Policy

File SystemFile System

Disk SubsystemDisk Subsystem

Page 6: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

6

LocalizedLocalized

I/O Shepherd– Localized policies, …

• More correct, less bug, simpler reliability management

File SystemFile System

Disk SubsystemDisk Subsystem

ShepherdShepherd

Page 7: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

7

FlexibleFlexible

I/O Shepherd– Localized, flexible policies, …

Disk Subsystem

ShepherdAdd

Mirror

ArchivalScientificData

Check-sum

NetworkedStorage

MoreRetry

ATA

LessReliable

Drive

MoreProtection

SCSI

MoreReliable

Drive

LessProtection

File System

Page 8: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

8

PowerfulPowerful

I/O Shepherd– Localized, flexible, and powerful policies

Disk Subsystem

ShepherdAdd

Mirror

ArchivalScientificData

Check-sum

NetworkedStorage

MoreRetry

ATA

LessReliable

Drive

MoreProtection

SCSI

MoreReliable

Drive

LessProtection

File System

CustomDrive

Compo-sable

Policies

AddMirror

Check-sum

MoreRetry

MoreProtection

Page 9: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

9

OutlineOutline

Introduction

I/O Shepherd Architecture

Implementation

Evaluation

Conclusion

Page 10: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

10

ArchitectureArchitecture

Building reliability framework– How to specify reliability

policies?– How to make powerful policies?– How to simplify reliability

management?

I/O Shepherd layer

Four important components– Policy table– Policy code– Policy primitives– Policy Metadata

File SystemFile System

Disk Subsystem

Disk Subsystem

I/O Shepherd I/O Shepherd Policy Table

Data Mirror()

Inode …

Super …

DynMirrorWrite(DiskAddr D, MemAddr A)DiskAddr copyAddr;IOS_MapLookup(MMap, D, &copyAddr);if (copyAddr == NULL) PickMirrorLoc(MMap, D, &copyAddr); IOS_MapAllocate(MMap, D, copyAddr);return (IOS_Write(D, A, copyAddr, A));

Policy Code

Checksum

Lookup

Read

SanityCheck

OnlineFsck

Write

PrimitivesLocation

Policy MetadataMirror-Map Remap-Map Checksum-Map

Page 11: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

11

Policy TablePolicy Table

How to specify reliability policies?– Different block types,

different levels of importance

– Different volumes, different reliability levels

– Need fine-grained policy

Policy table– Different policies across

different block types– Different policy tables

across different volumes

Policy Table

Block Type Write Policy Read Policy

… … …

Superblock TrippleMirror() …

Inode ChecksumParity()

Inode Bitmap

ChecksumParity()

Data WriteRetry1sec()

… … …

Shepherd/lib/tmp /boot /archive

File System

No protectionHigh-levelreliability

Page 12: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

12

Policy MetadataPolicy Metadata

What support is needed to make powerful policies?

– Remapping: track bad block remapping– Mirroring: allocate new block– Sanity check: need on-disk structure specification

Integration with file system– Runtime allocation– Detailed knowledge of on-disk structures

I/O Shepherd Maps– Managed by the shepherd– Commonly used maps:

• Mirror-map• Checksum-map• Remap-map 1001 2001

Mirror-Map

1002 2002

1003 null

… …

File System

I/O Shepherd

1001 1010

Csum-Map

1002 1010

1003 1010

… …

1001 null

Remap

1002 null

1003 3003

… …

Page 13: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

13

Policy Primitives and CodePolicy Primitives and Code

How to make reliability management simple?

I/O Shepherd Primitives– Rich set and reusable– Complexities are hidden

Policy writer simply composes primitives into Policy Code

Policy Code

Maps Computation

FS-Level

Sanity Check

Policy Primitives

Checksum

Stop FS

Map Update

ParityMap Lookup

Layout

Allocate Near

Allocate Far

MirrorData(Addr D) Addr M; MapLookup (MMap, D, M); if (M == NULL) M = PickMirrorLoc (D); MapAllocate (MMap, D, M); Copy (D, M); Write (D, M);

Page 14: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

14

Disk Subsystem

File System

I/O Shepherd

Policy Table

Data MirrorData()

Inode …

Super …

MirrorData(Addr D) Addr R; R = MapLookup(MMap, D);

if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R);

Policy Code

D

D

D

D R

Mirror-Map

RD D NULL

Mirror-Map

… …D R

Page 15: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

15

SummarySummary

Interposition simplifies reliability management– Localized policies – Simple and extensible policies

Challenge: Keeping new data and metadata consistent

Page 16: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

16

OutlineOutline

Introduction

I/O Shepherd Architecture

Implementation– Consistency Management

Evaluation

Conclusion

Page 17: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

17

ImplementationImplementation

CrookFS– (named for the hooked staff of a shepherd)– An ext3 variant with I/O shepherding capabilities

Implementation– Changes in Core OS

• Semantic information, layout and allocation interface, allocation during recovery

• Consistency management (data journaling mode)• ~900 LOC (non-intrusive)

– Shepherd Infrastructure• Shepherd primitives, thread support, maps management, etc.• ~3500 LOC (reusable for other file systems)

Well-integrated with the file system– Small overhead

Page 18: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

18

Fixed Location

Checkpoint (intent is realized)

Data Journaling ModeData Journaling ModeMemory

Journal

DI

TB D I TC

D IBm

Tx ReleaseSync (intent is logged)

Page 19: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

19

Reliability Policy + JournalingReliability Policy + Journaling

When to run policies?– Policies (e.g. mirroring) are executed during

checkpoint

Is current journaling approach adequate to support reliability policy?– Could we run remapping/mirroring during

checkpoint?

No – Problem of failed intentions– Cannot react to checkpoint failures

Page 20: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

20

Fixed Location

Journal Inconsistencies:1) Pointer ID invalid2) No reference to R

Memory

RMD0

Failed IntentionsFailed Intentions

DI

TB D I TC

Example Policy: Remapping

Remap-Map

RD I

Checkpoint completes RMDR

RMD0

RMDR

Impossible

Tx Release

Checkpoint (failed intent)

R I

Crash

Page 21: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

21

Journaling FlawJournaling Flaw

Journal: log intent to the journal– If journal write failure occurs? Simply abort the transaction

Checkpoint: intent is realized to final location– If checkpoint failure occurs? No solution!

• Ext3, IBM JFS: ignore • ReiserFS: stop the FS (coarse-grained recovery)

Flaw in current journaling approach– No consistency for any checkpoint recovery that changes

state• Too late, transaction has been committed• Crash could occur anytime

– Hopes checkpoint writes always succeed (wrong!)

Consistent reliability + current journal = impossible

Page 22: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

22

Chained TransactionsChained Transactions

Contains all recent changes (e.g. modified shepherd’s metadata)

“Chained” with previous transaction

Rule: Only after the chained transaction commits, can we release the previous transaction

Page 23: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

23

New: Tx Release after CTx commits

Fixed Location

Journal

Memory

RMD0

Chained TransactionsChained Transactions

DI

TB D I TC

Example Policy: Remapping

RD I

Checkpoint completes

RMDR

Old : Tx Release

TB TC

RMDR

Page 24: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

24

SummarySummary

Chained Transactions– Handles failed-intentions– Works for all policies– Minimal changes in the journaling

layer

Repeatable across crashes– Idempotent policy

• An important property for consistency in multiple crashes

Page 25: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

25

OutlineOutline

Introduction

I/O Shepherd Architecture

Implementation

Evaluation

Conclusion

Page 26: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

26

EvaluationEvaluation

Flexible– Change ext3 to all-stop or more-retry policies

Fine-Grained– Implement gracefully-degrade RAID[TOS’05]

Composable– Perform multiple lines of defense

Simple– Craft 8 policies in a simple manner

Page 27: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

27

FlexibilityFlexibility

Not applicableWorkload

Failed

Blo

ck T

yp

e

Stop

Propagate

No Recovery

Retry

Failed Block: Indirect block Workload: Path traversal cd /mnt/fs2/test/a/b/Policy observed: Detect failure and propagate failure to app

Propagate Retry Ignorefailure

Stop

Modify ext3 inconsistent read recovery policies

ext3

Page 28: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

28

FlexibilityFlexibility Modify ext3 policies to all-stop policies

AllStopRead (Block B) if (Read(B) == OK) return OK; else Stop();

Policy Table

Any Block Type AllStopRead()

ext3 All-Stop

Stop

Propagate

No Recovery

Retry

Page 29: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

29

FlexibilityFlexibility Modify ext3 policies to retry-more policies

RetryMoreRead (Block B) for (int i = 0; i < RETRY_MAX; i++) if (Read(B) == SUCCESS) return SUCCESS; return FAILURE;

Policy Table

Any Block Type RetryMoreRead()

ext3 Retry-More

Stop

Propagate

No Recovery

Retry

Page 30: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

30

Shepherd + DGRAID

File System

RAID-0

Fine-GranularityFine-Granularity

RAID problem– Extreme unavailability

• Partially available data• Unavailable root

directory

DGRAID[TOS’05] – Degrade gracefully

• Fault isolate a file to a disk

• Highly replicate metadata

File System

RAID-0

file1.pdf

/root /root/…

f1.pdf f2.pdf

Page 31: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

31

Fine-GranularityFine-Granularity

DGRAID Policy Table

Superblock

MirrorXway()

Group Desc

Bitmaps

Directory

Inode

Indirect

DataIsolateAFileToADisk()

X = 1, 5, 10

F: 1A: 90%

F: 2A: 80%

F: 3A: ~40%

10-wayLinear

Page 32: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

32

ReadInode(Block B){ C = Lookup(Ch-Map, B); Read(B,C); if ( CompareChecksum(B, C) == OK ) return OK; M = Lookup(M-Map, B); Read(M); if ( CompareChecksum(M, C) == OK ) B = M; return OK; if ( SanityCheck(B) == OK ) return OK; if ( SanityCheck(M) == OK ) B = M; return OK; RunOnlineFsck(); return ReadInode(B);}

ComposabilityComposability

Multiple lines of defense

Assemble both low-level and high-level recovery mechanism

Time (ms)

Page 33: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

33

SimplicitySimplicity

Writing reliability policy is simple– Implement 8 policies

• Using reusable primitives

– Complex one < 80 LOC

Policy LOC

Propagate 8

Sanity Check 10

Reboot 15

Retry 15

Mirroring 18

Parity 28

Multiple Lines of D

39

D-GRAID 79

Page 34: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

34

ConclusionConclusion

Modern storage failures are complex– Not only fail-stop, but also exhibit individual block

failures

FS reliability framework does not exist– Scattered policy code – can’t expect much

reliability– Journaling + Block Failures Failed intentions

(Flaw)

I/O Shepherding– Powerful

• Deploy disk-level, RAID-level, FS-level policies– Flexible

• Reliability as a function of workload and environment– Consistent

• Chained-transactions

Page 35: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

35

ADvanced Systems Laboratorywww.cs.wisc.edu/adsl

Scholarship

Sponsor:

ResearchSponsor:

Thanks to:

I/O Shepherd’s shepherd – Frans Kaashoek

Page 36: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

36

Extra SlidesExtra Slides

Page 37: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

37

Disk Subsystem

Policy Table

Data

RemapMirrorData()

.. …

.. …

RemapMirrorData(Addr D) Addr R, Q; MapLookup(MMap, D, R); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R);

if (Fail(R)) Deallocate(R); Q = PickMirrorLoc(D); MapAllocate(MMap, D, Q); Write(Q);

Policy CodeD

D Q

Mirror-Map

D NULL

Mirror-Map

… …D QRD

D R

Mirror-Map

Q

Page 38: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

38

Memory

Journal

Fixed Location MD0

MDR1

Chained Transactions (2)Chained Transactions (2)

DI

TB

D I

TB TC

R1 R2

MDR2MDR2

TC

MD0

D

Checkpointcompletes

I

Example Policy: RemapMirrorData

Page 39: 1 Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran +, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H.

39

Existing Solution Enough?Existing Solution Enough?

Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)?– Not pervasive in home environment (store

photos, tax returns)– New trend: commodity storage clusters

(Google, EMC Centera)

Is RAID enough?– Requires more than one disk– Does not protect faults above disk system– Focus on whole disk failure– Does not enable fine-grained policies


Recommended