+ All Categories
Home > Documents > Haryadi Gunawi UC Berkeley 1. 2 Research background FATE and DESTINI.

Haryadi Gunawi UC Berkeley 1. 2 Research background FATE and DESTINI.

Date post: 16-Dec-2015
Category:
Upload: thomas-barber
View: 221 times
Download: 0 times
Share this document with a friend
Popular Tags:
50
Haryadi Gunawi UC Berkeley Towards Reliable Cloud Storage 1
Transcript
Page 1: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

1

Haryadi GunawiUC Berkeley

Towards ReliableCloud Storage

Page 2: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

2

Outline Research background FATE and DESTINI

Page 3: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

3

Yesterday’s storage

Local storage

StorageServers

Page 4: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

4

Tomorrow’s storage

e.g. Google Laptop Cloud Storage

Page 5: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

5

Tomorrow’s storage

Internet-service FS:GoogleFS, HadoopFS, CloudStore, ...

Custom Storage:Facebook Haystack Photo Store, Microsoft StarTrack (Map Apps),

Amazon S3, EBS, ...

Key-Value Store:Cassandra, Voldemort, ...

Structured Storage:Yahoo! PNUTS, Google BigTable, Hbase, ...

“This is not just data. It’s my life.And I would be sick if I lost it” [CNN ’10]

+ Replication+ Scale-up+ Migration+ ...

Reliable?

Highly-available?

Page 6: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

Headlines

6

Cloudy with a

chance offailure

Page 7: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

7

Outline Research background FATE and DESTINI

Motivation FATE DESTINI Evaluation

Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Koushik Sen, Dhruba Borthakur

Page 8: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

8

Failure recovery ...

Cloud Thousands of commodity machines “Rare (HW) failures become frequent”

[Hamilton]

Failure recovery “… has to come from the software” [Dean] “… must be a first-class op”

[Ramakrishnan et al.]

Page 9: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

... hard to get right

9

Google Chubby (lock service) Four occasions of data loss Whole-system down More problems after injecting multiple

failures (randomly)

Google BigTable (key-value store) Chubby bug affects BigTable availability

More details? How often? Other problems and

implications?

Page 10: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

10

Open-source cloud projects Very valuable bug/issue repository! HDFS (1400+), ZooKeeper (900+),

Cassandra (1600+), Hbase (3000+), Hadoop (7000+)

HDFS JIRA study 1300 issues, 4 years (April 2006 to July 2010) Select recovery problems due to hardware

failures 91 recovery bugs/issues

... hard to get right

Implications Count

Data loss 13

Unavailability 48

Corruption 19

Misc. 10

Page 11: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

11

Why? Testing is not advanced enough

[Google] Failure model: multiple, diverse failures

Recovery is under-specified [Hamilton] Lots of custom recovery Implementation is complex

Need two advancements: Exercise complex failure modes Write specifications and test the

implementation

Page 12: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

12

Cloud software

Cloud testing

FATE

DESTINI

Failure Testing Service

Declarative Testing Specifications

X2

X1

Violatespecs?

Page 13: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

13

Outline Research background FATE and DESTINI

Motivation FATE

- Architecture- Failure exploration

DESTINI Evaluation

Page 14: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

14

M 1C 2 3

No failures

Setup Recovery: Recreate fresh pipeline (1, 2, 4)

Data Transfer Recovery: Continue on surviving nodes (1, 2)

M 1C 2 3M 1C 2 3 4

Alloc

Req

Setup

Stage

DataTransf

er

X1

X2

HadoopFS (HDFS)Write

Protocol

Page 15: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

15

Failures and FATE Failures

Anytime: different stages different recovery Anywhere: N2 crash, and then N3 Any type: bad disks, partitioned nodes/racks

FATE Systematically exercise multiple, diverse failures How? need to “remember” failures – via failure IDs

M 1C 2 3 M 1C 2 3 4

Page 16: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

16

Failure IDs Abstraction of I/O failures Building failure IDs

Intercept every I/O Inject possible failures

- Ex: crash, network partition, disk failure (LSE/corruption)

Node2 Node3

I/O information

OutputStream.read() inBlockReceiver.java

<stack trace>Net I/O from N3 to N2

“Data Ack”

Failure Crash After

Failure ID: 25

Note:FIDs

A, B, C, ...

X

Page 17: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

17

FATE Architecture

Java SDK

Target system(e.g. Hadoop FS)

Workload Driver

while (new FIDs){ hdfs.write()}

AspectJFailure Surface

FailureServer(fail/no fail?)

I/O Info

Page 18: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

Brute-force exploration

18

M 1C 2 3

A

A

B

A

B C

Exp #1: A

Exp #2: B

Exp #3: C

M 1C 2 3

A

B C

B

A

A

AB

AC

B CBC

1 failure / run 2 failures / run

Page 19: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

19

Outline Research background FATE and DESTINI

Motivation FATE

- Architecture- Failure exploration challenge and solution

DESTINI Evaluation

Page 20: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

20

Combinatorial explosion Exercised over 40,000 unique combinations of 1,

2, and 3 failures per run 80 hours of testing time!

New challenge: Combinatorial explosion Need smarter exploration strategies

1 2 3

A1 A2A1 B2B1 A2B1 B2...

2 failures / run

A1

B1

A2

B2

A3

B3

Page 21: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

21

Pruning multiple failures Properties of multiple failures

Pairwise dependent failures Pairwise independent failures

Goal: exercise distinct recovery behaviors Key: some failures result in similar

recovery Result: > 10x faster, and found the

same bugs

Page 22: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

22

Dependent failures Failure dependency graph

Inject single failures first Record subsequent dependent IDs

- Ex: X depends on A Brute-force: AX, BX, CX, DX, CY, DY

Recovery clustering Two clusters: {X} and {X, Y}

Only exercise distinct clusters Pick a failureID that triggers the

recovery cluster Results: AX, CX, CY

FID Subseq FIDsA XB XC X, YD X, Y

A B C D

X Y

Page 23: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

23

Independent failures (1)

Independent combinations FP2 x N (N – 1) FP = 2, N = 3, Tot

= 24

Symmetric code Just pick two nodes N (N – 1) 2 FP2 x 2

1 2 3

A1

B1

A2

B2

A3

B3

1 2 3

A1

B1

A2

B2

A3

B3

Page 24: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

24

Independent failures (2) FP2 bottleneck

FP = 4, total = 16 Real example, FP = 15

Recovery clustering Cluster A and B if: fail(A) == fail(B) Reduce FP2 to FP2

clustered

E.g.15 FPs to 8 FPsclustered

A1

B1

A2

B2

C1

D1

C2

D2

A1

B1

A2

B2

C1

D1

C2

D2

Page 25: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

25

FATE Summary Exercise multiple, diverse failures

Via failure IDs abstraction Challenge: combinatorial explosion of multiple

failures

Smart failure exploration strategies > 10x improvement Built on top of failure IDs abstraction

Current limitations No I/O reordering Manual workload setup

Page 26: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

26

Outline Research background FATE and DESTINI

Motivation FATE DESTINI

- Overview- Building specifications

Evaluation

Page 27: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

27

DESTINI: declarative specs Is the system correct under

failures? Need to write specifications[It is] great to document (in a spec) the HDFS write protocol ...

…, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently.

Implemen-tation

Specs

X1

X2

Page 28: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

28

Declarative specs How to write specifications?

Developer friendly (clear, concise, easy) Existing approaches:

- Unit test (ugly, bloated, not formal)- Others too verbose and long

Declarative relational logic language (Datalog) Key: easy to express logical relations

Page 29: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

29

Specs in DESTINI How to write specs?

Violations Expectations Facts

How to write recovery specs? “... recovery is under specified” [Hamilton] Precise failure events Precise check timings

How to test implementation? Interpose I/O calls (lightweight) Deduce expectations and facts from I/O events

Implemen-

tation

Specs

Page 30: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

30

Outline Research background FATE and DESTINI

Motivation FATE DESTINI

- Overview- Building specifications

Evaluation

Page 31: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

31

Specification template

violationTable(…) :- expectationTable(…), NOT-IN actualTable(…)

“Throw a violation if an expectation is different

from the actual behavior”

Datalog syntax:

head() :- predicates(), …:- derivation, AND

Page 32: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

32

Data transfer recovery

expectedNodes(Block, Node)

B Node 1

B Node 2

actualNodes(Block, Node)B Node 1

B Node 2

incorrectNodes(Block, Node)

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

M 1C 2 3

X

B B

DataTransf

er

“Replicas should exist in surviving nodes”

Page 33: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

33

Recovery bug

expectedNodes(Block, Node)

B Node 1

B Node 2

actualNodes(Block, Node)B Node 1

incorrectNodes(Block, Node)

B Node 2

M 1C 2 3

X

B B

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

Page 34: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

34

Building expectations Ex: which nodes should

have the blocks? Deduce expectations

from I/O events (italic)

M C

getBlockPipe(…)Give me 3 nodes for B

[Node1, Node2, Node3]

M C

expectedNodes(Block, Node)

B Node 1

B Node 2

B Node 3

expectedNodes (B, N) :-

getBlockPipe (B, N);

1 2 3

X2

#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

Page 35: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

35

Updating expectations

DEL expectedNodes (B, N) :-

expectedNodes (B, N), fateCrashNode (N)

expectedNodes(Block, Node)

B Node 1

B Node 2

B Node 3

M 1C 2 3

X

B B

#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

#2: expectedNodes(B, N) :- getBlockPipe(B,N)

Page 36: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

36

Precise failure eventsDEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode (N), writeStage (B, Stage), Stage == “Data

Transfer”;

Different stages different recovery

behaviors

#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

#2: expectedNodes(B,N) :- getBlockPipe(B,N)#3: expectedNodes(B,N) :- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage ==

“DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT<N>) :- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe (B, N); #7: acksCnt (B, CNT<A>) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck (B, P, A);

M 1C 2 3

M 1C 2 3 4

Datatransferrecover

y

vs.

Setupstage

recovery

Precise failure events

Page 37: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

37

Facts Ex: which nodes

actually store the valid blocks? Deduced from disk

I/O events in 8 datalog rules

actualNodes(Block, Node)

B Node 1

M 1C 2 3

X

B

#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

Page 38: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

38

Violation and check-timing

Recovery ≠ invariant If recovery is ongoing, invariants are violated Don’t want false alarms

Need precise check timings Ex: upon block completion

#1:incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B,

N), completeBlock (B);

Page 39: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

39

Data-transfer recovery spec

r1 incorrectNodes (B, N) :- cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);

r2 pipeNodes (B, Pos, N) :- getBlkPipe (UFile, B, Gs, Pos, N);

r3 expectedNodes (B, N) :- getBlkPipe (UFile, B, Gs, Pos, N);

r4 DEL expectedNodes (B, N) :- fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);

r5 setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack);

r6 goodAcksCnt (B, COUNT<Ack>)

:- setupAcks (B, Pos, Ack), Ack == ’OK’;

r7 nodesCnt (B, COUNT<Node>)

:- pipeNodes (B, , N, );

r8 pipeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;

r9 blkGenStamp (B, Gs) :- dnpNextGenStamp (B, Gs);

r10 blkGenStamp (B, Gs) :- cnpGetBlkPipe (UFile, B, Gs, , );

r11 diskFiles (N, File) :- fsCreate (N, File);

r12 diskFiles (N, Dst) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r13 DEL diskFiles (N, Src) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r14 fileTypes (N, File, Type) :- diskFiles(N, File), Type := Util.getType(File);

r15 blkMetas (N, B, Gs) :- fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);

r16 actualNodes (B, N) :- blkMetas (N, B, Gs), blkGenStamp (B, Gs);

Failureevent(crash

)

Expectation

Actual FactsI/O

Events

Page 40: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

40

More detailed specs

The spec: “something is wrong” Why? Where is the bug?

Let’s write more detailed specs

M 1C 2 3

X

B B

M 1C 2 3

X

B

Page 41: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

41

Adding specs First analysis

Client’s pipeline excludes Node2, why? Maybe, client gets a bad ack for Node2errBadAck (N) :- dataAck (N, “Error”), liveNodes (N)

Second analysis Client gets a bad ack for Node2, why? Maybe, Node1 could not communicate to Node2errBadConnect (N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN)

We catch the bug! Node2 cannot talk to Node3 (crash) Node2 terminates all connections (including Node1!) Node1 thinks Node2 is dead

X

B

M 1C 2 3

Page 42: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

42

Catch bugs with more specs

More detailed specs

Catch bugs closer to the source and earlier in time

Global view

Client’s view

Datanode’s view

Page 43: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

43

More ... Design patterns

Add detailed specs Refine existing specs Write specs from different views (global,

client, dn) Incorporate diverse failures (crashes, net

partitions) Express different violations (data-loss,

unavailability)

Page 44: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

44

Outline Research background FATE and DESTINI

Motivation FATE DESTINI Evaluation

Page 45: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

45

Evaluation Implementation complexity

~6000 LOC in Java

Target 3 popular cloud systems HadoopFS (primary target)

- Underlying storage for Hadoop/MapReduce ZooKeeper

- Distributed synchronization service Cassandra

- Distributed key-value store

Recovery bugs Found 22 new HDFS bugs (confirmed)

- Data loss, unavailability bugs Reproduced 51 old bugs

Page 46: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

46

Availability bug“If multiple racks are available (reachable), a block should be stored in a minimum of two racks”

Rack #1 Rack #2

B

B

B

Client

B

Replication Monitor

errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_);

“Throw a violation if a block is only stored in one rack, but the rack is connected to another rack”

Availability bug!#replicas = 3,locations are not checked,B is not migrated to R2

rackCnt

B, 1

blkRacksB, R1

connectedR1, R2

errorSingleRackB

FATE injectsrack partitioning

Page 47: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

47

Pruning Efficiency Reduce #experiments by an order of magnitude

Each experiment = 4-9 seconds

Found the same number of bugs (by experience)

# Exps

7720

618

5000

Write +2

crashes

Append +2

crashes

Bru

te

Forc

ePru

ned

Write +3

crashes

Append +3

crashes

Page 48: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

48

Specification simplicity

Compared to other related work

Framework #Chks Lines/Chk

D3S [NSDI ’08] 10 53

Pip [NSDI ’06] 44 43

WiDS [NSDI ’07] 15 22

P2 Monitor [EuroSys ’06]

11 12

DESTINI 74 5

Page 49: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

49

Summary Cloud systems

All good, but must manage failures Performance, reliability, availability depend

on failure recovery

FATE and DESTINI Explore multiple, diverse failures

systematically Facilitate declarative recovery specifications A unified framework

Real-world adoption in progress

Page 50: Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

50

Research background FATE and DESTINI Thanks! Questions?

END


Recommended