+ All Categories
Home > Documents > Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi,...

Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi,...

Date post: 16-Dec-2015
Category:
Upload: britney-west
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
36
Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do , Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci- Dusseau , Remzi H. Arpaci-Dusseau , Koushik Sen University of California, Berkeley University of Wisconsin, Madison
Transcript
Page 1: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Towards Automatically Checking

Thousands of Failures with Micro-Specifications

Haryadi S. Gunawi, Thanh Do†, Pallavi Joshi,

Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau†,

Remzi H. Arpaci-Dusseau†, Koushik Sen

University of California, Berkeley† University of Wisconsin, Madison

Page 2: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Cloud Era

Solve bigger human problemsUse cluster of thousands of

machines

2

Page 3: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Failures in The Cloud

“The future is a world of failures everywhere” - Garth Gibson

“Recovery must be a first-class operation” - Raghu Ramakrishnan

“Reliability has to come from the software” - Jeffrey Dean

3

Page 4: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

4

Page 5: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

5

Page 6: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Why Failure Recovery Hard?

• Testing is not advanced enough against complex failures– Diverse, frequent, and multiple failures– FaceBook photo loss

• Recovery is under specified– Need to specify failure recovery behaviors– Customized well-grounded protocols

• Example: Paxos made live – An engineering perspective [PODC’ 07]

6

Page 7: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Our Solutions

• FTS (“FATE”) – Failure Testing Service– New abstraction for failure exploration – Systematically exercise 40,000 unique

combinations of failures

• DTS (“DESTINI”) – Declarative Testing Specification– Enable concise recovery specifications– We have written 74 checks (3 lines / check)

• Note: Names have changed since the paper

7

Page 8: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Summary of Findings

• Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra

• Found 16 new bugs• Reproduced 74 bugs• Problems found

– Inconsistency– Data loss– Rack awareness broken– Unavailability

8

Page 9: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Outline

Introduction• FATE• DESTINI• Evaluation• Summary

9

Page 10: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

10

M 1C 2 3 M 1C 2 3 4

M 1C 2 3 M 1C 2 3

No failures Setup Stage Recovery: Recreate fresh pipeline

Data transfer Stage Recovery: Continue on surviving nodes

Bug in Data Transfer Stage Recovery

X3X2

X1

Setup

Stage

Alloc.Req.

Data Transfer

Stage Failures at DIFFERENT STAGES

lead to DIFFERENT FAILURE BEHAVIORS

Goal: Exercise different failure recovery path

Page 11: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

FATE

• A failure injection framework– target IO points– Systematically exploring

failure– Multiple failures

• New abstraction of failure scenario– Remember injected failures– Increase failure coverage

11

M 1C 2 3

XX X

X

X X

Page 12: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Failure ID

12

2 3

Fields Values

Static Func. Call OutputStream.read()

Source File BlockReceiver.java

Dynamic Stack Track …

Domain specific

Source Node 2

Destination Node 3

Net. Message Data Packet

Failure Type Crash After

Hash 12348729

Page 13: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

How Developers Build Failure ID?

• FATE intercepts all I/Os• Use aspectJ to collect information at

every I/O point– I/O buffers (e.g file buffer, network

buffer)– Target I/O (e.g. file name, IP address)

• Reverse engineer for domain specific information

13

Page 14: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Failure ID

12

2 3

Fields Values

Static Func. Call OutputStream.read()

Source File BlockReceiver.java

Dynamic Stack Track …

Domain specific

Source Node 2

Destination Node 3

Net. Message Data Packet

Failure Type Crash After

Hash 12348729

Page 15: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Exploring Failure Space

14

M 1C 2 3

A

A

B

A

B C

Exp #1: A

Exp #2: B

Exp #3: C

M 1C 2 3

A

B C

B

A

A

AB

AC

B CBC

Page 16: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Outline

IntroductionFATE• DESTINI• Evaluation• Summary

15

Page 17: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

DESTINI

• Enable concise recovery specifications• Check if expected behaviors match with

actual behaviors• Important elements:

– Expectations– Facts– Failure Events– Check Timing

• Interpose network and disk protocols

16

Page 18: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Writing specifications

“Violation if expectation is different from actual facts”

violationTable():- expectationTable(), NOT-IN actualTable()

DataLog syntax::- derivation

, AND17

Page 19: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

18

M 1C 2 3

Correct recovery

X

M 1C 2 3

X

Incorrect Recovery

Expected Nodes(Block, Node)

B Node 1

B Node 2

actualNodes(Block, Node)

B Node 1

B Node 2

IncorrectNodes(Block, Node)

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

Page 20: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

19

M 1C 2 3

Correct recovery

X

Expected Nodes(Block, Node)

B Node 1

B Node 2

actualNodes(Block, Node)

B Node 1

IncorrectNodes(Block, Node)

B Node 2

M 1C 2 3

X

Incorrect recovery

BUILD EXPECTATIONS CAPTURE FACTS

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

Page 21: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Building Expectations

expectedNodes(B, N) :- getBlockPipe(B, N);

20

Expected Nodes(Block, Node)

B Node 1

B Node 2

B Node 3

M 1C 2 3

X

Master Client

Give me list of nodes for B

[Node 1, Node 2, Node 3]

Page 22: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Updating Expectation

DEL expectedNodes(B, N) :- fateCrashNode(N), writeStage(B, Stage),

Stage = “Data Transfer”, expectedNode(B, N)

21

Expected Nodes(Block, Node)

B Node 1

B Node 2

B Node 3

M 1C 2 3

X

• “Client receives all acks from setup stage writeStage” enter Data Transfer stage

• Precise failure events- Different stages different recovery behaviors different

specifications- FATE and DESTINI must work hand in hand

setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack);goodAcksCnt (B, COUNT<Ack>) :- setupAcks (B, Pos, Ack), Ack == ’OK’;nodesCnt (B, COUNT<Node>) :- pipeNodes (B, , N, );writeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := “Data Transfer”;

Page 23: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Capture Facts

actualNodes(B, N) :- blocksLocation(B, N, Gs), latestGenStamp(B, Gs)

22

actualNodes(Block, Node)

B Node 1

blocksLocations(B, N, Gs)

B Node 1 2

B Node 2 1

B Node 3 1

latestGenStamp(B, Gs)

B 2

M 1C 2 3

Correct recovery

X

M 1C 2 3

X

Incorrect recovery

B_gs2 B_gs1 B_gs1

Page 24: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Violation and Check-Timing

23

actualNodes(Block, Node)

B Node 1

ExpectedNodes(Block, Node)

B Node 1

B Node 2

IncorrectNodes(Block, Node)

B Node 2

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N),

cnpComplete(B) ;

• There is a point in time where recovery is ongoing, thus specifications are violated

• Need precise events to decide when the check should be done– In this example, upon block completion

Page 25: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Rules

24

r1 incorrectNodes (B, N) :-

cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);

r2 pipeNodes (B, Pos, N) :-

getBlkPipe (UFile, B, Gs, Pos, N);

r3 expectedNodes (B, N) :-

getBlkPipe (UFile, B, Gs, Pos, N);

r4 DEL expectedNodes (B, N) :-

fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);

r5 setupAcks (B, Pos, Ack) :-

cdpSetupAck (B, Pos, Ack);

r6 goodAcksCnt (B, CUUNT<Ack>)

:-

setupAcks (B, Pos, Ack), Ack == ’OK’;

r7 nodesCnt (B, COUNT<Node>) :-

pipeNodes (B, , N, );

r8 pipeStage (B, Stg) :-

nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;

r9 blkGenStamp (B, Gs) :-

dnpNextGenStamp (B, Gs);

r10 blkGenStamp (B, Gs) :-

cnpGetBlkPipe (UFile, B, Gs, , );

r11 diskFiles (N, File) :-

fsCreate (N, File);

r12 diskFiles (N, Dst) :-

fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r13 DEL diskFiles (N, Src) :-

fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r14 fileTypes (N, File, Type) :-

diskFiles(N, File), Type := Util.getType(File);

r15 blkMetas (N, B, Gs) :-

fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);

r16 actualNodes (B, N) :-

blkMetas (N, B, Gs), blkGenStamp (B, Gs);

• Capture Facts, Build Expectation from IO events- No need to interpose internal functions• Specification Reuse- For the first check, # rules : #check is 16:1- Overall, #rules: # check ratio is 3:1

Page 26: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Outline

IntroductionFATEDESTINI• Evaluation• Summary

25

Page 27: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Evaluation

• FATE: 3900 lines, DESTINI: 1200 lines• Applied FATE and DESTINI to three

cloud systems– HDFS, ZooKeeper, Cassandra

• 40,000 unique combination of failures

• Found 16 new bugs, reproduced 74 bugs

• 74 recovery specifications– 3 lines / check

26

Page 28: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Bugs found

• Reduced availability and performance• Data loss due to multiple failures• Data loss in log recovery protocol• Data loss in append protocol• Rack awareness property is broken

27

Page 29: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Conclusion

• FATE explores multiple failure systematically• DESTINI enables concise recovery specifications• FATE and DESTINI: a unified framework

– Testing recovery specifications requires a failure service– Failure service needs recovery specifications to catch

recovery bugs

28

Page 30: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Thank you!

29

The Advanced Systems Laboratory

http://www.cs.wisc.edu/adsl

Berkeley Orders of Magnitudehttp://boom.cs.berkeley.edu

QUESTIONS?

Downloads our full TR paper from these websites

Page 31: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

New Challenges

• Exponential growth of multiple failures– FATE exercised 40,000 failure

combinations in 80 hours

30

Page 32: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

DESTINI vs. Related works

Framework

# Checks Lines/check

D3S 10 53

Pip 44 43

WiDS 15 22

P2 Monitor 11 12

DESTINI 74 3

31

Page 33: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

HDFS

Java SDK

FailureServer

Filt

ers Fail/

No Fail?

Workload Driverwhile (server injects new failureIDs) { runWorkload(); // e.g hdfs.write}

FailureSurface

FATE Architecture

Page 34: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

DESTINI

DESTINIDESTINIstateY(..) :- cnpEv(..), state(X);stateY(..) :- cnpEv(..), state(X);

NN DDCC FATEFATE

Page 35: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

Current state of the Art:

• Failure exploration- Rarely deal with multiple failures- Or using random approach

• System specifications- Unit test checking: cumbersome- WiDS, Pip: not integrated with

failure service

Page 36: Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do †, Pallavi Joshi, Joseph M. Hellerstein, Andrea.

35

M 1C 2 3 M 1C 2 3 4

X1

M 1C 2 3

X2

M 1C 2 3

X3

No failures Recovery 1: Recreate fresh pipeline

Recovery 2: Continue on surviving nodes Bug in recovery 2

Static: InputStream.read()Domain: - Src : Node 1 - Dest: Node 2 - Type: Data Transfer

Static: InputStream.read()Domain: - Src : Node 2 - Dest: Node 3 - Type: Data Transfer

Static: InputStream.read()Domain: - Src : Node 1 - Dest: Node 2 - Type: Setup


Recommended