+ All Categories
Home > Science > RICON keynote: outwards from the middle of the maze

RICON keynote: outwards from the middle of the maze

Date post: 14-Jun-2015
Category:
Upload: palvaro
View: 2,570 times
Download: 3 times
Share this document with a friend
Description:
slides from my RICON keynote
Popular Tags:
190
Outwards from the middle of the maze Peter Alvaro UC Berkeley
Transcript
Page 1: RICON keynote: outwards from the middle of the maze

Outwards from the middle of the maze

Peter Alvaro UC Berkeley

Page 2: RICON keynote: outwards from the middle of the maze

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 3: RICON keynote: outwards from the middle of the maze

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

Page 4: RICON keynote: outwards from the middle of the maze

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

Page 5: RICON keynote: outwards from the middle of the maze

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

Page 6: RICON keynote: outwards from the middle of the maze

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

Page 7: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 8: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 9: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 10: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 11: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 12: RICON keynote: outwards from the middle of the maze

The “top-down” ethos

Page 13: RICON keynote: outwards from the middle of the maze

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Page 14: RICON keynote: outwards from the middle of the maze

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Page 15: RICON keynote: outwards from the middle of the maze

Assert: balance > 0

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Page 16: RICON keynote: outwards from the middle of the maze

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Page 17: RICON keynote: outwards from the middle of the maze

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Page 18: RICON keynote: outwards from the middle of the maze

Incidental complexities

•  The “Internet.” Searching it. •  Cross-datacenter replication schemes •  CAP Theorem •  Dynamo & MapReduce •  “Cloud”

Page 19: RICON keynote: outwards from the middle of the maze

Fundamental complexity

“[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.”

Jim Waldo et al., A Note on Distributed Computing (1994)

Page 20: RICON keynote: outwards from the middle of the maze

A holistic contract …stretched to the limit

Write   Read  

Application

Opaque store

Transactions

Page 21: RICON keynote: outwards from the middle of the maze

A holistic contract …stretched to the limit

Write   Read  

Application

Opaque store

Transactions

Page 22: RICON keynote: outwards from the middle of the maze

Are you blithely asserting that transactions aren’t webscale?

Some people just want to see the world burn. Those same people want to see the world use inconsistent databases.

- Emin Gun Sirer

Page 23: RICON keynote: outwards from the middle of the maze

Alternative to top-down design?

The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.

Page 24: RICON keynote: outwards from the middle of the maze

Alternative: the “bottom-up,” systems ethos

Page 25: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 26: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 27: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 28: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 29: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 30: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Page 31: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

“‘Tis a fine barn, but sure ‘tis no castle, English”

Page 32: RICON keynote: outwards from the middle of the maze

The “bottom-up” ethos

Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?

Page 33: RICON keynote: outwards from the middle of the maze

Low-level contracts

Write   Read  

Application

Distributed store KVS

Page 34: RICON keynote: outwards from the middle of the maze

Low-level contracts

Write   Read  

Application

Distributed store KVS

Page 35: RICON keynote: outwards from the middle of the maze

Low-level contracts

Write   Read  

Application

Distributed store KVS

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

Page 36: RICON keynote: outwards from the middle of the maze

Low-level contracts

Write   Read  

Application

Distributed store KVS

Assert: balance > 0

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

Page 37: RICON keynote: outwards from the middle of the maze

Low-level contracts

Write   Read  

Application

Distributed store KVS

Assert: balance > 0

causal? PRAM? delta? fork/join? red/blue? Release?

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

Page 38: RICON keynote: outwards from the middle of the maze

When do contracts compose?

Application

Distributed service

Assert: balance > 0

Page 39: RICON keynote: outwards from the middle of the maze

iw, did I get mongo in my riak?  Assert: balance > 0

Page 40: RICON keynote: outwards from the middle of the maze

Composition is the last hard problem

Composing modules is hard enough We must learn how to compose guarantees

Page 41: RICON keynote: outwards from the middle of the maze

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 42: RICON keynote: outwards from the middle of the maze

Why distributed systems are hard2

Asynchrony Partial Failure

Fundamental Uncertainty

Page 43: RICON keynote: outwards from the middle of the maze

Asynchrony isn’t that hard

Logical timestamps Deterministic interleaving

 

Ameloriation:

Page 44: RICON keynote: outwards from the middle of the maze

Partial failure isn’t that hard

Replication Replay

Ameloriation:

Page 45: RICON keynote: outwards from the middle of the maze

(asynchrony * partial failure) = hard2

Logical timestamps Deterministic interleaving

Replication Replay

Page 46: RICON keynote: outwards from the middle of the maze

(asynchrony * partial failure) = hard2

Logical timestamps Deterministic interleaving

Replication Replay

Page 47: RICON keynote: outwards from the middle of the maze

(asynchrony * partial failure) = hard2

Tackling one clown at a time

Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs

 

Page 48: RICON keynote: outwards from the middle of the maze

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 49: RICON keynote: outwards from the middle of the maze

Distributed consistency

Today: A quick summary of some great work.

Page 50: RICON keynote: outwards from the middle of the maze

Consider a (distributed) graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Page 51: RICON keynote: outwards from the middle of the maze

Partitioned, for scalability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Page 52: RICON keynote: outwards from the middle of the maze

Replicated, for availability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Page 53: RICON keynote: outwards from the middle of the maze

Deadlock detection

Task: Identify strongly-connected components

Waits-for graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Page 54: RICON keynote: outwards from the middle of the maze

Garbage collection

Task: Identify nodes not reachable from Root.

Root  

Refers-to graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Page 55: RICON keynote: outwards from the middle of the maze

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

Page 56: RICON keynote: outwards from the middle of the maze

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives-

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

Page 57: RICON keynote: outwards from the middle of the maze

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Root  

Page 58: RICON keynote: outwards from the middle of the maze

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Page 59: RICON keynote: outwards from the middle of the maze

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Page 60: RICON keynote: outwards from the middle of the maze

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Efficient Correct

Page 61: RICON keynote: outwards from the middle of the maze

Object-level consistency

Capture semantics of data structures that •  allow greater concurrency •  maintain guarantees (e.g. convergence)

StorageObjectFlow

LanguageApplication

Page 62: RICON keynote: outwards from the middle of the maze

Object-level consistency

Page 63: RICON keynote: outwards from the middle of the maze

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Page 64: RICON keynote: outwards from the middle of the maze

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Page 65: RICON keynote: outwards from the middle of the maze

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Page 66: RICON keynote: outwards from the middle of the maze

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Reordering Batching Retry/duplication

Tolerant to

Page 67: RICON keynote: outwards from the middle of the maze

Application

Convergent data structures

Object-level composition?

Assert: Graph replicas converge

Page 68: RICON keynote: outwards from the middle of the maze

Application

Convergent data structures

Object-level composition?

GC Assert: No live nodes are reclaimed

Assert: Graph replicas converge

Page 69: RICON keynote: outwards from the middle of the maze

Application

Convergent data structures

Object-level composition?

?   ?  

GC Assert: No live nodes are reclaimed

Assert: Graph replicas converge

Page 70: RICON keynote: outwards from the middle of the maze

Flow-level consistency  

StorageObjectFlow

LanguageApplication

Page 71: RICON keynote: outwards from the middle of the maze

Flow-level consistency  

Capture semantics of data in motion •  Asynchronous dataflow model •  component properties à system-wide guarantees

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Page 72: RICON keynote: outwards from the middle of the maze

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 73: RICON keynote: outwards from the middle of the maze

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 74: RICON keynote: outwards from the middle of the maze

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 75: RICON keynote: outwards from the middle of the maze

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 76: RICON keynote: outwards from the middle of the maze

Flow-level consistency

=  

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 77: RICON keynote: outwards from the middle of the maze

Flow-level consistency

{                }  

{                }  =  

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Page 78: RICON keynote: outwards from the middle of the maze

Confluence is compositional

output  set  =  f  �  g(input  set)      

Page 79: RICON keynote: outwards from the middle of the maze

Confluence is compositional

output  set  =  f  �  g(input  set)      

Page 80: RICON keynote: outwards from the middle of the maze

Confluence is compositional

output  set  =  f  �  g(input  set)      

Page 81: RICON keynote: outwards from the middle of the maze

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow

Page 82: RICON keynote: outwards from the middle of the maze

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow Confluent

Coordinate  here  

Page 83: RICON keynote: outwards from the middle of the maze

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate  here  

Strategy 1: Establish a total order

Page 84: RICON keynote: outwards from the middle of the maze

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate  here  

Strategy 2: Establish a producer- consumer barrier

Page 85: RICON keynote: outwards from the middle of the maze

Fundamental costs: FT via replication

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graphstore

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

(mostly) free!

Page 86: RICON keynote: outwards from the middle of the maze

global synchronization!

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

GarbageCollector

Confluent Not

Confluent

Confluent

Paxos

Not

Confluent

Fundamental costs: FT via replication

Page 87: RICON keynote: outwards from the middle of the maze

Fundamental costs: FT via replication

GarbageCollector

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

Confluent Not

Confluent

Confluent

BarrierNot

Confluent

Barrier

The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton  

Page 88: RICON keynote: outwards from the middle of the maze

Language-level consistency  

DSLs for distributed programming? •  Capture consistency concerns in the

type system

   

StorageObjectFlow

LanguageApplication

Page 89: RICON keynote: outwards from the middle of the maze

Language-level consistency  

CALM Theorem:

Monotonic à confluent

Conservative, syntactic test for confluence

 

Page 90: RICON keynote: outwards from the middle of the maze

Language-level consistency

Deadlock detector

Garbage collector

Page 91: RICON keynote: outwards from the middle of the maze

Language-level consistency

Deadlock detector

Garbage collector

nonmonotonic  

Page 92: RICON keynote: outwards from the middle of the maze

Let’s review

•  Consistency is tolerance to asynchrony •  Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise

(Tricks are great, but tools are better)

Page 93: RICON keynote: outwards from the middle of the maze

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 94: RICON keynote: outwards from the middle of the maze

Grand challenge: composition

Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?

Page 95: RICON keynote: outwards from the middle of the maze

Example: Atomic multi-partition update

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Two-phase commit

Page 96: RICON keynote: outwards from the middle of the maze

Example: replication

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Reliable broadcast

Page 97: RICON keynote: outwards from the middle of the maze

Popular wisdom: don’t reinvent

Page 98: RICON keynote: outwards from the middle of the maze

Example: Kafka replication bug

Three “correct” components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper

One nasty bug: Acknowledged writes are lost

Page 99: RICON keynote: outwards from the middle of the maze

A guarantee would be nice

Bottom up approach: •  use formal methods to verify individual

components (e.g. protocols) •  Build systems from verified components

Shortcomings: •  Hard to use •  Hard to compose

Investment

Returns

Page 100: RICON keynote: outwards from the middle of the maze

Bottom-up assurances

Formal  verifica[on  

Program  Environment   Correctness  Spec  

Page 101: RICON keynote: outwards from the middle of the maze

Composing bottom-up assurances  

Page 102: RICON keynote: outwards from the middle of the maze

Composing bottom-up assurances  

Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property)

If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson

Page 103: RICON keynote: outwards from the middle of the maze

Composing bottom-up assurances  

Page 104: RICON keynote: outwards from the middle of the maze

Composing bottom-up assurances  

Page 105: RICON keynote: outwards from the middle of the maze

Composing bottom-up assurances  

Page 106: RICON keynote: outwards from the middle of the maze

Top-down “assurances”

Page 107: RICON keynote: outwards from the middle of the maze

Top-down “assurances”

Testing

Page 108: RICON keynote: outwards from the middle of the maze

Top-down “assurances”

Fault injection Testing

Page 109: RICON keynote: outwards from the middle of the maze

Top-down “assurances”

Fault injection

Testing

Page 110: RICON keynote: outwards from the middle of the maze

End-to-end testing would be nice

Top-down approach: •  Build a large-scale system •  Test the system under faults

Shortcomings: •  Hard to identify complex bugs •  Fundamentally incomplete

Investment

Returns

Page 111: RICON keynote: outwards from the middle of the maze

Lineage-driven fault injection

Goal: top-down testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist

Page 112: RICON keynote: outwards from the middle of the maze

Lineage-driven fault injection

Correctness Specification

Malevolent sentience

Molly

Page 113: RICON keynote: outwards from the middle of the maze

Lineage-driven fault injection

Molly

Correctness Specification

Malevolent sentience

Page 114: RICON keynote: outwards from the middle of the maze

Lineage-driven fault injection (LDFI)

Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: •  Why did a good thing happen? •  What could have gone wrong along the way?

Page 115: RICON keynote: outwards from the middle of the maze

Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.

Page 116: RICON keynote: outwards from the middle of the maze

The game

•  Both players agree on a failure model •  The programmer provides a protocol •  The adversary observes executions and

chooses failures for the next execution.

Page 117: RICON keynote: outwards from the middle of the maze

Dedalus: it’s about data

log(B, “data”)@5  

What

Where

When

Some data

Page 118: RICON keynote: outwards from the middle of the maze

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

Page 119: RICON keynote: outwards from the middle of the maze

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

(Which is like SQL)

create view log as select Node, Pload from bcast;!

Page 120: RICON keynote: outwards from the middle of the maze

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

Page 121: RICON keynote: outwards from the middle of the maze

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

Natural join (bcast.Node1 == node.Node1)

State change

Communication

Page 122: RICON keynote: outwards from the middle of the maze

The match

Protocol: Reliable broadcast

Specification:

Pre: A correct process delivers a message m Post: All correct process delivers m

Failure Model:

(Permanent) crash failures Message loss / partitions

Page 123: RICON keynote: outwards from the middle of the maze

Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);!log(Node, Pload)@next ! :- log(Node, Pload);!!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

“An effort” delivery protocol

Page 124: RICON keynote: outwards from the middle of the maze

Round 1 in space / time

Process b Process a Process c

2

1

2

log log

Page 125: RICON keynote: outwards from the middle of the maze

Round 1: Lineage

log(B,  data)@5    

Page 126: RICON keynote: outwards from the middle of the maze

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Page 127: RICON keynote: outwards from the middle of the maze

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

Page 128: RICON keynote: outwards from the middle of the maze

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

log(B,data)@2    

Page 129: RICON keynote: outwards from the middle of the maze

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

log(B,data)@2    

log(A,  data)@1    

log(Node2, Pload)@async :- bcast(Node1, Pload), !! ! ! ! ! ! node(Node1, Node2);!

!!!!log(B, data)@2 :- bcast(A, data)@1, !

! ! ! ! ! ! node(A, B)@1;!  

Page 130: RICON keynote: outwards from the middle of the maze

An execution is a (fragile) “proof” of an outcome

log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

(which required a message from A to B at time 1)

Page 131: RICON keynote: outwards from the middle of the maze

Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”

Page 132: RICON keynote: outwards from the middle of the maze

Round 1: counterexample

The adversary wins!

Process b Process a Process c

1

2

log (LOST) log

Page 133: RICON keynote: outwards from the middle of the maze

Round  2  

Same  as  Round  1,  but  A  retries.  

bcast(N, P)@next ! ! ! :- bcast(N, P);!

Page 134: RICON keynote: outwards from the middle of the maze

Round 2 in spacetime Process b Process a Process c

2

3

4

5

1

2

3

4

2

3

4

5

log log

log log

log log

log log

Page 135: RICON keynote: outwards from the middle of the maze

Round 2

log(B,  data)@5    

Page 136: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Page 137: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!!!!!log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!  

Page 138: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

Page 139: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

Page 140: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(A,  data)@1    

Page 141: RICON keynote: outwards from the middle of the maze

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(A,  data)@1    

Retry provides redundancy in time

Page 142: RICON keynote: outwards from the middle of the maze

Traces  are  forests  of  proof  trees  log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

Page 143: RICON keynote: outwards from the middle of the maze

Traces  are  forests  of  proof  trees  log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

Page 144: RICON keynote: outwards from the middle of the maze

Round  2:  counterexample  

Process b Process a Process c

1

CRASHED 2

log (LOST) log

The adversary wins!

Page 145: RICON keynote: outwards from the middle of the maze

Round 3

Same  as  in  Round  2,  but  symmetrical.  

bcast(N, P)@next ! ! ! :- log(N, P);!

Page 146: RICON keynote: outwards from the middle of the maze

Round 3 in space / time Process b Process a Process c

2

3

4

5

1

2

3

4

5

2

3

4

5

log log

log log

log log

log log

log log

log log

log log

log log

log log

log log

Redundancy in space and time

Page 147: RICON keynote: outwards from the middle of the maze

Round 3 -- lineage

log(B,  data)@5    

Page 148: RICON keynote: outwards from the middle of the maze

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Page 149: RICON keynote: outwards from the middle of the maze

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

Page 150: RICON keynote: outwards from the middle of the maze

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(C,  data)@2    

log(A,  data)@1    

Page 151: RICON keynote: outwards from the middle of the maze

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(C,  data)@2    

log(A,  data)@1    

Page 152: RICON keynote: outwards from the middle of the maze

Round 3

The programmer wins!

Page 153: RICON keynote: outwards from the middle of the maze

Let’s reflect

Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations

Page 154: RICON keynote: outwards from the middle of the maze

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

Disjunction

(AB1 ∨ BC2)

Page 155: RICON keynote: outwards from the middle of the maze

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

Page 156: RICON keynote: outwards from the middle of the maze

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

Page 157: RICON keynote: outwards from the middle of the maze

Molly, the LDFI prototype

Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast

Page 158: RICON keynote: outwards from the middle of the maze

Commit protocols

Problem: Atomically change things Correctness properties: 1.  Agreement (All or nothing) 2.  Termination (Something)

Page 159: RICON keynote: outwards from the middle of the maze

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Page 160: RICON keynote: outwards from the middle of the maze

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

Page 161: RICON keynote: outwards from the middle of the maze

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

YES YOU CAN

Page 162: RICON keynote: outwards from the middle of the maze

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

YES YOU CAN

Well I’m gone

Page 163: RICON keynote: outwards from the middle of the maze

Two-phase commit

Agent a Agent a Coordinator Agent d

2 2

1

3

CRASHED

2

v v

p p p

v

Violation: Termination

Page 164: RICON keynote: outwards from the middle of the maze

The  collabora[ve  termina[on  protocol  

Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.

Page 165: RICON keynote: outwards from the middle of the maze

2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote

decision_req decision_req

vote

decision_req decision_req

prepare prepare prepare

vote

decision_req decision_req

Page 166: RICON keynote: outwards from the middle of the maze

2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote

decision_req decision_req

vote

decision_req decision_req

prepare prepare prepare

vote

decision_req decision_req

Can I kick it?

YES YOU CAN

……?

Page 167: RICON keynote: outwards from the middle of the maze

3PC

Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1.  Phase 1: Just like in 2PC –  Agent timeout à abort

2.  Phase 2: send canCommit, collect acks –  Agent timeout à commit

3.  Phase 3: Just like phase 2 of 2PC

Page 168: RICON keynote: outwards from the middle of the maze

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack

cancommit cancommit cancommit

precommit precommit precommit

commit commit commit

vote_msg

ack

Page 169: RICON keynote: outwards from the middle of the maze

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack

cancommit cancommit cancommit

precommit precommit precommit

commit commit commit

vote_msg

ack

Timeout à Abort

Timeout à Commit

Page 170: RICON keynote: outwards from the middle of the maze

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Page 171: RICON keynote: outwards from the middle of the maze

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Agent crash Agents learn commit decision

Page 172: RICON keynote: outwards from the middle of the maze

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Page 173: RICON keynote: outwards from the middle of the maze

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Brief network partition

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Page 174: RICON keynote: outwards from the middle of the maze

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Brief network partition

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Agents A & B decide to commit

Page 175: RICON keynote: outwards from the middle of the maze

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Page 176: RICON keynote: outwards from the middle of the maze

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

Page 177: RICON keynote: outwards from the middle of the maze

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

Page 178: RICON keynote: outwards from the middle of the maze

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

a ACKs client write

Page 179: RICON keynote: outwards from the middle of the maze

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

a ACKs client write

Data loss

Page 180: RICON keynote: outwards from the middle of the maze

Molly summary

Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods

Page 181: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 182: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 183: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 184: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  (asynchrony X partial failure) = too hard to

hide! We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 185: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Page 186: RICON keynote: outwards from the middle of the maze

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Page 187: RICON keynote: outwards from the middle of the maze

Outline

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Page 188: RICON keynote: outwards from the middle of the maze

Outline

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Page 189: RICON keynote: outwards from the middle of the maze

Remember

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide! We

need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Composition is the hardest problem

Page 190: RICON keynote: outwards from the middle of the maze

A happy crisis

Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”


Recommended