Verifying Distributed Systems
Zachary Tatlock
Coq Workshop 2018
The Team
Steve Anton
Mike Ernst
Tom Anderson
Ryan Doenges
Keith Simmons
Xi Wang
Ilya Sergey
Karl Palmskog
Miranda Edwards
Doug Woos
Pavel Panchekha
James Wilcox
Justin Adsuara
Steve Anton
Mike Ernst
Tom Anderson
Ryan Doenges
Keith Simmons
Xi Wang
Ilya Sergey
Karl Palmskog
Miranda Edwards
Justin Adsuara
James Wilcox
Doug Woos
Pavel Panchekha
The Team
Steve Anton
Mike Ernst
Tom Anderson
Ryan Doenges
Keith Simmons
Xi Wang
Ilya Sergey
Karl Palmskog
Miranda Edwards
Justin Adsuara
James Wilcox
Doug Woos
Pavel Panchekha
Amazing researchers on the job market next year!
The Team
Distributed Systems
Distributed Systems
Distributed Apps
Distributed Infrastructure
One summer day...
One summer day...
One summer day...
One summer day...
How distributed systems fail
How distributed systems fail
concurrencyChallenges
How distributed systems fail
concurrencymessage drops
machine crash
Challenges
message dupsmessage reorder
machine reboot
How distributed systems fail
How distributed systems fail
Too many possible behaviors to effectively test!
How distributed systems fail
Too many possible behaviors to effectively test!
Edsger W. DijkstraUnder the Spell of Leibniz's Dream
When exhaustive testing is impossible, our trust can only be based on proof.
Toward verified distributed systems
Toward verified distributed systems
Formalize network semanticscapture how faults can occur
Toward verified distributed systems
Separate app / fault reasoning
Formalize network semanticscapture how faults can occur
Toward verified distributed systems
Separate app / fault reasoningdevelop and prove in simple fault model
Formalize network semanticscapture how faults can occur
Toward verified distributed systems
Separate app / fault reasoningdevelop and prove in simple fault model
Formalize network semanticscapture how faults can occur
!AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==
Toward verified distributed systems
Separate app / fault reasoningdevelop and prove in simple fault modelapply generic verified fault handling
Formalize network semanticscapture how faults can occur
!AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==
Toward verified distributed systems
Separate app / fault reasoningdevelop and prove in simple fault modelapply generic verified fault handling
Formalize network semanticscapture how faults can occur
!AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Formalizing distributed systems
Formalizing distributed systems
Formalizing distributed systems
timeouts
Formalizing distributed systems
timeouts
msg delivery
Formalizing distributed systems
timeouts
msg delivery
state change
Formalizing distributed systems
timeouts
msg delivery
state change
node failure
Formalizing distributed systems
Formalizing distributed systems
1. Defining distributed systems2. Giving systems semantics3. Proving system safety4. Reusable, verified fault-tolerance
1. Distributed sys as event handlers
Def mySys (P : params) : system :=
...
Def mySys (P : params) : system :=
// types for state and I/OType msg := (* to/from internal nodes *)Type cmsg := (* to/from external world *)Type data := (* node-local state *)
...
1. Distributed sys as event handlers
Def mySys (P : params) : system :=
// types for state and I/OType msg := (* to/from internal nodes *)Type cmsg := (* to/from external world *)Type data := (* node-local state *)
Type resp := data * list cmsg * list msg
// event handlersDef onMsg : data * msg -> respDef onTmOut : data * unit -> respDef onClient : data * cmsg -> resp
1. Distributed sys as event handlers
2. Network semantics
state of the world
2. Network semantics
state of the world
packets in flight
2. Network semantics
state of the world
packets in flight
data @ nodes
2. Network semantics
state of the world
packets in flight history of client I/O
data @ nodes
2. Network semantics
2. Network semantics
Good old small step operational semantics.
2. Network semantics
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
Example rule: message delivery
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
Example rule: message delivery
if this message is in the network
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
Example rule: message delivery
if this message is in the network
run handler on message
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
Example rule: message delivery
if this message is in the network
get responserun handler on message
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
Example rule: message delivery
if this message is in the network
get response
resulting new global state
run handler on message
Hnet(dst, ⌃[dst], src, m)=(�0, o, P
0) ⌃0=⌃[dst 7! �0]({(src, dst, m)} ] P, ⌃, T ) (P ] P 0, ⌃0, T ++ hoi) Deliver
p 2 P(P, ⌃, T ) (P ] {p}, ⌃, T ) Duplicate
({p} ] P, ⌃, T ) (P, ⌃, T ) Drop
Htmt(n, ⌃[n]) = (�0, o, P
0) ⌃0 = ⌃[n 7! �0](P, ⌃, T ) (P ] P 0, ⌃0, T ++ htmt, oi) Timeout
2. Network semantics
Library of network semanticsType sem := state -> state -> Prop
Def sync_sem := (* in-order delivery *)
Def async_sem := (* + reordering *)
Def flaky_sem := (* + drops, timeouts *)
Def busy_sem := (* + duplicates *)
Def crash_sem := (* + crash, reboot *)
Library of network semanticsType sem := state -> state -> Prop
Def sync_sem := (* in-order delivery *)
Def async_sem := (* + reordering *)
Def flaky_sem := (* + drops, timeouts *)
Def busy_sem := (* + duplicates *)
Def crash_sem := (* + crash, reboot *)
Precisely characterize fault model for sys.
Library of network semanticsType sem := state -> state -> Prop
Def sync_sem := (* in-order delivery *)
Def async_sem := (* + reordering *)
Def flaky_sem := (* + drops, timeouts *)
Def busy_sem := (* + duplicates *)
Def crash_sem := (* + crash, reboot *)
Precisely characterize fault model for sys.
more behaviors--> harder proof
Def ok : state -> Prop
3. Verifying system safety
Def ok : state -> Prop
3. Verifying system safety
init state
Def ok : state -> Prop
3. Verifying system safety
init state
Def ok : state -> Prop
3. Verifying system safety
init state
Def ok : state -> Prop
3. Verifying system safety
need to show all reachable states ok
3. Verifying system safety
ok
?Def ok : state -> Prop
3. Verifying system safety
okDef ok : state -> Prop
As usual, problem isspecs not inductive.
Def ok : state -> Prop
3. Verifying system safety
Strengthen “ok” to inductive “ok_ind”.
As usual, problem isspecs not inductive.
ok
ok_ind
Def ok : state -> Prop
3. Verifying system safety
Strengthen “ok” to inductive “ok_ind”.
As usual, problem isspecs not inductive.
ok
ok_ind
When verifying systems in a particular semantics, need to repeat similar fault tolerance reasoning for every system.
4. Verifying system transformers
Implement fault tolerance as wrapperDef tcp : system -> system
Transfer proofs across semanticsTheorem tcp_ok : forall s P, P s -> lift_tcp P (tcp s)
Separate app proof / fault tolerancehandles class of faults once and for allcan compose transformers, proofs
4. Verifying system transformers
App
RaftConsensus
App
PrimaryBackup
Seq # andRetrans
GhostVariables
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Example: verifying Raft in Verdi
critical components must not fail
Example: verifying Raft in Verdi
Replication for fault tolerance
Replication for fault tolerance
)available if n/2 nodes are up
Replication for fault tolerance
)
)Replication correctness
Replication correctness
⇡
Replication correctness
⇡linearizability
cluster looks like a single node (state machine) to clients
Def raft(sm: state machine, ...) :=
...
Defining Raft
Def raft(sm: state machine, ...) :=
// types for state and I/Ocmsg := sm.cmsg
...
Defining Raft
Def raft(sm: state machine, ...) :=
// types for state and I/Ocmsg := sm.cmsgmsg := (* ??? *)data := (* ??? *)
// event handlersDef onMsg := (* ??? *)Def onTmOut := (* ??? *)Def onClient := (* ??? *)
Defining Raft
Raft: election and replication terms
{
Term 1
{Term 2
{Term 3
Raft: election and replication terms
election
replication
...
...{
Term 1
{Term 2
{Term 3
Raft: leader election
...{
Term 1
{Term 2
{Term 3
Raft: leader election
Candidate
...{
Term 1
{Term 2
{Term 3
Raft: leader election
Candidate
Followers
ReqVote
...{
Term 1
{Term 2
{Term 3
Raft: leader election
Candidate
Followers
ReqVote Vote
...{
Term 1
{Term 2
{Term 3
Raft: election and replication terms
...{
Term 1
{Term 2
{Term 3
Raft: log replication
Leader
Followers
Append AppendAck
Def raft(sm: state machine, ...) :=
// types for state and I/Ocmsg := sm.cmsgmsg := ReqVote | Vote | Append | ...data := { sm.data, list sm.op, ... }
// event handlersDef onMsg :=Def onTmOut :=Def onClient := {
Defining Raft
Verifying Raft
⇡
Verifying Raft
⇡linearizability
⇡Raft internal correctness
⇡Raft internal correctness
linearizability follows from internal correctness:
state machine safety
Proving Raft in Verdi
)
)
Proving Raft in Verdi
State machine safetyNodes’ logs match on committed entries
proof by induction over executions
since only committed entries executed
)
State Machine Safety: Proof
State Machine Safety: Proof
not inductive!
State Machine Safety: Proof
I
State Machine Safety: Proof
I) I
State Machine Safety: Proof
I) I
State Machine Safety: Proof
I) II Itrue initially preserved
State Machine Safety: Proof
I) II Itrue initially preserved
Lemma Lemma Lemma …90 invariants
in total
State Machine Safety: Proof
I) II Itrue initially preserved
Lemma Lemma Lemma …
State Machine Safety: Proof
I) II Itrue initially preserved
Lemma Lemma Lemma …
State Machine Safety: Proof
I) II Itrue initially preserved
Lemma Lemma Lemma …
The burden of proof
P) PP with ghost state
P true initially P preservedLemma Lemma …Lemma
Re-verification is the primary challenge: - invariants are not inductive - not-yet-verified code is wrong - need additional invariants
The burden of proof
P) PP with ghost state
P true initially P preservedLemma Lemma …LemmaRe-verification is the primary challenge
The burden of proof
P) PP with ghost state
P true initially P preservedLemma Lemma …LemmaRe-verification is the primary challenge
Proof engineering techniques help: - affinity lemmas - intermediate reachability - structural tactics - information hiding
Ghost state: exampleCapture all entries received by a node
Ghost state: exampleCapture all entries received by a node
Leader
Ghost state: exampleCapture all entries received by a node
Leader
Log (real)
A,B,C
Ghost state: exampleCapture all entries received by a node
Leader
FollowerLog (real)
A,B,C
A,D
Ghost state: exampleCapture all entries received by a node
Leader
FollowerLog (real) allEntries (ghost)
A,B,C
A,D {A,D}
{A,B,C}
Ghost state: exampleCapture all entries received by a node
Leader
Follower
Append
Log (real) allEntries (ghost)
A,B,C
[A],B,C
A,D {A,D}
{A,B,C}
Ghost state: exampleCapture all entries received by a node
Leader
Follower
Append
Log (real) allEntries (ghost)
A,B,C
[A],B,C
A,B,C {A,B,C,D}
{A,B,C}
) e.term > 0e allEntries 2
Affinity lemmas: example
)e.term > 0
e log2
) e.term > 0e allEntries 2
Affinity lemmas: example
Affinity Lemma)e.term > 0
e log2
) e.term > 0e allEntries 2
Affinity lemmas: example
Affinity Lemma
every invariant of entries in logs is invariant of entries in allEntries
)e.term > 0
e log2
) e.term > 0e allEntries 2
Affinity lemmas: example
Affinity lemmas: example
Affinity Lemma
every invariant of entries in logs is invariant of entries in allEntries
)P e
e log2
) P ee allEntries 2
More affinity lemmas
Relate ghost state to real statetransfer properties once and for all
Relate current messages to pastresponse => past request
handler = update_state ; respond
Structured handlers
handler = update_state ; respond
handler
net
net’
Structured handlers
handler = update_state ; respond
handler
net
net’
update_statenet
net’
netirespond
Structured handlers
handler = update_state ; respond
handler
net
net’
I
I
update_statenet
net’
netirespond
Structured handlers
Structured handlershandler = update_state ; respond
handler
net
net’
update_statenet
net’
netirespond
I
I
I
I
I
First formal verification of Raft
50k lines of Coq 18 person-months Considerable pizza and beer
First formal verification of Raft
50k lines of Coq 18 person-months Considerable pizza and beer
First formal verification of Raft
50k lines of Coq 18 person-months Considerable pizza and beer
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Verified perfect
Network semantics shim is delicateatomicity, fairness, serialization,…
Verdi users need Coq + distr sys skillsnotorious learning curves hinder impact
Regular development still trickymaintenance, extension, management
6=AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDbbSbt0s4m7G6GE/gUvHhTx6h/y5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMmlzqS3wcVGtu3Z2DrBKvIDUo0BxUv/rDmKURSsME1brnuYnxM6oMZwJnlX6qMaFsQkfYs1TSCLWfzW+dkTOrDEkYK1vSkLn6eyKjkdbTKLCdETVjvezl4n9eLzXhtZ9xmaQGJVssClNBTEzyx8mQK2RGTC2hTHF7K2FjqigzNp6KDcFbfnmVtC/qnlv37i9rjZsijjKcwCmcgwdX0IA7aEILGIzhGV7hzYmcF+fd+Vi0lpxi5hj+wPn8ARgNjkI=AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDbbSbt0s4m7G6GE/gUvHhTx6h/y5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMmlzqS3wcVGtu3Z2DrBKvIDUo0BxUv/rDmKURSsME1brnuYnxM6oMZwJnlX6qMaFsQkfYs1TSCLWfzW+dkTOrDEkYK1vSkLn6eyKjkdbTKLCdETVjvezl4n9eLzXhtZ9xmaQGJVssClNBTEzyx8mQK2RGTC2hTHF7K2FjqigzNp6KDcFbfnmVtC/qnlv37i9rjZsijjKcwCmcgwdX0IA7aEILGIzhGV7hzYmcF+fd+Vi0lpxi5hj+wPn8ARgNjkI=AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDbbSbt0s4m7G6GE/gUvHhTx6h/y5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMmlzqS3wcVGtu3Z2DrBKvIDUo0BxUv/rDmKURSsME1brnuYnxM6oMZwJnlX6qMaFsQkfYs1TSCLWfzW+dkTOrDEkYK1vSkLn6eyKjkdbTKLCdETVjvezl4n9eLzXhtZ9xmaQGJVssClNBTEzyx8mQK2RGTC2hTHF7K2FjqigzNp6KDcFbfnmVtC/qnlv37i9rjZsijjKcwCmcgwdX0IA7aEILGIzhGV7hzYmcF+fd+Vi0lpxi5hj+wPn8ARgNjkI=AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDbbSbt0s4m7G6GE/gUvHhTx6h/y5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMmlzqS3wcVGtu3Z2DrBKvIDUo0BxUv/rDmKURSsME1brnuYnxM6oMZwJnlX6qMaFsQkfYs1TSCLWfzW+dkTOrDEkYK1vSkLn6eyKjkdbTKLCdETVjvezl4n9eLzXhtZ9xmaQGJVssClNBTEzyx8mQK2RGTC2hTHF7K2FjqigzNp6KDcFbfnmVtC/qnlv37i9rjZsijjKcwCmcgwdX0IA7aEILGIzhGV7hzYmcF+fd+Vi0lpxi5hj+wPn8ARgNjkI=
Network semantics shim is delicate
Network semantics shim is delicate
Note that all steps are atomic in semantics! Shim must carefully persist to ensure fidelity.
Network semantics shim is delicate
Network semantics shim is delicate
User stumbled across liveness bug for single
node cluster.
Network semantics shim is delicate
Network semantics shim is delicate
“CSmith” paper for verified distr sys
Network semantics shim is delicate
Network semantics shim is delicate
Pedro et al. found several bugs, BUT none in any verified components.
Network semantics shim is delicate
Pedro et al. found several bugs, BUT none in any verified components.
Cheerios:New system transformer with correct serialization implemented and verified.Justin Adsuara
Training the next generation
Doug Woos
There will always be a TCBwe’ll always need informed judgement
Engineers unlikely to pick this up at workbut courses great evangelism opportunity
How to get this into ugrad canon?need reusable labs and tools
Training the next generation
Doug Woos
Training the next generation
Doug Woos
Proof engineering
Proof engineering
Karl Palmskog
Finally catching the interest of the SE
community: ASE ’17, ICSE ’18, ISSTA ‘18
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Churn = nodes joining & leaving a system at run time
...
Punctuated safety propertiesReachable under churn ( )
Safety after churn stops( )
Ryan Doenges
...
Punctuated safety propertiesReachable under churn ( )
Safety after churn stops( )
Toward verifying churn tolerance
Tree aggregation
Chord
aggregate data in sensor networksdesignated root node eventually correct
distributed hash tableprotocol bugs found [Zave 2015]ring should eventually stabilize
Composition: A way to make proofs harder
Composition: A way to make proofs harder
“In 1997, the unfortunate reality is that engineers rarely specify and reason formally about the systems they build. It seems unlikely that reasoning about the composition of open-system specifications will be a practical concern within the next 15 years.”
“Horizontal composition”: eliminate closed world hypothesis
+ =
“Horizontal composition”: eliminate closed world hypothesis
Compositional Verif of Distr Sys
[POPL 18] Ilya Sergey
James Wilcox
Toward verified distributed systems
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
Reflections on Verdi experience
Distributed sys good fit for verificationcritical, expert-written, I/O bound cases
Biggest challenge is proof engineeringreproving and managing scale daunting
Lots of low-hanging fruit leftdynamic update, concurrency, optimization
The most important ingredients
Steve Anton
Mike Ernst
Tom Anderson
Ryan Doenges
Keith Simmons
Xi Wang
Ilya Sergey
Karl Palmskog
Miranda Edwards
Doug Woos
Pavel Panchekha
James Wilcox
Justin Adsuara
Thank You!
Verified Raft Consensus
The Verdi Framework
TCB, Tools, Teaching
Enriching Models & Modularity
http://distributedcomponents.net