UNDERSTANDING AND IMPLEMENTING CHORD
Pamela Zave
Princeton University
Princeton, New Jersey
1
8
14
21
32
42
51
here, the identifier spaceconsist of all 6-bit binary values
THE CHORD PROTOCOL MAINTAINS A PEER-TO-PEER NETWORK identifier of a node (assumedunique) is an m-bit hashof its IP address
nodes are arranged ina ring, each nodehaving a successorpointer to the nextnode (in integer orderwith wraparound at 0)
the protocol preservesthe ring structure asnodes join, leave silently,or fail
redundant pointerssupport fault-tolerance(extra successors,predecessors)
successor
successor2
predecessor
“consistent hashing” meansthat all IP addresses are spread
evenly around the identifer space
1
8
14
21
32
42
51
CHORD AS A DISTRIBUTED HASH TABLE
a hash table stores (key, value) pairs
the keys are the same as identifiers
keys 22 through 32 are stored at 32
keys 33 through 42 are stored at 42,
etc.
to look up a value, you only need toknow the IP address of one Chordmember
a list of accessible members isthe only central administration needed!
your query to that member is forwarded around the ringuntil it gets to the member that has the value
as an optimization, “finger” pointers go across the ring,make lookup faster
OPERATIONS OF THE PROTOCOL
710
16
10JOINS
10STABILIZES
710
16
16GETS
NOTIFIED
10GETS
NOTIFIED
7
10
16
710
16
7STABILIZES
7
10
16
an operation changesthe state of one member
Join and Stabilize are scheduled autonomously,GetsNotified is caused by another member’s Stabilize
now 10 is completelyincorporated intothe ring!
MORE OPERATIONS OF THE PROTOCOL
9
16
22 2235
9
16
35
22FAILS
OR LEAVES
9
16
35
16UPDATES
22 hasno pointers,
does not respondto queries, soother nodes
know that it has failed
9
16
35
35FLUSHES
9
16
35
9RECONCILES
now the holeleft by 22 isrepaired
16
16
16
16FAILS
3STABILIZES
3
20 3
20
3
20
WHAT YOUR IMPLEMENTATION MIGHT DO
now 3 has no pointer toanother Chord member
the ring is broken,and the protocolcannot recover it
16
16
16
16
16FAILS
3STABILIZES
3
20 3
20
3
320
20
WHAT YOUR IMPLEMENTATION MIGHT DO
nodes shouldreconcile more often,
so their redundantsuccessors havegood information
before over-writing with ,
should have checkedthat the new node is live
DO YOU THINK YOU WOULD FIND AND FIXALL THESE LITTLE PROBLEMS?
DHTS (there are others besides Chord) have areputation for being unreliable
when a distributed system fails,it is very difficult to find out why!
“As part of a lab project on distributed systems, I asked the 8 studentsto implement the Chord protocol based on the Chord paper [same one you read].
Each student had his/her own implementation. The ETH students are pretty goodsystems builders so they all did fairly well and got something running: We had a small test suite and all implementations passed that.
Then, I asked the students to do the second part of the project: Run a Chord ring with your own implementation X.
Then add a new node to this ring, but that new node uses Chord implementation Y from another student. There was no combination of X and Y (not even if there was only one X node andone Y node) in which two nodes with different Chord implementations could evenexchange a message. The interesting thing was that the problem was notserialization or message format or use of different TCP libraries or so. Thosewould have been easy fixes. There was no way for the students to fix the problemand get anything running even though they would get extra credit if they succeeded.
We allocated 10 weeks for the first part and 4 weeks for the second part.4 weeks were not enough!”
Donald Kossmann writes about an experiment he did with M.S. students incomputer science at ETH in 2008:
WHY IS CHORD IMPORTANT TO YOU?
the 2001 SIGCOMM paper introducing Chordis one of the most-referenced
papers in computer science, . . .
. . . and won SIGCOMM’s 2011 Test of Time Award
APPLICATIONS
“Three features that distinguish Chordfrom many other peer-to-peer lookupprotocols are . . .
. . . its simplicity,
. . . provable correctness,
. . . and provable performance.”
allows millions of ad hoc peers tocooperate
the best-known application isBitTorrent
WIDELY used as a building blockin distributed fault-tolerantapplications
if you are a Ph.D. student,your dissertationmay use Chord!
WHY MIGHT YOU CHOOSE CHORD TOIMPLEMENT YOUR DISSERTATIONIDEA?
OOPS!
WHY IS CHORD IMPORTANT TO YOU?
the 2001 SIGCOMM paper introducing Chordis one of the most-referenced
papers in computer science, . . .
. . . and won SIGCOMM’s 2011 Test of Time Award
APPLICATIONS
“Three features that distinguish Chordfrom many other peer-to-peer lookupprotocols are . . .
. . . its simplicity,
. . . provable correctness,
. . . and provable performance.”
allows millions of ad hoc peers tocooperate
the best-known application isBitTorrent
WIDELY used as a building blockin distributed fault-tolerantapplications
if you are a Ph.D. student,your dissertationmay use Chord!
WHY MIGHT YOU CHOOSE CHORD TOIMPLEMENT YOUR DISSERTATIONIDEA?
TRUE!
TRUE!
THE CLAIMS
THE REALITY
even with simple bugs fixed andoptimistic assumptions aboutatomicity, the original protocol isnot correct
of the seven properties claimedinvariant of the original version, notone is actually an invariant
Correctness Property:
In any execution state, IF there areno subsequent Join or Fail events, . . .
. . . THEN eventually . . .
. . . all pointers in the network will beglobally correct, and remain so.
not surprisingly, due to sloppyinformal specification and proof
I found these problems byanalyzing a small Alloy model
Chris Newcombe and others atAWS credit this work withovercoming their bias againstformal methods, which they nowuse to find bugs.
[CACM, April 2015]
“EVENTUAL REACHABILITY” IS NOT THE ONLY ISSUE
OrderedMerges . . .
. . . means that appendages mergein the correct places, as theydo here
12
6
10
16
6
10
1216
VIOLATIONS OF OrderedMerges
invalidate someassumptions used inperformance analysis
6STABILIZES,
12GETS
NOTIFIED
OrderedMergesis easilyviolated
are not incorrect
can be demonstrated inChord networks with3 nodes
how could they go unknownfor ten years?
this is whyformal methodsare so important
BASIC CORRECTNESS STRATEGY 1
7
19
16
13
263
29
29
55
55
41
41
6
appendagesbest successor
(first livesuccessor)
ring
dead
29 2932
extended successor list (ESL)of 29 (with L = 2):
memberitself
Original operating assumption:
No failure leaves a member without a live successor.
But if an ESL with L = 2 is . . .
. . . then 32 cannot fail!Clearly, with L = 2, Chord isintended to tolerate one failurein a neighborhood, but not two.
BASIC CORRECTNESS STRATEGY 2
7
19
16
13
263
29
29
55
55
41
41
6
appendages
best successor(first live
successor)
ring
dead
extended successor list (ESL)of 29 (with L = 2):
memberitself
Definition of FullSuccessorLists:The extended successor list of eachmember has L+1 distinct entries.
New operating assumption:
If a Chord network has the propertyFullSuccessorLists, then no failure leavesa member without a live successor.
BASIC CORRECTNESS STRATEGY 2
7
19
16
13
263
29
29
55
55
41
41
6
appendages
best successor(first live
successor)
ring
dead
extended successor list (ESL)of 29 (with L = 2):
memberitself
Definition of FullSuccessorLists:The extended successor list of eachmember has L+1 distinct entries.
New operating assumption:
If a Chord network has the propertyFullSuccessorLists, then no failure leavesa member without a live successor.
if not satisfied for the real failure rate,
. . . increase rate of stabilization,
. . . or increase redundancy
BASIC CORRECTNESS STRATEGY 3
7
19
16
13
263
29
55
41
6
appendages
ring
TO MAKE ORIGINAL CHORD CORRECT:
alter the initialization to satisfy FullSuccessorListswith all members live
alter the operations to populate successor lists moreeagerly, preserve FullSuccessorLists at all times
now it isroughly correct(in hindsight)
but how do weprove it
without aninvariant?
requires L+1 members
w
x
y
u
z
ringx
fails
w
y
u
z
ring
WHY IS FINDING AN INVARIANT SO DIFFICULT?
there is a ring of best successors
there is no more than one ring
on the unique ring, the membersare in identifier order
from each appendage member, thering is reachable through bestsuccessors
THE KNOWN, NECESSARY PROPERTIES ARE STATED IN TERMS OF THE RING . . .
about the ring ofbest successors
about the appendages
. . . BUT “RING VERSUS APPENDAGE” IS CONTEXT-DEPENDENT AND FLUID:
AN INTERMEDIATE RESULT
THE INDUCTIVE INVARIANT:
OneOrderedRing
ConnectedAppendages
BaseNotSkipped
and
and
ANOTHER OPERATING ASSUMPTION:
A chord network has a stable baseof L+1 nodes that are alwaysmembers.
no successor list skips overa member of the stable base
THE PROOF OF CORRECTNESS:
by exhaustive enumeration, in Alloy,for all model instances up to N = 9, L = 3
expensive to implement thesehigh-availability nodes!
a stable base would have 3-6members, while a Chord networkcan have millions of members—what is the base doing?
I believe it is just preventinganomalies in small networks,
but how can we know for sure?
THE FINAL RESULT
THE INDUCTIVE INVARIANT:
OneLiveSuccessor
and SufficientPrincipals
ANOTHER OPERATING ASSUMPTION:
THE PROOF OF CORRECTNESS:
None
informal and intuitive, but
. . . a real proof (no size limits)
. . . backed up by an Alloy model checked up to N = 9, L = 3 (as a protection against human error)
. . .
this is just a formalizationof the original operatingassumption
Definition of a principal member:A member that is not skipped by anymember’s successor list.
Definition of SufficientPrincipals: There are at least L+1 principal nodes.
the “stable base” has become something we can prove, rather thanan assumption!
CONCLUSIONS
SPECIFICATION OF A CORRECTVERSION OF CHORD
initialization is more difficult thanoriginal Chord, but a simple protocolwill get networks off to a safe start
otherwise correct Chord is justas efficient as original Chord
these peer-to-peer protocols have a(justified) reputation for unreliability
it is an impressivepattern for fault-tolerance
a correct specification could pave theway for a new generation of reliable,more useful implementations
it also provides a firm foundationfor work on better failure
detection and security
FOR YOUR DISSERTATION,ONLY THE BEST WILL DO!
use an implementation based on
“Reasoning about identifierspaces: How to make Chordcorrect”, Pamela Zave,IEEE Transactions on Software Engineering
APPENDIX:
AN OUTLINE OF THE PROOF
identifierspace
hypothesize adisordered extended
successor list[. . . x, . . . y, . . . z, . . .]
[x, . . . y, . . . z] mustinclude L + 1 principalnodes
Proof of OrderedSuccessorLists
CONTRADICTION!
but the length of anESL is always L + 1
picture,principal
nodes notskipped,
Sufficient Principals
x is either not aprincipal node,or is duplicated in[y, . . . z]
between [y, x, z]in identifier
space
same reasoningfor z
so the length of[x, . . . y, . . . z] is atleast L + 3
(L + 1)plus one xand one z
ORDERED SUCCESSOR LISTS . . .. . . ARE IMPLIED BY THE INVARIANT
x
y z
Definition of OrderedSuccessorLists:For all distinct identifiers x, y, z, and sublists [x, y, z] of an ESL(whether the sublist is contiguousor not) . . . between [x, y, z].
every member has a bestsuccessor (first live successor)
there are sufficientprincipal nodes
here is a graph of bestsuccessors:
7
7
7
23
23
23
46
46
460
0
0
15
15
9
9
12
12
37
37
24
24
35
35
33
33
51
51
ssss
s
d d d d
these paths . . .
. . . do not skipprincipal nodes
. . . are acyclic
. . . are ordered by identifiers
each tree has exactly one p , whichis unique to it
so the re-arrangedgraph mustlook likethis
automaticallysatisfyingOneOrderedRingandConnected-Appendages
THE PRINCIPAL NODES MAKE THE SHAPE OF THE RING
CONCURRENCY AND COMMUNICATION
node X node Y
atomic step at X
X changes state
step can beassumed to occur
at this instant
AN OPERATION IS A SEQUENCE OFATOMIC STEPS
EACH ATOMIC STEP IS AN INTERNALSTATE CHANGE OR THIS:
querymessage
replymessage
formal model has ashared-stateabstraction
while waiting for a reply (or timeout),X cannot answer queries about its state
because of the structure of operations,queries cannot form circular waits
extended successor list ofstabilizing node, before Stabilize
extended successor listof its new successor
some Chord operations need multiple atomic steps
in the new, provably correct, specification,every intermediate operation state is alsoconstructed in this safe way
N N N NS S SS
S
0
0 0
00
01
1
12 23 3
N N N0 0 1 2
because invariant holds, nocurrent principal nodes are skippedhere or here
precondition guaranteesthat between [S , N , S ],so no current principalnodes are skipped fromS to N
therefore, no former principalnodes are skipped by this newsuccessor list, and the numberof principal nodes has notdecreased
HOW STABILIZE PRESERVES THE INVARIANT JOIN ANDRECTIFY
ARE SIMILAR
HOW FAIL PRESERVES THE INVARIANT
PRESERVATION OFOneLiveSuccessor
PRESERVATION OFSufficientPrincipals
The operatingassumption is thatno failure leaves amember with no livesuccessor, . . .
. . . so the invariantis assumed to bepreserved.
Why can’t failure of a principal node leave the networkwith fewer than L+1 principals?
Lemma: The only operation that can cause a node tochange from principal to non-principal is its own failure.
The life history of a long-lived member:
Joinbecome principal because all neighbors know youenjoy life as a principal nodeFail
1234
Therefore the number of principal nodes isproportional to the number of nodes.
Once the network has grown (especially to millions ofmembers!) it is overwhelmingly improbable that it willhave fewer than 3-6 principal nodes.
PROVING PROGRESS
IF THERE ARE NO MOREJOIN OR FAIL EVENTS . . .
dead successors areremoved, so that everymember’s firstsuccessor is live
. . . WHILE MEMBERSCONTINUE TO STABILIZE . . .
every member’s firstsuccessor andpredecessor becomeglobally correct
tails of all successorlists become correct
1
2
3
as with construction of intermediatesuccessor lists, operations must bespecified precisely to ensurecorrectness
here preconditions must ensurethat no operation reverses theprogress of a past or current phase
CONCLUSIONS
THE PRODUCT THE PROCESS
initialization is more difficult thanoriginal Chord, but a simpleprotocol will get networks off toa safe start
otherwise correct Chord is justas efficient as original Chord
these peer-to-peer protocolshave a (justified) reputation forunreliability
it is an impressivepattern for fault-tolerance
a correct specification couldpave the way for a new generation of reliable, moreuseful implementations
it also provides a firm foundation for work on better failure detection
and security
Chord is a very interesting protocol—note that the invariantlooks nothing like the propertieswe care about!
results would have been impossibleto find without model-checking toexplore bizarre cases and get ideasfrom them
the best result was impossible tofind without the insights that camefrom the proof process
that is where the idea of astable base came from
pamelazave.com> How to Make Chord Correct