DHTs
1
Distributed Hash Tables (DHTs)
� Abstraction: a distributed hash-table data structure - insert(id, item);- item = query(id); (or lookup(id);)- Note: item can be anything: a data object, document, file,
pointer to a file…
2
� Proposals- CAN, Chord, Kademlia, Pastry, Tapestry, etc
� Distributed Hash Tables (DHTs)� Hash table interface: put (key,item), get (key)� O(log n) hops
Structured Networks
K IK I
(K1,I1)
3
K I
K I
K I
K I
K I
K I
K I
put(K 1,I1) get (K 1)
I1
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
4
Robert MorrisIon Stoica, David Karger,
M. Frans Kaashoek, Hari Balakrishnan
MIT and Berkeley
Chord IDs
� Key identifier = SHA-1(key)� Node identifier = SHA-1(IP address)� Both are uniformly distributed� Both exist in the same ID space
5
� Both exist in the same ID space
� How to map key IDs to node IDs?
Consistent hashing [Karger 97]
N105 K20
K5Key 5Node 105
6
N32
N90
K80
Circular 7-bitID space
A key is stored at its successor: node with next higher ID
� N: number of nodes in the system� m: m-bit identifier assigned to each node id� K: number of keys stored in system� Identifier circle: modulo 2m
7
� “Load balancing”: uniform distribution of keys to nodes- Rougly K/N keys change hand for each join/leave.
� Discussion:- What if nodes have heterogeneous capabilities?- What if keys have varying degrees of popularity?
Basic lookup
N105
N10N120
“Where is key 80?”
8
N32
N90
N60
K80
“N90 has K80”
Simple lookup algorithm
Lookup(my-id, key-id)n = my successorif my-id < n < key-id
call Lookup(id) on node n // next hop
9
elsereturn my successor // done
� Correctness depends only on successors
“Finger table” allows log(N)-time lookups
½¼
10
N80
1/8
1/161/321/641/128
Finger i points to successor of n+2i
½¼112
N120
11
N80
1/8
1/161/321/641/128
Lookup with fingers
Lookup(my-id, key-id)look in local finger table for
highest node n s.t. my-id < n < key-idif n exists
12
call Lookup(id) on node n // next hop
elsereturn my successor // done
Lookups take O( log(N)) hops
N10
N5
N20
N110
N99
K19
13
N32
N99
N80
N60
Lookup(K19)
Discussion
� “Structure”:- Neighbor relations imposed by id- Performance not considered- Potentially log N hops, but each hop could be really inefficient.
� Debate:
14
� Debate:- Can “Structured” protocols result in good performance?
� Variants (e.g. Pastry):- Slightly “relax” structure- Neighbor must be chosen from among “set of possibilities”.- Use performance to choose among them.
Joining: linked list insert
N25
15
N36
N401. Lookup(36)
K30K38
Join (2)
N25
16
N36
N40
2. N36 sets its ownsuccessor pointer
K30K38
Join (3)
N25
17
N36
N40
3. Copy keys 26..36from N40 to N36
K30K38
K30
Join (4)
N25
18
N36
N40
4. Set N25’s successorpointer
Update finger pointers in the backgroundCorrect successors produce correct lookups
K30K38
K30
Join (finger pointers)
� Node n contacts “bootstrap node” b� Node n learns predecessors and fingers by asking b to look them up.
- O(log2N) messages- Practical optimizations to reduce further.
� Need to update existing finger tables of existing nodes to point to n.- p=findpredecessor(n - 2i-1)
19
- p=findpredecessor(n - 2 )- Find Predecessor (id):
• Contact a node x towards the id• If id lies btw x and succ(x), return x• Else ask x for the node it knows about that closely precedes id
Failures might cause incorrect lookup
N120
N113
N102
N10
20
N80
N85
N80 doesn’t know correct successor, so incorrect lookup
Lookup(90)
Solution: successor lists
� Each node knows r immediate successors� After failure, will know first live successor� Correct successors guarantee correct lookups
21
� Guarantee is with some probability� If O(log N) successors, probability of ring partition is small.
Concurrent Operations and Failures
� What’s discussed:- Simplistic scenarios: individual joins and leaves.- How about concurrent failures, churn, transient state?
� Paper argument:- Maintain correctness of ring.
22
- Maintain correctness of ring.- Its ok to have inaccuracies in finger table (performance)
� “Stabilization” primitive:- If node n runs stabilize, it asks its successor s for its predecessor p.- Decides if p should be successor instead.
� Does not discuss failures leading to “partition” of ring� Finger-pointer update not extensively discussed.
Other Notable DHTs
� Concurrently developed with Chord:- CAN (Ratnasamy, Francis, Handley, Karp, Shenker)- Pastry (Rowstron, Druschel)- Tapestry (Zhao, Kubiatowicz, Joseph)
23
� Later: - Kademlia (Mazieres)
Discussion
� Churn, concurrent failures is tricky in general� Paper, and DHT works in general seem to be superficial
about it. (some follow-on work exists).� “Ring Correctness”: What if NATs exist?
24
Data Replication
� If node leaves, data moved to nearby node.� What about abrupt failure?� Paper: Replicate data in r successors.
- Follow-up work looked at replication.
25
DHTs: Discussion
� Structured Search Vs. Unstructured Search.- Structure helps, but what if most searches are for popular files?
� Lots of practical issues are tricky with structured- Handling node heterogeneity (Ethernet Vs. DSL)- Handling NATs and firewalls
26
- High churn- Performance Vs. structure in neighbor selection.- Load imbalance due to differences in file popularity- Keyword search/ fuzzy keyword search
� Lots of work since on:- Resolving some of these issues- Comparing structured and unstructured protocols- Hybrid schemes that combine both structured/unstructured.
DHTs: Discussion
� DHTs: Evolved as a nice algorithm for search/lookup.� Since then, DHT-camp argues it is a generic substrate
- Multicast, Anycast, file storage, etc. etc. (e.g. I3 paper, Sigcomm02)
� Panacea/Generic substrate or “solution looking for problems”?
27
problems”?
� DHTs in real world:- Primarily seen success in lookup/search. - Deployed in Kademlia and recent versions of Bittorrent.