Self-stabilizing Overlay Networks
Sukumar GhoshUniversity of Iowa
Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju
(Talk at Michigan Technological University)
On Thursday, 16th August 2007 Skype had an outage
(Skype is known to be a “self-healing” overlay network)
(Skype’s explanation)
The disruption was triggered by a massive restart of users’
computers across the globe within a very short timeframe,
as they re-booted after receiving a routine set of patches
through Windows Update.
Overlay Network
A logical network laid on top of the Internet
AB
C
Internet
Logical link AB Logical link BC
The Formal Model
Let V be a set of nodes. The functions
id : V Z+ assigns a unique id to each node in V
rs : V {0, 1}* assigns a random bit string to each node
in V
A family of overlay networks ON : F G, where F is the
set
of all triples λ= (V; id; rs) and G is the set of all directed
graphs.
The family of overlay networks associates a unique
directed
graph ON(λ)∈ G with each labeled set λ = (V; id; rs) of
nodes.
Structured vs. Unstructured
Overlay networks
Unstructured Structured
No restriction on
network topology.
Examples: Gnutella,
Kazaa, Bittorrent,
Skype etc.
Network topology
satisfies specific
invariants.
Examples:
Chord, CAN, Pastry
Skip Graph etc
The Challenge
Can an overlay network restore its correct functionality from
an arbitrary initial configuration?
Bad configurations can be caused by failures, perturbations,
selfish actions, malicious attacks.
Autonomic Systems
Self-management is the holy grail of all complex
dynamic systems.
Self-stabilizing systems
(Convergence) Recover from any arbitrary
initial configuration to a legal configuration in a
bounded number of steps, and
(Closure) remain in the legal configuration
thereafter, until another failure or perturbation
occurs.
Self-stabilizing Overlay Networks
Can an overlay network restore its topology from
an arbitrary initial configuration?
Does it make sense in unstructured networks?
Does it make sense in structured networks?
Related work
Self-stabilizing and Byzantine-tolerantoverlay network. OPODIS 2007[Dolev, Hoch, van Renesse]
A distributed polylog time algorithm for self-stabilizing SKIP graph. PODC ’09[Jacob, Richa, Scheideler et. al]
Linearization: Locally self-stabilizing Sorting in graphs. ALENEX, SIAM ‘07[Onus, Richa, Scheideler]
Example: Linearization
2 7
102015
30
13
18 3421
2 5 7 10 131518 213034
The ideal topology is a sorted list. The goal is to spontaneously recover to the ideal topology from anarbitrary connected topology
(Onus, Richa, Scheideler, ALENEX 2007)
Self-stabilizing algorithm: Linearization
Left and right neighbors:– ‘w’ is left neighbor of node ‘u’ if {u, w} E and w < u.
– ‘w’ is right neighbor of node ‘u’ if {u, w} E and u < w.
u=10
w1=2 w2=3 w4=8w3=6 v1=19 v2=28 v4=35v3=30
left neighbors right neighbors
Self-stabilizing algorithm: Linearization
u=10
w1=2 w2=3 w4=8w3=6 v1=19 v2=28 v4=35v3=30
(The Algorithm) In each round do
Convert left neighbors into sorted listConvert right neighbors into sorted list
Takes at most (n-2) rounds.
Slide borrowed from Onus et al.
Evolution of Skip Graph(Aspenes, Shah SODA 2003)
42 329 15 6347 9380 107
Search time is O(n) hops
SKIP Graph
42 329 15 6347 9380 107
Node degree = O(log n), diameter = O(log n)
Number of levels = O(log n),
Search time now is O(log n) hops
001 100 110 010 111 000101 011 101 010
Level 0
Level 1
Level 2
0 - -
1 - -
00 -
01 -
10 -
11 -
SKIP Graph: the question
Can we have a self-stabilizing skip graph that
can spontaneously restore its topology starting
from any “connected” initial configuration?
Why local checking is important
Unless bad configurations are detected via local
checking, periodic global snapshots are needed,
which is disruptive for the system.
SKIP Graph is NOT locally checkable
Self-stabilization requires local detection of errors,
but certain failures are not locally checkable
SKIP+ graph
Jacob, Richa, Scheideler et al. (PODC 2009) proposed a
locally checkable version of SKIP Graph by adding a
few extra edges to an existing Skip Graph. They called
it a SKIP+ Graph.
They presented an algorithm to stabilize such a topology
in O(log2n) rounds with high probability. The algorithm is
quite cumbersome.
We try to devise a simpler and better solution.
Detectors
detectordetector
detector
detector
detectordetector
Our first step
Detector diameter
The detector diameter of G, is the maximum hop
distance in G between any node and the closest
detector.
Transitive Closure Framework
Due to the local checkability property
in any faulty configuration, there is at least one detector
Transitive Closure Framework
Theorem
For a SKIP+ graph, the detector diameter D = O(log n)
Transitive Closure Framework
Transitive Closure Framework
The neighbors of each detector become detectors in the next
round. In O(log n) rounds, every node becomes a detector, and
these detectors initiate the transitive closure process. After
an additional O(log n) rounds, all nodes become connected with
one another, and the topology becomes completely connected.
Transitive Closure Framework
After all nodes becomes detectors and eventually the
topology becomes completely connected, the nodes
rebuild the correct topology using a REPAIR
subroutine. REPAIR takes only one round.
The Repair Process
Lemma
If the network is completely connected and all nodes are
detectors in round i, a legal overlay network will be built in
round (i + 1), and no node will be a detector.
Compare with Jacob et. al’s results
Local checkability
Let L define a correct configuration of an overlay network.
Then network is locally checkable when
L = p0 p∧ 1 p∧ 2 … p∧ ∧ n-1
where pi is a local predicate involving process i and its
immediate neighbors only.
Most of the real life networks are NOT locally checkable
Example: a clique
Theorem. A complete connected topology is locally checkable
a
b
c
Example: a clique
Theorem. A complete connected topology is locally checkable
a
b
c
Chord is not locally checkable
Chord ring Loopy chord ring
CAN is not locally checkable
Content Addressable Network (CAN) on a 2D torus
Replace the black edges by thered edges, and each columnbecomes a loopy chord ring
LCON: a locally checkableoverlay network in a circular key space
18
0
3
32
5
37
23
25
40
50
54
59
N= 64
7
LCON: a locally checkableoverlay network in a circular key space
18
0
3
32
5
37
23
25
40
50
54
59
S-links for node u:one edge to each node
in the range (u to u+s mod N)
D-links for node u:Succ (u+s mod N), Succ (u+2s mod N)Succ (u+(d-1)s + mod N)
Nmax = s x dLet s=16, d=4
7
Observations
Observation
Each node in LCON has (d+s-2) neighbors.
When
d = s, the size of the neighborhood is O(sqrt
N).
Theorem
The detector diameter of LCON is at most
two.
Some properties of LCON
Theorem. LCON is locally checkable.
Main idea.
Case 1. If the diameter is two, then every node can “see”
every other node, and check if the topology is correct.
Case 2. We show that if the diameter if greater than two,
then there is at least one detector.
Self-stabilization of LCON
The Transitive Closure Framework (TCF) will stabilize LCON
in O(log N) time.
But it may be a sledgehammer. What is the space complexity
of stabilization using TCF?
Self-stabilization of LCON
We have an algorithm customized for LCON that stabilizes
LCON in polylog time, while the space complexity does
not skyrocket to O(n)
Generalization of LCON
Main idea
Consider a CAN-like topology on
a d-dimensional torus. Convert
the “ring” in each dimension into
an LCON ring. It is only partially
shown in the figure on a 2-
dimensional torus
Each node has O(d.N1/2d)
neighbors
Conclusion
A new problem of growing interest. We need
efficient
algorithms for stabilizing a variety of overlay
topologies.
The initial topology must be connected.
Stabilization
from a partitioned topology is impossible. Also for a
given (V, id, rs) the legal topology should be
unique.
Otherwise there will be an additional step for
distributed
consensus
Working on extending this to more fragile
networks.
Questions?