Traceroute-like exploration of unknown networks: a statistical
analysis
A. Barrat, LPT, Université Paris-Sud, FranceI. Alvarez-Hamelin (LPT, France)L. Dall’Asta (LPT, France)A. Vázquez (Notre-Dame University, USA)A. Vespignani (LPT, France)
cond-mat/0406404
http://www.th.u-psud.fr/page_perso/Barrat
Plan of the talk
• Apparent complexity of Internet’s structure
• Problem of sampling biases
• Model for traceroute
• Theoretical approach of the traceroute mapping process
• Numerical results
• Conclusions
•Multi-probe reconstruction (router-level)•Use of BGP tables for the Autonomous System level (domains)
Graph representation
different granularities
Internet representation
Many projects (CAIDA, NLANR, RIPE, IPM, PingER...)
=>Large-scale visualizations
TOPOLOGICAL ANALYSIS by STATISTICAL TOOLS and GRAPH THEORY !
Main topological characteristics
•Small world
Distribution of Shortest paths (# hops) between two nodes
Main topological characteristics
•Small world : captured by Erdös-Renyi graphs
Poisson distribution
<k> = p N
With probability p an edge is established among couple of vertices
Main topological characteristics
•Small world•Large clustering: different neighbours of a node will likely know each other
1
2
3
n
Higher probability to be connected
=>graph models with large clustering, e.g. Watts-Strogatz 1998
Main topological characteristics
•Small world•Large clustering•Dynamical network•Broad connectivity distributions
•also observed in many other contexts (from biological to social networks)•huge activity of modeling
Main topological characteristics
Broad connectivity distributions:obtained from mapping projects
Govindan et al 2002
Result of a sampling: is this reliable ?
Sampling biases
Internet mapping:
Sampling is incompleteLateral connectivity is missed (edges are underestimated)Finite size sample
=> spanning tree
Sampling biases
● Vertices and edges best sampled in the proximity of sources
● Bad estimation of some topological properties
Statistical properties of the sampled graph may sharply differ from the original one
Lakhina et al. 2002
Clauset & Moore 2004
De Los Rios & Petermann 2004
Guillaume & Latapy 2004
Bad sampling
?
Evaluating sampling biases
Real graph G=(V,E) (known, with given properties )
Sampled graph G’=(V’,E’)
simulated sampling
Analysis of G’, comparison with G
Model for traceroute
G=(V, E)
Sources Targets
First approximation: union of shortest paths
NB: Unique Shortest Path
Model for traceroute
G’=(V’, E’)
First approximation: union of shortest paths
Very simple model, but: allows for some analyticaland numerical understanding
More formally...
G = (V, E): sparse undirected graph with a set of
NS sources S = { i1, i2, …, iNS }
NT targets T = {j1, j2, …, jNT }
randomly placed.
The sampled graph G’=(V’,E’) is obtained by considering the union of all the traceroute-like paths connecting source-target pairs.
PARAMETERS: ,N
NSS ,
N
NTT
N
NN TS (probing effort)
Usually NS=O(1), T=O(1)
For each set = { S,T }, the indicator function that a given edge (i, j) belongs to the sampled graph is
ml
N
s
mlij
N
tmjliij
S T
ts1
),(
1
11
with
0
1),( mlij
if (i,j) path between l,m
otherwise,S
N
sii N
S
s
1
T
N
tij N
T
t
1
Analysis of the mapping process
Averaging over all the possible realizations of the set = { S,T },
ml
mlijTS
ml
N
s
N
t
mlijmjliij
S T
ts)1(111 ),(
1 1
),(
WE HAVE NEGLECTED CORRELATIONS !!
• usually S T << 1
•
Mean-field statistical analysis
ij > 1 – exp(- S T bij )
betweenness
Betweenness centrality b
for each pair of nodes (l,m) in the graph, there are
lm shortest paths between l and m
ijlm shortest paths going through ij
bij is the sum of ijlm
/ lm over all pairs (l,m)
Similar concept: node betweenness bi
ij
kij: large betweenness
jk: small betweenness
Also: flow of information if each individual sends a message to all other individuals
Consequences of the analysis
ij > 1 – exp(- S T bij )
•Smallest betweenness bij=2 => <ij > 2 S T i
j
•Largest betweenness bij=O(N2) => <ij > 1
(i.e. i and j have to be source and target to discover i-j)
Results for the vertices
ij > 1 – exp(- bij/N)
<i > 1 – (1-T) exp( - bi/N ) (discovery probability)
N*k/Nk 1 – exp( - b(k)/N ) (discovery frequency)
<k*>/k ( 1 + b(k)/N )/k (discovered connectivity)
Summary:
• discovery probability strongly related with the centrality ;
• vertex discovery favored by a finite density of targets ;
• accuracy increased by increasing probing effort .
Numerical simulations
1. Homogeneous graphs:
• peaked distributions of k and b
• narrow range of betweenness
11,max ebb
(ex: ER random graphs)
=> Good sampling expected only for high probing effort
Numerical simulations
• broad distributions of k and b
P(k) ~ k-3
• large range of available values
1
k
1. Homogeneous graphs
2. Heavy-tailed graphs
(ex: Scale-free BA model)
=> Expected: Hubs well-sampled independently of
(b(k) ~k
Homogeneous graphs
Homogeneously pretty badly sampled
Numerical simulations
Hubs are well discovered
Numerical simulationsScale-free graphs
No heavy-tail behaviour except....
• heavy-tailed P*(k) only for NS = 1 (cf Clauset and Moore 2004)
• cut-off around <k> => large, unrealistic <k> needed
• bad sampling of P(k)
Numerical simulationsHomogeneous graphs
NS=1
• good sampling, especially of the heavy-tail;
• almost independent of NS ;
• slight bending for low degree (less central) nodes => bad evaluation of the exponents.
Scale-free graphs
Numerical simulations
Summary
Analytical approach to a traceroute-like sampling process
Link with the topological properties, in particular the betweenness
Usual random graphs more “difficult” to sample than heavy-tails
Heavy-tails well sampled
Bias yielding a sampled scale-free network from a homogeneous network:only in few cases
<k> has to be unrealistically large
Heavy tails properties are a genuine feature of the Internet
however
Quantitative analysis might be strongly biased (wrong exponents...)
•Optimized strategies:•separate influence of T, S
•location of sources, targets•investigation of other networks
•Results on redundancy issues•Massive deployment traceroute@home
Perspectives
•The internet is a weighted networks bandwidth, traffic, efficiency, routers capacity
•Data are scarse and on limited scale
•Interaction among topology and traffic
and...
cf also Guillaume and Latapy 2004