© 2012 Columbia University
E6885 Network Science Lecture 4: Network Estimation and Modeling
E 6885 Topics in Signal Processing -- Network Science
Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University
October 1st, 2012
© 2012 Columbia University2 Network Science – Lecture 4: Network Estimation
Course Structure
Class Date Class Number Topics Covered
09/10/12 1 Overview of Network Science
09/17/12 2 Network Representations and Characteristics
09/24/12 3 Network Partitioning, Clustering and Visualization
10/01/12 4 Network Estimation and Modeling
10/08/12 5 Network Models
10/15/12 6 Network Topology Inference
10/22/12 7 Dynamic Networks - I
10/29/12 8 Dynamic Networks - II
11/12/12 9 Final Project Proposals
11/19/12 10 Analysis of Network Flow
11/26/12 11 Graphical Models and Bayesian Networks
12/03/12 12 Social and Economic Impact of Network Analysis
12/10/12 13 Large-Scale Network Processing System
12/17/12 14 Final Project Presentation
© 2012 Columbia University3 Network Science – Lecture 4: Network Estimation
Network Sampling and Estimation
© 2012 Columbia University4 Network Science – Lecture 4: Network Estimation
Why Network Sampling and Estimation is important?
Frequently, only a portion of the nodes and edges is observed in a complex system.
What is the outcome?
– Are the network characteristics measurements from a subgraph representing the whole network?
– What’s the difference between principles of statistical sampling theory and sampling of graphs?
© 2012 Columbia University5 Network Science – Lecture 4: Network Estimation
Definitions
Population graph: G = (V,E)
Sampled graph: G* = (V*,E*)
– In principle, G* is a subgraph of G.
– Error may exist in assessing the existence of vertices or edges, through observations.
Assume we are interested in a particular characteristic of G: η(G)
– For instance, η(G) is:
• Number of edges of G
• Average degree
• Distribution of vertex betweenness centrality scores
• Attributes of vertices such as the proportion of men with more female than male friends in a social network.
– Are we able to get a good estimate of η(G) , say, from G* ?ˆη
© 2012 Columbia University6 Network Science – Lecture 4: Network Estimation
Estimation based on sampled graph?
How accurate if we directly use:
*ˆ ( )Gη η=
This implicity is used in many network study that asserts the properties of an observed network graph are indicative of those same properties for the graph of the network from which the data were sampled.
Statistic sampling theories: means, standard deviations, quantiles, etc., are measurement of individual’s properties.
© 2012 Columbia University7 Network Science – Lecture 4: Network Estimation
Examples
Supposed that the characteristic of interest is the average degree of a graph G,
Let the sample graph G* be based on the n vertices, , and denote its observed degree sequence by
Begin with a simple random sample without replacement.
Scenario 1: for each vertex , we observe all edges
Scenario 2: for each vertex , we only observe all edges when
( ) (1/ )v ii V
G N dη∈
= ∑
*
* *ˆ ( ) (1/ ) ii V
G n dη η∈
= = ∑
*1, , nV i i= *
*i i V
d∈
*i V∈ , i j E∈
*i V∈ , i j E∈ *,i j V∈
© 2012 Columbia University8 Network Science – Lecture 4: Network Estimation
Estimated average degree of the previous example
5,151 vertices and 31,201 edges. Original average degree 12.115.
Sample n =1,500.
Scenario 1: (mean, std) = (12.117, 0.3797). Scenario 2: (mean, std) = (3.528, 0.2260).
A typical adjustment of Scenario 2 is: * /i i vd n d N≈ ⋅ mean* = 12.115
© 2012 Columbia University9 Network Science – Lecture 4: Network Estimation
Choices of Sampling Designs
“Statistical Properties of Sampled Networks”: Physical Review, Lee, Kim and Jeong, 2006
“Effect of sampling on topology predictions of protein-protein interaction networks, “Han et. al., Nature Biotechnology, 2005.
In principle, this shall depend on:
– The topology of the graph G.
– The characteristics of η(G)
– The nature of the sampling design
© 2012 Columbia University10 Network Science – Lecture 4: Network Estimation
Estimation for Totals
Suppose we have:
– Population : ,and yi is the attribute value of interest.
– Let and be the total and average values of the y’s in the population.
– Let be a sample of n units from Ω
In the canonical case in which S is chosen by drawing n units uniformly from Ω, with replacement, a nature estimate of μ is:
and . These estimates are unbiased (i.e., and ).
1, , uNΩ =
iiyτ = ∑ / uNµ τ=
1 , , nS i i=
(1/ ) ii Sy n y
∈= ∑
ˆ uN yτ = ( )E y µ= ( )E τ τ=
© 2012 Columbia University11 Network Science – Lecture 4: Network Estimation
Estimation for Totals (cont’d)
The variances of these estimators take the forms:
where is the variance of the values y in the full population Ω.
In practice:
– Seldom simple random sample with replacement. Some units are more likely than others to be included. E.g.:
• Marketing
• Census
– Unequal Probability Sampling
2( ) /V y nσ= 2 2( ) /uV N nτ σ=2σ
© 2012 Columbia University12 Network Science – Lecture 4: Network Estimation
Horvitz-Thompson Estimation for Totals
The Horvitz-Thompson estimator – through the use of weighted-averaging.
Suppose that, under a given sampling design, each unit has probability of being included in a sample of size n.
Let S be the set of distinct units in the sample. Then the Horvitz-Thompson estimate of the total takes the form:
Let Z be a set of binary random variables, which is 1 if unit i is in S, and zero otherwise. Since,
i ∈ Ω iπ
ˆ i
i S i
yπτ
π∈
= ∑
ˆ( ) ( ) ( ) ( )i i ii i
i S i ii i i
y y yE E E Z E Zπτ
π π π∈ ∈ Ω ∈ Ω
= = =∑ ∑ ∑
( ) ( 1)i i iE Z P Z π= = =and Is an unbiased estimate of
ˆ ˆ(1/ )uNπ πµ τ=
ˆπµ µ
and
© 2012 Columbia University13 Network Science – Lecture 4: Network Estimation
Horvitz-Thompson Estimation for Totals (cont’d)
Variance of the estimator can be expressed as:
Its estimation is an unbiased fashion by the quantity
assuming for all pairs i,j.
ˆ( ) ( 1)iji j
i j i j
V y yπ
πτ
π π∈ Ω ∈ Ω
= −∑ ∑
0ijπ >
1 1ˆ ˆ( ) ( )i ji j i j ij
V y yπτπ π π∈ Ω ∈ Ω
= −∑ ∑
© 2012 Columbia University14 Network Science – Lecture 4: Network Estimation
Simple Random Sampling Without Replacement
Consider the case of sampling without replacement.
Then the Horvitz-Thompson estimates of the total and mean have the form:
The variance may be shown to be
1
1u
iu u
N
n nN N
n
π
− − = =
( 1)
( 1)iju u
n n
N Nπ −=
−and
ˆ uN yπτ = and ˆ yπµ =
2( ) /u uN N n nσ− 2 2( ) /uV N nτ σ=Compare to
while with replacement
© 2012 Columbia University15 Network Science – Lecture 4: Network Estimation
Probability Proportional to Size Sampling
For instance, sampling household based on people.
Sampling is done with replacement.
If the probability is directly proportional to the value ci of some characteristics.
1 (1 )ni ipπ = − − where /i i ii
p c c= ∑
The Horvitz-Thompson estimators are more appropriate than the sample mean.
© 2012 Columbia University16 Network Science – Lecture 4: Network Estimation
Estimation of Group Size
Many real cases, the size of the population is unknown.
The capture-recapture estimators.
The simplest version of capture-recapture involves two stages of simple random sampling without replacement, yielding two samples, say S1 and S2.
Stage 1: – the sample S1 of size n1 is taken.– Mark all the units in S1 .– All units are returned.
Stage 2:– Take another sample of size n2 . Then the estimation:
( / ) 21
ˆ c ru
nN n
m= where m is the number of intersection
© 2012 Columbia University17 Network Science – Lecture 4: Network Estimation
Common Network Graph Sampling Designs
Procedures:
–Selection Stage: two inter-related sets of united being sampled.
–Observation Stage
Induced and Incident Subgraph Sampling
Star and Snowball Sampling
Link Tracing
© 2012 Columbia University18 Network Science – Lecture 4: Network Estimation
Induced Subgraph Sampling
Random sample of vertices in a graph and observing their induced subgraph
iV
n
Nπ =
,
( 1)
( 1)i jV V
n n
N Nπ −=
−
© 2012 Columbia University19 Network Science – Lecture 4: Network Estimation
Incident Subgraph Sampling
Uniform sampling based on edges
1 ,
1 ,
e i
e iei
e i
N d
nn N d
N
n
n N d
π
−
− ≤ −=
> −
© 2012 Columbia University20 Network Science – Lecture 4: Network Estimation
Star and Snowball Sampling
,
2
1
v
i jv
N
n
N
n
π
− = −
| |
| || | 1, ( 1)
v
v
i
N L
n LLi j N
L N n
π+
− −+
⊆
= − ⋅∑
© 2012 Columbia University21 Network Science – Lecture 4: Network Estimation
Link Tracing
After selection of an initial sample, some subset of the edges from vertices in this sample are traced to additional vertices
© 2012 Columbia University22 Network Science – Lecture 4: Network Estimation
Estimation of the number of edges. Example with different p by induced subgraph sampling
© 2012 Columbia University23 Network Science – Lecture 4: Network Estimation
Estimation of the number of edges. Example with different p by induced subgraph sampling (cont’d)
© 2012 Columbia University24 Network Science – Lecture 4: Network Estimation
Histograms of estimates of the number of triangles, connected triples and clustering coefficients.
© 2012 Columbia University25 Network Science – Lecture 4: Network Estimation
Random Graph Models
© 2012 Columbia University26 Network Science – Lecture 4: Network Estimation
Network Graph Models
Modeling of random network graphs.
( ), :G Gθ θΡ ∈Γ ∈Θ
ΓΘ
Pθ : probability distribution on Γ: a collection of possible graphs
: a collection of parameters
© 2012 Columbia University27 Network Science – Lecture 4: Network Estimation
Usage of network graph models
In practice, network graph models are used for a variety of purposes.
Study of proposed mechanisms for the emergence of certain commonly observed properties in real-world networks
Or, the testing for significance of a pre-defined characteristics in a given network graph.
© 2012 Columbia University28 Network Science – Lecture 4: Network Estimation
Random Graph Models
The term ‘Random Graph Model’ typically is used to refer to a model
specifying a collection and a uniform probability over .
Random graph models are arguably the most well-developed class of
network graph models, due to:
– comparatively simpler nature of these models
–This nature allows for the precise analytical characterization of many of
the structural summary measures (in Chapter 4).
Γ ΓP( )⋅
© 2012 Columbia University29 Network Science – Lecture 4: Network Estimation
Model-based estimation vs. Design-based Estimation
Model-Based Estimation vs. Design-based Estimations in Network Graphs.
–Design-based: inference is based entirely on the random mechanism by which a subset of elements were selected from the population to create the sample.
–Model-based: a model is given by the analyst that specifies a relationship between the sample and the population.
In recent decades, the distinction between these two approaches has become more blurred.
Consider the task of estimating a given characteristic of a network graph G, based on a sampled version of that graph, G*.
In Chapter 5, we used ‘design-based’ perspective.
If we augment this perspective to include model-based component, then G is assumed to be randomly from a collection and inference needs to consider it.
( )Gη
© 2012 Columbia University30 Network Science – Lecture 4: Network Estimation
Example – Assessing Significance in Network Graphs
Suppose we have a graph derived from observations:
We are interested in accessing whether the value is ‘significant’, in the sense of being somehow unusual or unexpected.
Formally, a random graph model is used to create a reference distribution which, under the accompanying assumption of uniform likelihood of elements in , takes the form:
If is found to be sufficiently unlikely under this distribution, this is taken as evidence against the hypothesis that is a uniform draw from .
How best to choose is a practical issue of some importance.
obsG
( )obsGη
, ( )
# : ( )
| |t
G G tPη
ηΓ
∈ Γ ≤=Γ
( )obsGηobsG
© 2012 Columbia University31 Network Science – Lecture 4: Network Estimation
Classical Random Graph Models
Erdos and Renyi models (1959):
– A simple model that places equal probability on all graphs of a given order and size.
A collection of all graphs with and .
Assign probability to each , where is the
total number of distinct vertex pairs.
The key contribution of Erdos and Renyi was to develop a foundation of formal
probabilistic results concerning the characteristics of graphs G drawn randomly
from
,v eN NΓ ( , )G V E= | | vV N= | | eE N=1
( )e
NP G
N
−
= 2
vNN
=
,v eN NG ∈ Γ
,v eN NΓ
© 2012 Columbia University32 Network Science – Lecture 4: Network Estimation
Classical Random Graph Models
Gilber Model (1959):
A collection of all graphs with .
Assign edge independently to each pair of distinct vertices with probability
When p is an appropriately defined function of and , these two
classes of models are essentially equivalent for large .
,vN pΓ ( , )G V E= | | vV N=
(0,1)p ∈
vNe vN e N⋅:
vN
© 2012 Columbia University33 Network Science – Lecture 4: Network Estimation
Example
Let
Then, if c>1, with high probability G will have a single connected
component consisting of vertices, for some constant
depending on c, with the remaining components having only on the order
of O(logNv) vertices.
If c<1, then all components will have on the order of O(logNv) vertices,
with high probability G will consist entirely of a large number of very small,
separate components.
v
cp
N=
c vNα cα
© 2012 Columbia University34 Network Science – Lecture 4: Network Estimation
Classic random graphs have distributions that are concentrated
Classical random graphs have distributions that are concentrated, with exponentially decaying tails.
(1 ) ( ) (1 )! !
d c d c
d
c e c ef G
d dε ε
− −
− ≤ ≤ +
( )df G : the proportion of vertices with degree d.
v
cp
N=
For large Nv, G will have a degree distribution that is like a Poisson distribution with mean c.
© 2012 Columbia University35 Network Science – Lecture 4: Network Estimation
Are Classical Random Graphs Practical?
Classical random graphs do not have the broad degree distribution observed in many large-scale real-world network.
They do not display much clustering.
On the other hand, these graphs do possess the smal-world property. The diameter can be shown to vary like O(log Nv).
© 2012 Columbia University36 Network Science – Lecture 4: Network Estimation
Comparing Random Graph Models with Small World Graphs
Small World graphs start from the lattice structure, which can be shown a high level of clustering. The clustering coefficient is roughly ¾ for r large. This model begin with a set of Nv vertices, arranged in a periodic fashion, and join each vertex to r of its neighbors to each side.
Add a few randomly rewired edges.
© 2012 Columbia University37 Network Science – Lecture 4: Network Estimation
Network Growth Models
Network grows over time
Preferential Attachment Models:
–‘The rich get richer’ principle.
–Simon (1955) proposed a class of models that produced such broad, skewed distributions.
–Price (1965) took this idea and applied it in creating a model for the manner in which networks of citations for document sin the literature grow.
–Barabasi and Albert’s model (1999) – a network growth model for undirected graphs.
© 2012 Columbia University38 Network Science – Lecture 4: Network Estimation
Some examples of Degree Distribution
(a) scientist collaboration: biologists (circle) physicists (square), (b) collaboration of move actors, (d) network of directors of Fortune 1000 companies
© 2012 Columbia University39 Network Science – Lecture 4: Network Estimation
Power-Law Model
Barabasi-Albert model:
–Start with an initial graph of vertices and edges.
–At Stage t=1,2,…, the current graph is modified to create a new graph by adding a new vertex of degree , where the m new edges are attached to m different vertices in , and the probability that the new vertex will be connected to a given vertex v is given by
–At each stage, m existing vertices are connected to a new vertex in a manner preferential to those with higher degrees.
–After t iterations, the resulting graph G will have vertices and edges.
–In the time as t tends to infinity, the graph G have degree distributions that tend to a power-law form , with .
(0)G (0)vN
(0)eN
( 1)tG −
( )tG 1m ≥( 1)tG −
''
v
vv V
d
d∈∑
( ) (0)tv vN N t= +
( ) (0)te eN N tm= +
d α−3α =
© 2012 Columbia University40 Network Science – Lecture 4: Network Estimation
Copying Models
More common in biochemical networks, rather the WWW.
Gene duplication is at the heart of nature’s observed tendency of ‘re-use’ biological information in evolving the genomes of living orgamisms.
Chung et. al. (2003):
– Beginning with an initial graph .
– Graphs are constructed from their immediate predecessors, , by the addition of a new vertex, say v, that is connected to some randomly chosen subset of neighbors of a randomly chosen existing vertex, say u.
– A vertex u is chosen from uniformly at random, and then the new vertex v is joined with each of the neighbors of u independently with probability p.
– The degree distribution will tend to a power-law form, with exponent satisfying the equation
– When p=1, each new vertex is connected to by fully duplicating the edges of the randomly selected vertex u.
(0)G( )tG ( 1)tG −
1( 1) 1p pαα −− = −( 1)tG −
( 1)tG −
© 2012 Columbia University41 Network Science – Lecture 4: Network Estimation
Fitting Network Growth Models
Predicting – making informal comparisons between certain characteristics of an observed network and the graph resulting from such models.
Example Wiuf duplication-attachment models – calculating a univariate likelihood function for a network (e.g., interactions among 2,368 proteins).
However, there are a number of open issues, such as the methodology to be scaled up effectively to more complicated contexts, such as involving multivariate parameters, larger networks, more realistic network growth models, etc.
© 2012 Columbia University42 Network Science – Lecture 4: Network Estimation
Exponential Random Graph Models
Robins and Morris: “A good statistical network graph model needs to be both estimable from data and a reasonable representation of that data, to be theoretically plausible about the type of effects that might have produced the network, and to be amenable to examining which competing effects might be the best explanation of the data.”
A potential set of such models are the “Exponential Random Graph Models” – ERGM models.
© 2012 Columbia University43 Network Science – Lecture 4: Network Estimation
Questions?