Centrality Measures - Computing Closeness and...

Centrality MeasuresComputing Closeness and Betweennes

Andrea Marino

PhD Course on Graph Mining Algorithms,Universita di Pisa

Pisa, February 2018

Andrea Marino Centrality Measures

Centrality measures

The problem of identifying the most central nodes in anetwork is a fundamental question that has been asked manytimes in a plethora of research areas, such as

biology,computer science,sociology, andpsychology...

Because of the importance of this question, dozens ofcentrality measures have been introduced in the literature.

Paolo Boldi, Sebastiano Vigna: Axioms for Centrality. InternetMathematics 10(3-4): 222-262 (2014)


Some of them

Local indices

(In)degreeNumber of triangles

Spectral indices, based on some linear-algebra construction

Recursive definition: node is important if connected toimportant vertices (PageRank, Katz, Seeley)

Path-based indices, based on the number of paths or shortestpaths passing through a vertex!

Shortest paths passing through a vertex (Betwenness)Paths ending in a vertex (different point of view for KatzIndex)

Geometric indices, based on distances from a vertex to othervertices

Average distance of one vertex to all the others (Closeness orHarmonic)


Closeness and Betweenness

Closeness and betweenness centrality are certainly two of theoldest and of the most widely used:

almost all books dealing with network analysis discuss them,almost all existing network analysis libraries implementalgorithms to compute them.

We will examine algorithms for computing these two centralitymeasures, by restricting ourselves to unweighted graphs.


Part I

Closeness Centrality


Closeness

Main Idea

A central node should be very efficient in spreadinginformation to all other nodes

A node is central if the average number of links needed toreach another node is small.

Definition

In a connected graph, the closeness centrality of a node v isdefined as c(v) = n−1

f (v) , where f (v) =∑

w∈V d(v ,w) is the farness

of v , and d(v ,w) is the distance between the two vertices v and w(that is, the number of edges in a shortest path from v to w).


If the graph is not (strongly) connected

Researchers have proposed various ways to extend this definition:here, we focus on Lin’s index.

Definition

Let R(v) be the set of vertices reachable from v , and let r(v)denote its cardinality (note that v ∈ R(v) by definition). Then, thecloseness centrality of a node v is equal to

c(v) =r(v)− 1

f (v)

r(v)− 1

n − 1=

(r(v)− 1)2

(n − 1)f (v)

where f (v) =∑

w∈R(v) d(v ,w).

N. Lin. Foundations of social research. McGraw-Hill, 1976.


Exact Closeness Computation

Computing the closeness value for each node v can be easilydone by executing a breadth-first search starting from v .

If we want to compute the closeness value of all nodes of thegraph, then the time complexity would be O(nm)

O(nm) time complexity is not affordable whenever we dealwith very large graphs.


Approximating Closeness Centrality

Consider a simple algorithm for computing the closeness centralityin undirected unweighted graphs, which is based on randomsampling.This algorithm performs k breadth-first searches from k randomnodes v1, . . . , vk and, for any node u, return

c(u) =1∑k

i=1nd(vi ,u)k(n−1)

.

Theorem

If k = Θ(

log nε2

), with high probability

∣∣∣ 1c(u) −

1c(u)

∣∣∣ < ε.

David Eppstein, Joseph Wang: Fast approximation ofcentrality. SODA 2001: 228-229


Top-k central nodes

We focus on the problem of computing only the k mostimportant nodes with respect to the closeness centrality

computing an approximation of the closeness values does notguarantee that we can determine the top-k nodes.

Theorem

On directed sparse graphs, in the worst case, an algorithmcomputing the most closeness central vertex in time O(m2−ε) forsome ε > 0 would falsify SETH.

Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi,Andrea Marino, Henning Meyerhenke: Computing top-kCloseness Centrality Faster in Unweighted Graphs. CoRRabs/1704.01077 (2017). To appear.


Exact Computation

The simplest algorithm for computing the k vertices withlargest closeness

1 It performs a breadth-first search from each vertex v ,

2 it computes its closeness c(v), and,

3 finally, it returns the k vertices with biggest c(v) values.


Speeding up the Algorithm: General Schema

1 It sets c(v) equal to the result of a pruned breadth-firstsearch,

this pruned BFS receives in input the starting node v and avalue xk , which is the k-th biggest closeness value found untilnow (xk = 0 if we have not processed at least k vertices).

2 If this pruned BFS returns the value 0, it means that v is notone of k most central vertices, otherwise c(v) is the actualcloseness of v .

3 At the end, the k vertices with biggest closeness values areagain the k most central vertices.

The order of the nodes

To speed-up the pruned BFS, we want xk to be as big as possible,and consequently we need to process central vertices as soon aspossible. To this purpose, we process vertices in decreasing orderof degree.


Time cost analysis

This algorithm needs a pre-processing, which requires lineartime.

It requires time O(n log n) to sort vertices, and it needs apriority queue containing at each step the k most centralvertices.

Since all other operations need time O(1), the total runningtime is O(m + n log n + n log k + T ) = O(m + n log n + T )),

where log k is the time necessary to execute extraction andupdate operations in a priority queue andT is the time needed to perform the pruned breadth-firstsearch n times.

We can easily parallelise this algorithm, by giving each vertex to adifferent thread: there could be some race condition on xk but itdoes not affect correctness and performance.


The pruned breadth-first search

Reminder

A pruned BFS receives in input the starting node v and a value xk ,which is the k-th biggest closeness value found until now (xk = 0 ifwe have not processed at least k vertices).

The pruning of a BFS started from node v makes use of anupper bound cv () on the closeness of v , which has to beupdated whenever, for any d ≥ 0, the exploration of the d-thlevel of the breadth-first search tree is finished.

This upper bound cv () is obtained by proving a lower boundon the farness of v , i.e. f (v), since:

c(v) =(r(v)− 1)2

(n − 1)f (v)

where recall that f (v) =∑

w∈R(v) d(v ,w).


A lower bound on the farness

If Γd(v) denotes the nodes at level d of the BFS tree started fromv and if γd(v) = |Γd(v)|, then

f (v) ≥ fd(v) + (d + 1)γd+1(v) + (d + 2)(r(v)− nd+1(v)),

where

fd(v) =d∑

i=1

i · |Γi (v)| and nd(v) =d∑

i=1

|Γi (v)|.

Since nd+1(v) = γd+1(v) + nd(v), we have that

f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)).


f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)).

At the end of the exploration of the d-th level of the breadth-firstsearch tree, we don’t know yet the value of γd+1(v). However, wecan certainly say that this value is not greater than the sum of thedegrees of all nodes at level d , that is,

γd+1(v) ≤∑

u∈Γd (v)

deg(v) := γd+1(v).

Hence,

f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)) := fd(v , r(v)).

This lower bound on the farness implies the following upper boundon the closeness:

c(v) ≤ (r(v)− 1)2

(n − 1)

1

fd(v , r(v)):= cd(v).


Summary

After the exploration of the first d levels of the breadth-first searchstarted from node v , an upper bound on the closeness value of v

can be computed.


Using the upper bound to prune the BFS

Reminder

A pruned BFS receives in input the starting node v and a value xk ,which is the k-th biggest closeness value found until now (xk = 0 ifwe have not processed at least k vertices).

At the end of the exploration of the d-th level of thebreadth-first search tree, the upper bound cd(v) is computedand compared to xk .

If xk > cd(v) ≥ c(v), the BFS is interrupted, since for surethe node v is not among the top-k vertices.

Otherwise, the BFS continues with the exploration of the nextlevel of the search tree.


Computing the upper bound

We have said:

c(v) ≤ (r(v)− 1)2

(n − 1)

1

fd(v , r(v)):= cd(v).

Everything is known after the exploration of the d-th level ofthe BFS tree apart from the value r(v).

If the graph is undirected, r(v) can be easily pre-computed(connected components).If the graph is directed and strongly connected, then r(v) = n.It remains to deal with the case in which the graph is directedbut not strongly connected.


The directed but not strongly connected case

Let us assume, for now, that we know a lower (respectively,upper) bound α(v) (respectively, ω(v)) on r(v)

without loss of generality we can assume that α(v) > 1.

We now show that, instead of examining all possible values ofr(v) between α(v) and ω(v), it is sufficient to examine onlythe two extremes of this interval.

We prove the following lower bound λd(v) on 1c(v) :

Lemma

1

c(v)≥ λd(v) = (n − 1) min

(fd(v , α(v))

(α(v)− 1)2,

fd(v , ω(v))

(ω(v)− 1)2

).


If we denote a = d + 2 and b = γd+1(v) + a(nd(v)− 1)− fd(v),we have that

f (v) ≥ fd(v)− γd+1(v) + a(r(v)− nd(v))

= a(r(v)− 1) + fd(v)− γd+1(v)− a(nd(v)− 1)

= a(r(v)− 1)− b.

Note that a > 0 because d > 0, and b > 0 because

fd(v) =∑

w∈Nd (v)

d(v ,w) ≤ d(nd(v)− 1) < a(nd(v)− 1)

where Nd(v) =⋃d

i=1 Γi (v) and the first inequality holds because, ifw = v , then d(v ,w) = 0, and if w ∈ Nd(v), then d(v ,w) ≤ d .Hence,

1

c(v)≥ (n − 1)

a(r(v)− 1)− b

(r(v)− 1)2.


Let us consider the function g(x) = ax−bx2 .

The derivative g ′(x) = −ax+2bx3 is positive for 0 < x < 2b

aand negative for

x > 2ba

:

this means that 2ba is a local maximum, and there are no local

minima for x > 0.Consequently, in each closed interval [x1, x2] where x1 and x2

are positive, the minimum of g(x) is reached in x1 or x2.

Since 0 < α(v)− 1 ≤ r(v)− 1 ≤ ω(v)− 1,

g(r(v)− 1) ≥ min(g(α(v)− 1), g(ω(v)− 1))

The plot of function g(x) = ax−bx2 with

a = b = 1. There is a local maximumbut no local minimum: hence, in eachclosed interval the minimum is reachedin the extremes of the interval.


It now remains to compute α(v) and ω(v) (in the case of adirected graph which is not strongly connected).This can be done during the pre-processing phase of the algorithmas follows.

Let Gscc be the component graph of G and, for any SCC D,let w(D) denote the number of nodes in D.

If v and w are in the same SCC, thenr(v) = r(w) =

∑D∈r(C) w(D), where r(C ) denotes the set of

SCCs that are reachable from C in Gscc.

Hence, we simply need to compute a lower (respectively,upper) bound α(C ) (respectively, ω(C )) on

∑D∈r(C) w(D),

for every SCC C .


To compute a lower (respectively, upper) bound α(C) (respectively, ω(C)) on∑D∈r(C) w(D), for every SCC C .

We first compute a topological sort {C1, . . . ,Cl} of Vscc (that is, if(Ci ,Cj) ∈ Escc, then i < j).

Successively, we use a dynamic programming approach, and, by startingfrom Cl , we process the SCCs in reverse topological order, and we set

α(C) = w(C) + max(C ,D)∈Escc

α(D) ω(C) = w(C) +∑

(C ,D)∈Escc

ω(D).

Processing the SCCs in reverse topological ordering ensures that thevalues α(D) and ω(D) on the right hand side of these equalities areavailable when we process the SCC C .

Clearly, the complexity of computing α(C) and ω(C), for each SCC C , islinear in the size of G, which is smaller than G .


Observe that the bounds obtained through this simple approachcan be improved by using some “tricks”.

When the biggest SCC C is processed, we do not use thedynamic programming approach and we can exactly compute∑

D∈r(C) w(D) by simply performing a BFS starting from any

node in C .

We get exact α(C ) and ω(C )Also α(C ) and ω(C ) are improved for each SCC C from whichit is possible to reach C .

In order to compute the upper bounds for the SCCs that areable to reach C , we can run the dynamic programmingalgorithm on the graph obtained from Gscc by removing allcomponents reachable from C , and we can then add∑

D∈r(C) w(D).


IMDB

Analyzing the Internet Movie DataBase (in short, IMDB)graph, where nodes are actors, and two actors are connected ifthey played together in a movie (TV-series are ignored).

The data can be collected from the websitehttp://www.imdb.com (some genres can be excluded such asawards-shows, documentaries, game-shows, news, realities andtalk-shows).

We can then analyze snapshots of the actor graph, takenevery 5 years from 1940 to 2010, and 2014.


http://www.imdb.com

The most central actors in the IMDB graph with respect to thecloseness centrality measure.The total time needed to perform the computation with 30 threadsis less than 40 minutes!


Part II

Betweenness centrality


Another popular centrality measure is betweenness centrality,which ranks the nodes according to their participation in theshortest paths between other node pairs.

Intuitively, betweenness measures a node’s influence on theinformation flow circulating through the social network, underthe assumption that the flow follows shortest paths.


Definition

Let σs,t be the number of shortest paths going from node s tonode t, and let σs,t(v) be the number of shortest paths going fromnode s to node t and passing through node v . Then, thebetweenness centrality value of node v is equal to

b(v) =∑

s 6=v ,t 6=v

σs,t(v)

σs,t.

In order to compute b(v), we can compute the contribution of anode s 6= v to b(v), that is, the value

bs(v) =∑t 6=v

σs,t(v)

σs,t.


To this aim, we make use of the so-called Brandes algorithm,which performs two basic steps (in the following, we will restrictourselves to undirected unweighted connected graphs).

1 An augmented breadth-first search starting from s, whichallows us to compute, for every node t, the value σs,t .

2 An accumulation phase which uses the breadth-first searchDAG constructed during the previous phase, in order tocompute, for every node v , the value bs(v).

U.Brandes. A faster algorithm for betweenness centrality. TheJournal of Mathematical Sociology, 25:163–177, 2001.


Augmented breadth-first search

During the augmented BFS, each node v maintains a list A(v) ofits predecessors in a shortest path from the starting node s, andthe number σs,v of shortest path from s to v .

Each time a node v is inserted into the queue, the node uwhich inserted it is also memorized along with its distancefrom s and its value σs,u.

Once the node v is extracted from the queue, ifd(s, v) = d(s, u) + 1, then the node u is added to the listA(v), and σs,v is increased of σs,u.

At the end of the augmented breadth-first search each node v hascomputed its value σs,v and the set of all predecessors.


An Example


After the augmented phase, each node v has computed its valueσs,v and the set of all predecessors.


Accumulation phase

During the accumulation phase, each node v distributes its valueσs,v to its predecessors u1, . . . , uh, proportionally to their valuesσs,ui .

Each node v receives from its successors w1, . . . ,wk in thebreadth-first search DAG a value x1, . . . , xk .

It then computes the sum X (v) = 1 + x1 + · · ·+ xk , and it“sends” to each ui the value X (v)

σs,uiσs,v

.

By using the dynamic programming technique, this processcan be done in linear time starting from the nodes in the DAGwhich have no outgoing edges.

All the information necessary to execute the process areavailable at the right time.

Finally, for each node v , we can set σs,v = X (v)− 1(remember that we do not want to count the paths arriving atX ).


After the accumulation phase each node v has computed its valueσs,v .


The whole algorithm

By executing the augmented BFS starting from each node sand by executing the corresponding accumulation phase, wecan compute for each v the value σs,v .

Hence, the last step is to compute the betweenness centralityvalue of v by summing up all these values, that is,

b(v) =∑s 6=v

σs,v .

For the time complexity of the Brandes algorithm, the timecomplexity if O(nm), since we are visiting twice thebreadth-first search DAG starting from each node s.

Since this time complexity is not affordable whenever thegraph is very large, several approximation algorithms has beenproposed.


Thanks

Part of these slides are based on a chapter written by PierluigiCrescenzi for his course ”Algorithms for Graph Mining”.


Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Centrality Measures - Computing Closeness and...

Documents