Community Detectionfaculty.nps.edu/rgera/MA4404/Winter2020/04-CommunityDetectionAndModularity.pdfWhy...

Community Detection

Prof. Ralucca Gera, Applied Mathematics Dept.Naval Postgraduate SchoolMonterey, [email protected]

Excellence Through Knowledge

Learning Outcomes

• Understand why and how community detection and validation work:– Explain the connection to modularity

• Distinguish methodologies used for overlapping and non-overlappingcommunity detection;

• Contrast methodology used in networks built as stochastic block models from random models.

Why Community Detection?

• Communities are features that appear in real networks– We generally try to identify them through the structural

properties of the network: nodes tend to cluster based on common interests;

• Massive amount of research since 2002 in this area;• Based on its usefulness, community detection became

one of the most prominent directions of research in network science.

• It is one of the common analysis tools in understanding networks

• A community ~ a group of people with common characteristic or shared interests 3

What is a community?

A community is a subset of nodes that share common or similar characteristics, based on which they tend to group. • In a social network it might be a circle of friends,• In the World Wide Web it might indicate a group of pages

on closely related topics, • In a network of emails it may indicate groups of emails that

have similar patterns or domain or belong to individuals that correspond on a regular basis.

Community detection: identifying what nodes belong to what communities (fast algorithms are usually not deterministic). 4

What might influence a community?

8

Homophily: similar nodes cluster together: for example based on Language (or based on degree for degree homophily)

__________________________________________________________________________Virality Prediction and Community Structure in Social Networks Yong-Yeol “YY” Ahn

Fundamental concepts for clustering- Identification and Evaluation -


Adjacency matrices of different types of networks

Ref: “Think locally, act locally: Detection of small, medium-sized, and large communities in large networks” by Jeub et al, 2015

Figure: (a) good spectral clustering (b) core-periphery structure (c) unstructured, (d) either way

Different types of adjacency matrices and associated networks:Dark = 1 (or nonnegative weights) and Gray = 0 (no edge)

What do networks look like?

Community detection

Methodology from Leskovec’s paper (Stanford):(1) Data is modeled by an “interaction graph”.(2) Hypothesis: the world contains groups that

interact more strongly amongst within the group than with the outside world.

(3) An objective function or metric is chosen to formalize this idea of groups.

(4) An algorithm is then selected to find sets of nodes that exactly or approximately optimize this function

(5) The clusters (communities) are then evaluated. 8

Community evaluation

How do we confirm the value of the community detection?• Ideally:

– validating algorithms on community-labeled data (also called ground truth),

– comparing against existing algorithms.• Alternatively: since community detection identifies

sets of nodes that should naturally be in a community in the real world, then search for an understanding to whether they appear to make intuitive sense as a plausible community.

9

10

Adjacency matrices (some overlapping communities)

Reference: Jure Leskovec https://www.youtube.com/watch?v=htWQWN1xAZQ

Overlapping vs non-overlapping

Overview of different types of adjacency matrices and associated networks:Dark = 1 (or nonnegative weights) and Gray = 0 (no edge)

Common clustering methodologies

• Louvain • Girvan-Newman • Minimum-cut method• Modularity maximization

Nonoverlapping• Clique Percolation

Overlapping

Non-overlapping communities (node partitioning into communities)


Partitioning Nodes Methods

• We will discuss the two most commonly used methods for community detection partitioning the node set:– Method 1: Louvain – Method 2: Girvan Newman

• First, let’s talk about modularity– Goal of modularity based community detection:

assign nodes to communities to maximize modularity

13

Modularity

Define modularity as: 𝑄 = (number of edges within communities) – (expected number of edge of a random network of the same size).• Where “expected” come from a “null model” to compare our

network against random networks with the same 𝑛 and 𝑚.

𝑄1

2𝑚 𝑎 𝑝 , 𝑤ℎ𝑒𝑟𝑒 𝑝

𝑘2 𝑜𝑟

𝑝 𝑘 𝑘2𝑚 , ∈∈

• 𝑄 ∈ 1, 1 and it compares edges inside communities to edges created at random/uniform in similar networks.

• Larger values of 𝑄 indicating stronger community structure, dense communities with sparse connections between them.

Method 1: Louvain

• Goal: optimize modularity theoretically this results in the best possible grouping of the nodes (but modularity may not capture the right communities as they depends on the function of the network & definition of edges)

• The Louvain Method of community detection:– Step 1: find small communities by optimizing modularity

locally on all nodes,– Step 2: each small community is grouped into one node– Step 3: Repeated Step 1 on the new graph

• Louvain’s visualization15

Method 1: Louvain (slide 2)

• Simple, efficient and easy-to-implement (NetworkX, Matlab, C++, and Gephi, and R):

• For community detection in large networks– For sizes up to 100 million nodes and billions of links. – The analysis of a typical network of 2 million nodes takes

2 minutes on a standard PC.

• The method unveils hierarchies of communities and allows to zoom within communities to discover sub-communities, sub-sub-communities, etc.

• It is today one of the most widely used method for detecting communities in large networks

16

Method 2: Girvan Newman

http://www.jstor.org/stable/pdf/3058918.pdf

• The Girvan–Newman algorithm detects communities by progressively removing edges (with high betweeness centrality) from the original network.

• These edges are believed connect communities

• Algorithm stops when there are no edges between the identified communities.

Method 2: Girvan Newman (slide 2)

18

Implementation in Python and R.

Overlapping communities (not a partition into communities)


Cliques

• Recall that a clique: a maximum complete subgraph in which all nodes are adjacent to each other

• NP-hard to find the maximum clique in a network• Straightforward implementation to find cliques is very

expensive in time complexity

Nodes 5, 6, 7 and 8 form a clique

20

Clique Percolation Method (CPM)

• It uses cliques as a core or a seed to find larger communities

• Clique Percolation Method to find overlappingcommunities (diagram on next page)– Input

• A parameter k, and a network – Procedure

• Find all cliques of size k in a given network• Construct a clique graph: two cliques are adjacent if they share k-1

nodes• The nodes depicted in the labels of each connected components in

the clique graph form a community21

CPM Example

Cliques of size 3:{1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}

Communities: {1, 2, 3, 4}

{4, 5, 6, 7, 8}

22

Parameter = 3

Clique graph

Source and code in R using igraph: http://infernusweb.altervista.org/wp/?p=1479

EvaluationOf Community Detection


Community detection evaluation

• Map the sets of nodes back to the real world to see whether they appear to make intuitive sense as a plausible social community.

• Obtain some form of ground truth, in which case the set of nodes output by the algorithm may be compared with it (compare it using Normalized Mutual Index).

• Use Modularity and Conductance as the popular theoretical metric to evaluate the quality of the communities. – Network Community Profile: identifies the best community among all

the communities of the same size (next page)

• Create an application and validate the derived community structure 24

Network Community Profile (NCP)

• Given a community “quality” score—i.e., a formalization of the idea of a “good” community

• NCP plots the score of the best community of a given size as a function of community size

•

Conductance = min{ , where s = the number of edges between the community and its complement, e is the sum of the degrees in S}

“Think locally, act locally: Detection of small, medium-sized, and large communities in large networks” by Jeub et al, 2015

Generative models preserving community structure


Generative models

• They are probabilistic: assigns a probability value to each edge in the network.– Not deterministic (unlike modularity, for

example) • They capture specific assumptions about the

way latent (unknown parameters) interact to create edges.

• Fitting of the model to specific empirical data is not easy.

• Most commonly used: Stochastic Block Model27

Stochastic Bock Models

SBM is a commonly used model for creating networks with communities (by Holland, Laskey, & Leinhard, 1983)• Definition: For 𝑛, 𝑘, ∈ 𝑁, (𝑛 nodes, 𝑘 communities) a

community vector 𝑧 (where 𝑧 gives the group index of vertex 𝑣), and a symmetric stochastic block matrix (probability matrix 𝑊 ∈ 0,1 ), the model SBM(𝑛, 𝑝, 𝑊) is 𝑛-vertex (labelled) random graph such that: 1. 𝑣 belongs to community 𝑧 ∈ 1, 2, … , 𝑘 (independently

chosen), 2. 𝑖𝑗 ∈ 𝐸 𝐺 exists independent of the other edges, with

probability 𝑤 , .

http://tuvalu.santafe.edu/~aaronc/courses/5352/fall2013/csci5352_2013_L16.pdf

Two examples with k=5

29http://tuvalu.santafe.edu/~aaronc/courses/5352/fall2013/csci5352_2013_L16.pdf

Assortative communities: nodes connect to similar nodes (dense groups)

𝑚 𝑚 , i 𝑗

Disassortative comms:unlike nodes tend to connect:

𝑚 𝑚 , i 𝑗

What happens if 𝑚 𝑚 , ∀𝑖, 𝑗?

An example with constant M (k=5)

What happens if 𝑚 𝑚 , ∀ 𝑖, 𝑗?


Core-periphery (we’ll study it later)

The density of connections decreases with the community index.


innercore

innercore

Outer periphery

Outer periphery

Visualization helps

32https://arxiv.org/pdf/1703.10146.pdf

Extensions of SBM

• binomial SBM [Holland et al. 1983, Wang & Wong 1987]• simple assortative SBM [Hofman & Wiggins 2008] • mixed-membership SBM [Airoldi et al. 2008] • hierarchical SBM [Clauset et al. 2006,2008, Peixoto 2014]• fractal SBM [Leskovec et al. 2005] • infinite relational model [Kemp et al. 2006] • degree-corrected SBM [Karrer & Newman 2011] • SBM + topic models [Ball et al. 2011] • SBM + vertex covariates [Mariadassou et al. 2010, Newman &

Clauset 2016] • SBM + edge weights [Aicher et al. 2013,2014, Peixoto 2015] • bipartite SBM [Larremore et al. 2014] • multilayer SBM [Peixoto 2015, Valles-Catata et al. 2016, N. Stanly et

al. 2016]33

http://tuvalu.santafe.edu/~aaronc/courses/5352/csci5352_2017_L6_supplement.pdf

Common methods for dynamic networks

Synthetic models describing the evolution of communities in dynamic networks: • Spectral graph theory;• Dirichlet process mixture model;• Stochastic block model;• Quantifying the evolution of communities;• And possibly others

34

[1] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in dynamic multi-mode networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 677–685. ACM, 2008.[2] Yizhou Sun, Jie Tang, Jiawei Han, Manish Gupta, and Bo Zhao. Community evolution detection in dynamic heterogeneous information networks. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pages 137–146. ACM, 2010[3] Yu-Ru Lin, Yun Chi, Shenghuo Zhu, Hari Sundaram, and Belle L Tseng. Analyzing communities and their evolutions in dynamic social networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(2):8, 2009.[4] Gergely Palla, Albert-L ́aszl ́o Barab ́asi, and Tam ́as Vicsek. Quantifying social group evolution. Nature, 446(7136):664–667, 2007

Code for community analysis

• Python with NetworkX: Community library• Matlab: http://commdetect.weebly.com/• R:http://infernusweb.altervista.org/wp/?p=1479• Gephi:DyCoNet• Girvan Newman’s method in Python and R• Create SBM in Python and R with igraph;• Python visualization libraries Bokeh and VisPy

35

References

• “Statistical Properties of Community Structure in Large Social and Information Networks” by Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, Michael W. Mahoney

• Porter, Mason A., Jukka-Pekka Onnela, and Peter J. Mucha. "Communities in networks." Notices of the AMS 56.9 (2009): 1082-1097.

• Conversations and PPT from Mason Porter, Oxford.• https://networkit.iti.kit.edu/• Vishwanathan, S. Vichy N., et al. "Graph Kernels" The Journal of Machine Learning

Research 11 (2010): 1201-1242.• Fast computing random walk kernels: Borgwardt, Karsten M., Nicol N. Schraudolph,

and S. V. N. Vishwanathan. "Fast computation of graph kernels." Advances in neural information processing systems. 2006.

• An alternative to kernels using graphlets: Shervashidze, Nino, et al. "Efficient graphletkernels for large graph comparison." International conference on artificial intelligence and statistics. 2009.

• Karsten M. Borgwardt and Hans-Peter Kriege Shortest path kernels, IEEE International Conference on Data Mining (ICDM’05) 2005

• Robustness in Modular structure• Relative centrality and local community 36

References (2)• Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y. and Porter, M.A.,

2014. Multilayer networks. Journal of complex networks, 2(3), pp.203-271.• Lucas G. S. Jeub, Prakash Balachandran, Mason A. Porter, Peter J. Mucha, and

Michael W. Mahoney, “Think locally, act locally: Detection of small, medium-sized, and large communities in large networks” PHYSICAL REVIEW E 91, 012821 (2015)

• J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, Internet Math. 6, 29 (2009).

• M. E. Newman “Finding community structure in networks using the eigenvectors of matrices” PHYSICAL REVIEW E 74, 036104 (2006)

• Aggarwal, Charu C., and Haixun Wang. "Graph data management and mining: A survey of algorithms and applications." Managing and Mining Graph Data. Springer US, 2010. 13-68.

• Malliaros, Fragkiskos D., and Michalis Vazirgiannis. "Clustering and community detection in directed networks: A survey." Physics Reports 533.4 (2013): 95-142.

• Social Media: http://link.springer.com/article/10.1007/s10618-011-0224-z#page-1• Graph mining and management (clustering networks):Aggarwal, Charu C., and

Haixun Wang. "Graph data management and mining: A survey of algorithms and applications." Managing and Mining Graph Data. Springer US, 2010. 13-68.

• Encyclopedia of Distances 37

Date post:	11-Mar-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Community Detectionfaculty.nps.edu/rgera/MA4404/Winter2020/04-CommunityDetectionAndModularity.pdfWhy...

Documents