+ All Categories
Home > Documents > Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is...

Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is...

Date post: 09-Jun-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
67
Statistical Analysis of Network Data Lecture 1: Intro, Background, & Descriptive Statistics Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University [email protected] Les Houches, avril 2014
Transcript
Page 1: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Statistical Analysis of Network Data

Lecture 1: Intro, Background, & Descriptive Statistics

Eric D. Kolaczyk

Dept of Mathematics and Statistics, Boston University

[email protected]

Les Houches, avril 2014

Page 2: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Welcome

Outline

1 Welcome2 Background & Motivation

Network Analysis & StatisticsPrelude

3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms

4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization

Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks

5 Wrapping UpLes Houches, avril 2014

Page 3: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Welcome

Topics for my lectures

L1 Introduction, Background, and Descriptive Statistics (1.5hrs)

L2 Network Sampling (1hr)

L3 Network Modeling (1.5hrs)

L4 Additional Topics in Modeling/Analysis (1.5hr)

Les Houches, avril 2014

Page 4: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Welcome

Resources

Organization and presentation of material in these lectures will largelyparallel that in

1

USE R !

ISBN 978-1-4939-0982-7

Use R ! Use R !

Eric KolaczykGábor Csárdi

Statistical Analysis of Network Data with R

Statistical Analysis of Network Data

with R

Kolaczyk · Csárdi

Eric Kolaczyk · Gábor Csárdi

Statistical Analysis of Network Data with R

Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).

Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).

Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.

Statistics

9 781493 909827

Les Houches, avril 2014

Page 5: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Welcome

Topics for this lecture

1 Background & Motivation

2 (Brief!) review of graph-related concepts

3 Descriptive Statistics for Networks

Network MappingNetwork Characterization

Les Houches, avril 2014

Page 6: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation

Outline

1 Welcome2 Background & Motivation

Network Analysis & StatisticsPrelude

3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms

4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization

Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks

5 Wrapping UpLes Houches, avril 2014

Page 7: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

Why Networks?

Pdp

dCLK Cyc

Tim

VriPer

Relatively small ‘field’ of study until ∼ 15 years ago

Epidemic-like spread of interest in networks since mid-90s

Arguably due to various factors, such as

Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.

Les Houches, avril 2014

Page 8: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

What Do We Mean by ‘Network’?

Definition (OED): A collection of inter-connected things.

Caveat emptor: The term ‘network’ is used in the literature to meanvarious things.

Two extremes are

1 a system of inter-connected things

2 a graph representing such a system1

Often is not even clear what is meant when an author refers to ‘the’network!

1I’ll use the slightly redundant term ‘network graph’.

Les Houches, avril 2014

Page 9: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

Our Focus . . .

The statistical analysis of network data

i.e., analysis of measurements either of or from a systemconceptualized as a network.

Challenges:

relational aspect to the data;

complex statistical dependencies (often the focus!);

high-dimensional and often massive in quantity.

Les Houches, avril 2014

Page 10: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

Examples of Networks

Network-based perspective has been brought to bear on problems fromacross the sciences, humanities, and arts.

A convenient (but nevertheless only rough/approximate) grouping ofnetworks is the following:

Technological

Biological

Social

Informational

Les Houches, avril 2014

Page 11: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

Examples of Networks (cont)

a1

a2

a3

a4a5

a6

a7

a8

a9

a10

a11

a12

a13

a14

a15

a16

a17

a18

a19

a20

a21

a22

a23a24a25

a26

a27

a28a29

a30

a31

a32

a33a34

“Know thy data!” (W. Willinger)Les Houches, avril 2014

Page 12: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Network Analysis & Statistics

Statistics and Network Analysis

The (still emerging?) field of ‘network science’ started very ‘horizontal’,although it is increasingly filling out with greater ‘vertical’ depth.

Lots of ‘players’ . . . uneven depth across the ‘field’ . . . mixed levels ofcommunication/cross-fertilization.

Note: Statisticians arguably a (growing) minority in this area!

But from a statistical perspective, there are certain canonical tasks andproblems faced in the questions addressed across the different areas ofspecialty.

Better vertical depth can achieved in this area by viewing problems – andpursuing solutions – from this perspective.

Les Houches, avril 2014

Page 13: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Statistical Analysis of Network Data: Prelude

The unique relational nature of network data means that we frequentlyencounter challenges in canonical tasks that differ in important ways fromthose typically faced in the statistical analysis of ‘standard’ data.

We will examine how such challenges arise in relation to

visualization;

summary & description;

sampling and inference;

and modeling.

Les Houches, avril 2014

Page 14: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Statistical Analysis of Network Data: Prelude (cont.)

Let’s look at some examples of how the unique relational nature of networkdata makes seemingly familiar statistical problems change in character.

1 Mapping Science

2 Understanding Epilepsy

3 Monitoring Social Media

4 Predicting Protein Function

Les Houches, avril 2014

Page 15: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Mapping Science

How does one go about‘mapping’ the ‘land-scape’ of ‘Science’?

Statistical challenges:

Defining thepopulation ofinterest.

Representativenessof our data.

Appropriate notionsof units (i.e.‘vertex’and ‘edge’).

How to visualize?

Les Houches, avril 2014

Page 16: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Understanding Epilepsy

How can we effectivelysummarize/describe thecomplex interactionstaking place during anepileptic seizure?

Statistical challenges:

Criterion fordefining ‘brainnetworks’.

Choice of networksummary statistics.

Assessing‘significance’ ofchange/differences.

Les Houches, avril 2014

Page 17: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Monitoring Social Media

Can we monitor basic charac-teristics of (typically massive!)social media networks based onsamples?

Statistical challenges:

Computer protocolscorrespond to what typesof sampling designs?

What sort of bias(es) areinherent in the sampling?

Can we correct for suchbiases?

Les Houches, avril 2014

Page 18: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Predicting Protein Function

Is it possible to use knowledge ofprotein-protein interactions to predictbiological function of proteins?

Statistical challenges:

To what extent do interactingproteins share commonfunction?

How do we incorporate anetwork as an‘explanatory/predictor variable’?

Can we account for uncertaintyin training data and/or network?

Les Houches, avril 2014

Page 19: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Background & Motivation Prelude

Dramatic Pause!

Lot’s of interesting issues . . .

. . . shall we get started?

Les Houches, avril 2014

Page 20: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs

Outline

1 Welcome2 Background & Motivation

Network Analysis & StatisticsPrelude

3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms

4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization

Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks

5 Wrapping UpLes Houches, avril 2014

Page 21: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs

Networks & Graphs

The language of graphs typically is adopted in talking about networks.

Will assume that everyone is fluent in the basics . . .. . . but a (very) quick review perhaps won’t hurt!

Les Houches, avril 2014

Page 22: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Graphs

Formally2, a graph G = (V ,E ) is a mathematical structure consisting ofsets

V of vertices(also commonly called nodes)

E of edges(also commonly called links),

where elements of E are unordered pairs {u, v} of distinct verticesu, v ∈ V .

The values Nv = |V | and Ne = |E | are the order and size of the graph,respectively.

2Even more formally, graphs are defined uniquely only up to isomorphisms, i.e.,relabeling of vertices and edges that leave the structure unchanged.Les Houches, avril 2014

Page 23: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Some nuances . . .

More generally, graphs may have

loops, and/or

multi-edges

A graph with either is called a multi-graph.

We will typically assume the absence of loops and multi-edges, i.e., thatour graphs are simple.

Les Houches, avril 2014

Page 24: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Directed Graphs

A graph G for which each edge in E has an ordering to its vertices,

i.e., {u, v} is distinct from {v , u}, for u, v ∈ V

is called a directed graph or digraph.

Such edges are called directed edges or arcs, with the direction of an arc{u, v} read from left to right, from the tail u to the head v .

Arcs are said to be mutual if they share the same vertex pair, but with thevertices playing opposite roles of head and tail for each arc.

Les Houches, avril 2014

Page 25: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Connectivity

It is necessary to have a language for discussing the connectivity of agraph. One of the most basic notions of connectivity is that of adjacency.

1 Two vertices u, v ∈ V are said to be adjacent if joined by an edge inE .

2 Similarly, two edges e1, e2 ∈ E are adjacent if joined by a commonendpoint in V .

In addition, a vertex v ∈ V is incident on an edge e ∈ E if v is anendpoint of e.

Les Houches, avril 2014

Page 26: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Degree

The notion of ‘degree’ is used to summarize the connectivity of a vertex.

The degree of a vertex v , say dv , is the number of edges incident on v .

The degree sequence of a graph G is the sequence formed by arranging thevertex degrees dv in non-decreasing order, i.e.,

d(1) ≤ d(2) ≤ · · · ≤ d(Nv−1) ≤ d(Nv ) .

Les Houches, avril 2014

Page 27: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Degree (cont.)

Note that

1 the sum of the elements of the degree sequence is equal to twice thenumber of edges in the graph (i.e., twice the size of the graph).

2 for digraphs, vertex degree is replaced by in-degree (i.e., d inv ) and

out-degree (i.e., doutv ), which count the number of edges pointing in

towards and out from a vertex, respectively.

Hence, digraphs have both an in-degree sequence and an out-degreesequence.

Les Houches, avril 2014

Page 28: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Additional Concepts

Beyond these basics, it is useful to be able to discuss concepts of

MovementE.g., walks, trails, and paths; circuits and cycles.

Reachability / Connectivity / ComponentsE.g., Giant connected component; strong and weakly connectedcomponents.

DistanceE.g., shortest path / geodesic distance; diameter.

Les Houches, avril 2014

Page 29: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Definitions and Concepts

Other ‘Flavors’ of Networks

There are a number of additional types of networks – beyond the basic(un)directed varieties – now commonly studied:

weighted networks

bipartite networks

multi-relational networks

dynamic/temporal networks

We will encounter a few of these here and there.

Les Houches, avril 2014

Page 30: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Graphs & Matrix Algebra

Graphs and Matrix Algebra

Useful in the modeling and analysis of network data to be able tocharacterize a graph G and certain aspects of its structure using matricesand matrix algebra3.

The fundamental connectivity of a graph G may be captured in theNv × Nv binary, symmetric adjacency matrix A, where

Aij =

{1, if {i , j} ∈ E ,

0, otherwise ,

In words, A is non-zero for entries whose row-column indices correspond tovertices in G joined by an edge, and zero, for those that are not.

3Formal blending of graph theory with matrix algebrais the focus of algebraic graph theory.Les Houches, avril 2014

Page 31: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Graphs & Matrix Algebra

Some Properties of A

More than just a storage mechanism, various operations applied to A yieldinformation on G .

Examples include

Degree

di = Ai+ =∑j

Aij

Walks

Arij = # of walks of length r between i and j on G

Eigen-structure

E.g., G is a regular graph if and only if the maximum degree dmax ofG is an eigenvalue of A.

Les Houches, avril 2014

Page 32: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Graphs Data Structures & Algorithms

Graph Data Structures and Algorithms

The study of graph

data structures, and

algorithms

facilitates the transition from

graphs as purely mathematical objects to

graphs as practical tools for network modeling and analysis.

Contributions in this area primarily due to computer science.

Les Houches, avril 2014

Page 33: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Graphs Data Structures & Algorithms

Graph Data Stuctures

Two common data structures for representing graphs:

1 adjacency matrixNv × Nv binary matrix (seen previously)

2 adjacency listAn array of size Nv , each element of which is a list, where the i-th listcontains the labels of the di vertices incident to vertex vi .

Also, a variation on adjacency lists is the idea of an edge list, which issimply a two-column list of all vertex pairs that are joined by an edge.

Les Houches, avril 2014

Page 34: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Review of Graphs Graphs Data Structures & Algorithms

Graph Algorithms

Suppose you have a graph stored on your computer. What questionsmight you like to ask?

For those used to working with ‘regular’ data (e.g., signals, images, surveysamples, regression analyses, etc.), it may come as a surprise that thenature of the question can be important!

Roughly speaking, questions can be broken down into categories:

1 Directly answerable from the adjacency matrix / list

2 Answerable, with a bit of work, in a ‘reasonable’ amount of time4

3 (Expected to be) unanswerable in any ‘reasonable’ amount of time

4Means polynomial in Nv and/or Ne .Les Houches, avril 2014

Page 35: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks

Outline

1 Welcome2 Background & Motivation

Network Analysis & StatisticsPrelude

3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms

4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization

Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks

5 Wrapping UpLes Houches, avril 2014

Page 36: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks

Descriptive Statistics for Networks

Statisticians typically distinguish between descriptive and inferentialstatistics.

We will spend the rest of this lecture looking at the network analogue ofdescriptive statistics5, in the form of

network mapping

characterization of network graphs

May seem ‘soft’ . . . but it’s important!

This is basically descriptive statistics for networks.

Probably constitutes at least 2/3 of the work done in this area.

Note: It’s sufficiently different from standard descriptive statistics that it’ssomething unto itself.

5. . . and the remaining lectures looking at inferential statistics.Les Houches, avril 2014

Page 37: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Network Mapping

What is ‘network mapping’?

Production of a network-based visualization of a complex system.

What is ‘the’ network?

Network as a ‘system’ of interest;

Network as a graph representing the system;

Network as a visual object.

Analogue: Geography and the production of cartographic maps.

Les Houches, avril 2014

Page 38: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Example: Mapping Belgium

Which of these is ‘the’ Belgium?

Les Houches, avril 2014

Page 39: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Three Stages of Network Mapping

Continuing our geography analogue . . . a fourth stage might be‘validation’.

Les Houches, avril 2014

Page 40: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Stage 1: Collecting Relational Network Data

Begin with measurements on system ‘elements’ and ‘relations’.

Note that choice of ‘elements’ and ‘relations’ can produce very differentrepresentations of same system.

Sgg

Tim Dbt

Per

dCLKCyc

Pdp

dCLK Cyc

Tim

VriPer

Les Houches, avril 2014

Page 41: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Standard Statistical Issues Present Too!

Type of measurements (e.g., cont., binary, etc.) can influence qualityof information they contain on underlying ‘relation’.

Full or partial view of the system?(Analogues in spatial statistics . . .)

Sampling, missingness, etc.

Les Houches, avril 2014

Page 42: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Stage 2: Constructing Network Graphs

Sometimes measurements are direct declaration of edge/non-edge status.

More commonly, edges dictated after processing measurements

comparison of ‘similarity’ metric to threshold

voting among multiple views (e.g., router I-net)

Frequently ad hoc . . .. . . sometimes formal (e.g., network inference).

Even with direct and error free observation of edges, decisions may bemade to thin edges, adjust topology to match additional variables, etc.

Les Houches, avril 2014

Page 43: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Stage 3: Visualization

Goal is to embed a combinatorial object G = (V ,E ) intotwo- or three-dimensional Euclidean space.

Non-unique . . . not even well-defined!

Common to better define / constrain this problem by incorporating

conventions (e.g., straight line segs)

aesthetics (e.g., minimal edge crossing)

constraints (e.g., on relative placement of vertices, subgraphs, etc.)

Les Houches, avril 2014

Page 44: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Graph Layout: Art and Science.

Software for network visualization (aka graph layout) largely use a handfulof standard classes of methodsE.g., Circular, radial, analogies to physical systems, etc.

Many visualization packages, some general and some area-specific.

Examples in these talks primarily produced using

Pajek

R tools in igraph package

Better packages also allow for user interaction to manipulate further.

Les Houches, avril 2014

Page 45: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Layout ... Does it Matter?

Yes!

Layered, circular, and h-v layouts of the same tree.Les Houches, avril 2014

Page 46: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Visualization and Scale: Attention Needed!

●●

●●●

●●●

● ●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●●●

●●●●●●●

●●

●●●●

●●

●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●

●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●

●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

Cap21

Commentateurs Analystes

Les Verts

liberaux

Parti Radical de Gauche

PCF − LCR

PS UDF

UMP

Top: Visualizations at the level of blogs(left: energy-based placement;right: projection)

Bottom: Visualization at the level of

political parties.

Les Houches, avril 2014

Page 47: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Mapping

Mapping Dynamic Network Data

For dynamic networks the challenges are greater still, particularly forvisualization.

Options include use of weighted networks (left), slices (center), andedge-timelines (right).

●●

●●

●●

● ●

●●

●●

●●

0 to 12 hrs

●●

● ●

●●

●●

●●

12 to 24 hrs

●●

● ●

●●

●●

●●

24 to 36 hrs

●●

● ●

●●

●●

●●

36 to 48 hrs

●●

● ●

●●

●●

●●

48 to 60 hrs

●●

● ●

●●

●●

●●

60 to 72 hrs

●●

● ●

●●

●●

●●

72 to 84 hrs

●●

● ●

●●

●●

●●

84 to 96 hrs

0 20 40 60 80 100

020

040

060

080

010

00

Time (hours)

Inte

ract

ing

Pai

r (O

rder

ed b

y F

irst I

nter

actio

n)

ADM−ADMMED−MEDNUR−NURPAT−PATADM−MEDADM−NURADM−PATMED−NURMED−PATNUR−PAT

(Data represent person-to-person contacts over 96hrs within a hospital environment. (Src: Vanhems et al.))

Les Houches, avril 2014

Page 48: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Characterization of Network Graphs: Intro

Given a network graph representation of a system (i.e., perhaps a result ofnetwork mapping), often questions of interest can be phrased in terms ofstructural properties of the graph.

social dynamics can be connected to patterns of edges among vertextriples;

routes for movement of information can be approximated by shortestpaths between vertices;

‘importance’ of vertices can be captured through so-called centralitymeasures;

natural groups/communities of vertices can be approached throughgraph partitioning.

Les Houches, avril 2014

Page 49: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Characterization Intro (cont.)

Structural analysis of network graphs ≈ descriptive analysis; this is astandard first (and sometimes only!) step in statistical analysis ofnetworks.

Main contributors of tools are

social network analysis,

mathematics & computer science,

statistical physics

Many tools out there . . . two rough classes based on scale:

characterization of vertices/edges, and

characterization of network cohesion.

Les Houches, avril 2014

Page 50: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Characterization of Vertices/Edges

Examples include

Degree distribution

Vertex/edge centrality

Role/positional analysis

We’ll look briefly at degree and centrality.

Les Houches, avril 2014

Page 51: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Vertex Degree

The degree6 dv of a vertex v is the number of vertices in V incident to v .

i.e., dv = The number of neighbors of v

Define

fd = fraction of vertices v ∈ V with degree dv = d .

The degree distribution{fd}d≥0

is a summary of local connectivity across the graph G .

6For weighted networks there is the analogous notion of vertex strength.Les Houches, avril 2014

Page 52: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Degree Distributions

The degree distribution is an object of fundamental importance in graphtheory.

As such, it has been a source of considerable interest in empirical studies.

Much attention has been focused on distinguishing between two generalflavors of distributions i.e.,

homogeneous and heterogeneous.

Degree Distribution for Random Graph

Degree

Fre

quen

cy

0 2 4 6 8

010

2030

40

Degree Distribution for Power−Law Graph

Degree

Fre

quen

cy

0 5 10 15 20

020

4060

80

Les Houches, avril 2014

Page 53: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Fitting Heterogeneous Degree Distributions

Natural to want to fit these data.

But . . . in general (i.e., not just fornetworks) . . . the fitting of heavy-tailed densities is a non-trivial exer-cise.

Even in the case of a true power-lawdistribution, fitting must be handledwith some care.

Arguably best treated as a descriptivetask, rather than inferential.

(Shown: Data for yeast PPI network.)

Degree Distribution

Degree

Fre

quen

cy

0 20 40 60 80 100 120

050

010

0015

0020

00

1 2 5 10 20 50 100

5e-0

42e

-03

5e-0

32e

-02

5e-0

22e

-01

Log-Log Degree Distribution

Log-Degree

Log-

Inte

nsity

Les Houches, avril 2014

Page 54: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Centrality: Motivation

Many questions related to ‘importance’ of vertices.

Which actors hold the ‘reins of power’?

How authoritative is a WWW page considered by peers?

The deletions of which genes is more likely to be lethal?

How critical to traffic flow is a given Internet router?

Researchers have sought to capture the notion of vertex importancethrough so-called centrality measures.

Les Houches, avril 2014

Page 55: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Centrality: An Illustration

Clockwise from top left:(i) toy graph, with (ii)closeness, (iii) between-ness, and (iv) eigenvectorcentralities.

(Example and figures courtesy of Ulrik

Brandes.)

Les Houches, avril 2014

Page 56: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Higher-order Centrality

Indianapolis

Houston

Los Angeles

Sunnyvale

Seattle

DenverChicago New York

Wash. DC

Atlanta

Kansas City

Indianapolis

Houston

Los Angeles

Sunnyvale

Seattle

DenverChicago New York

Wash. DC

Atlanta

Kansas City

Conceptuallya, centrality general-izes naturally to higher orders.

For betweenness, two logical ex-tensions are

1 group betweenness (i.e.,‘union’/OR), and

2 co-betwenness (i.e.,‘intersection’/AND).

(See Kolaczyk, Chua, and Barthelemy (2009).)

aComputationally, such generalizations may notbe so straightforward!

Les Houches, avril 2014

Page 57: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Network Cohesion: Motivation

Many questions involve scales coarser than just individual vertices/edges.More properly considered questions regarding ‘cohesion’ of network.

Do friends of actors tend to be friends themselves?

Which proteins are most similar to each other?

Does the WWW tend to separate according to page content?

What proportion of the Internet is constituted by the ‘backbone’?

These questions go beyond individual vertices/edges.

Les Houches, avril 2014

Page 58: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Network Cohesion: Various Notions!

Various notions of ‘cohesion’.

density

clustering

connectivity

flow

partitioning

. . . and more . . .

We’ll look quickly at just examples relating to connectivity.

Les Houches, avril 2014

Page 59: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Components

Not uncommon in practice that a graph be unconnected!

A (connected) component of a graph G is a maximally connectedsub-graph.

Common to decompose graph into components. Often find this results in

giant component

smaller components

isolates

Frequently, reported analyses are for the giant component.

Les Houches, avril 2014

Page 60: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Components in Directed Graphs Become Interesting!

Tendrils

Strongly

Connected

Component

In−Component Out−Component

Tubes

(Due to Broder et al. ’00.)

Les Houches, avril 2014

Page 61: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Example: AIDS Blog Network

Left: Original network. Right: Network with vertices annotated bycomponent membership i.e.,

strongly connected component (yellow)

in-component (blue)

out-component (red)

tendrils (pink)Les Houches, avril 2014

Page 62: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Vertex/Edge-Connectivity

Question: “If an arbitrary subset of k vertices (edges) is removed from agraph G , is the remaining subgraph connected?”

The notions of k-vertex(edge)-connected seek to make this question andits answer precise.

E.g., A graph G is said to be k-vertex-connected if

Nv > k , and

the removal of an X ⊂ V , such that |X | < k leaves a subgraphG − X that is still connected.

Note: Connectivity closely related to results on ‘flow’.

Les Houches, avril 2014

Page 63: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Illustration: Detecting Malicious Internet Sources

Ding et al.a use the idea of cut-vertices to detect Internet IPaddresses associated withmalicious behavior.

Corresponds to a type of (anti)socialbehavior.

aDing, Q., Katenka, N., Barford, P., Kolaczyk, E.D.,

and Crovella, M. (2012). Intrusion as (Anti)social Communication:Characterization and Detection. Proceedings ofthe 2012 ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining.

Source/Destination Network

Projected Source Network

Les Houches, avril 2014

Page 64: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Characterization of Dynamic Networks

Development of a comprehensive body of tools for characterizing dynamicnetworks lags far behind what we have for static networks.

increased demand relatively recent

graph-theoretic infrastructure less developed

potential for severe magnification of computational challenges

Les Houches, avril 2014

Page 65: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Descriptive Statistics for Networks Network Characterization

Characterization of Dynamic Networks (cont).

0102030

0102030

0102030

0102030

0102030

0102030

0102030

0102030

0−12hrs

12−24hrs

24−36hrs

36−48hrs

48−60hrs

60−72hrs

72−84hrs

84−96hrs

0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323335Degree

Cou

nt

Status

ADM

MED

NUR

PAT

Arguably still the most com-mon approach is to applymethods for static networksto consecutive time slices.

Left: Degree distributionsfor hospital contact data,every 12hrs.

Les Houches, avril 2014

Page 66: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Wrapping Up

Outline

1 Welcome2 Background & Motivation

Network Analysis & StatisticsPrelude

3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms

4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization

Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks

5 Wrapping UpLes Houches, avril 2014

Page 67: Statistical Analysis of Network Data€¦ · Statistical Analysis of Network Data with R is book is the rst of its kind in network research. It can be used as a stand-alone resource

Wrapping Up

Final Thoughts

This first lecture only scratches the surface of mapping andcharacterization of networks.

More details will surely emerge in lectures of various other speakers thesenext two weeks.

Remaining lectures:

L1 Introduction, Background, and Descriptive Statistics (1.5hrs)

L2 Network Sampling (1hr)

L3 Network Modeling (1.5hrs)

L4 Additional Topics in Modeling/Analysis (1.5hr)

Les Houches, avril 2014


Recommended