No Slide Title - Leiden Universityliacs.leidenuniv.nl/~bakkerem2/dbdm2012/10_social_net.pdfOrkut...

12/6/2011 1

Data Mining: Concepts and Techniques

— Chapter 9 —

9.2. Social Network Analysis

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

©2006 Jiawei Han and Micheline Kamber. All rights reserved.

Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen

http://www.cs.uiuc.edu/~hanj

12/6/2011 2

Social Network Analysis

Social Networks: An Introduction

Primitives for Network Analysis

Different Network Distributions

Models of Social Network Generation

Mining on Social Network

Summary

Social Networks

Social network: A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ...

Graphical representation

Nodes = members

Edges = relationships

Examples of typical social networks on the Web

Social bookmarking (Del.icio.us)

Friendship networks (Facebook, Myspace, LinkedIn)

Blogosphere

Media Sharing (Flickr, Youtube)

Folksonomies (collaborative ontologies etc.)

Web 2.0 Examples

Blogs Blogspot Wordpress

Wikis Wikipedia Wikiversity

Social Networking Sites Facebook Myspace Orkut

Digital media sharing websites Youtube Flickr

Social Tagging Del.icio.us

Others Twitter Yelp

Adapted from H. Liu & N. Agarwal, KDD’08 tutorial

6

Society

Nodes: individuals

Links: social relationship (family/work/friendship/etc.)

S. Milgram (1967)

Social networks: Many individuals with diverse social interactions between them

Facebook Study (11-2011): on average 3.74 intermediate friends between any two users, i.e., 4.74 hops.

John Guare Six Degrees of Separation

7

Communication Networks

The Earth is developing an electronic nervous system, a network with diverse nodes and links are

-computers

-routers

-satellites

-phone lines

-TV cables

-EM waves

Communication networks: Many non-identicalcomponents with diverseconnectionsbetween them

8

Complex systems

Made of many non-identical elements connected by diverse

interactions.

NETWORK

9

“Natural” Networks and Universality

Consider many kinds of networks:

social, technological, business, economic, content,…

These networks tend to share certain informal properties:

large scale; continual growth

distributed, organic growth: vertices “decide” who to link to

interaction restricted to links

mixture of local and long-distance connections

abstract notions of distance: geographical, content, social,…

Social network theory and link analysis

Do natural networks share more quantitative universals?

What would these “universals” be?

How can we make them precise and measure them?

How can we explain their universality?

10







Summary

Networks and Their Representations

A network (or a graph): G = (V, E), where V: vertices (or nodes), and

E: edges (or links)

Multi-edge: if more than one edge between the same pair of

vertices

Self-edge (self-loop): if an edge connects vertex to itself

Simple network/graph if a network has neither self-edges nor multi-

edges

Adjacency matrix:

Aij = 1 if there is an edge between vertices i and j; 0 otherwise

Weighted networks:

Edges having weight (strength), usually a real number

Directed network (directed graph): if each edge has a direction

Aij = 1 if there is an edge from i to j; 0 otherwise

Cocitation and Bibliographic Coupling

Cocitation of vertices i and j: # of vertices having outgoing edges pointing to both i and j

Cocitation of i and j:

Cocitation matrix: It is a symmetric matrix

Diagonal matrix (Cii): total # papers citing i

Bibliographic coupling of vertices i and j: # of other vertices to which both point

Bibliographic coupling of i and j:

Bib-coupling matrix:

Diagonal matrix (Bii): total # papers cited by i

i

Vertices i and j are co-cited by 3 papers

Vertices i and j cite 3 same papers

Cocitation & Bibliographic Coupling: Comparison

Two measures are affected by the number of incoming and outgoing

edges that vertices have

For strong cocitation: must have a lot of incoming edges

Must be well-cited (influential) papers, surveys, or books

Takes time to accumulate citations

Strong bib-coupling if two papers have similar citations

A more uniform indicator of similarity between papers

Can be computed as soon as a paper is published

Not change over time

Recent analysis algorithms

HITS explores both cocitation and bibliographic coupling

SIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

Bipartite Networks

Bipartite Network: two kinds of vertices, and

edges linking only vertices of unlike types

Incidence matrix:

Bij = 1 if vertex j links to group i

0 otherwise

One can create a one-mode project from the

two-mode partite form (but with info loss)

The projection to one-mode can be written in

terms of the incidence matrix B is follows

Degree and Network Density

Degree of a vertex i:

# of edges m = 1/2 of sum of degrees of all the vertices:

The mean degree c of a vertex in an undirected graph:

Density ρ of a graph:

A network is dense if density ρ tends to be a constant as n → ∞

A network is sparse if density ρ → 0 as n → ∞. The fraction of nonzero element in the adjacency matrix tends to zero

Internet, WWW and friendship networks are usually regarded as sparse

15

12/6/2011 16







Summary

17

Some Interesting Network Quantities

Connected components: how many, and how large?

Network diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon

Clustering: to what extent that links tend to cluster “locally”? what is the balance between local and long-distance

connections? what roles do the two types of links play?

Degree distribution: what is the typical degree in the network? what is the overall distribution?

18

A “Canonical” Natural Network has…

Few connected components:

often only 1 or a small number, indep. of network size

Small diameter:

often a constant independent of network size (like 6)

or perhaps growing only logarithmically with network size or even shrink?

typically exclude infinite distances

A high degree of clustering:

considerably more so than for a random network

in tension with small diameter

A heavy-tailed degree distribution:

a small but reliable number of high-degree vertices

often of power law form (random variable X assuming integer values

(> 0) probability of value x ~ 1/xa (typically 0 < a < 2)

19

Probabilistic Models of Networks

All of the network generation models we will study are probabilistic or statistical in nature

They can generate networks of any size

They often have various parameters that can be set:

size of network generated

average degree of a vertex

fraction of long-distance connections

The models generate a distribution over networks

Statements are always statistical in nature:

with high probability, diameter is small

on average, degree distribution has heavy tail

Thus, we’re going to need some basic statistics and probability theory

20

Probability and Random Variables

A random variable X is simply a variable that probabilistically assumes values in some set

set of possible values sometimes called the sample space S of X

sample space may be small and simple or large and complex

S = {Heads, Tails}, X is outcome of a coin flip

S = {0,1,…,U.S. population size}, X is number voting democratic

S = all networks of size N, X is generated by preferential attachment

Behavior of X determined by its distribution (or density)

for each value x in S, specify Pr[X = x]

these probabilities sum to exactly 1 (mutually exclusive outcomes)

complex sample spaces (such as large networks):

distribution often defined implicitly by simpler components

might specify the probability that each edge appears independently

this induces a probability distribution over networks

may be difficult to compute induced distribution

21

Some Basic Notions and Laws

Independence: let X and Y be random variables independence: for any x and y, Pr[X = x & Y = y] = Pr[X=x]Pr[Y=y] intuition: value of X does not influence value of Y, vice-versa dependence: e.g. X, Y coin flips, but Y is always opposite of X

Expected (mean) value of X: only makes sense for numeric random variables “average” value of X according to its distribution formally, E[X] = S (Pr[X = x] X), sum is over all x in S often denoted by m always true: E[X + Y] = E[X] + E[Y] true only for independent random variables: E[XY] = E[X]E[Y]

Variance of X: Var(X) = E[(X – m)^2]; often denoted by s^2 standard deviation is sqrt(Var(X)) = s

Union bound: for any X, Y, Pr[X=x & Y=y] <= Pr[X=x] + Pr[Y=y]

22

Convergence to Expectations

Let X1, X2,…, Xn be:

independent random variables

with the same distribution Pr[X=x]

expectation m = E[X] and variance s2

independent and identically distributed (i.i.d.)

essentially n repeated “trials” of the same experiment

natural to examine r.v. Z = (1/n) S Xi, where sum is over i=1,…,n

example: number of heads in a sequence of coin flips

example: degree of a vertex in the random graph model

E[Z] = E[X]; what can we say about the distribution of Z?

Central Limit Theorem:

as n becomes large, Z becomes normally distributed

with expectation m and variance s2/n

23

The Normal Distribution

The normal or Gaussian density: applies to continuous, real-valued random

variables characterized by mean m and standard

deviation s density at x is defined as

(1/(s sqrt(2p))) exp(-(x-m)2/2s2) special case m = 0, s = 1: a exp(-x2/b) for

some constants a,b > 0 peaks at x = m, then dies off exponentially

rapidly the classic “bell-shaped curve”

exam scores, human body temperature, remarks:

can control mean and standard deviation independently

can make as “broad” as we like, but always have finite variance

24

The Binomial Distribution

Coin with Pr[heads] = p, flip n

times, probability of getting exactly

k heads:

choose (n, k) = pk(1-p)n-k

For large n and p fixed:

approximated well by a normal

with

m = np, s = sqrt(np(1-p))

s/m 0 as n grows

leads to strong large deviation

bounds

www.professionalgambler.com/ binomial.html

http://www.professionalgambler.com/binomial.html

http://www.professionalgambler.com/binomial.html

25

The Poisson Distribution

Like binomial, applies to variables taken on

integer values > 0

Often used to model counts of events

number of phone calls placed in a given

time period

number of times a neuron fires in a given

time period

Single free parameter l, probability of exactly

x events:

exp(-l) lx/x!

mean and variance are both l

Binomial distribution with n large, p = l/n (l

fixed)

converges to Poisson with mean l

single photoelectron distribution

26

Power Law (or Pareto) Distributions

Heavy-tailed, pareto, or power lawdistributions:

For variables assuming integer values > 0

probability of value x ~ 1/xa

Typically 0 < a < 2; smaller a gives heavier tail

sometimes also referred to as being scale-free

For binomial, normal, and Poisson distributions the tail probabilities approach 0 exponentially fast

What kind of phenomena does this distribution model?

What kind of process would generate it?

27

Distinguishing Distributions in Data

All these distributions are idealized models

In practice, we do not see distributions, but data

Typical procedure to distinguish between Poisson, power law, …

might restrict our attention to a range of values of interest

accumulate counts of observed data into equal-sized bins

look at counts on a log-log plot

power law:

log(Pr[X = x]) = log(1/xa) = -a log(x)

linear, slope –a

Normal:

log(Pr[X = x]) = log(a exp(-x2/b)) = log(a) – x2/b

non-linear, concave near mean

Poisson:

log(Pr[X = x]) = log(exp(-l) lx/x!)

also non-linear

28

Zipf’s Law

Pareto distribution vs. Zipf’s Law

Pareto distributions are continuous probability distributions

Zipf's law: a discrete counterpart of the Pareto distribution

Zipf's law:

Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table

Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.

General theme:

rank events by their frequency of occurrence

resulting distribution often is a power law!

Other examples:

North America city sizes

personal income

file sizes

genus sizes (number of species)

12/6/2011 29

Linear scales on both axes Logarithmic scales on both axes

The same data plotted on linear and logarithmic scales. Both plots show a Zipf distribution with 300 data points

Zipf Distribution

30







Summary

31

Some Models of Network Generation

Random graphs (Erdös-Rényi models):

gives few components and small diameter

does not give high clustering and heavy-tailed degree distributions

is the mathematically most well-studied and understood model

Watts-Strogatz models:

give few components, small diameter and high clustering

does not give heavy-tailed degree distributions

Scale-free Networks:

gives few components, small diameter and heavy-tailed distribution

does not give high clustering

Hierarchical networks:

few components, small diameter, high clustering, heavy-tailed

Affiliation networks:

models group-actor formation

12/6/2011 32


Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks

Basic Network Measures: Degrees and Clustering Coefficients

Let a network G = (V, E), degree of a vertex

Undirected network: d(vi):

Directed network

In-degree of a vertex din(vi):

Out-degree of a vertex dout(vi):

Clustering coefficients

Let Nv be the set of adjacent vertices of v, kv be the number of adjacent

vertices to node v

Local clustering coefficient for directed network

Local clustering coefficient for undirected network

For the whole network: Averaging the local clustering coefficient of all

the vertices (Watts & Strogatz):

33

34

The Erdös-Rényi (ER) Model: A Random Graph Model

A random graph is obtained by starting with a set of N vertices and adding

edges between them at random

Different random graph models produce different probability distributions

on graphs

Most commonly studied is the Erdős–Rényi model, denoted G(N,p), in which

every possible edge occurs independently with probability p

Random graphs were first defined by Paul Erdős and Alfréd Rényi in their

1959 paper "On Random Graphs”

The usual regime of interest is when p ~ 1/N, N is large

e.g., p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc.

in expectation, each vertex will have a “small” number of neighbors

will then examine what happens when N infinity

can thus study properties of large networks with bounded degree

Sharply concentrated; not heavy-tailed

35

Erdös-Rényi Model (1959)

- Democratic

- Random

Pál Erdös(1913-1996)

Connect with probability p

p=1/6N=10 k ~1.5 Poisson distribution

Btw.: Erdös NumberE.M. Bakker – J. Van Leeuwen – S. Zaks – P. Erdös

36

The -model

The -model has the following parameters or “knobs”:

N: size of the network to be generated

k: the average degree of a vertex in the network to be generated

p: the default probability that two vertices are connected

: adjustable parameter dictating bias towards local connections

For any vertices u and v:

define m(u,v) to be the number of common neighbors (so far)

Key quantity: the propensity R(u,v) of u to connect to v

if m(u,v) >= k, R(u,v) = 1 (share too many friends not to connect)

if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect)

else, R(u,v) = p + (m(u,v)/k)^ (1-p)

Generate new edges incrementally

using R(u,v) as the edge probability; details omitted

Note: = infinity is “like” Erdos-Renyi (but not exactly)

The Watts and Strogatz Model

Proposed by Duncan J. Watts and Steven Strogatz in their joint 1998 Nature

paper

A random graph generation model that produces graphs with small-world

properties, including short average path lengths and high clustering

The model also became known as the (Watts) beta model after Watts used

β to formulate it in his popular science book Six Degrees

The Erdos-Renyi graphs fail to explain two important properties observed in

real-world networks:

By assuming a constant and independent probability of two nodes

being connected, they do not account for local clustering, i.e., having a

low clustering coefficient

Do not account for the formation of hubs. Formally, the degree

distribution of Erdos-Renyi graphs converges to a Poisson distribution,

rather than a power law observed in most real-world, scale-free

networks

38

C(p) : clustering coeff.

L(p) : average path length (Watts and Strogatz, Nature 393, 440 (1998))

Watts-Strogatz Model

39

Small Worlds and Occam’s Razor

For small , should generate large clustering coefficients

we “programmed” the model to do so

Watts claims that proving precise statements is hard…

But we do not want a new model for every little property

Erdos-Renyi small diameter

-model high clustering coefficient

In the interests of Occam’s Razor, we would like to find

a single, simple model of network generation…

… that simultaneously captures many properties

Watt’s small world: small diameter and high clustering

40

Discovered by Examining the Real World…

Watts examines three real networks as case studies:

the Kevin Bacon graph

the Western states power grid

the C. elegans nervous system

For each of these networks, he:

computes its size, diameter, and clustering coefficient

compares diameter and clustering to best Erdos-Renyi approx.

shows that the best -model approximation is better

important to be “fair” to each model by finding best fit

Overall:

if we care only about diameter and clustering, is better than p

12/6/2011 41

Case 1: Kevin Bacon Graph

Vertices: actors and actresses

Edge between u and v if they appeared in a film together

Is Kevin Bacon

the most

connected actor?

NO!

Rank NameAverage

distance

# of

movies

# of

links

1 Rod Steiger 2.537527 112 2562

2 Donald Pleasence 2.542376 180 2874

3 Martin Sheen 2.551210 136 3501

4 Christopher Lee 2.552497 201 2993

5 Robert Mitchum 2.557181 136 2905

6 Charlton Heston 2.566284 104 2552

7 Eddie Albert 2.567036 112 3333

8 Robert Vaughn 2.570193 126 2761

9 Donald Sutherland 2.577880 107 2865

10 John Gielgud 2.578980 122 2942

11 Anthony Quinn 2.579750 146 2978

12 James Earl Jones 2.584440 112 3787

…

876 Kevin Bacon 2.786981 46 1811

…876 Kevin Bacon 2.786981 46 1811

Kevin Bacon

No. of movies : 46

No. of actors : 1811

Average separation: 2.79

42

Rod Steiger

Martin Sheen

Donald

Pleasence

#1

#2

#3

#876

Kevin Bacon

Bacon-map

43

Case 2: New York State Power Grid

Vertices: generators and substations

Edges: high-voltage power transmission lines and transformers

Line thickness and color indicate the voltage level

Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV

44

Case 3: C. Elegans Nervous System

Vertices: neurons in the C. elegans worm

Edges: axons/synapses between neurons

45

Two More Examples

M. Newman on scientific collaboration networks

coauthorship networks in several distinct communities

differences in degrees (papers per author)

empirical verification of

giant components

small diameter (mean distance)

high clustering coefficient

Alberich et al. on the Marvel Universe

purely fictional social network

two characters linked if they appeared together in an issue

“empirical” verification of

heavy-tailed distribution of degrees (issues and characters)

giant component

rather small clustering coefficient

46

Towards Scale-Free Networks

Major limitation of the Watts-Strogatz model

It produces graphs that are homogeneous in degree

In contrast, real networks are often scale-free networks

inhomogeneous in degree, having hubs and a scale-free

degree distribution. Such networks are better described by

the preferential attachment family of models, such as the

Barabási–Albert (BA) model

The Watts-Strogatz model also implies a fixed number of

nodes and thus cannot be used to model network growth

The leads to the proposal of a new model: scale-free network,

a network whose degree distribution follows a power law, at

least asymptotically

47

Scale-free Networks: World Wide Web

800 million documents (S. Lawrence, 1999)

ROBOT: collects all URL’s found in a document and follows them recursively

Nodes: WWW documents Links: URL links

R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)

48

k ~ 6

P(k=500) ~ 10-99

NWWW ~ 109

N(k=500)~10-90

Expected Result Real Result

Pout(k) ~ k- out

P(k=500) ~ 10-6

out= 2.45 in = 2.1

Pin(k) ~ k- in

NWWW ~ 109

N(k=500) ~ 103

J. Kleinberg, et. al, Proceedings of the ICCC (1999)

World Wide Web

49

< l

>

Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)

< l > = 0.35 + 2.06 log(N)

l15=2 [1 2 5]

l17=4 [1 3 4 6 7]

… < l > = ??

1

2

3

4

5

6

7

nd.edu

19 degrees of separation

R. Albert et al, Nature (99)

based on 800 million webpages [S. Lawrence et al Nature (99)]

A. Broder et al WWW9 (00)

IBM

World Wide Web

50

What Does that Mean?

Poisson distribution

Exponential Network

Power-law distribution

Scale-free Network

51

Scale-Free Networks

The number of nodes (N) is not fixed

Networks continuously expand by additional new nodes

WWW: addition of new nodes

Citation: publication of new papers

The attachment is not uniform

A node is linked with higher probability to a node that

already has a large number of links

WWW: new documents link to well known sites (CNN,

Yahoo, Google)

Citation: Well cited papers are more likely to be cited

again

52

Scale-Free Networks

Start with (say) two vertices connected by an edge For i = 3 to N:

for each 1 <= j < i, d(j) = degree of vertex j so far let Z = Sum d(j) (sum of all degrees so far) add new vertex i with k edges back to {1, …, i-1}:

i is connected back to j with probability d(j)/Z Vertices j with high degree are likely to get more links! —“Rich get richer” Natural model for many processes:

hyperlinks on the web new business and social contacts transportation networks

Generates a power law distribution of degrees exponent depends on value of k

Preferential attachment explains heavy-tailed degree distributions small diameter (~log(N), via “hubs”)

Will not generate high clustering coefficient no bias towards local connectivity, but towards hubs

53

Case1: Internet Backbone

(Faloutsos, Faloutsos and Faloutsos, 1999)

Nodes: computers, routers Links: physical lines

12/6/2011 54

Internet-Map

55

Case2: Actor Connectivity

Nodes: actors

Links: cast jointly

N = 212,250 actors

k = 28.78

P(k) ~k-

Days of Thunder (1990) Far and Away (1992) Eyes Wide Shut (1999)

=2.3

12/6/2011 56

Case 3: Science Citation Index

( = 3)

Nodes: papers

Links: citations

(S. Redner, 1998)

P(k) ~k-

2212

25

1736 PRL papers (1988)

Witten-Sander

PRL 1981

12/6/2011 57

Nodes: scientist (authors) Links: write paper together

(Newman, 2000, H. Jeong et al 2001)

Case 4: Science Coauthorship

58

Case 5: Food Web

Nodes: trophic species

Links: trophic interactions

R.J. Williams, N.D. Martinez Nature (2000)R. Sole (cond-mat/0011195)

http://online.sfsu.edu/~webhead/lrlbest.jpg

12/6/2011 59

Robustness of Random vs. Scale-Free Networks

The accidental failure of a number of nodes in a random network can fracture the system into non-communicating islands.

Scale-free networks are more robust in the face of such failures

Scale-free networks are highly vulnerable to a coordinated attack against their hubs

12/6/2011 60

protein-gene

interactions

protein-protein

interactions

PROTEOME

GENOME

Citrate Cycle

METABOLISM

Bio-chemical

reactions

Bio-Map

12/6/2011 61

Boehring-Mennheim

62

Nodes: proteins

Links: physical interactions (binding)

P. Uetz, et al., Nature 403, 623-7 (2000)

Prot Interaction Map: Yeast Protein Network

63







Summary

64

Information on Social Network

Heterogeneous, multi-relational data represented as a graph or network

Nodes are objects

May have different kinds of objects

Objects have attributes

Objects may have labels or classes

Edges are links

May have different kinds of links

Links may have attributes

Links may be directed, are not required to be binary

Links represent relationships and interactions between objects - rich content for mining

Metrics (Measures) in Social Network Analysis (I)

Betweenness: The extent to which a node lies between other nodes in the

network. This measure takes into account the connectivity of the node's

neighbors, giving a higher value for nodes which bridge clusters. The

measure reflects the number of people who a person is connecting

indirectly through their direct links

Bridge: An edge is a bridge if deleting it would cause its endpoints to lie in

different components of a graph.

Centrality: This measure gives a rough indication of the social power of a

node based on how well they "connect" the network. "Betweenness",

"Closeness", and "Degree" are all measures of centrality.

Centralization: The difference between the number of links for each node

divided by maximum possible sum of differences. A centralized network

will have many of its links dispersed around one or a few nodes, while a

decentralized network is one in which there is little variation between the

number of links each node possesses.65

Metrics (Measures) in Social Network Analysis (II)

Closeness: The degree an individual is near all other individuals in a

network (directly or indirectly). It reflects the ability to access information

through the "grapevine" of network members. Thus, closeness is the

inverse of the sum of the shortest distances between each individual and

every other person in the network

Clustering coefficient: A measure of the likelihood that two associates of a

node are associates themselves. A higher clustering coefficient indicates a

greater 'cliquishness'.

Cohesion: The degree to which actors are connected directly to each other

by cohesive bonds. Groups are identified as ‘cliques’ if every individual is

directly tied to every other individual, ‘social circles’ if there is less

stringency of direct contact, which is imprecise, or as structurally cohesive

blocks if precision is wanted.

Degree (or geodesic distance): The count of the number of ties to other

actors in the network. 66

http://en.wikipedia.org/wiki/Social_circle

http://en.wikipedia.org/wiki/Structural_cohesion

Metrics (Measures) in Social Network Analysis (III)

(Individual-level) Density: The degree a respondent's ties know one another/

proportion of ties among an individual's nominees. Network or global-level

density is the proportion of ties in a network relative to the total number

possible (sparse versus dense networks).

Flow betweenness centrality: The degree that a node contributes to sum of

maximum flow between all pairs of nodes (not that node).

Eigenvector centrality: A measure of the importance of a node in a network. It

assigns relative scores to all nodes in the network based on the principle that

connections to nodes having a high score contribute more to the score of the

node in question.

Local Bridge: An edge is a local bridge if its endpoints share no common

neighbors. Unlike a bridge, a local bridge is contained in a cycle.

Path Length: The distances between pairs of nodes in the network. Average

path-length is the average of these distances between all pairs of nodes.

67

Metrics (Measures) in Social Network Analysis (IV)

Prestige: In a directed graph prestige is the term used to describe a node's

centrality. "Degree Prestige", "Proximity Prestige", and "Status Prestige" are all

measures of Prestige.

Radiality Degree: an individual’s network reaches out into the network and

provides novel information and influence.

Reach: The degree any member of a network can reach other members of the

network.

Structural cohesion: The minimum number of members who, if removed from

a group, would disconnect the group

Structural equivalence: Refers to the extent to which nodes have a common

set of linkages to other nodes in the system. The nodes don’t need to have any

ties to each other to be structurally equivalent.

Structural hole: Static holes that can be strategically filled by connecting one or

more links to link together other points. Linked to ideas of social capital: if you

link to two people who are not linked you can control their communication68

69

A Taxonomy of Common Link Mining Tasks

Object-Related Tasks

Link-based object ranking

Link-based object classification

Object clustering (group detection)

Object identification (entity resolution)

Link-Related Tasks

Link prediction

Graph-Related Tasks

Subgraph discovery

Graph classification

Generative model for graphs

70

Link-Based Object Ranking (LBR)

Exploit the link structure of a graph to order or prioritize the set of objects within the graph

Focused on graphs with single object type and single link type

A primary focus of link analysis community

Web information analysis

PageRank and Hits are typical LBR approaches

In social network analysis (SNA), LBR is a core analysis task

Objective: rank individuals in terms of “centrality”

Degree centrality vs. eigen vector/power centrality

Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs

71

PageRank: Capturing Page Popularity (Brin & Page’98)

Intuitions

Links are like citations in literature

A page that is cited often can be expected to be more useful in general

PageRank is essentially “citation counting”, but improves over simple counting

Consider “indirect citations” (being cited by a highly cited paper counts a lot…)

Smoothing of citations (every page is assumed to have a non-zero citation count)

PageRank can also be interpreted as random surfing (thus capturing popularity)

72

The PageRank Algorithm (Brin & Page’98)

1

( )

0 0 1/ 2 1/ 2

1 0 0 0

0 1 0 0

1/ 2 1/ 2 0 0

1( ) (1 ) ( ) ( )

1( ) [ (1 ) ] ( )

( (1 ) )

j i

t i ji t j t k

d IN d k

i ki k

k

T

M

p d m p d p dN

p d m p dN

p I M p

d1

d2

d4

“Transition matrix”

d3

Iterate until converge

Essentially an eigenvector problem….

Same as

/N (why?)

Stationary (“stable”)

distribution, so we

ignore time

Random surfing model:At any page,

With prob. , randomly jumping to a pageWith prob. (1 – ), randomly picking a link to follow

Iij = 1/N

Initial value p(d)=1/N

73

HITS: Capturing Authorities & Hubs (Kleinberg’98)

Intuitions

Pages that are widely cited are good authorities

Pages that cite many other pages are good hubs

The key idea of HITS

Good authorities are cited by good hubs

Good hubs point to good authorities

Iterative reinforcement …

12/6/2011 74

The HITS Algorithm (Kleinberg 98)

d1

d2

d4( )

( )

0 0 1 1

1 0 0 0

0 1 0 0

1 1 0 0

( ) ( )

( ) ( )

;

;

j i

j i

i j

d OUT d

i j

d IN d

T

T T

A

h d a d

a d h d

h Aa a A h

h AA h a A Aa

“Adjacency matrix”

d3

Again eigenvector problems…

Initial values: a=h=1

Iterate

Normalize:

2 2( ) ( ) 1i i

i i

a d h d

75

Block-level Link Analysis (Cai et al. 04)

Most of the existing link analysis algorithms, e.g. PageRank and HITS, treat a web page as a single node in the web graph

However, in most cases, a web page contains multiple semantics and hence it might not be considered as an atomic and homogeneous node

Web page is partitioned into blocks using the vision-based page segmentation algorithm

extract page-to-block, block-to-page relationships

Block-level PageRank and Block-level HITS

76

Link-Based Object Classification (LBC)

Predicting the category of an object based on its attributes, its

links and the attributes of linked objects

Web: Predict the category of a web page, based on words that

occur on the page, links between pages, anchor text, html tags,

etc.

Citation: Predict the topic of a paper, based on word

occurrence, citations, co-citations

Epidemics: Predict disease type based on characteristics of the

patients infected by the disease

Communication: Predict whether a communication contact is

by email, phone call or mail

77

Challenges in Link-Based Classification

Labels of related objects tend to be correlated

Collective classification: Explore such correlations and jointly

infer the categorical values associated with the objects in the

graph

Example: Classify related news items in Reuter data sets

(Chak’98)

Simply incorp. words from neighboring documents: not

helpful

Multi-relational classification is another solution for link-based

classification

78

Group Detection

Cluster the nodes in the graph into groups that share common characteristics

Web: identifying communities

Citation: identifying research communities

Methods

Hierarchical clustering

Blockmodeling of SNA

Spectral graph partitioning

Stochastic blockmodeling

Multi-relational clustering

79

Entity Resolution

Predicting when two objects are the same, based on their attributes and their links

Also known as: deduplication, reference reconciliation, co-reference resolution, object consolidation

Applications

Web: predict when two sites are mirrors of each other

Citation: predicting when two citations are referring to the same paper

Epidemics: predicting when two disease strains are the same

Biology: learning when two names refer to the same protein

80

Entity Resolution Methods

Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes

Importance at considering links

Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents

Use of links in resolution

Collective entity resolution: one resolution decision affects another if they are linked

Propagating evidence over links in a depen. graph

Probabilistic models interact with different entity recognition decisions

81

Link Prediction

Predict whether a link exists between two entities, based on attributes and other observed links

Applications

Web: predict if there will be a link between two pages

Citation: predicting if a paper will cite another paper

Epidemics: predicting who a patient’s contacts are

Methods

Often viewed as a binary classification problem

Local conditional probability model, based on structural and attribute features

Difficulty: sparseness of existing links

Collective prediction, e.g., Markov random field model

82

Link Cardinality Estimation

Predicting the number of links to an object

Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out-links

Citation: predicting the impact of a paper based on the number of citations

Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease

Predicting the number of objects reached along a path from an object

Web: predicting number of pages retrieved by crawling a site

Citation: predicting the number of citations of a particular author in a specific journal

83

Subgraph Discovery

Find characteristic subgraphs

Focus of graph-based data mining

Applications

Biology: protein structure discovery

Communications: legitimate vs. illegitimate groups

Chemistry: chemical substructure discovery

Methods

Subgraph pattern mining

Graph classification

Classification based on subgraph pattern analysis

84

Metadata Mining

Schema mapping, schema discovery, schema reformulation

cite – matching between two bibliographic sources

web - discovering schema from unstructured or semi-structured data

bio – mapping between two medical ontologies

85

Link Mining Challenges

Logical vs. statistical dependencies

Feature construction: Aggregation vs. selection

Instances vs. classes

Collective classification and collective consolidation

Effective use of labeled & unlabeled data

Link prediction

Closed vs. open world

Challenges common to any link-based statistical model (Bayesian LogicPrograms, Conditional Random Fields, Probabilistic Relational Models,Relational Markov Networks, Relational Probability Trees, Stochastic LogicProgramming to name a few)

86







Summary

87

Ref: Mining on Social Networks

D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. CIKM’03

P. Domingos and M. Richardson, Mining the Network Value of Customers. KDD’01

M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for Viral Marketing. KDD’02

D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of Influence through a Social Network. KDD’03.

P. Domingos, Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1), 80-82, 2005.

S. Brin and L. Page, The anatomy of a large scale hypertextual Web search engine. WWW7.

S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99

D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.

Lecture notes from Lise Getoor’s website: www.cs.umd.edu/~getoor/

http://www.cs.umd.edu/~getoor/

Date post:	12-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

No Slide Title - Leiden Universityliacs.leidenuniv.nl/~bakkerem2/dbdm2012/10_social_net.pdfOrkut...

Documents