+ All Categories
Home > Documents > Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph...

Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph...

Date post: 13-Sep-2020
Category:
Upload: others
View: 2 times
Download: 1 times
Share this document with a friend
55
Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 1 Link Prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze Interdepartmental Bioinformatics Group MPI for Biological Cybernetics MPI for Developmental Biology
Transcript
Page 1: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 1

Link Prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze

Interdepartmental Bioinformatics GroupMPI for Biological CyberneticsMPI for Developmental Biology

Page 2: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Link prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2

DefinitionGiven two nodes x and x′, should they be connected by anedge?

Unsupervised versus supervised

Supervised: We are given a training set of edges.Unsupervised: No such training set is available.

Similarity-score versus cluster-based

Similarity-based: Nodes are connected if they are similar.Cluster-based: Nodes from the same cluster show similarconnectivity patterns.

Page 3: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 3

Section 1:Similarity-score based link prediction

Page 4: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Similarity-based link prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 4

Unsupervised link predictionDirect methodUnsupervised link prediction using kernel methods

Supervised link predictionBasic schemeProtein interaction prediction

Page 5: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Unsupervised link prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 5

Introduction to unsupervised network inference

Direct approachStatistical interpretation

Network inference by kernel-based dependence maximizationNETHSIC

ExperimentsSocial network analysis

Conclusions

Page 6: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Unsupervised network inference

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 6

Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Example: social network

Objects are people

Attribute is the occupation

Target network:

Who is friends with whom?

Page 7: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Unsupervised network inference

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 7

Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Who is friends with whom?

A direct approach:

Measure the pairwise di-stances d(xi, xj)

Iteratively connect the leastdistant pair by an edge

Page 8: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Unsupervised network inference

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 8

Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Who is friends with whom?

A direct approach:

Measure the pairwise di-stances d(xi, xj)

Iteratively connect the leastdistant pair by an edge

Page 9: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Direct approach

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 9

Measure the pairwise distances d(xi, xj) induced by a kernel k̃(xi, xj) on the centeredattributes

Iteratively connect the least distant pair by an edge

argmine′

∑(i,j)∈E∪{e′}

d(xi, xj)

argmine′

∑(i,j)∈E∪{e′}

k̃(i, i) + k̃(j, j)− 2k̃(i, j)

argmine′

∑i,j

K̃. ∗ (D − AE∪{e′})

argmine′

tr(K̃ LE∪{e′})

argmine′

tr(HKH LE∪{e′})

argmaxe′

tr(HKH(aI − LE∪{e′})1)

argmaxe′

1

(n− 1)2tr(HKHL1−step

E∪{e′})

argmaxe′

HSIC(K,L1−stepE∪{e′})

A: adjacency matrixD: diagonal matrix holding the degree of each node; D(i, i) =

∑j A(i, j)

L: graph LaplacianLp−step: p-step random walk kernel

Page 10: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Direct approach

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 10

Measure the pairwise distances d(xi, xj) induced by a kernel k̃(xi, xj) on the centeredattributes

Iteratively connect the least distant pair by an edge

argmine′

∑(i,j)∈E∪{e′}

d(xi, xj)

argmine′

∑(i,j)∈E∪{e′}

k̃(i, i) + k̃(j, j)− 2k̃(i, j)

argmine′

∑i,j

K̃. ∗ (D − AE∪{e′})

argmine′

tr(K̃ LE∪{e′})

argmine′

tr(HKH LE∪{e′})

argmaxe′

tr(HKH(aI − LE∪{e′})1)

argmaxe′

1

(n− 1)2tr(HKHL1−step

E∪{e′})

argmaxe′

HSIC(K,L1−stepE∪{e′})

So the direct approach iteratively maximizes the Hilbert-Schmidt indepen-dence criterion between a kernel on the attributes and a 1-step random walkkernel on the nodes in the network.

Page 11: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

HSIC

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 11

Hilbert-Schmidt independence criterion (Gretton et al., 2005)

Let F and G be RKHS on X and Yand mappings φ : X→ F and ψ : Y→ G

HSIC is a measure of dependence between F and G

HSIC(F,G, Prxy) := ‖Cxy ‖2HS

For pairs of finite samples X, Y an empirical estimate of HSICcan be computed in terms of kernels

HSIC(K,L) :=1

(n− 1)2tr(HKHL)

Kij = 〈φ(xi), φ(xj)〉Lij = 〈ψ(yi), ψ(yj)〉Hij = δij − n−1

Page 12: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

HSIC

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 12

Hilbert-Schmidt independence criterion (Gretton et al., 2005)

Let F and G be RKHS on X and Yand mappings φ : X→ F and ψ : Y→ G

HSIC is a measure of dependence between F and G

HSIC(F,G, Prxy) := ‖Cxy ‖2HS

For pairs of finite samples X, Y an empirical estimate of HSICcan be computed in terms of kernels

HSIC(K,L) :=1

(n− 1)2tr(HKHL)

Kij = 〈φ(xi), φ(xj)〉Lij = 〈ψ(yi), ψ(yj)〉Hij = δij − n−1

⇒ The direct approach maximizes the dependence between repre-sentations of the objects in the spaces induced by a kernel on theattributes and a 1-step random walk kernel on the network.

Page 13: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Overview

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 13

Introduction to unsupervised network inferenceDirect approachStatistical interpretation

Network inference by kernel-based dependencemaximization

NETHSIC

ExperimentsSocial network analysis

Conclusions

Page 14: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

NETHSIC

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 14

exploits the fact that HSIC can be estimated using kernels.

K(i, j) = k(xi, xj) attribute kernel

LE(i, j) = lE(xi, xj) node kernel

argmaxE⊂(V×V )∧|E|=m

1

(n− 1)2tr(HKHLE)

O(nm) number of choices⇒ use greedy selection of m edges

real-world networks are often sparse⇒ do greedy forward selection of edges

Page 15: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

NETHSIC

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 15

exploits the fact that HSIC can be estimated using kernels.

K(i, j) = k(xi, xj) attribute kernel

LE(i, j) = lE(xi, xj) node kernel

Input: the set of nodes V ,number of edges m,attribute kernel k and node kernel l

Output: a subset E of V × V of size mE ← ∅repeate = argmaxe′∈V×V tr(HKHLE∪{e′})E ← E ∪ {e}

until |E| = m

Algorithm 1: NETHSIC forward selection

Page 16: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

NETHSIC – node kernels LE

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 16

Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Who is friends with whom?

What happens if we lookat a different kind of relationbetween the objects?

Who has a trade relationwith whom?

Page 17: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

NETHSIC – node kernels LE

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 17

Given set of objects described by their attributes xi ∈ X

Find a set E of m edges e(i, j) that corresponds to interactions

Who has a trade relation withwhom?

NETHSIC is kernel based

Network topology definedby node kernel LEHere 1-step random walkdoes not fit

Define node kernel LE ex-pressing prior knowledgeabout the network structure

Page 18: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

NETHSIC – node kernels LE

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 18

choice of node kernel LE defines the topology of the network

argmaxE⊂(V×V )∧|E|=m

1

(n− 1)2tr(HKHLE)

1-step Laplacian degree A2 closeness betweenness(aI − L)1 L = D − A 〈δ(i), δ(j)〉 A2 〈CC(i), CC(j)〉 〈CB(i), CB(j)〉⇓ ⇓ ⇓ ⇓ ⇓ ⇓

similar xi dissimilar xi similar xi similar xi similar xi similar xiare connected are connected have similar share many have similar have similar

degrees neighbors closeness betweennesscentrality centrality

CC(i) = (n− 1)−1∑t∈V \{i} dG(i, t) Average shortest path length dG between i and

all other nodes in G.CB(i) =

∑s6=i 6=t∈V

s6=t

σst(i)σst

Number of shortest paths σst passing through i

Page 19: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Overview

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 19

Introduction to unsupervised network inferenceDirect approachStatistical interpretation

Network inference by kernel-based dependence maximizationNETHSIC

Experiments

Social network analysis

Conclusions

Page 20: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Experiments

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 20

Countries trade data (Wassermann et al., 1994)

24 countries

3 attributes:population sizeGNP per capitaenergy usage

reference network:trade relations ofbasic manufactured goods

experimental setup:

linear kernel on each attribute

set m to max number of edges

rank edges by order of insertion

compute area under ROC curveusing reference network

Page 21: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Experiments

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 21

Countries trade data (Wassermann et al., 1994)

population size GNP per capita energy consumption0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

area

und

er R

OC

cur

ve

1−step random walkLaplaciandegreemutual information< 5% quantile> 95% quantile

Degree kernel of-ten shows best re-sults

Some results arebelow 5% quantile

Often it is not de-sirable to connectmost similar nodes

Page 22: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Overview

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 22

Introduction to unsupervised network inferenceDirect approachStatistical interpretation

Network inference by kernel-based dependence maximizationNETHSIC

ExperimentsSocial network analysis

Conclusions

Page 23: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Conclusions

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 23

Kernel method for unsupervised network inference (NETHSIC)

Statistically motivated

High flexibility by choice of node kernelLE that can define com-plex network topologies

Allows for a statistical interpretation to direct approaches

In real-world networks it is not always desirable to connect themost similar objects

Future work: Use NETHISC for network completion

argmaxE⊂(V×V )∧|E|=m

1

(n− 1)2 tr(HKHLE)

Page 24: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Supervised approaches

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 24

SettingWe are now given a training set of edges Etraining

We try to infer a rule, a classifier from this set Etraining thatallows us to predict edges on the test set Etest.

Ingredientsa similarity measure or metric for two pairs of nodesa set of negative examples of non-interacting nodesa classifier that turns these similarity scores into predictions

Page 25: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 25

Tensor pairwise kernel (Ben-Hur and Noble, ISMB 2005)Given two pairs of nodes (a, b) and (c, d).

ktensor((a, b), (c, d)) =knodes(a, c)knodes(b, d)

+knodes(a, d)knodes(b, c); (1)

This kernel quantifies the similarity of the source and targetnodes in both edges, for both directions.knodes is a kernel that measures the similarity of two nodes,just like the ones that are used for unsupervised link predicti-on.

Page 26: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 26

Method 1: Direct similarity-based prediction

Motivation: “connect similar genes”Connect a and b if d(a, b) is below a threshold.This is an unsupervised approach (no use of the knownsubnetwork).

J.-P. Vert (Ecole des Mines) Supervised network inference 5 / 19

Page 27: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 27

Method 1: Direct similarity-based prediction

Motivation: “connect similar genes”Connect a and b if d(a, b) is below a threshold.This is an unsupervised approach (no use of the knownsubnetwork).

J.-P. Vert (Ecole des Mines) Supervised network inference 6 / 19

Page 28: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 28

Method 2: metric learning

Metric learningMotivation: use the known subnetwork to refine the distancemeasure, before applying the similarity-based methodBased on kernel CCA (Yamanishi et al., 2004) or kernel metriclearning (V. and Yamanishi, 2005).

J.-P. Vert (Ecole des Mines) Supervised network inference 7 / 19

Page 29: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 29

Method 2: metric learning

Metric learningMotivation: use the known subnetwork to refine the distancemeasure, before applying the similarity-based methodBased on kernel CCA (Yamanishi et al., 2004) or kernel metriclearning (V. and Yamanishi, 2005).

J.-P. Vert (Ecole des Mines) Supervised network inference 8 / 19

Page 30: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 30

Method 2: metric learning

Metric learningMotivation: use the known subnetwork to refine the distancemeasure, before applying the similarity-based methodBased on kernel CCA (Yamanishi et al., 2004) or kernel metriclearning (V. and Yamanishi, 2005).

J.-P. Vert (Ecole des Mines) Supervised network inference 9 / 19

Page 31: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 31

Method 2: metric learning

Metric learningMotivation: use the known subnetwork to refine the distancemeasure, before applying the similarity-based methodBased on kernel CCA (Yamanishi et al., 2004) or kernel metriclearning (V. and Yamanishi, 2005).

J.-P. Vert (Ecole des Mines) Supervised network inference 10 / 19

Page 32: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Pairwise similarity measures

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 32

Metric learning pairwise kernel (Vert et al., 2007)Given two pairs of nodes (a, b) and (c, d).

kml((a, b), (c, d)) =(knodes(a, c)− knodes(a, d)−knodes(b, c) + knodes(b, d))

2

=[(φ(a)− φ(b))>(φ(c)− φ(d)]2; (2)

knodes is a kernel that measures the similarity of two nodes,just like the ones that are used for unsupervised link predicti-on.a pair (a, b) is similar to a pair (c, d)

if a− b is similar to c− d, or . . .if a− b is similar to d− c.

Page 33: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Protein interaction prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 33

SettingProtein-protein interactions (PPI) from yeast-two hybridscreens and mass spectrometry measurements provide onlypartial view of the interactomeGoal of protein interaction prediction is to complete the inter-actome by link prediction

Sequence-based PPI predictiondomain or motif-based interaction prediction (Sprinzak andMargalit, 2001; Deng et al., 2002; Gomez et al., 2003; Wanget al., 2004)3-mer sequence kernel (Martin et al., 2005)phylogenetic trees (Ramani and Marcotte, 2003), correlatedmutations (Pazos and Valencia, 2002) derived from sequence

Page 34: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Protein interaction prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 34

Negative examplesJansen et al., 2003: pairs of proteins from different cellularlocationsBen-Hur & Noble, 2005: select random pairs of non-interacting proteins

Ben-Hur & Noble, 2005Used 3-mer kernel, kernel based on sequence and domain mo-tifs, kernel based on GO annotation, interactions in other spe-cies and common neighboursPPI prediction on BIND physical interaction dataset via SVM:AUC of 0.97, ROC50 of 0.58

Page 35: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 35

Section 2:Cluster-based link prediction

Page 36: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Cluster-based link prediction

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 36

ApproachSimilar nodes form a clusterNodes from the same cluster exhibit a similar connectivitypattern

Problems to be solvedHow to find clusters on a graph?→ graph-based clusteringHow to define a connectivity pattern of a cluster?

Page 37: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Graph-based clustering I

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 37

Data representationdataset D is given in terms of a graph G = (V,E)

a data objects vi is a node in Gedge e(i, j) from node vi to node vj has weight w(i, j)

Graph-based clusteringDefine a threshold θRemove all edges e(i, j) from G with weight w(i, j) > θ

Each connected component of the graph now corresponds toone clusterTwo nodes are in the same connected component if there is apath between themGraph components can be found by depth-first search in agraph ((O(|V | + |E|))

Page 38: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Graph-based clustering II

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 38

Original graph

Page 39: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Graph-based clustering III

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 39

Thresholded graph (θ = 0.5)

Page 40: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Graph-based clustering IV

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 40

But how to get the graph in the first place?Think of the weights as a similarity measure.If two nodes are not connected, then their similarity measureis 0.Graph-based clustering creates clusters of similar objectsFor any object vi in a cluster, there is a second object vj suchthat similarity(vi, vj) is larger than θ.

Page 41: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan I

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 41

Noise-robust graph-based clusteringGraph-based clustering can suffer from the fact that one noisyedge connects two clustersDBScan (Ester et al., 1996) is a noise-robust extension ofgraph-based clusteringDBScan is short for Density-Based Spatial Clustering ofApplications with Noise

Core objectTwo objects vi and vj with distance d(vi, vj) < ε belong tothe same cluster if either vi or vj are a core object.vi is a core object iff there are MinPoints points within adistance of ε from vi.A cluster is defined by iteratively checking this core objectproperty.

Page 42: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan II

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 42

DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UN-CLASSIFIEDClusterId := nextId(NOISE);for i FROM 1 TO SetOfPoints.size doPoint := SetOfPoints.get(i);if Point.ClId = UNCLASSIFIED thenif ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts)thenClusterId := nextId(ClusterId)

end ifend if

end for

Page 43: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan III

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 43

Code: ExpandClusterExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boo-lean;seeds:=SetOfPoints.regionQuery(Point,Eps);if seeds.size < MinPts then

SetOfPoint.changeClId(Point,NOISE);RETURN False;

elseSetOfPoints.changeClIds(seeds,ClId);seeds.delete(Point);while seeds <> Empty

currentP := seeds.first();result := SetOfPoints.regionQuery(currentP, Eps);

Page 44: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan IV

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 44

if result.size >= MinPts thenfor i FROM 1 TO result.size do

resultP := result.get(i);if resultP.ClId IN (UNCLASSIFIED, NOISE) then

if resultP.ClId = UNCLASSIFIED thenseeds.append(resultP);

end ifSetOfPoints.changeClId(resultP,ClId);

end if // UNCLASSIFIED or NOISEend for;

end if; // result.size >= MinPtsseeds.delete(currentP);

end while; // seeds <> EmptyRETURN True;

end ifend // ExpandCluster

Page 45: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan V

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 45

Original graph

Page 46: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan VI

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 46

DBScan-clustered graph (MinPts = 2, Eps = 0.5)

Page 47: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan VII

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 47

Original graph

Page 48: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan VIII

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 48

DBScan-clustered graph (MinPts = 3, Eps = 0.5)

Page 49: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

DBScan IX

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 49

PropertiesCluster assignment of border points is order-dependentUnlike k-means, one does not have to specify the number ofclusters a prioriBut one has to set MinPts and EpsEster et al. report that for 2D examples MinPts=4 is sufficientfor good resultsThey determine Eps by visual inspection of a k-distance plotTransfer question: How to kernelise DBScan?

Page 50: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 50

PropertiesRepresents the graph as a probability distribution, in terms ofa graphical model.A graphical model is a probabilistic model for which a graphdenotes the conditional independence structure between thenodes, that is the random variables.A link r is a random variable in this model, typically a binaryvariable:r = 1; link does existr = 0; link does not exist

Page 51: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 51

Link prediction based on node atrributes

Page 52: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 52

Link prediction based on cluster membership

Page 53: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 53

Link prediction based on cluster membership

Page 54: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 54

Variants of cluster-based relational learningLinks between all members of the same cluster, no links bet-ween members of different clusters:P (r = 1|za, zb) = 1 if za = zbP (r = 1|za, zb) = 0 if za 6= zb

Links between all members of the same cluster, fixed link pro-bability between members of different clusters:P (r = 1|za, zb) = 1 if za = zbP (r = 1|za, zb) = c if za 6= zb and 0 ≤ c ≤ 1

Link probability η(a, b) between members of clusters a and b:P (r = 1|za, zb) = η(a, b) and η(a, b) ∼ Beta(β, β)

Page 55: Link Prediction - ETH Zürich - Homepage | ETH ZürichLink prediction Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 2 Definition Given

Relational learning

Karsten Borgwardt, Christoph Lippert and Nino Shervashidze: Biological Network Analysis, Page 55

Infinite (Hidden) Relational Model (IRM, IHRM)developed independently by Kemp et al and Xu et al in 2006Cluster nodes via a Chinese Restaurant ProcessLink probability η(a, b) between members of clusters a and b

Chinese restaurant process in a nutshell

P (zi = a|z1, . . . , zi−1) =

{ nai−1+γ na > 0γ

i−1+γ a is a new clusterwhere z1, . . . , zi−1 are the cluster asssignments of objects1, . . . , n, na is the number of objects assigned to cluster a,and γ is a parameter.The more objects there are in a cluster, the more likely it isthat a new data point is also assigned to this cluster.The creation of a new cluster is also possible.


Recommended