+ All Categories
Home > Technology > A discussion on sampling graphs to approximate network classification functions

A discussion on sampling graphs to approximate network classification functions

Date post: 11-May-2015
Category:
Upload: larca-upc
View: 1,665 times
Download: 2 times
Share this document with a friend
Description:
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph. In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
Popular Tags:
34
A discussion on sampling graphs to approximate network classification functions (work in progress) Gemma C Garriga INRIA [email protected] 22.09.2011
Transcript
Page 1: A discussion on sampling graphs to approximate network classification functions

A discussion on sampling graphs to approximatenetwork classification functions

(work in progress)

Gemma C Garriga

[email protected]

22.09.2011

Page 2: A discussion on sampling graphs to approximate network classification functions

Outline

Starting point

Classification in networks

Samples of graphs

Some first experiments

Page 3: A discussion on sampling graphs to approximate network classification functions

Outline

Starting point

Classification in networks

Samples of graphs

Some first experiments

Page 4: A discussion on sampling graphs to approximate network classification functions

Network classification problem

Learn a classification function f : X 7→ Y for nodes x ∈ G

Relaxing: f : X 7→ R means to infer a probability Pr(y|{x}n,G)

Aka collective classification or within-network prediction:nodes with the same label tend to be clustered together

Page 5: A discussion on sampling graphs to approximate network classification functions

Network classification problem

Challenges

Sparsely labeled: few labeled nodes but many unlabeled nodes

Heterogeneous types of contents, multiple types of links

Network structure (what edges are in the graph) affects theaccuracy of the models

Networks are large size

Related to

Semi-supervised learning based on graphs

Page 6: A discussion on sampling graphs to approximate network classification functions

Semi-supervised learning

GoalBuild a learner f that can label input instances x into differentclasses or categories y

Notation

input instance x, label y

learner f : X 7→ Y

labeled data (Xl,Yl) = {(x1:l,y1:l)}

unlabeled data Xu = {xl+1:n}, available during training

usually l � n

Semi-supervised learning

Use both labeled and unlabeled data to build better learners

Page 7: A discussion on sampling graphs to approximate network classification functions

Semi-supervised graph-based methods

Transform vectorial data into a graph

Nodes: labeled and unlabeled Xl ∪ Xu

Edges: weighted edges (xi, xj) computed from features

Weights represent similarity, e.g. wi,j = exp(−γ||xi − xj||2)

Sparsify with: k-nearest neighbor graph, threshold graph (εdistance graph) . . .

The general idea is that there will be similiarity implied via allpaths in the graph

Page 8: A discussion on sampling graphs to approximate network classification functions

Semi-supervised graph-based methods

Smoothness assumption

In a weighted graph, nodes that are similar are connected by heavyedges (high density region) and therefore tend to have the samelabel. Density is not uniform

[From Zhu et al. ICML 2003]

Page 9: A discussion on sampling graphs to approximate network classification functions

The harmonic function

Relaxing discrete labels to real values with f : X −→ R thatsatisfies:

1 f(xi) = yi for i = 1 . . . l

2 f minimizes the energy function∑ij

wij(f(xi) − f(xj))2

3 it is the mean of the associated Gaussian random field

4 the harmonic property means

f(xi) =

∑j∼i wijf(xj)∑

j∼i wij

Page 10: A discussion on sampling graphs to approximate network classification functions

Harmonic solution with iterative method

An iterative method as in self-training:

1 Set f(xi) = yi for i = 1 . . . l and f(xj) arbitrary for xj ∈ Xu

2 Repeat until convergence:

. Set f(xi) =∑

j∼i wijf(xj)∑j∼i wij

. Keep always f(Xl) fixed

Page 11: A discussion on sampling graphs to approximate network classification functions

A random walk interpretation on directed graphs

Randomly walk from node i to j with probabilitywij∑k wik

The harmonic function tells about Pr(hit label 1 | start from i)

[From Zhu’s tutorial at ICML 2007]

Page 12: A discussion on sampling graphs to approximate network classification functions

Harmonic solution with graph Laplacian

Let W be the n× n weight matrix on Xl ∪ Xu

. Symmetric and non-negative

Let diagonal degree matrix D: Dii =∑n

j=1 wij

Graph Laplacian is ∆ = D − W

The energy function can be rewritten:

minf

∑ij

wij(f(xi) − f(xj))2 = min

fft∆f

Harmonic solution solves fu = −∆uu−1∆ulYl

Complexity of O(n3)

Page 13: A discussion on sampling graphs to approximate network classification functions

Outline

Starting point

Classification in networks

Samples of graphs

Some first experiments

Page 14: A discussion on sampling graphs to approximate network classification functions

Characteristics of network data

So, can one use semi-supervised learning based on graphs fornetworks? Some reflections:

+ The smoothness assumption can be seen as a clustering assumption,or community structure assumption

Groups of nodes that are similar tend to be more denselyconnected between them than with the rest of the network

+ The laplacian matrix could help to integrate both vectorial andstructure of the network

− However, networks have scale free of the degree distributions

Structure of the links influences iterative propagation

− Networks can be very large

Page 15: A discussion on sampling graphs to approximate network classification functions

How to use graph samples

First idea:

1 For i = 1 . . . |samples| do:

Extract graph sample Gi ≺ G from the full graphApply harmonic iterative algorithm to Gi to get f(u),u ∈ Gi

2 Average f(u) for u ∈ {Gi } selected in several samples

3 For all nodes v that did appear in any sample do:

Make random walks to k nodes touched by samplesCompute weighted average of the k labels found

f(v) =1∑

j={1...k} d(v,uj)

∑j={1...k}

d(v,uj)f(uj)

Page 16: A discussion on sampling graphs to approximate network classification functions

How can samples help?

Samples have less edges than the full graph, so diffusion is differentfrom the full graph

Subgraphs will be random, so maybe a good behavior on average

The iterative algorithm (or laplacian harmonic) will be applied onlyon samples. Complexity is reduced

The nodes not contained in any sample, will be labeled followingthe assumptions of the random walk interpretation given by theharmonic iterative solution

[From Zhu’s tutorial at ICML 2007]

Page 17: A discussion on sampling graphs to approximate network classification functions

How cannot samples help?

It depends on how samples in the graph are extracted. Thingsto take into account

Including some labeled points from all classes in the sampledgraph

Extracting a connected subgraph

Sampling on the vectorial data, on the structural edges, orintegrating both in the sampling process (like random walksampling)

It is just an approximation, how good is it? can we saysomething theoretically? ensemble approaches based onsamples?

Page 18: A discussion on sampling graphs to approximate network classification functions

Going further: sparsify the samplesFinding some sort of ”backbone”

Second idea:

1 For i = 1 . . . |samples| do:

Extract graph sample Gi ≺ G from the full graphApply harmonic iterative algorithm to Gi to obtain f(u),u ∈ Gi

2 From S = {Gi} find nodes (or subgraph) U ≺ S with |U| = l s.t.

f(U ′) = g(f(U))

where U ′ = S\U and g is some defined (linear) transformation

3 Label any other node v by k random walks to nodes in theprevious central nodes (or subgraph) U

Page 19: A discussion on sampling graphs to approximate network classification functions

Outline

Starting point

Classification in networks

Samples of graphs

Some first experiments

Page 20: A discussion on sampling graphs to approximate network classification functions

Induced subgraph samplingFrom ”Statistical analysis of network data”, Kolaczyk

Sample n vertices without replacement to formV∗ = {i1, . . . , in}

Edges are observed for vertex pairs i, j ∈ V∗ for which{i, j} ∈ E, yielding E∗

Selected nodes in yellow, observed edges in orange

Page 21: A discussion on sampling graphs to approximate network classification functions

Incident subgraph samplingFrom ”Statistical analysis of network data”, Kolaczyk

Select n edges with random sampling without replacement, E∗

All incident vertices to E∗ are then observed, providing V∗

Selected edges in yellow, observed nodes in orange

Page 22: A discussion on sampling graphs to approximate network classification functions

Star and snowball samplingFrom ”Statistical analysis of network data”, Kolaczyk

Take initial vertex sample V∗0 without replacement of size n.

Observe all edges incident to i ∈ V∗0 , yielding E∗

For labeled star sampling we observe also vertices i ∈ V\V∗0 to

which edges in E∗ are incident

For snowball sampling we iterate the process of labeled starsampling to neighbors up to the k-th wave

1-wave: yellow, 2-wave: orange, 3-wave: red

Page 23: A discussion on sampling graphs to approximate network classification functions

Link tracing samplingFrom ”Statistical analysis of network data”, Kolaczyk

A sample S = {s1, . . . , sns } of ”sources” are selected from V

A sample T = {t1, . . . , tnt } of ”targets” are selected from V\S

A path is sampled between pairs (si, ti) and all vertices andedges in the paths are observed, yielding G∗ = (V∗,E∗)

Sources {s1, s2} to targets {t1, t2}

Page 24: A discussion on sampling graphs to approximate network classification functions

Some other sampling algorithms

Other possible ideas of sampling algorithms for graphs:

Random node selection, random edge selection

Selecting nodes with probability proportional to ”page rank”weight

Random node neighbor

Random walk sampling

Random jump sampling

Forest fire sampling

Page 25: A discussion on sampling graphs to approximate network classification functions

Some challenges of sampling with labels

Including labels in the samples

Size of the samples

Isolated nodes

Edges of structure or content

Page 26: A discussion on sampling graphs to approximate network classification functions

Outline

Starting point

Classification in networks

Samples of graphs

Some first experiments

Page 27: A discussion on sampling graphs to approximate network classification functions

Experimental set-up

Classification algorithm

In the samples, compute the harmonic function f in iterativefashion for ≈ 10 iterations

Final classification: for every node u assign label l that hasmax value (probability) f(u)

Keep 1/3 of the labels

Datasets

Graph generated data: (1) cluster generator and (2)community guided attachment generator

Other: Webkb, IMDB, Cora

Page 28: A discussion on sampling graphs to approximate network classification functions

What happens in one sample?

Incident(left) & induced (right), Webkb (Cornell), 867 nodes

Blue: error of harmonic iterative on the full graphGreen: error on one single increasing-size sample

Page 29: A discussion on sampling graphs to approximate network classification functions

What happens in one sample?

Link tracing, Imdb, 1169 nodes

Blue: error of harmonic iterative on the full graphGreen: error on one single increasing-size sample

Page 30: A discussion on sampling graphs to approximate network classification functions

What happens in one sample?

Random node-edge selection, Imdb, 1169 nodes

Blue: error of harmonic iterative on the full graphGreen: error on one single increasing-size sample

Page 31: A discussion on sampling graphs to approximate network classification functions

Full classification vs sampling classification

Induced & Incident, Cora, 1878 nodes

Blue: error of harmonic iterative on the full graphGreen: error of sampling classification on increasing number ofsamples

Page 32: A discussion on sampling graphs to approximate network classification functions

Full classification vs sampling classification

Induced & Incident, Webkb (Wisconsin), 1263 nodes

Blue: error of harmonic iterative on the full graphGreen: error of sampling classification on increasing number ofsamples

Page 33: A discussion on sampling graphs to approximate network classification functions

Full classification vs sampling classification

Link tracing, CGA generator, 1000 nodes

Blue: error of harmonic iterative on the full graphGreen: error of sampling classification on increasing number ofsamples

Page 34: A discussion on sampling graphs to approximate network classification functions

Some discussion

Samples of graphs can serve to avoid high complexity (O(n3))of applying learning algorithm in the full graph

Choice of sampling methods (e.g. snowball is bad for highlyconnected graphs, link tracing is useful in highly clusteredgraphs)

Approximation of accuracy is reasonable with small number ofsamples already

Question of the I/O operations in the graph

Samples of the graph to estimate a distribution?

Ensemble approaches?

Approximation in terms of shortest paths?


Recommended