Date post: | 23-Mar-2018 |
Category: |
Documents |
Upload: | hoangxuyen |
View: | 218 times |
Download: | 4 times |
Structure, complexity and
learning
Edwin Hancock
Department of Computer Science
University of York
Supported by a Royal Society
Wolfson Research Merit Award
Aims
• How to probe or characterise structure of similarity data
in the form of graphs using random walks.
• How to characterise complexity of structures using ideas
from information theory (entropy).
• How to use complexity level characterisations to learn
variations in structure.
Outline
• Random walks, graph spectra and
structural characteristics (zeta functions).
• Complexity and graph kernels.
• Learning generative models.
Structural Variations
Learning with graph data
• Problems based on graphs arise in areas such as
language processing, proteomics/chemoinformatics,
data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial
methods from statistical machine learning not easily
applied since there is no canonical ordering of the nodes
in a graph.
• Can make considerable progress if we develop
permutation invariant characterisations of variations in
graph structure.
Learning with graph data
• Problems based on graphs arise in areas such as
language processing, proteomics/chemoinformatics,
data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial
methods from statistical machine learning not easily
applied since there is no canonical ordering of the nodes
in a graph.
• Can make considerable progress if we develop
permutation invariant characterisations of variations in
graph structure.
Learning with graph data
• Problems based on graphs arise in areas such as
language processing, proteomics/chemoinformatics,
data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial
methods from statistical machine learning not easily
applied since there is no canonical ordering of the nodes
in a graph.
• Can make considerable progress if we develop
permutation invariant characterisations of variations in
graph structure.
Protein-Protein Interaction Networks
Characterising graphs
• Topological: e.g. average degree, degree distribution, edge-density, diameter, cycle frequencies etc.
• Spectral or algebraic: use eigenvalues of adjacency matrix or Laplacian, or equivalently the co-efficients of characteristic polynomial.
• Complexity: use information theoretic measures of structure (e.g. Shannon entropy).
Learning is difficult because
• Graphs are not vectors: There is no natural ordering of nodes and edges. Correspondences must be used to establish order.
• Structural variations: Numbers of nodes and edges are not fixed. They can vary due to segmentation error.
• Not easily summarised: Since they do not reside in a vector space, mean and covariance hard to characterise.
Learning with graphs Work with (dis) similarities: Can perform pariwise clustering or embed
sets of graphs in a vector space using multidimensional scaling on
similarities. Non metricity of similarities may pose problems.
Embed individual graphs in a low dimensional space: Characterise
structural variations in terms of statistical variation in a point-pattern.
Learn modes of structural variation: Understand how edge
(connectivity) structure varies for graphs belonging to the same class.
Requires correspondences of raw structure or alignment of an
embedded one. Can also be effected using permutation invariant
characteristics (path length, commute-times, cycle frequencies).
Construct generative model: Borrow ideas from graphical model to
construct model for raw structures or point distribution model to for
embedded graphs.
Methods
• Graph-spectra lead to straightforward methods for
characterising structure and embedding, that do not
require correspondences and are permutation invariant.
• Random walks provide an intuitive way of understanding
spectral methods in terms of distributions of path and
cycle length.
• Complexity characterisations can by used in information
theoretic settings and allow tasks such as kernelisation
and learning to be addressed in a principled way.
Aims
• Show how random walks can be used as
probes of graph structure.
• Explain links with spectral graph theory.
• Show links with complexity
characterisations, and information theory.
Our contributions
• IJCV 2007 (Torsello, Robles-Kelly, Hancock) –shape classes from edit distance using pairwise clustering.
• PAMI 06 and Pattern Recognition 05 (Wilson, Luo and Hancock) – graph clustering using spectral features and polynomials.
• PAMI 07 (Torsello and Hancock) – generative model for variations in tree structure using description length.
• CVIU09 (Xiao, Wilson and Hancock) – generative model from heat-kernel embedding of graphs.
• QIC09 (Emms, Wilson and Hancock) quantum version of commute time.
• PR09a,b,c,(Emms, Wilson and Hancock) graph matching using quantum walks, lifting cospectrality of graphs using quantum walks.
• PR109(Xiao, Wilson and Hancock) graph characteristics from heat kernel trace.
• TNN11 (Ren, Wilson and Hancock) Ihara zeta function as graph characterisation,
Random walks
And links to graph spectra.
Problem studied • How can we find efficient means of characterising
graph structure which does not involve exhaustive
search? Enumerate properties of graph structure without
explicit search, e.g. count cycles, path length frequencies,
etc..
• Can we analyse the structure of sets of graphs
without solving the graph-matching problem? Inexact graph matching is computational bottleneck for
most problems involving graphs.
• Answer: let a random walker do the work.
Graph Spectral Methods Use eigenvalues and eigenvectors of adjacency graph (or
Laplacian matrix) - Biggs, Cvetokovic, Fan Chung
Singular value methods for exact graph-matching and
point-set alignment). (Umeyama, Scott and Longuet-Higgins,
Shapiro and Brady).
Use of eigenvectors for image segmentation (Shi and Malik)
and for perceptual grouping (Freeman and Perona, Sarkar
and Boyer).
Graph-spectral methods for indexing shock-trees (Dickinson
and Shakoufandeh)
Random walks on graphs
• Determined by the Laplacian spectrum
(and in continuous time case by heat-
kernel).
• Can be used to interpret, and analyse,
spectral methods since they can be
understood intuitively as path-based.
Graph spectra and
random walks
Use spectrum of Laplacian matrix
to compute hitting and commute
times for random walk on a graph
Laplacian Matrix
• Weighted adjacency matrix
• Degree matrix
• Laplacian matrix
otherwise
EvuvuwvuW
0
),(),(),(
Vv
vuWuuD ),(),(
WDL
Laplacian spectrum
• Spectral Decomposition of Laplacian
• Element-wise
)()(),( vuvuL kk
k
k
T
k
k
kk
TL
),....,( ||1 Vdiag )|.....|( ||1 V
||21 ....0 V
Properties of the Laplacian
• Eigenvalues are positive and smallest eigenvalue is zero
• Multiplicity of zero eigenvalue is number connected components of graph.
• Zero eigenvalue is associated with all-ones vector.
• Eigenvector associated with the second smallest eigenvalue is Fiedler vector.
||21 .....0 V
Continuous time random walk
Heat Kernels
• Solution of heat equation and measures
information flow across edges of graph
with time:
• Solution found by exponentiating
Laplacian eigensystem
tt Lht
h
TT
kk
k
kt tth ]exp[]exp[
TWDL
Heat kernel and random walk
• State vector of continuous time random
walk satisfies the differential equation
• Solution
tt Lp
t
p
00]exp[ phpLtp tt
Heat kernel and path lengths
• In terms of number of paths of
length k from node u to node v
!),(]exp[),(
|
1 k
tvuPtvuh
k
k
kt
),( vuPk
)()()1(),( vuvuP ii
k
Vi
ik
Example.
Graph shows spanning tree of heat-kernel. Here weights of graph are
elements of heat kernel. As t increases, then spanning tree evolves from
a tree rooted near centre of graph to a string (with ligatures).
Low t behaviour dominated by Laplacian, high t behaviour dominated by
Fiedler-vector.
Moments of the heat-kernel
trace
….can we characterise graph by
the shape of its heat-kernel trace
function?
Heat Kernel Trace
Time (t)->
Trace
]exp[][ thTri
it
Shape of heat-kernel
distinguishes
graphs…can we
characterise its shape
using moments
Rosenberg Zeta function
• Definition of zeta function
s
k
k
s
)()(0
Heat-kernel moments
• Mellin transform
• Trace and number of connected components
• Zeta function
dttts
i
ss
i ]exp[)(
1
0
1
dttts s ]exp[)(0
1
]exp[][0
tChTri
it
dtChTrts
s t
ss
i
i
][)(
1)(
0
1
0
C is multiplicity of zero
eigenvalue or number of
connected components in
graph.
Zeta-function is related
to moments of heat-
kernel trace.
Zeta-function behavior
Objects
72 views of each object taken in 5 degree intervals as camera moves
in circle around object.
Feature points extracted using corner detector.
Construct Voronoi tesselation image plane using corner points as
seeds.
Delaunay graph is region adjacency graph for Voronoi regions.
Heat kernel moments
(zeta(s), s=1,2,3,4)
PCA using zeta(s), s=1,2,3,4)
PCA on Laplace spectrum
Ox-Caltech database
Line-patterns
• Use Huet+Hancock representation (TPAMI-99).
• Extract straight line segments from Canny edge-map.
• Weight computed using continuity and proximity.
• Captures arrangement Gestalts.
Zeta function derivative
• Zeta function in terms of natural exponential
• Derivative
• Derivative at origin
]lnexp[)()(00
kk
k
s
k ss
]lnexp[ln)('0
k
kk ss
0
0
1lnln)0('
k
k k
k
Meaning
• Number of spanning trees in graph
)](exp[)( ' od
d
G
Vu
u
Vu
u
COIL
Ox-Cal
Eigenvalue polynomials (COIL)
Trace
Determinant
Eigenvalue Polynomials (Ox-
Cal)
Spectral polynomials (COIL)
Spectral Polynomials (Ox-Cal)
COIL: node and edge frequency
Ox-Cal: node+edge frequency
Performance
• Rand index=correct/(correct+wrong).
Zeta func. 0.92
Sym. Poly.
(evals)
0,90
Sym. Poly.
(matrix)
0.88
Laplacian
spectrum
0.78
Deeper probes of structure
Ihara zeta function
Zeta functions
• Used in number theory to characterise
distribution of prime numbers.
• Can be extended to graphs by replacing
notion of prime number with that of a
prime cycle.
Ihara Zeta function
• Determined by distribution of prime cycles.
• Transform graph to oriented line graph (OLG) with edges as nodes and edges indicating incidence at a common vertex.
• Zeta function is reciprocal of characteristic polynomial for OLG adjacency matrix.
• Coefficients of polynomial determined by eigenvalues of OLG adjacency matrix.
• Coefficients linked to topological quantity, i.e. prime cycle frequencies.
Oriented Line Graph
- Frobenius operator
Links to walks on graphs
• Transition matrix of OLG determines
discrete time quantum walk on graph (with
Haddamard coin).
• Can be used to define backtrackless
classical walk.
Ihara Zeta Function
• Ihara Zeta Function for a graph G(V,E)
– Defined over prime cycles of graph
– Rational expression in terms of characteristic polynomial of oriented line-graph
A is adjacency matrix of line digraph
Q =D-I (degree matrix minus identity
Characteristic Polynomials from IZF
• Perron-Frobenius operator is the adjacency matrix TH
of the oriented line graph
• Determinant Expression of IZF
– Each coefficient,i.e. Ihara coefficient, can be derived from the
elementary symmetric polynomials of the eigenvalue set
• Pattern Vector in terms of
Analysis of determinant
• From matrix logs
• Tr[T^k] is symmetric polynomial of
eigenvalues of T
]][exp[]det[
1)(
1 k
sTTr
TsIs
k
k
k
N
N
N
N
TTr
TTr
TTr
.............][
.....
...][
........][
21
2
21
2
1
2
1
1
Distribution of prime cycles
• Frequency distribution for cycles of length l
• Cycle frequencies
l
l
lsNsds
ds )(ln
][)(ln)!1(
10
l
sl
l
l TTrsds
d
lN
Experiments: Edge-weighted Graphs
Feature Distance
& Edit Distance
Three Classes of Randomly
Generated Graphs
Experiments: Hypergraphs
Complexity
Information theory, graphs and
kernels.
Protein-Protein Interaction Networks
Characterising graphs
• Topological: e.g. average degree, degree distribution, edge-density, diameter, cycle frequencies etc.
• Spectral or algebraic: use eigenvalues of adjacency matrix or Laplacian, or equivalently the co-efficients of characteristic polynomial.
• Complexity: use information theoretic measures of structure (e.g. Shannon entropy).
Complexity characterisation
• Information theory: entropy measures
• Structural pattern recognition: graph
spectral indices of structure and topology.
• Complex systems: measures of centrality,
separation, searchability.
Information theory
• Entropic measures of complexity:
Shannon , Erdos-Renyi, Von-Neumann.
• Description length: fitting of models to
data, entropy (model cost) tensioned
against log-likelihood (goodness of fit).
• Kernels: Use entropy to computeJensen-
Shannon divergence
Entropy on graphs
• Permutation structure: numbers of
different colourings, projection onto
Birkhoff polytopes.
• Graph spectral: Von Neumann entropy
over Laplacian spectrum.
• Embedding: Entropy associated with
embedding as a point-set.
Von-Neumann Entropy
• Derived from normalised Laplacian
spectrum
• Comes from quantum mechanics and is
entropy associated with density matrix.
2
ˆln
2
ˆ||
1
i
V
i
iVNH
TDADDL ˆˆˆ)(ˆ 2/12/1
Approximation
• Quadratic entropy
• In terms of matrix traces
||
1
2||
1
||
1
ˆ4
1ˆ2
1
2
ˆ1
2
ˆ V
i
i
V
i
ii
V
i
iVNH
]ˆ[4
1]ˆ[
2
1 2LTrLTrHVN
Computing Traces
• Normalised Laplacian
• Normalised Laplacian squared
||]ˆ[ VLTr
Evu vudd
VLTr),(
2
4
1||]ˆ[
Simplified entropy
Evu vu
VNdd
VH),( 4
1||
4
1
Collect terms together, von Neumann
entropy reduces to
Homogeneity index
Evuvuvu
Evu
vu
ddddVVG
ddG
),(
2
),(
2/12/1
211
1||2||
1)(
)()(
Based on degree
statistics
Homogeneity meaning
Evu
vuAvuCTG),(
),(2),(~)(
Limit of large degree
Largest when commute time differs from 2
due to large number of alternative
connecting paths.
Polytopal complexity
• Decompose doubly stochastic kernel matrix into
sum of convex polytopes (permutation matrices)
• Entropy defined over expansion coefficients
PpK
ppS ln
Thermodynamic Depth Complexity
• Simulate heat flow on graph using continuous time
random walk.
• Characterise nodes by their thermodynamic depth (time
walk takes to reach node).
• Measure heat flow dependence at each node with time.
Record maximum.
• Compute homogeneity statistics over thermodynamic
depth.
Phase transition
• As time evolves complexity undergoes
phase transition.
• Corresponds to maximum flow at a node.
• Maximum of entropy.
Uses
• Complexity-based clustering (especially
protein-protein interaction networks).
• Defining information theoretic (Jensen-
Shannon) kernels.
• Controlling complexity of generative
models of graphs.
Protein-Protein Interaction Networks
Experiment
Non-extensive kernel
Based on von Neumann entropy
of a graph
Graph kernels
• Avoid correspondence problem.
• Path length kernel (Gartner+etc) from powers of
adjacency matrix.
• Random walk kernel from product graph
(Borgwardt+Smola etc).
• Cycle length kernel (Horvarth+Gartner), Subtree kernel,
frequent subgraphs etc.
Information theoretic kernels
• Rely on probability distributions, their
entropy and mutual information (Jenssen,
Figueiredo).
• Entropy is complexity level chracterisation
of graph structure.
• Extensive =>additive entropy,
Non-extensive=> non additive entropy
Jensen-Shannon Kernel
• Defined in terms of J-S divergence
• Properties: extensive, positive.
)()()(),(
),(2ln),(
jijiji
jijiJS
GHGHGGHGGJS
GGJSGGK
Computation
• Construct direct product graph for each
graph pair.
• Compute von-Neumann entropy difference
between product graph and two graphs
individually.
• Construct kernel matrix over all pairs.
Generative Models
• Structural domain: define probability distribution over
prototype structure. Prototype together with parameters
of distribution minimise description length (Torsello and
Hancock, PAMI 2007) .
• Spectral domain: embed nodes of graphs into vector-
space using spectral decomposition. Construct point
distribution model over embedded positions of nodes
(Bai, Wilson and Hancock, CVIU 2009).
Deep learning
• Deep belief networks (Hinton 2006, Bengio 2007).
• Compositional networks (Amit+Geman 1999, Fergus
2010).
• Markov models (Leonardis 200
• Stochastic image grammars (Zhu, Mumford, Yuille)
• Taxonomy/category learning (Todorovic+Ahuja, 2006-
2008).
Aim
• Combine spectral and structural methods.
• Use description length criterion.
• Apply to graphs rather than trees.
Prior work
• IJCV 2007 (Torsello, Robles-Kelly, Hancock) –shape classes from edit distance using pairwise clustering.
• PAMI 06 and Pattern Recognition 05 (Wilson, Luo and Hancock) – graph clustering using spectral features and polynomials.
• PAMI 07 (Torsello and Hancock) – generative model for variations in tree structure using description length.
• CVIU09 (Xiao, Wilson and Hancock) – generative model from heat-kernel embedding of graphs.
Structural learning
Using description length
Description length
• Wallace+Freeman: minimum message
length.
• Rissanen: minimum description length.
Use log-posterior probability to locate model that is
optimal with respect to code-length.
Similarities/differences
• MDL: selection of model is aim; model
parameters are simply a means to this
end. Parameters usually maximum
likelihood. Prior on parameters is flat.
• MML: Recovery of model parameters is
central. Parameter prior may be more
complex.
Coding scheme
• Usually assumed to follow an exponential
distribution.
• Alternatives are universal codes and predictive
codes.
• MML has two part codes (model+parameters). In
MDL the codes may be one or two-part.
Method
• Model is supergraph (i.e. Graph prototypes) formed by graph union.
• Sample data observation model: Bernoulli distribution over nodes and edges.
• Mode: complexity: Von-Neumann entropy of supergraphs.
• Fitting criterion:
MDL-like-make ML estimates of the Bernoulli parameters
MML-like: two-part code for data-model fit + supergraph complexity.
Model overview
• Description length criterion
code-length=negative + model code-length
log-likelihood (entropy)
Data-set: set of graphs G
Model: prototype graph+correspondences with it
Updates by expectation maximisation:
Model graph adjacency matrix (M-step)
+ correspondence indicators (E-step).
)()|(),( HGLLGL
Learn supergraph using MDL
• Follow Torsello and Hancock and pose the problem of learning generative model for graphs as that of learning a supergraph representation.
• Required probability distributions is an extension of model developed by Luo and Hancock.
• Use von Neumann entropy to control supergraph’s complexity.
• Develop an EM algorithm in which the node correspondences and the supergraph edge probability matrix are treated as missing data.
Probabilistic Framework
V1
V4
V3V2
V1 V2 V3 V4
0
A=
011
1 0 1
110
0
0
1
001 V1
V2
V3
V4
Here the structure of the sample graphs and the supergraph are
represented by their adjacency matrices
Given a sample graph and a supergraph
along with their assignment matrix,
the a posteriori probabilities of the sample graphs given the
structure of the supergraph and the node correspondences is
defined as
Observation model
Data code-length
• For the sample graph-set and the supergraph ,
the set of assignment is . Under the assumption
that the graphs in are independent samples from the distribution,
the likelihood of the sample graphs can be written
• Code length of observed data
Overall code-length
According to Rissanen and Grunwald’s minimum description length
criterion, we encode and transmit the sample graphs and the
supergraph structure. This leads to a two-part message whose total
length is given
We consider both the node correspondence information between
graphs S and the structure of the supergraph M as missing data and
locate M by minimizing the overall code-length using EM algorithm.
EM – code-length criterion
Expectation + Maximization • M-step :
Recover correspondence matrices: Take partial derivative of the weighted log-likelihood function and soft assign.
Modify supergraph structure :
• E-step: Compute the a posteriori probability of the nodes in the sample graphs being matching to those of the supergraph.
Experiments
Delaunay graphs from images of different objects.
COIL dataset Toys dataset
Experiments---validation COIL dataset: model complexity increase, graph data log-likelihood
increase, overall code length decrease during iterations.
Toys dataset: model complexity decrease, graph data log-likelihood
increase, overall code length decrease during iterations.
Experiments---classification task
We compare the performance of our learned
supergraph on classification task with two alternative
constructions , the median graph and the supergraph
learned without using MDL. The table below shows
the average classification rates from 10-fold cross
validation, which are followed by their standard
errors.
Experiments---graph embedding
Pairwise graph distance based on the
Jensen-Shannon divergence and the von
Neumann entropy of graphs
Experiments---graph embedding
Edit distance JSD distance
Generative model
• Train on graphs with set of predetermined
characteristics.
• Sample using Monte-Carlo.
• Reproduces characteristics of training set,
e.g. Spectral gap, node degree
distribution, etc.
Erdos Renyi
Barabasi Albert (scale free)
Dealunay Graphs
Experiments---generate new samples