Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 1
Data Mining in Bioinformatics Day 3: Graph Mining
August 24, 2008 | ACM SIG KDD, Las Vegas
Karsten Borgwardt & Chloé-Agathe Azencott
February 6 to February 17, 2012
Machine Learning and Computational Biology Research Group MPIs Tübingen
From Borgwardt & Yan, Graph Mining & Graph Kernels, KDD tutorial, 2008 – with permission from Xifeng Yan.
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 2
Graphs Are Everywhere
Co-expression Network
Mag
wen
e et
al.
Gen
ome
Bio
logy
200
4 5:
R10
0
Program Flow
Social Network
Protein Structure Chemical Compound
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 3
Mining Graph Patterns
Graph Pattern Mining – Frequent graph patterns
– Pattern summarization
– Optimal graph patterns
– Graph patterns with constraints
– Approximate graph patterns
Graph Classification – Pattern-based approach
– Decision tree
– Decision stumps
Graph Compression Other important topics (graph model, laws, graph dynamics,
social network analysis, visualization, summarization, graph clustering, link analysis, …)
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 4
Applications of Graph Pattern Mining
Mining biochemical structures
Finding biological conserved subnetworks
Finding functional modules
Program control flow analysis
Intrusion network analysis
Mining communication networks
Anomaly detection
Mining XML structures
Building blocks for graph classification, clustering, compression, comparison, correlation analysis, and indexing
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 5
Graph Pattern Mining
Graph pattern mining (single graph setting)
Graph classification (multiple graphs setting)
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 6
Interesting Graph Patterns
Interestingness measures / Objective functions Frequency: frequent graph patterns
Discriminative: information gain, Fisher score
Significance: G-test
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 7
Graphs
A graph is an ordered pair (V, E) V is a set of vertices (nodes) E is a set of edges (links) e = (v1, v2) v1, v2 vertices in V
undirected / directed / mixed loops / multigraphs / simple graphs unlabeled / labeled unweighted / weighted
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 8
Graph isomorphism G and H are isomorphic if there exist a bijection f between the vertices of G and
the vertices of H such that any two vertices u and v are adjacent in H iff f(u) and f(v)
are adjacent in G.
Solving graph isomorphism is NP, and it’s not known whether it is P or NPC.
In practice, it can be solved efficiently for a number of classes of graphs [McKay1981].
The subgraph isomorphism problem is NP-hard.
f(1) = A
f(2) = C
f(3) = D
f(4) = B
f(5) = F
f(6) = E
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 9
Searching Graphs
Breadth-First Search Explore all vertices at distance k before exploring vertices at distance (k+1).
Depth-First Search Explore each branch til its end before exploring another branch.
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 10
Frequent Pattern Mining
Frequent Pattern
freq(g) is called the support of g
θ is called the minimum support
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 11
Example: Chemical compounds
(a) caffeine (b) theobromine (c) viagra
Frequent subgraph:
…
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 12
Example: Program call graphs
Min support = 2
Frequent subgraphs:
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 13
Algorithms
Inductive Logic Programming (WARMR, King et al. 2001) – Graphs are represented by Datalog facts
Graph Based Approaches Apriori-based approach
– AGM/AcGM: Inokuchi et al. (PKDD’00)
– FSG: Kuramochi and Karypis (ICDM’01)
– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)
– FFSM: Huan et al. (ICDM’03)
– SPIN: Huan et al. (KDD’04)
– FTOSM: Hórvath et al. (KDD’06)
Pattern growth approach – Subdue: Holder et al. (KDD’94)
– MoFa: Borgelt and Berthold (ICDM’02)
– gSpan: Yan and Han (ICDM’02)
– Gaston: Nijssen and Kok (KDD’04)
– CMTreeMiner: Chi et al. (TKDE’05)
– LEAP: Yan et al. (SIGMOD’08)
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 14
If a graph is frequent, all of its subgraphs are frequent. Therefore you can stop extending non-frequent patterns.
… heuristics
Apriori Property
k edges (k+1) edges …
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 15
Cost Analysis
isomorphism checking
number of candidates frequent infrequent (X) duplicate (X) data
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 16
Design Principles
Search Order – breadth vs. depth
– complete vs. incomplete
Support Calculation – compute for each new pattern vs. remember all subgraph isomorphism tests
1. Generation of Candidate Patterns – apriori vs. pattern growth
2. Discovery Order of Patterns – DFS order
– path tree graph
3. Elimination of Duplicate Subgraphs – passive vs. active
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 17
Generation of Candidate Patterns
…
G
G1
G2
Gn
size k size (k+1)
Q
P
Apriori-Based Approach
join 2 patterns of size k that share a pattern of size (k-1)
…
G
G1
G2
G’1
size k size (k+1)
Pattern-Growth Approach
extend patterns of size k by 1 edge / vertex
G’2
…
size (k+2)
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 18
22 new patterns
6 edges
…
7 edges
Discovery Order — Free Extension
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 19
depth-first search
4 new patterns
7 edges
right-most path start end
[Yan and Han ICDM’02]
Discovery Order – Right-Most Extension
6 edges
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 20
Duplicates Elimination
Option 1
Check graph isomorphism of g with each pattern g1, g2, …, gm (slow)
Option 2
Transform each graph to a canonical label, create a hash value for this
canonical label, and check if there is a match with g (faster)
Option 3
Build a canonical order and generate graph patterns in that order (fastest)
Existing patterns g1, g2, …, gm
Newly discovered pattern g
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 21
Minimum support (in %)
Run
tim
e pe
r pat
tern
(m
sec)
AIDS antiviral screen compound dataset from NCI/NIH
[Wörlein et al. PKDD’05]
Duplicates Elimination – Run Time
MoFA: option 1 + free extension
gSpan: option 1 + right-most extension
FFSM: option 2
Gaston: option 3
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 22
Minimum support (in %)
Mem
ory
usag
e (G
B)
[Wörlein et al. PKDD’05]
Duplicates Elimination – Memory Usage
MoFA: option 1 + free extension
gSpan: option 1 + right-most extension
FFSM: option 2
Gaston: option 3
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 23
Graph Pattern Explosion Problem
Conclusion: Many enumeration algorithms (AGM, FSG, gSpan, Path-
Join, MoFa, FFSM, SPIN, Gaston….)
But: – If a graph is frequent, all of its subgraphs are frequent (apriori property)
– An n-edge frequent graph may have 2n subgraphs
– E.g.: AIDS antiviral screen dataset with 400+ compounds, support level 5% over1M frequent graph patterns
Problem 1: How to interpret frequent patterns? Pattern summarization
Problem 2: How to reduce the size of the pattern set? Closed and maximal graphs
Problem 3: How to set the minimum support?
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 24
Pattern Summarization
Too many patterns may not lead to more explicit knowledge It can confuse users as well as further discovery (e.g., clustering,
classification, indexing, etc.) a small set of representative patterns that preserve most of
the information
[Xin et al. KDD’06, Chen et al. CIKM’08]
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 25
Pattern Summarization – Pattern Distance
… …
patterns data
distance
measure 1: pattern-based pattern containment pattern similarity
measure 2: data-based data similarity
patterns
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 26
Closed and Maximal Patterns
Closed Frequent Graph A frequent graph G is closed if there exists no supergraph of G that carries the
same support as G
If some of G’s subgraphs have the same support, it is unnecessary to output
these subgraphs (nonclosed graphs)
Lossless compression: still ensures that the mining result is complete
Maximal Frequent Graph A frequent graph G is maximal if there exists no supergraph of G that is frequent
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 27
Closed and Maximal Patterns – Examples
Data:
is a subgraph of A, B, C but so is which has the same support (3)
No supergraph of E is also a subgraph of all 3 graphs and therefore E is closed.
is a subgraph of A, B and is also closed: none of its supergraphs has support 2
If θ = 70%, E is maximal:
E is frequent None of its supergraphs is frequent
Therefeore D is not closed
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 28
Closed and Maximal Patterns – Sizes
Minimum support
Num
ber o
f pat
tern
s
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 29
CloseGraph (Yan and Han, KDD’03)
…
Pattern-Growth Approach
G
G1
G2
Gn
k-edge
(k+1)-edge
Under which condition can we stop searching supergraphs?
(early termination)
If: G and H are frequent G is a subgraph of H in any part of graphs in the dataset where G occurs, H also occurs, then we need not grow G, since none of G’s supergraphs will be closed except those of H.
[Yan and Han KDD’03]
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 30
References & Further Reading
B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45–87, 1981.
M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston, PKDD 2005
X. Yan and J. Han, gSpan: graph-based substructure pattern mining, ICDM 2002
X. Yan and J. Han, CloseGraph: mining closed frequent graph patterns, KDD 2003
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 31
More References
C. Borgelt and M. R. Berthold, Mining molecular fragments: finding relevant substructures of molecules, ICDM 2002
C. Chen, C. X. Lin, X. Yan, and J. Han. On effective presentation of graph patterns: a structural representative approach, CIKM 2008
Y. Chi, Y. Xia, Y. Yang, and R. Muntz, Mining closed and maximal frequent subtrees from databases of labeled rooted trees, TKDE 2005
T. Horváth, J. Ramon, and S. Wrobel, Frequent subgraph mining in outerplanar graphs, KDD 2006
J. Huan, W. Wang, and J. Prins, Efficient mining of frequent subgraph in the presence of isomorphism, ICDM 2003
J. Huan, W. Wang, and J. Prins, and J. Yang, SPIN: Mining maximal frequent subgraphs from graph databases, KDD 2004
A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data, PKDD 2000
R. King, A Srinivasan, and L Dehaspe, WARMR: a data mining tool for chemical data, J. Comput. Aided Mol. Des. 2001
M. Kuramochi and G. Karypis. Frequent subgraph discovery, ICDM 2001
S. Nijssen and J. Kok, A quickstart in frequent structure mining can make a difference, KDD 2004
N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data, ICDM 2002
D. Xin, H. Cheng, X. Yan, and J. Han, Extracting redundancy-aware top-k patterns, KDD 2006
X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining significant graph patterns by leap search, SIGMOD 2008
Graph Mining and Graph Kernels
Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 32
The end
See you soon!
Next: Feature Selection