+ All Categories
Home > Documents > Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I....

Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I....

Date post: 28-Mar-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
32
Graph Mining and Graph Kernels Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 1 Data Mining in Bioinformatics Day 3: Graph Mining August 24, 2008 | ACM SIG KDD, Las Vegas Karsten Borgwardt & Chloé-Agathe Azencott February 6 to February 17, 2012 Machine Learning and Computational Biology Research Group MPIs Tübingen From Borgwardt & Yan, Graph Mining & Graph Kernels, KDD tutorial, 2008 – with permission from Xifeng Yan.
Transcript
Page 1: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 1

Data Mining in Bioinformatics Day 3: Graph Mining

August 24, 2008 | ACM SIG KDD, Las Vegas

Karsten Borgwardt & Chloé-Agathe Azencott

February 6 to February 17, 2012

Machine Learning and Computational Biology Research Group MPIs Tübingen

From Borgwardt & Yan, Graph Mining & Graph Kernels, KDD tutorial, 2008 – with permission from Xifeng Yan.

Page 2: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 2

Graphs Are Everywhere

Co-expression Network

Mag

wen

e et

al.

Gen

ome

Bio

logy

200

4 5:

R10

0

Program Flow

Social Network

Protein Structure Chemical Compound

Page 3: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 3

Mining Graph Patterns

  Graph Pattern Mining –  Frequent graph patterns

– Pattern summarization

– Optimal graph patterns

– Graph patterns with constraints

– Approximate graph patterns

  Graph Classification – Pattern-based approach

– Decision tree

– Decision stumps

  Graph Compression   Other important topics (graph model, laws, graph dynamics,

social network analysis, visualization, summarization, graph clustering, link analysis, …)

Page 4: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 4

Applications of Graph Pattern Mining

  Mining biochemical structures

  Finding biological conserved subnetworks

  Finding functional modules

  Program control flow analysis

  Intrusion network analysis

  Mining communication networks

  Anomaly detection

  Mining XML structures

  Building blocks for graph classification, clustering, compression, comparison, correlation analysis, and indexing

Page 5: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 5

Graph Pattern Mining

Graph pattern mining (single graph setting)

Graph classification (multiple graphs setting)

Page 6: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 6

Interesting Graph Patterns

Interestingness measures / Objective functions   Frequency: frequent graph patterns

  Discriminative: information gain, Fisher score

  Significance: G-test

Page 7: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 7

Graphs

A graph is an ordered pair (V, E)   V is a set of vertices (nodes)   E is a set of edges (links) e = (v1, v2) v1, v2 vertices in V

  undirected / directed / mixed   loops / multigraphs / simple graphs   unlabeled / labeled   unweighted / weighted

Page 8: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 8

Graph isomorphism G and H are isomorphic if there exist a bijection f between the vertices of G and

the vertices of H such that any two vertices u and v are adjacent in H iff f(u) and f(v)

are adjacent in G.

Solving graph isomorphism is NP, and it’s not known whether it is P or NPC.

In practice, it can be solved efficiently for a number of classes of graphs [McKay1981].

The subgraph isomorphism problem is NP-hard.

f(1) = A

f(2) = C

f(3) = D

f(4) = B

f(5) = F

f(6) = E

Page 9: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 9

Searching Graphs

  Breadth-First Search Explore all vertices at distance k before exploring vertices at distance (k+1).

  Depth-First Search Explore each branch til its end before exploring another branch.

Page 10: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 10

Frequent Pattern Mining

  Frequent Pattern

  freq(g) is called the support of g

  θ is called the minimum support

Page 11: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 11

Example: Chemical compounds

(a) caffeine (b) theobromine (c) viagra

Frequent subgraph:

Page 12: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 12

Example: Program call graphs

Min support = 2

Frequent subgraphs:

Page 13: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 13

Algorithms

Inductive Logic Programming (WARMR, King et al. 2001) – Graphs are represented by Datalog facts

Graph Based Approaches   Apriori-based approach

–  AGM/AcGM: Inokuchi et al. (PKDD’00)

–  FSG: Kuramochi and Karypis (ICDM’01)

–  PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)

–  FFSM: Huan et al. (ICDM’03)

–  SPIN: Huan et al. (KDD’04)

–  FTOSM: Hórvath et al. (KDD’06)

  Pattern growth approach –  Subdue: Holder et al. (KDD’94)

–  MoFa: Borgelt and Berthold (ICDM’02)

–  gSpan: Yan and Han (ICDM’02)

–  Gaston: Nijssen and Kok (KDD’04)

–  CMTreeMiner: Chi et al. (TKDE’05)

–  LEAP: Yan et al. (SIGMOD’08)

Page 14: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 14

If a graph is frequent, all of its subgraphs are frequent. Therefore you can stop extending non-frequent patterns.

… heuristics

Apriori Property

k edges (k+1) edges …

Page 15: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 15

Cost Analysis

isomorphism checking

number of candidates   frequent   infrequent (X)   duplicate (X) data

Page 16: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 16

Design Principles

  Search Order –  breadth vs. depth

–  complete vs. incomplete

  Support Calculation –  compute for each new pattern vs. remember all subgraph isomorphism tests

  1. Generation of Candidate Patterns –  apriori vs. pattern growth

  2. Discovery Order of Patterns – DFS order

–  path tree graph

  3. Elimination of Duplicate Subgraphs –  passive vs. active

Page 17: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 17

Generation of Candidate Patterns

G

G1

G2

Gn

size k size (k+1)

Q

P

Apriori-Based Approach

join 2 patterns of size k that share a pattern of size (k-1)

G

G1

G2

G’1

size k size (k+1)

Pattern-Growth Approach

extend patterns of size k by 1 edge / vertex

G’2

size (k+2)

Page 18: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 18

22 new patterns

6 edges

7 edges

Discovery Order — Free Extension

Page 19: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 19

depth-first search

4 new patterns

7 edges

right-most path start end

[Yan and Han ICDM’02]

Discovery Order – Right-Most Extension

6 edges

Page 20: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 20

Duplicates Elimination

Option 1

  Check graph isomorphism of g with each pattern g1, g2, …, gm (slow)

Option 2

  Transform each graph to a canonical label, create a hash value for this

canonical label, and check if there is a match with g (faster)

Option 3

  Build a canonical order and generate graph patterns in that order (fastest)

  Existing patterns g1, g2, …, gm

  Newly discovered pattern g

Page 21: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 21

Minimum support (in %)

Run

tim

e pe

r pat

tern

(m

sec)

AIDS antiviral screen compound dataset from NCI/NIH

[Wörlein et al. PKDD’05]

Duplicates Elimination – Run Time

  MoFA: option 1 + free extension

  gSpan: option 1 + right-most extension

  FFSM: option 2

  Gaston: option 3

Page 22: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 22

Minimum support (in %)

Mem

ory

usag

e (G

B)

[Wörlein et al. PKDD’05]

Duplicates Elimination – Memory Usage

  MoFA: option 1 + free extension

  gSpan: option 1 + right-most extension

  FFSM: option 2

  Gaston: option 3

Page 23: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 23

Graph Pattern Explosion Problem

  Conclusion: Many enumeration algorithms (AGM, FSG, gSpan, Path-

Join, MoFa, FFSM, SPIN, Gaston….)

  But: –  If a graph is frequent, all of its subgraphs are frequent (apriori property)

– An n-edge frequent graph may have 2n subgraphs

– E.g.: AIDS antiviral screen dataset with 400+ compounds, support level 5% over1M frequent graph patterns

  Problem 1: How to interpret frequent patterns? Pattern summarization

  Problem 2: How to reduce the size of the pattern set? Closed and maximal graphs

  Problem 3: How to set the minimum support?

Page 24: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 24

Pattern Summarization

  Too many patterns may not lead to more explicit knowledge   It can confuse users as well as further discovery (e.g., clustering,

classification, indexing, etc.)   a small set of representative patterns that preserve most of

the information

[Xin et al. KDD’06, Chen et al. CIKM’08]

Page 25: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 25

Pattern Summarization – Pattern Distance

… …

patterns data

distance

measure 1: pattern-based   pattern containment   pattern similarity

measure 2: data-based   data similarity

patterns

Page 26: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 26

Closed and Maximal Patterns

Closed Frequent Graph   A frequent graph G is closed if there exists no supergraph of G that carries the

same support as G

  If some of G’s subgraphs have the same support, it is unnecessary to output

these subgraphs (nonclosed graphs)

  Lossless compression: still ensures that the mining result is complete

Maximal Frequent Graph   A frequent graph G is maximal if there exists no supergraph of G that is frequent

Page 27: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 27

Closed and Maximal Patterns – Examples

Data:

is a subgraph of A, B, C but so is which has the same support (3)

  No supergraph of E is also a subgraph of all 3 graphs and therefore E is closed.

is a subgraph of A, B and is also closed: none of its supergraphs has support 2

  If θ = 70%, E is maximal:

  E is frequent   None of its supergraphs is frequent

Therefeore D is not closed

Page 28: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 28

Closed and Maximal Patterns – Sizes

Minimum support

Num

ber o

f pat

tern

s

Page 29: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 29

CloseGraph (Yan and Han, KDD’03)

Pattern-Growth Approach

G

G1

G2

Gn

k-edge

(k+1)-edge

Under which condition can we stop searching supergraphs?

(early termination)

If:   G and H are frequent   G is a subgraph of H   in any part of graphs in the dataset where G occurs, H also occurs, then we need not grow G, since none of G’s supergraphs will be closed except those of H.

[Yan and Han KDD’03]

Page 30: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 30

References & Further Reading

  B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45–87, 1981.

  M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston, PKDD 2005

  X. Yan and J. Han, gSpan: graph-based substructure pattern mining, ICDM 2002

  X. Yan and J. Han, CloseGraph: mining closed frequent graph patterns, KDD 2003

Page 31: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 31

More References

  C. Borgelt and M. R. Berthold, Mining molecular fragments: finding relevant substructures of molecules, ICDM 2002

  C. Chen, C. X. Lin, X. Yan, and J. Han. On effective presentation of graph patterns: a structural representative approach, CIKM 2008

  Y. Chi, Y. Xia, Y. Yang, and R. Muntz, Mining closed and maximal frequent subtrees from databases of labeled rooted trees, TKDE 2005

  T. Horváth, J. Ramon, and S. Wrobel, Frequent subgraph mining in outerplanar graphs, KDD 2006

  J. Huan, W. Wang, and J. Prins, Efficient mining of frequent subgraph in the presence of isomorphism, ICDM 2003

  J. Huan, W. Wang, and J. Prins, and J. Yang, SPIN: Mining maximal frequent subgraphs from graph databases, KDD 2004

  A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data, PKDD 2000

  R. King, A Srinivasan, and L Dehaspe, WARMR: a data mining tool for chemical data, J. Comput. Aided Mol. Des. 2001

  M. Kuramochi and G. Karypis. Frequent subgraph discovery, ICDM 2001

  S. Nijssen and J. Kok, A quickstart in frequent structure mining can make a difference, KDD 2004

  N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data, ICDM 2002

  D. Xin, H. Cheng, X. Yan, and J. Han, Extracting redundancy-aware top-k patterns, KDD 2006

  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining significant graph patterns by leap search, SIGMOD 2008

Page 32: Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa,

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 32

The end

See you soon!

Next: Feature Selection


Recommended