Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I....

Graph Mining and Graph Kernels

Karsten Borgwardt & Chloé-Agathe Azencott | Data mining in Bioinformatics | 1

Data Mining in Bioinformatics Day 3: Graph Mining

August 24, 2008 | ACM SIG KDD, Las Vegas

Karsten Borgwardt & Chloé-Agathe Azencott

February 6 to February 17, 2012

Machine Learning and Computational Biology Research Group MPIs Tübingen

From Borgwardt & Yan, Graph Mining & Graph Kernels, KDD tutorial, 2008 – with permission from Xifeng Yan.



Graphs Are Everywhere

Co-expression Network

Mag

wen

e et

al.

Gen

ome

Bio

logy

200

4 5:

R10

0

Program Flow

Social Network

Protein Structure Chemical Compound



Mining Graph Patterns

  Graph Pattern Mining –  Frequent graph patterns

– Pattern summarization

– Optimal graph patterns

– Graph patterns with constraints

– Approximate graph patterns

  Graph Classification – Pattern-based approach

– Decision tree

– Decision stumps

  Graph Compression   Other important topics (graph model, laws, graph dynamics,

social network analysis, visualization, summarization, graph clustering, link analysis, …)



Applications of Graph Pattern Mining

  Mining biochemical structures

  Finding biological conserved subnetworks

  Finding functional modules

  Program control flow analysis

  Intrusion network analysis

  Mining communication networks

  Anomaly detection

  Mining XML structures

  Building blocks for graph classification, clustering, compression, comparison, correlation analysis, and indexing



Graph Pattern Mining

Graph pattern mining (single graph setting)

Graph classification (multiple graphs setting)



Interesting Graph Patterns

Interestingness measures / Objective functions   Frequency: frequent graph patterns

  Discriminative: information gain, Fisher score

  Significance: G-test



Graphs

A graph is an ordered pair (V, E)   V is a set of vertices (nodes)   E is a set of edges (links) e = (v1, v2) v1, v2 vertices in V

  undirected / directed / mixed   loops / multigraphs / simple graphs   unlabeled / labeled   unweighted / weighted



Graph isomorphism G and H are isomorphic if there exist a bijection f between the vertices of G and

the vertices of H such that any two vertices u and v are adjacent in H iff f(u) and f(v)

are adjacent in G.

Solving graph isomorphism is NP, and it’s not known whether it is P or NPC.

In practice, it can be solved efficiently for a number of classes of graphs [McKay1981].

The subgraph isomorphism problem is NP-hard.

f(1) = A

f(2) = C

f(3) = D

f(4) = B

f(5) = F

f(6) = E



Searching Graphs

  Breadth-First Search Explore all vertices at distance k before exploring vertices at distance (k+1).

  Depth-First Search Explore each branch til its end before exploring another branch.



Frequent Pattern Mining

  Frequent Pattern

  freq(g) is called the support of g

  θ is called the minimum support



Example: Chemical compounds

(a) caffeine (b) theobromine (c) viagra

Frequent subgraph:

…



Example: Program call graphs

Min support = 2

Frequent subgraphs:



Algorithms

Inductive Logic Programming (WARMR, King et al. 2001) – Graphs are represented by Datalog facts

Graph Based Approaches   Apriori-based approach

–  AGM/AcGM: Inokuchi et al. (PKDD’00)

–  FSG: Kuramochi and Karypis (ICDM’01)

–  PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)

–  FFSM: Huan et al. (ICDM’03)

–  SPIN: Huan et al. (KDD’04)

–  FTOSM: Hórvath et al. (KDD’06)

  Pattern growth approach –  Subdue: Holder et al. (KDD’94)

–  MoFa: Borgelt and Berthold (ICDM’02)

–  gSpan: Yan and Han (ICDM’02)

–  Gaston: Nijssen and Kok (KDD’04)

–  CMTreeMiner: Chi et al. (TKDE’05)

–  LEAP: Yan et al. (SIGMOD’08)



If a graph is frequent, all of its subgraphs are frequent. Therefore you can stop extending non-frequent patterns.

… heuristics

Apriori Property

k edges (k+1) edges …



Cost Analysis

isomorphism checking

number of candidates   frequent   infrequent (X)   duplicate (X) data



Design Principles

  Search Order –  breadth vs. depth

–  complete vs. incomplete

  Support Calculation –  compute for each new pattern vs. remember all subgraph isomorphism tests

  1. Generation of Candidate Patterns –  apriori vs. pattern growth

  2. Discovery Order of Patterns – DFS order

–  path tree graph

  3. Elimination of Duplicate Subgraphs –  passive vs. active



Generation of Candidate Patterns

…

G

G1

G2

Gn

size k size (k+1)

Q

P

Apriori-Based Approach

join 2 patterns of size k that share a pattern of size (k-1)

…

G

G1

G2

G’1

size k size (k+1)

Pattern-Growth Approach

extend patterns of size k by 1 edge / vertex

G’2

…

size (k+2)



22 new patterns

6 edges

…

7 edges

Discovery Order — Free Extension



depth-first search

4 new patterns

7 edges

right-most path start end

[Yan and Han ICDM’02]

Discovery Order – Right-Most Extension

6 edges



Duplicates Elimination

Option 1

  Check graph isomorphism of g with each pattern g1, g2, …, gm (slow)

Option 2

  Transform each graph to a canonical label, create a hash value for this

canonical label, and check if there is a match with g (faster)

Option 3

  Build a canonical order and generate graph patterns in that order (fastest)

  Existing patterns g1, g2, …, gm

  Newly discovered pattern g



Minimum support (in %)

Run

tim

e pe

r pat

tern

(m

sec)

AIDS antiviral screen compound dataset from NCI/NIH

[Wörlein et al. PKDD’05]

Duplicates Elimination – Run Time

  MoFA: option 1 + free extension

  gSpan: option 1 + right-most extension

  FFSM: option 2

  Gaston: option 3



Minimum support (in %)

Mem

ory

usag

e (G

B)

[Wörlein et al. PKDD’05]

Duplicates Elimination – Memory Usage

  MoFA: option 1 + free extension

  gSpan: option 1 + right-most extension

  FFSM: option 2

  Gaston: option 3



Graph Pattern Explosion Problem

  Conclusion: Many enumeration algorithms (AGM, FSG, gSpan, Path-

Join, MoFa, FFSM, SPIN, Gaston….)

  But: –  If a graph is frequent, all of its subgraphs are frequent (apriori property)

– An n-edge frequent graph may have 2n subgraphs

– E.g.: AIDS antiviral screen dataset with 400+ compounds, support level 5% over1M frequent graph patterns

  Problem 1: How to interpret frequent patterns? Pattern summarization

  Problem 2: How to reduce the size of the pattern set? Closed and maximal graphs

  Problem 3: How to set the minimum support?



Pattern Summarization

  Too many patterns may not lead to more explicit knowledge   It can confuse users as well as further discovery (e.g., clustering,

classification, indexing, etc.)   a small set of representative patterns that preserve most of

the information

[Xin et al. KDD’06, Chen et al. CIKM’08]



Pattern Summarization – Pattern Distance

… …

patterns data

distance

measure 1: pattern-based   pattern containment   pattern similarity

measure 2: data-based   data similarity

patterns



Closed and Maximal Patterns

Closed Frequent Graph   A frequent graph G is closed if there exists no supergraph of G that carries the

same support as G

  If some of G’s subgraphs have the same support, it is unnecessary to output

these subgraphs (nonclosed graphs)

  Lossless compression: still ensures that the mining result is complete

Maximal Frequent Graph   A frequent graph G is maximal if there exists no supergraph of G that is frequent



Closed and Maximal Patterns – Examples

Data:

is a subgraph of A, B, C but so is which has the same support (3)

  No supergraph of E is also a subgraph of all 3 graphs and therefore E is closed.

is a subgraph of A, B and is also closed: none of its supergraphs has support 2

  If θ = 70%, E is maximal:

  E is frequent   None of its supergraphs is frequent

Therefeore D is not closed

 

 



Closed and Maximal Patterns – Sizes

Minimum support

Num

ber o

f pat

tern

s



CloseGraph (Yan and Han, KDD’03)

…

Pattern-Growth Approach

G

G1

G2

Gn

k-edge

(k+1)-edge

Under which condition can we stop searching supergraphs?

(early termination)

If:   G and H are frequent   G is a subgraph of H   in any part of graphs in the dataset where G occurs, H also occurs, then we need not grow G, since none of G’s supergraphs will be closed except those of H.

[Yan and Han KDD’03]



References & Further Reading

  B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45–87, 1981.

  M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston, PKDD 2005

  X. Yan and J. Han, gSpan: graph-based substructure pattern mining, ICDM 2002

  X. Yan and J. Han, CloseGraph: mining closed frequent graph patterns, KDD 2003



More References

  C. Borgelt and M. R. Berthold, Mining molecular fragments: finding relevant substructures of molecules, ICDM 2002

  C. Chen, C. X. Lin, X. Yan, and J. Han. On effective presentation of graph patterns: a structural representative approach, CIKM 2008

  Y. Chi, Y. Xia, Y. Yang, and R. Muntz, Mining closed and maximal frequent subtrees from databases of labeled rooted trees, TKDE 2005

  T. Horváth, J. Ramon, and S. Wrobel, Frequent subgraph mining in outerplanar graphs, KDD 2006

  J. Huan, W. Wang, and J. Prins, Efficient mining of frequent subgraph in the presence of isomorphism, ICDM 2003

  J. Huan, W. Wang, and J. Prins, and J. Yang, SPIN: Mining maximal frequent subgraphs from graph databases, KDD 2004

  A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data, PKDD 2000

  R. King, A Srinivasan, and L Dehaspe, WARMR: a data mining tool for chemical data, J. Comput. Aided Mol. Des. 2001

  M. Kuramochi and G. Karypis. Frequent subgraph discovery, ICDM 2001

  S. Nijssen and J. Kok, A quickstart in frequent structure mining can make a difference, KDD 2004

  N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data, ICDM 2002

  D. Xin, H. Cheng, X. Yan, and J. Han, Extracting redundancy-aware top-k patterns, KDD 2006

  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining significant graph patterns by leap search, SIGMOD 2008



The end

See you soon!

Next: Feature Selection

Date post:	28-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Data Mining in Bioinformatics Day 3: Graph Mining · 2014. 10. 29. · M. Wörlein, T. Meinl, I....

Documents