+ All Categories
Home > Documents > Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes...

Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes...

Date post: 31-Mar-2015
Category:
Upload: tucker-rawlins
View: 220 times
Download: 3 times
Share this document with a friend
Popular Tags:
23
Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1 , Christoph Helma 1,2 , and Stefan Kramer 3 1) FDM Universität Freiburg (D) 2) in-silico toxicology Basel (CH) 3) Technische Universität München (D)
Transcript
Page 1: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Technische Universität München

Large-Scale Graph Mining Using Backbone Refine-ment Classes

05/2009

Andreas Maunz1, Christoph Helma1,2, and Stefan Kramer3

1) FDM Universität Freiburg (D)2) in-silico toxicology Basel (CH)3) Technische Universität München (D)

Page 2: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Technische Universität München

BACKBONE REFINEMENT CLASS MININGEfficient diverse substructure mining from a large class-labelled graph database

Page 3: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Rationale

Trees are most frequent substructure type; yet efficiently enumerable. However:

• Excessively large result sets are obtained even for high correlation and minimum frequency constraints.

Paths; 5%Real Trees;

85%

Cycle-clos-ing Graphs;

10%Typical substructure frequencies for databases of small molecules

Page 4: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Definitions

4

GASTON (GrAph, Sequence and Tree ExtractiON) by Nijssen and Kok1:

1 Nijssen S. & Kok J.N.: “A Quickstart in Frequent Structure Mining can make a Difference”, KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM 2004: 647–652.

• Backbone of a tree: longest path with the lowest sequence (assuming canonical sequence ordering).

• Since every tree has exactly one backbone, backbones partition the partial order of trees disjointly.

Backbone Refinement Class (BBRC): All tree refinements growing from a specific backbone.

• Pre-order (depth-first) traversal is used within each partition to refine structures.

Page 5: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Example

5

C-C(-O-C)(=C-c:c:c)

C-C(=C(-O-C)(-C))(-c:c:c)

C-C(=C-O-C)(-c:c:c)

Class 1

Class 2

Refinement

Refinement

Backbone:c:c:c-C=C-O-C

Backbones in gray

Page 6: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Properties (1)

6

• BBRCs partition the search space structurally (as opposed to occurrence-based methods, such as open/closed features).

Search space for two BBRCs within the same backbone.

Some Properties

• Two types of BBRCs:

I. within a backbone: not disjoint(see figure on the left)

II. across backbones: disjoint

• A given backbone spans a maximum search tree. No node may be added without changing the backbone.

Page 7: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Properties (2)

7

Consider the special case of a rooted perfect binary tree of height h.

Backbone with branches in gray

Perfect binary tree of height 3

The Number of BBRCs

→ The number of Backbone Refinement Classes is governed by the (recursive)branches on this backbone.

Page 8: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Properties (3)

8

The Number of BBRCs (unpublished)

The number of backbone refinement classes of a branch of length l is

1

1

)()1()!1()(l

i

iblllb

1 L.A. Szekely, Hua Wang, On subtrees of trees, Advances in Applied Mathematics, Volume 34, Issue 1, January 2005, Pages 138-155,

The full set of subtrees containing the root has size [1]

1)( )2( 1

h

qhF

where q~1.50284.

2,)()2()()2()!2(2)(1

1

1

12,

2

htbjsbijihBi

s

j

t

h

ji

ji

The set of BBRCs containing the root has size

Page 9: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Properties (4)

9

Summary of Feature Counts

h B(h) F(h)

1 1 4

2 8 25

3 632 676

4 6.03E+004 4.58E+005

5 1.19E+007 2.10E+011

6 4.13E+009 4.41E+022

7 2.14E+012 1.95E+045

8 1.54E+015 3.79E+0901 2 3 4 5 6 7 8

1.00E+000

1.00E+010

1.00E+020

1.00E+030

1.00E+040

1.00E+050

1.00E+060

1.00E+070

1.00E+080

1.00E+090

1.00E+100

B(h)F(h)

Page 10: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

BBRC Implementation

10

Use paths as candidate backbones. Idea: Mine BBRCs and represent each BBRC by the most (2-) significant member.

1 S. Morishita and J. Sese. Traversing Itemset Lattices with Statistical Metric Pruning. In Symposium on Principles of Database Systems, pages 226–236, 2000.

• 2 thresholds can not be used for anti-monotonic pruning, however an upper bound for 2 values of refinements of a pattern exists1 (Statistical Metric Pruning).

Dynamic Upper Bound Pruning: 2 threshold may be increased during depth-first traversal since we only search for the max. elements of classes.

• In case of several most significant members, use the most general one.

Page 11: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0411

BBRC Experiments (1)

Investigation of BBRCs regarding time efficiency, feature set sizes and expressiveness

• BBRC Representatives: most significant representatives of the backbone refinement classes.

Class-Balanced CPDB datasets:

• Salmonella Mutagenicity (SM, 388 active / 810 compounds)

• Rat Carcinogenicity (RC, 459 active / 1145 compounds)

• Mouse Carcinogenicity (MoC, 428 active / 927 compounds)

• Multicell Call (MuC, 553 active / 1067 compounds).

• Significant Trees: all trees that are frequent and significant.

• Open Trees[1]: most general significant trees with the same occurrences.

1 B. Bringmann, A. Zimmermann, L. de Raedt, and S. Nijssen. Don’t Be Afraid of Simpler Patterns. In Proceedings 10th PKDD, pages 55–66. Springer-Verlag, 2006.

Page 12: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0412

BBRC Experiments (2)

Feature Set Sizes

 Sign. Trees

Open Trees

BBRC Repr.

SM 27,093 8,062 2,715

RC 94,991 4,569 5,183

MoC 22,395 1,937 3,083

MuC 29,970 5,122 3,636

Minimum frequency: 6

Page 13: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0413

BBRC Experiments (3)

Time Efficiency

  No statisticalpruning

Static UBpruning

Dynamic UB

pruningSM 2.63 2.55 0.44

RC 21.23 21.11 6.63

MoC 3.71 2.98 2.13

MuC 5.17 4.76 1.76

Minimum frequency: 6

Page 14: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0414

BBRC Experiments (4)

Accuracy, Sensitivity, Specificity

Black: Sign. TreesDark Gray: BBRC-R.Light Gray: Open Trees

    Sign. Tr. Open Tr. BBRC-R.

  all 74.6 75.5 74.6SM AD 80.7 80.6 79.4

  wt. 86.8 84.5 85.4  all 64.4 64.5 67.2RC AD 70.0 68.7 70.4

  wt. 81.8 80.0 82.2  all 73.3 71.5 71.7MoC AD 75.7 74.4 76.5  wt. 83.7 80.8 82.0  all 71.9 70.2 70.3MuC AD 75.6 73.5 74.1  wt. 83.5 81.3 84.9

Instance-based predictionsall: all predictionsAD: top 80% confidence predictionswt.: predictions weighted by confidence

Page 15: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0415

BBRC Experiments (5)

Active / Inactive compoundsActivating / Deactivatingfeatures

Euclidean embedding based on Co-Occurrences and Entropy[1]

1 Hannes Schulz, Christian Kersting, Andreas Karwath, ILP, the Blind, and the Elephant: Euclidean Embedding of Co-Proven Queries (Proceedings of the 19th International Conference on Inductive Logic Programming (ILP 2009) (forthcoming)).

Differently colored features nearly perfectly separated

Features are well distributed with few clusters

Page 16: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0416

Large-Scale Analysis (1)

Large Scale Analysis

NCI Yeast Anticancer Drug Screen datasets (April 2002 release)

1. AC-One (stage 0): 87,264 compounds, 12,068 active

2. AC-All (stage 0): 87,264 compounds, 5,777 active

3. AC-All (stage 1): 10,924 compounds, 5,433 active

To the best knowledge of the authors, 1. and 2. are the largest labelled datasets that have been considered in correlated graph mining.

Page 17: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0417

Large-Scale Analysis (2)

BBRC descriptors are more probable in lighter regions.

AC-One (stage 0): 87,264 comp:

Min. Freq. Coverage Time eff.

100 (~0.12 %) 47.1 36m40s

Similar results were obtained for the other datasets*.

* The effects of not using aromatic perception, i.e. no special node and edge labels for aromatic bindings, were much greater. The number of descriptors per compound in this setting was > 80 for both thresholds.

Effects of Minimum Frequency on Dataset Coverage

200 (~0.23%) 44.7 19m40s

Page 18: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

Large-Scale Analysis (3)

Feature Count for Balanced datasets (downsampling)

1 M. Al Hasan et.al. Origami: Mining Representative Orthogonal Graph Patterns. ICDM 2007. Seventh IEEE International Conference on Data Mining, pages 153–162, Oct. 2007.

Max. Trees: the positive border as implied by minimum frequency and significance constraints [1].

Open Trees Memory alloc. error

216,206

  AC-one (stage 0)23,400 comp.

AC-all (stage 1)10,548 comp.

Sign. Trees 1,190,763 291,729

Max. Trees[1]

556,673 148,562

BBRC Repr. 31,450 14,381

Page 19: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes0419

Large-Scale Analysis (4)

Time Efficiency

Time efficiency (Mining)

AC-one (st. 0), 23.400 4m52s

AC-all (st. 1), 10548 1m13s

Open Trees:prediction times of >60simpractical RAM demand.

AC-one (st. 0) 11.1s

AC-all (st. 1) 4.7s

Time efficiency (Prediction)

all AD wt.62646668707274767880

AC-one (st. 0)

AC-all (st. 1)

all: all predictionsAD: top 80% confidence predictionswt.: predictions weighted by confidence

Accuracy

Open Trees:mining times of ~12h

Page 20: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Technische Universität München

SUMMARY

Page 21: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

• Structurally heterogeneous descriptors, compression by structural invariant (backbone constraint)

Backbone Refinement Class Representatives

Summary (1)

• Good dataset coverage, robust against increasing minimum frequencies

• Applicable to large-scale graph databases through a novel statistical pruning technique

Page 22: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

• Compression of 90% compared to all trees and 31% compared to open trees

Backbone Refinement Class Representatives

Summary (2)

• Time efficiency improved by 85% and 83% versus no statistical pruning and static upper bound pruning, respectively.

• Discriminative potential similar to complete set of trees, but significantly better than open trees.

Page 23: Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer.

Large-Scale Graph Mining using Backbone Refinement Classes04

Acknowledgements

The authors would like to thank Björn Bringmann for providing a binary and friendly cooperation in dataset testing, and Ulrich Rückert for providing datasets.

The research was (partially) supported by the EU seventh framework programme under contract no Health-F5-2008-200787 (OpenTox).

http://www.opentox.org

C++ implementation: http://www.maunz.de/libfminer-doc


Recommended