1
12/4/2003
Computing frequent patterns Computing frequent patterns fromfrom
semisemi--structured datastructured data
Natalia Vanetik, Ehud Gudes and Eyal S. Shimony
Department of Computer Science,Ben-Gurion University of the Negev
Talk outline
Part 1 IntroductionIntroduction1.1 Motivation and background1.1 Motivation and background1.2 Overview of the graph mining algorithm1.2 Overview of the graph mining algorithm
Part 2 DetailsDetails2.1 The support measure2.1 The support measure2.2 Combining graphs2.2 Combining graphs2.3 Details of the mining algorithm2.3 Details of the mining algorithm
Part 3 Experimental evaluationExperimental evaluation
Part 4 Conclusions and future workConclusions and future work
2
What Graphs are good for?
Most of existing data mining algorithms are based on Most of existing data mining algorithms are based on transaction representationtransaction representation, i.e., sets of items., i.e., sets of items.Datasets with structures, layers, hierarchy and/or Datasets with structures, layers, hierarchy and/or geometry often do not fit well in this transaction geometry often do not fit well in this transaction setting. For e.g.setting. For e.g.
Numerical simulationsNumerical simulations3D protein structures3D protein structuresChemical CompoundsChemical CompoundsGeneric XML files.Generic XML files.
Graph Based Data Mining
Graph Mining then essentially is the problem of Graph Mining then essentially is the problem of discovering repetitive discovering repetitive subgraphssubgraphs occurring in the input occurring in the input graphs.graphs.
Motivation:Motivation:finding finding subgraphssubgraphs capable of compressing the data capable of compressing the data by abstracting instances of the substructures.by abstracting instances of the substructures.Identifying conceptually interesting patternsIdentifying conceptually interesting patterns
3
Overview
Frequent patterns discovered from semiFrequent patterns discovered from semi--structured data are useful for:structured data are useful for:
•• Improving database design Improving database design (A. Deutsch, M. Fernandez, D.Suciu “Storing
Semistructured Data with STORED”, SIGMOD’99)
•• Efficient indexing Efficient indexing (R. Goldman, J. Widom “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases”, VLDB’97)
•• User preference based applicationsUser preference based applications
•• User behavior predictionsUser behavior predictions
•• Database storage and archivalDatabase storage and archival
SemiSemi--structuredstructured data is any data that can be modeled as a labeled graph.data is any data that can be modeled as a labeled graph.For example, XML and HTML data, user access patterns.For example, XML and HTML data, user access patterns.
Overview
Rule-based patternsPatterns of form A1,A2,…,An ⇒ B where A1,…,An,B are atomic values.Example: “diapers ⇒ beer”.
Topology-based patternsPatterns that have structure in addition to atomic valuesExample: graph patterns
4
Overview
Definition Transaction database is a set of records where each recordcontains results of a single transaction.
Example: Supermarket database where each purchase is a transaction.
An item-set X in relational model is a set of tuples (f1,v1),…,(fn,vn) where fiare the names of fields and vi are values. n is a size of an item-set.
A transaction T supports X if the value of fi equals vi for any i.
Let user-defined support threshold be s%. An item-set X is frequent if s% ormore transactions in the database support X.
Apriori algorithm
Apriori algorithm: each item-set X of size n is first generated from two frequent item-sets X1 and X2 of size (n-1) and then itsfrequency is evaluated by a pass over the database.
Apriori principle: If X is a frequent item-set of size n then all items X’contained in X are frequent as well.
5
Background
Existing algorithmsExisting algorithms
• Simple path patterns (Chen,Park,Yu 98)
• Generalized path patterns (Nanopoulos,Manolopoulos 01)
• Simple tree patterns (Lin,Liu,Zhang, Zhou 98)
• Tree-like patterns (Wang,Huiqing,Liu 98)
• “Naïve” graph patterns (Kuramochi,Karypis 01)
FSG Algorithm[M. Kuramochi and G. Karypis. Frequent subgraph discovery.]
Incremental and breadthIncremental and breadth--first fashion in the size of frequent first fashion in the size of frequent subgraphs (like Apriori for frequent itemsets)subgraphs (like Apriori for frequent itemsets)Counting of single and double edge subgraphs Counting of single and double edge subgraphs For finding frequent subgraphs from size k (k > 2).For finding frequent subgraphs from size k (k > 2).
Candidates generationCandidates generation-- all possible joining of two graphs all possible joining of two graphs from size kfrom size k--1 which share common kernel subgraph from 1 which share common kernel subgraph from size ksize k--2.2.Candidate pruningCandidate pruning-- a necessary condition of candidate to be a necessary condition of candidate to be frequent is that each of its subgraphs is frequent.frequent is that each of its subgraphs is frequent.Frequency counting Frequency counting -- Check if a candidate subgraph appear Check if a candidate subgraph appear at least at least minSupminSup (minimum support) times.(minimum support) times.Repeat the steps for k=k+1Repeat the steps for k=k+1
6
GSpan Algorithm[X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.]
Adopts a patternAdopts a pattern--growth by growing patterns from a single growth by growing patterns from a single graph directly. graph directly. The algorithm maps each subgraph to a unique label. The algorithm maps each subgraph to a unique label. By using these labels, a Tree SearchBy using these labels, a Tree Search--Space (TSS) hierarchy is Space (TSS) hierarchy is constructed over all possible subgraphs. constructed over all possible subgraphs. subgraph from size k is kept in node at depth k in TSS.subgraph from size k is kept in node at depth k in TSS.An inAn in--order search over the TSS lets discover all frequent order search over the TSS lets discover all frequent subgraphs. subgraphs.
Pruning Pruning -- If a node in TSS holds infrequent subgraph then its If a node in TSS holds infrequent subgraph then its subsub--tree in TSS is pruned.tree in TSS is pruned.
Overview
Model: A semistructured database is viewed as a (labeled) graph.A pattern is any connected subgraph of a database graph.
Goal: Find all frequent connected subgraphs of a database graph.
Related problem: What if there are no transactions in the database(when it is a single graph, such as XML file, Web etc.) ?How do we count pattern instances ?What is support of a graph pattern ?
Most existing algorithms use Transactions based databases!
We assume a single graph database!
7
Overview (contd.)
Most existing algorithms use Apriori-based approach.
The basic building block is either a tree (for tree mining only) or an edge.
Our approach: use edge-disjoint paths as building blocks.
Result: faster convergence of the algorithm
Path Facts
Definition Path number p(G) of a graph is the minimal number of edge-disjoint paths that cover all edges in the graph. A collectionof p(G) paths that cover all edges is called a minimal path cover.
2)(
)(oddisvdVv
Gp∈
=
2
)()()(∑∈
−+ −= Vv
vdvdGp
Graph G is Eulerian if it can be covered by a single cyclic path (in this case, p(G)=1).
For a non-Eulerian connected graph G, the following is true:
(for an undirected graph)
(for a directed graph)
8
Path Facts
Definition Graph G’=G\P where P is an edge-disjoint path denotes the graph obtained by removing from G all edges of P followed by removing of all trivial sub-graphs.
Claim 1: Let P be any path from minimal path cover of a connected graph G. Then p(G\P)=p(G)-1.
Claim 2: In any path cover of connected graph G there are at least two pathsP1,P2 such that G\ P1 and G\ P2 are connected.
We also define an order ≤P on paths in order to represent a path decompositionof a graph in a unique way. We only store decompositions that are minimal with respect to this order(denoted by P-minimal).
Three phases of mining algorithm
Phase #1 finds all frequent graph patterns with path number 1
Phase #2 finds all frequent graph patterns with path number 2 by “joining” pairs of patterns found in phase #1
Phase #3 finds all frequent graph patterns with pathnumber n≥3 by “joining” pairs of patterns withpath number (n-1).
9
Support issue
Definition a support measure S is admissible if for any patternP and any sub-pattern Q ⊂ P, S(Q)≥S(P).
Problem: the number of appearances of the graph pattern in the databasegraph is not an admissible support measure.
Graph A appears 3 times in the database graph, while graph B ⊂ Aappears only once.
Support issueDefinition An instance graph I(P)of pattern P in database graph D is a
graphG = (V,E) where V = g ⊂ G, g ≈ P and E = (g,h), g,h ∈V and E(g) ∩ E(h) ≠ ∅.
Operations on instance graph:
clique contractionreplacing a clique C by a single node c such that only the nodes thatwere adjacent to each node of C may become adjacent to cnode expansionreplacing an existing node v by a new subgraph whose nodes may or may not be adjacent to the nodes adjacent to vnode additionadding a new node to the graph and arbitrary edges between thenew node and the old onesedge removal
10
Example of operations on instance graph
• clique contraction • vertex addition
• vertex expansion• edge removal
The main result
Theorem A support measure S is an admissible support measure ifit is non-decreasing on instance graph I(P) of every pattern P under clique contraction, node expansion, node addition and edge removal.
Note we proved a stronger result:support measure is admissible if and only if it is non-decreasingon I(P) under these operations.
11
Example of support measure
Admissible support measure:
Maximum independent set size of instance graph___________________________________________
Number of edges in the database graph
Motivation:
We are interested in typical structure, i.e. structures created by many users.
A single complex structure that has many references isless interesting for us.
Most common support measures are covered by this definition,including the standard one for transaction databases.
Graph composition - composition relation
Definition A composition relation C(P1,…,Pn) on paths P1,…,Pn of graph Gis a table with nodes of G as rows and paths as columns suchthat C[i,j]≠ ⊥⊥ iff i-th node of G is also a node of path Pj.
c2c2⊥⊥⊥⊥v7v7c1c1⊥⊥⊥⊥v6v6c3c3b3b3⊥⊥v5v5
⊥⊥b1b1⊥⊥v4v4⊥⊥⊥⊥a3a3v3v3⊥⊥b2b2a2a2v2v2⊥⊥⊥⊥a1a1v1v1PP33PP22PP11NodeNode
Example: C(P1,P2,P3) as a table
12
Graph composition
By treating table rows as graph nodes and defining edges (i,j) whenever two nodes of a path Pk, appearing in rows i and j, have an edge between them,we can construct a graph corresponding to composition relation C(P1,…,Pn).
Notation: graph composition (realization) Ω(C) of C(P1,…,Pn).
Example: Graph composition Ω(C(P1,P2,P3)).
Subtraction from Composition Relation
Definition Subtraction of a path Pi from composition relation consists of:a) eliminating the i-th column from the table;b) removal of all rows containing only null values.
b3b3⊥⊥v5v5
b1b1⊥⊥v4v4⊥⊥a3a3v3v3b2b2a2a2v2v2⊥⊥a1a1v1v1PP22PP11NodeNode
Example: C(P1,P2,P3) after subtraction of P3, denoted byC\ P3 or C|1,2
13
Bijective sum
Definition A bijective sum BS(C1, C2,I1,I2) of composition relations C1 andC2, where I1,I2 are sets of indices and C1 |I1=C2|I2,is a composition relation obtained by adding all columns
of C2 corresponding to paths that are not in C1, to the table of C1.
d3d3v9v9d2d2v8v8
c2c2v7v7d3d3v7v7c2c2v7v7c1c1v6v6d2d2v6v6c1c1v6v6c3c3b3b3v4v4b3b3v5v5c3c3b3b3v5v5
d1d1b1b1v4v4d1d1b1b1v4v4b1b1v4v4a3a3v3v3a3a3v3v3a3a3v3v3
b2b2a2a2v2v2b2b2a2a2v2v2b2b2a2a2v2v2a1a1v1v1a1a1v1v1a1a1v1v1
P4P3P2P1P3P2P1P3P2P1
C3C3C2C1
Example: Bijective sum of C1 and C2 on common paths P1 and P2.
Graph composition of a bijective sum
14
Splice
Definition A splice ⊕i,j of two composition relations C1(P1,…,Pn) and C2(Pi,Pj),is a composition relation that turns every node common to Pi and Pj in C2, into the node common to Pi and Pj in C1 as well.
Example: C3= C1(P1,P2,P3)⊕2,3C2(P2,P3).
c2c2v7v7
c2c2v6v6c1c1v6v6
c3c3b3b3v5v5c3c3b3b3v5v5
c1c1b1b1v4v4c2c2v7v7b1b1v4v4
a3a3v3v3c3c3b3b3v5v5a3a3v3v3
b2b2a2a2v2v2b2b2v4v4b2b2a2a2v2v2
a1a1v1v1c1c1b1b1v2v2a1a1v1v1
P3P2P1P3P2P3P2P1
C3C2C1
Graph composition of a splice
15
Phase #1 – overview
Phase #1 constructs frequent paths by adding one edge at a time.
If the path is cyclic (i.e is a (not necessarily simple) cycle,we can add edge anywhere (providing the labels match):
1. between two existing nodes,2. between existing and new node.
If the path is not cyclic, we can add edge between pair of nodesone of which is unbalanced:
1. between two existing unbalanced nodes,2. between existing unbalanced and existing balanced node,3. between existing unbalanced node and a new node.
Definition A node v in graph G is balanced if degree of v is even(for undirected graphs). A node is unbalanced if it is not balanced.
Phase #1 – Path generation
Algorithm: Phase #11. Find all frequent edges and add them to L1,1. Set k←2.2. Set C1,k←∅, L1,k ←∅.3. For every path P∈L1,k-1 and every edge e=(v,u) ∈ L1,1 do:
a. Let X be all nodes of P if P is cyclic and all unbalanced nodes of P if P is non-cyclic.
b. For every x ∈ X such that x ≈ v add Q=(V(P)∪u, E(P) ∪x,u)) to C1,k (if p(Q)=1).
c. For every x ∈ X such that x ≈ u add Q=(V(P)∪v, E(P) ∪(v, x)) to C1,k (if p(Q)=1).
d. For every x,y ∈ X such that x ≈ v, x ≈ u and (x,y)∉E(P), add Q=(V(P), E(P) ∪(x, y)) to C1,k if p(Q)=1.
4. Compute frequency of all paths from C1,k and add the frequent ones to L1,k.
5. If L1,k =∅, stop. Otherwise, set k ← k+1 and go to step 2.
16
Phase #1 – Example
Phase #2 – Path pairs generation
Algorithm: Phase #2
1. Let L1 be the set of all frequent paths.Set C2←∅, L2 ←∅.
2. For every pair P1,P2 ∈L1 and every possible label-preservingcomposition relation C on P1 and P2 do:a. If p(Ω(< P1, P2, C> ))=2, add < P1, P2, C> to C2.
3. Remove all tuples producing non P-minimal graphs from C2.
4. For every t ∈ C2 if Ω(t) is frequent, add it to L2.
17
Phase #2 - Example
Join of two paths produced three graphs with path number 2
Phase #3 – overview
Phase #3: Input = frequent graphs with path number kOutput = frequent graphs with path number (k+1)
The main step: 1. find a common (k-1)-subgraph of two k-graphs,2. if found, join these graphs into (k+1)-graph using
bijective sum operation,
Additional step:3. for bijective sum G of two graphs and two paths
P and Q in which these graphs differ, find all frequent combinations of P and Q in L2, and join them withG using splice operation.
18
Phase #3 – Graphs with p(G)≥3
Algorithm: Phase #3
1. Let L2 be the set of all frequent path pairs. Set k←3.2. Set Ck←∅, Lk ←∅.3. For every t1,t2 ∈ Lk-1 such that t1=<P1,…,Pi-1,Pi+1,…,Pj,…,Pk,C1>
and t2= <P1,…,Pj,…,Pj-1,Pj+1,…,Pk ,C2> do:a. Let C = BS(C1,C2,(k)-i-j,(k)-i-j).b. Add t=< P1,…,Pk ,C > to Ck (if p(Ω(t)) = k).c. For every t3=< Pi,Pj,C3 > ∈ L2,
add t=< P1,…,Pk ,C⊕i,jC3 > to Ck (if p(Ω(t)) = k).4. Remove all non P-minimal tuples from Ck.5. Add every t ∈ Ck, where Ω(t) is frequent, to Lk.6. If Lk = ∅, stop. Otherwise, set k ← k+1 and go to step 2.
Proof milestones
Theorem 1 All frequent graphs with path number 1 are produced byphase 1 of the algorithm.
Basis: For every path P and unbalanced vertex v of P there existsa vertex u such that (u,v)∈E(P) and P\(u,v) is a path.
Theorem 2 All frequent graphs with path number 2 are produced byphase 2 of the algorithm.
Basis: Each graph G with p(G)=2 can be expressed as a label-preservingcomposition relation on two paths from its any P-minimal path decomposition.
Theorem 3 All frequent graphs with p(G)>2 are produced by phase 3.Main steps: 1. There exists two paths P,Q in minimal path decomposition
of G such that G\P and G\Q are connected.2. G\P and G\Q are also frequent and were found earlier.3. If P and Q are disjoint in G, then application of a bijective
sum produces G.4. Otherwise, bijective sum and splice combined produce G.
19
Complexity
Exponential – as the number of frequent patterns can be exponentialon the size of the database
Difficult tasks:1. Support computation that consists of:
a. Finding all instances of a frequent pattern in the database.
b. Computing MIS (maximum independent set size) of an instance graph.
Relatively easy tasks:1. Candidate set generation:
polynomial on the size of frequent set fromprevious iteration,
2. Elimination of isomorphic candidate patterns:graph isomorphism computation is at worstexponential on the size of a pattern, not a database.
Phase #3 – complexity
Algorithm: Phase #3
1. Let L2 be the set of all frequent path pairs. Set k←3.2. Set Ck←∅, Lk ←∅.3. For every t1,t2 ∈ Lk-1 such that t1=<P1,…,Pi-1,Pi+1,…,Pj,…,Pk,C1>
and t2= <P1,…,Pj,…,Pj-1,Pj+1,…,Pk ,C2> do:a. Let C = BS(C1,C2,(k)-i-j,(k)-i-j).b. Add t=< P1,…,Pk ,C > to Ck (if p(Ω(t)) = k).c. For every t3=< Pi,Pj,C3 > ∈ L2,
add t=< P1,…,Pk ,C⊕i,jC3 > to Ck (if p(Ω(t)) = k).4. Remove all non P-minimal tuples from Ck.5. Add every t ∈ Ck, where Ω(t) is frequent, to Lk.6. If Lk = ∅, stop. Otherwise, set k ← k+1 and go to step 2.
subgraph and graphisomorphism
)**2
( 22
2
LkL
O k
20
Complexity (cont.)
Why is mining in real-life databases easier ?
real databases tend to be sparse rather than dense,real databases tend to have large number of differentlabels.
Impact on algorithm’s complexity:
the number of database subgraphs isomorphic to a givengraph pattern is not exponential,the size of instance graph is not exponential,instance graphs tend to be very sparse, which makesthe task of finding MIS much easier.
Additional improvements:approximate techniques can be used for MIS computationas user usually does not care for the exact support value.
Experiment overview
Goals of our experiments are:To compare our algorithm with naïve algorithms:
Naive1 – produce all graphs and computetheir support (B. D. McKay “Isomorph-free exhaustivegeneration”, J. of Algorithms vol. 26, 1998)Naive2 – at each iteration, add edge to frequentgraphs from previous iteration
To study algorithm’s behavior on various graph topologies:cliquestreessparse graphs vs dense graphs
To study the effect of following parameters on the numberof frequent patterns found:
size of the databasenumber of different labels
Test algorithm on both synthetic and real-life databases
21
Experiments setting
Experimental results on synthetic data: trees
Notation: S – support; N – nodes; L – labels; E – edgesFP – frequent patternsC – candidate patterns; I – isomorphism checks; SC – support calculations; ALG - algorithm in use.
SCSCIICCALGALGFPFPSSLLNN##
92922424100100Naive2Naive215157%7%44404011
262629294141AprioriApriori
87871818103103Naive2Naive214145%5%88606077
276276868868175175AprioriApriori
716716203203728728Naive2Naive244445%5%66606066
525247475252AprioriApriori
92922424100100Naive2Naive215155%5%44606055
1111119191119119AprioriApriori
2902906262306306Naive2Naive227273%3%88505044
205205239239202202AprioriApriori
4584588282470470Naive2Naive237373%3%66505033
424245454545AprioriApriori
1021024141110110Naive2Naive216167%7%44505022
525247475252AprioriApriori
22
Experimental results on synthetic data:sparse graphs
100100
9090
8080
8080
6060
5050
5050
EE SCSCIICCALGALGFPFPSSLLNN##
525233336060Naive2Naive214147%7%44404011
141141127127149149AprioriApriori
3873877474403403Naive2Naive232323%3%88808077
1101109898126126AprioriApriori
2362367777252252Naive2Naive227273%3%88707066
110110102102120120AprioriApriori
2532538686265265Naive2Naive227273%3%66606055
565658585656AprioriApriori
93933131101101Naive2Naive216164%4%44606044
143143185185117117AprioriApriori
3433437474355355Naive2Naive228285%5%66505033
545470705959AprioriApriori
767648488484Naive2Naive217175%5%66404022
424255554949AprioriApriori
919112921292129312936676762794279427572757559090481348134730473044222278517851702770273325259416941683378337221121121387813878126561265611LabelsLabelsEdgesEdgesNodesNodesData set #Data set #
Subsets of Movie databaseused in experiments
23
Experimental results on subsetsof Movie database
Number of frequent patternsNumber of frequent patternsSupSupportport
Number of frequent patternsNumber of frequent patternsSupSupportport
5%5%
6%6%
7%7%
8%8%
9%9%
10%10%
20%20%
DataDatasetset
7676111116161111141422223434558844557730%30%
7676101012121111131321213232225544446640%40%
84849912121010121218182121225533335550%50%
848488121210101212161699224433334460%60%
797977111110101212161655224433334470%70%
797977111110101212151555223333333380%80%
4646669955668844223333333390%90%
#6#6#5#5#4#4#3#3#2#2#1#1#6#6#5#5#4#4#3#3#2#2#1#1DataDatasetset
Comparison
Our algorithm vs naive onesNaive1 algorithm does not work on graphs with ≥ 10 nodesOur algorithm produces less candidate patterns and thereforeperforms less support computations than Naive2 algorithm.
Trees vs sparse graphsSupport computation is easier for treesLess candidate patterns are generated for trees
Synthetic vs real-life dataSynthetic graphs are not very regular. When increasing number of labels, the chance of finding non-trivial frequent graph patternsdecreases drastically.Large real-life graph databases are highly regular and containcomplex frequent graph patterns.
24
Experimental results (on synthetic data)
Experimental results (on synthetic data)
25
Pattern examples in Movies database
Conclusion
An Apriori-like algorithm for mining graph patternsthat uses edge-disjoint paths as building blocks has been constructed.
A problem of defining support measure for semi-structured data was addressed.
An experimental analysis of the algorithm was conducted.
26
Future work
Usage of building blocks other than edge disjoint paths, such as trees.
Using Apriori-TID technique at the advanced stages of the search.
Treat patterns that have high degree of resemblance, such as bisimular patterns, as representatives of their equivalency classes and generate representatives of each class instead of the full search.
Find additional examples of admissible support measures.
Take into account topological properties of a database graph while computing support.Compare with GSPAN algorithm using our Support measure