Computing frequent patterns from semi-structured data - CIW · Computing frequent patterns from...

1

12/4/2003

Computing frequent patterns Computing frequent patterns fromfrom

semisemi--structured datastructured data

Natalia Vanetik, Ehud Gudes and Eyal S. Shimony

Department of Computer Science,Ben-Gurion University of the Negev

Talk outline

Part 1 IntroductionIntroduction1.1 Motivation and background1.1 Motivation and background1.2 Overview of the graph mining algorithm1.2 Overview of the graph mining algorithm

Part 2 DetailsDetails2.1 The support measure2.1 The support measure2.2 Combining graphs2.2 Combining graphs2.3 Details of the mining algorithm2.3 Details of the mining algorithm

Part 3 Experimental evaluationExperimental evaluation

Part 4 Conclusions and future workConclusions and future work

2

What Graphs are good for?

Most of existing data mining algorithms are based on Most of existing data mining algorithms are based on transaction representationtransaction representation, i.e., sets of items., i.e., sets of items.Datasets with structures, layers, hierarchy and/or Datasets with structures, layers, hierarchy and/or geometry often do not fit well in this transaction geometry often do not fit well in this transaction setting. For e.g.setting. For e.g.

Numerical simulationsNumerical simulations3D protein structures3D protein structuresChemical CompoundsChemical CompoundsGeneric XML files.Generic XML files.

Graph Based Data Mining

Graph Mining then essentially is the problem of Graph Mining then essentially is the problem of discovering repetitive discovering repetitive subgraphssubgraphs occurring in the input occurring in the input graphs.graphs.

Motivation:Motivation:finding finding subgraphssubgraphs capable of compressing the data capable of compressing the data by abstracting instances of the substructures.by abstracting instances of the substructures.Identifying conceptually interesting patternsIdentifying conceptually interesting patterns

3

Overview

Frequent patterns discovered from semiFrequent patterns discovered from semi--structured data are useful for:structured data are useful for:

•• Improving database design Improving database design (A. Deutsch, M. Fernandez, D.Suciu “Storing

Semistructured Data with STORED”, SIGMOD’99)

•• Efficient indexing Efficient indexing (R. Goldman, J. Widom “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases”, VLDB’97)

•• User preference based applicationsUser preference based applications

•• User behavior predictionsUser behavior predictions

•• Database storage and archivalDatabase storage and archival

SemiSemi--structuredstructured data is any data that can be modeled as a labeled graph.data is any data that can be modeled as a labeled graph.For example, XML and HTML data, user access patterns.For example, XML and HTML data, user access patterns.

Overview

Rule-based patternsPatterns of form A1,A2,…,An ⇒ B where A1,…,An,B are atomic values.Example: “diapers ⇒ beer”.

Topology-based patternsPatterns that have structure in addition to atomic valuesExample: graph patterns

4

Overview

Definition Transaction database is a set of records where each recordcontains results of a single transaction.

Example: Supermarket database where each purchase is a transaction.

An item-set X in relational model is a set of tuples (f1,v1),…,(fn,vn) where fiare the names of fields and vi are values. n is a size of an item-set.

A transaction T supports X if the value of fi equals vi for any i.

Let user-defined support threshold be s%. An item-set X is frequent if s% ormore transactions in the database support X.

Apriori algorithm

Apriori algorithm: each item-set X of size n is first generated from two frequent item-sets X1 and X2 of size (n-1) and then itsfrequency is evaluated by a pass over the database.

Apriori principle: If X is a frequent item-set of size n then all items X’contained in X are frequent as well.

5

Background

Existing algorithmsExisting algorithms

• Simple path patterns (Chen,Park,Yu 98)

• Generalized path patterns (Nanopoulos,Manolopoulos 01)

• Simple tree patterns (Lin,Liu,Zhang, Zhou 98)

• Tree-like patterns (Wang,Huiqing,Liu 98)

• “Naïve” graph patterns (Kuramochi,Karypis 01)

FSG Algorithm[M. Kuramochi and G. Karypis. Frequent subgraph discovery.]

Incremental and breadthIncremental and breadth--first fashion in the size of frequent first fashion in the size of frequent subgraphs (like Apriori for frequent itemsets)subgraphs (like Apriori for frequent itemsets)Counting of single and double edge subgraphs Counting of single and double edge subgraphs For finding frequent subgraphs from size k (k > 2).For finding frequent subgraphs from size k (k > 2).

Candidates generationCandidates generation-- all possible joining of two graphs all possible joining of two graphs from size kfrom size k--1 which share common kernel subgraph from 1 which share common kernel subgraph from size ksize k--2.2.Candidate pruningCandidate pruning-- a necessary condition of candidate to be a necessary condition of candidate to be frequent is that each of its subgraphs is frequent.frequent is that each of its subgraphs is frequent.Frequency counting Frequency counting -- Check if a candidate subgraph appear Check if a candidate subgraph appear at least at least minSupminSup (minimum support) times.(minimum support) times.Repeat the steps for k=k+1Repeat the steps for k=k+1

6

GSpan Algorithm[X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.]

Adopts a patternAdopts a pattern--growth by growing patterns from a single growth by growing patterns from a single graph directly. graph directly. The algorithm maps each subgraph to a unique label. The algorithm maps each subgraph to a unique label. By using these labels, a Tree SearchBy using these labels, a Tree Search--Space (TSS) hierarchy is Space (TSS) hierarchy is constructed over all possible subgraphs. constructed over all possible subgraphs. subgraph from size k is kept in node at depth k in TSS.subgraph from size k is kept in node at depth k in TSS.An inAn in--order search over the TSS lets discover all frequent order search over the TSS lets discover all frequent subgraphs. subgraphs.

Pruning Pruning -- If a node in TSS holds infrequent subgraph then its If a node in TSS holds infrequent subgraph then its subsub--tree in TSS is pruned.tree in TSS is pruned.

Overview

Model: A semistructured database is viewed as a (labeled) graph.A pattern is any connected subgraph of a database graph.

Goal: Find all frequent connected subgraphs of a database graph.

Related problem: What if there are no transactions in the database(when it is a single graph, such as XML file, Web etc.) ?How do we count pattern instances ?What is support of a graph pattern ?

Most existing algorithms use Transactions based databases!

We assume a single graph database!

7

Overview (contd.)

Most existing algorithms use Apriori-based approach.

The basic building block is either a tree (for tree mining only) or an edge.

Our approach: use edge-disjoint paths as building blocks.

Result: faster convergence of the algorithm

Path Facts

Definition Path number p(G) of a graph is the minimal number of edge-disjoint paths that cover all edges in the graph. A collectionof p(G) paths that cover all edges is called a minimal path cover.

2)(

)(oddisvdVv

Gp∈

=

2

)()()(∑∈

−+ −= Vv

vdvdGp

Graph G is Eulerian if it can be covered by a single cyclic path (in this case, p(G)=1).

For a non-Eulerian connected graph G, the following is true:

(for an undirected graph)

(for a directed graph)

8

Path Facts

Definition Graph G’=G\P where P is an edge-disjoint path denotes the graph obtained by removing from G all edges of P followed by removing of all trivial sub-graphs.

Claim 1: Let P be any path from minimal path cover of a connected graph G. Then p(G\P)=p(G)-1.

Claim 2: In any path cover of connected graph G there are at least two pathsP1,P2 such that G\ P1 and G\ P2 are connected.

We also define an order ≤P on paths in order to represent a path decompositionof a graph in a unique way. We only store decompositions that are minimal with respect to this order(denoted by P-minimal).

Three phases of mining algorithm

Phase #1 finds all frequent graph patterns with path number 1

Phase #2 finds all frequent graph patterns with path number 2 by “joining” pairs of patterns found in phase #1

Phase #3 finds all frequent graph patterns with pathnumber n≥3 by “joining” pairs of patterns withpath number (n-1).

9

Support issue

Definition a support measure S is admissible if for any patternP and any sub-pattern Q ⊂ P, S(Q)≥S(P).

Problem: the number of appearances of the graph pattern in the databasegraph is not an admissible support measure.

Graph A appears 3 times in the database graph, while graph B ⊂ Aappears only once.

Support issueDefinition An instance graph I(P)of pattern P in database graph D is a

graphG = (V,E) where V = g ⊂ G, g ≈ P and E = (g,h), g,h ∈V and E(g) ∩ E(h) ≠ ∅.

Operations on instance graph:

clique contractionreplacing a clique C by a single node c such that only the nodes thatwere adjacent to each node of C may become adjacent to cnode expansionreplacing an existing node v by a new subgraph whose nodes may or may not be adjacent to the nodes adjacent to vnode additionadding a new node to the graph and arbitrary edges between thenew node and the old onesedge removal

10

Example of operations on instance graph

• clique contraction • vertex addition

• vertex expansion• edge removal

The main result

Theorem A support measure S is an admissible support measure ifit is non-decreasing on instance graph I(P) of every pattern P under clique contraction, node expansion, node addition and edge removal.

Note we proved a stronger result:support measure is admissible if and only if it is non-decreasingon I(P) under these operations.

11

Example of support measure

Admissible support measure:

Maximum independent set size of instance graph___________________________________________

Number of edges in the database graph

Motivation:

We are interested in typical structure, i.e. structures created by many users.

A single complex structure that has many references isless interesting for us.

Most common support measures are covered by this definition,including the standard one for transaction databases.

Graph composition - composition relation

Definition A composition relation C(P1,…,Pn) on paths P1,…,Pn of graph Gis a table with nodes of G as rows and paths as columns suchthat C[i,j]≠ ⊥⊥ iff i-th node of G is also a node of path Pj.

c2c2⊥⊥⊥⊥v7v7c1c1⊥⊥⊥⊥v6v6c3c3b3b3⊥⊥v5v5

⊥⊥b1b1⊥⊥v4v4⊥⊥⊥⊥a3a3v3v3⊥⊥b2b2a2a2v2v2⊥⊥⊥⊥a1a1v1v1PP33PP22PP11NodeNode

Example: C(P1,P2,P3) as a table

12

Graph composition

By treating table rows as graph nodes and defining edges (i,j) whenever two nodes of a path Pk, appearing in rows i and j, have an edge between them,we can construct a graph corresponding to composition relation C(P1,…,Pn).

Notation: graph composition (realization) Ω(C) of C(P1,…,Pn).

Example: Graph composition Ω(C(P1,P2,P3)).

Subtraction from Composition Relation

Definition Subtraction of a path Pi from composition relation consists of:a) eliminating the i-th column from the table;b) removal of all rows containing only null values.

b3b3⊥⊥v5v5

b1b1⊥⊥v4v4⊥⊥a3a3v3v3b2b2a2a2v2v2⊥⊥a1a1v1v1PP22PP11NodeNode

Example: C(P1,P2,P3) after subtraction of P3, denoted byC\ P3 or C|1,2

13

Bijective sum

Definition A bijective sum BS(C1, C2,I1,I2) of composition relations C1 andC2, where I1,I2 are sets of indices and C1 |I1=C2|I2,is a composition relation obtained by adding all columns

of C2 corresponding to paths that are not in C1, to the table of C1.

d3d3v9v9d2d2v8v8

c2c2v7v7d3d3v7v7c2c2v7v7c1c1v6v6d2d2v6v6c1c1v6v6c3c3b3b3v4v4b3b3v5v5c3c3b3b3v5v5

d1d1b1b1v4v4d1d1b1b1v4v4b1b1v4v4a3a3v3v3a3a3v3v3a3a3v3v3

b2b2a2a2v2v2b2b2a2a2v2v2b2b2a2a2v2v2a1a1v1v1a1a1v1v1a1a1v1v1

P4P3P2P1P3P2P1P3P2P1

C3C3C2C1

Example: Bijective sum of C1 and C2 on common paths P1 and P2.

Graph composition of a bijective sum

14

Splice

Definition A splice ⊕i,j of two composition relations C1(P1,…,Pn) and C2(Pi,Pj),is a composition relation that turns every node common to Pi and Pj in C2, into the node common to Pi and Pj in C1 as well.

Example: C3= C1(P1,P2,P3)⊕2,3C2(P2,P3).

c2c2v7v7

c2c2v6v6c1c1v6v6

c3c3b3b3v5v5c3c3b3b3v5v5

c1c1b1b1v4v4c2c2v7v7b1b1v4v4

a3a3v3v3c3c3b3b3v5v5a3a3v3v3

b2b2a2a2v2v2b2b2v4v4b2b2a2a2v2v2

a1a1v1v1c1c1b1b1v2v2a1a1v1v1

P3P2P1P3P2P3P2P1

C3C2C1

Graph composition of a splice

15

Phase #1 – overview

Phase #1 constructs frequent paths by adding one edge at a time.

If the path is cyclic (i.e is a (not necessarily simple) cycle,we can add edge anywhere (providing the labels match):

1. between two existing nodes,2. between existing and new node.

If the path is not cyclic, we can add edge between pair of nodesone of which is unbalanced:

1. between two existing unbalanced nodes,2. between existing unbalanced and existing balanced node,3. between existing unbalanced node and a new node.

Definition A node v in graph G is balanced if degree of v is even(for undirected graphs). A node is unbalanced if it is not balanced.

Phase #1 – Path generation

Algorithm: Phase #11. Find all frequent edges and add them to L1,1. Set k←2.2. Set C1,k←∅, L1,k ←∅.3. For every path P∈L1,k-1 and every edge e=(v,u) ∈ L1,1 do:

a. Let X be all nodes of P if P is cyclic and all unbalanced nodes of P if P is non-cyclic.

b. For every x ∈ X such that x ≈ v add Q=(V(P)∪u, E(P) ∪x,u)) to C1,k (if p(Q)=1).

c. For every x ∈ X such that x ≈ u add Q=(V(P)∪v, E(P) ∪(v, x)) to C1,k (if p(Q)=1).

d. For every x,y ∈ X such that x ≈ v, x ≈ u and (x,y)∉E(P), add Q=(V(P), E(P) ∪(x, y)) to C1,k if p(Q)=1.

4. Compute frequency of all paths from C1,k and add the frequent ones to L1,k.

5. If L1,k =∅, stop. Otherwise, set k ← k+1 and go to step 2.

16

Phase #1 – Example

Phase #2 – Path pairs generation

Algorithm: Phase #2

1. Let L1 be the set of all frequent paths.Set C2←∅, L2 ←∅.

2. For every pair P1,P2 ∈L1 and every possible label-preservingcomposition relation C on P1 and P2 do:a. If p(Ω(< P1, P2, C> ))=2, add < P1, P2, C> to C2.

3. Remove all tuples producing non P-minimal graphs from C2.

4. For every t ∈ C2 if Ω(t) is frequent, add it to L2.

17

Phase #2 - Example

Join of two paths produced three graphs with path number 2

Phase #3 – overview

Phase #3: Input = frequent graphs with path number kOutput = frequent graphs with path number (k+1)

The main step: 1. find a common (k-1)-subgraph of two k-graphs,2. if found, join these graphs into (k+1)-graph using

bijective sum operation,

Additional step:3. for bijective sum G of two graphs and two paths

P and Q in which these graphs differ, find all frequent combinations of P and Q in L2, and join them withG using splice operation.

18

Phase #3 – Graphs with p(G)≥3

Algorithm: Phase #3

1. Let L2 be the set of all frequent path pairs. Set k←3.2. Set Ck←∅, Lk ←∅.3. For every t1,t2 ∈ Lk-1 such that t1=<P1,…,Pi-1,Pi+1,…,Pj,…,Pk,C1>

and t2= <P1,…,Pj,…,Pj-1,Pj+1,…,Pk ,C2> do:a. Let C = BS(C1,C2,(k)-i-j,(k)-i-j).b. Add t=< P1,…,Pk ,C > to Ck (if p(Ω(t)) = k).c. For every t3=< Pi,Pj,C3 > ∈ L2,

add t=< P1,…,Pk ,C⊕i,jC3 > to Ck (if p(Ω(t)) = k).4. Remove all non P-minimal tuples from Ck.5. Add every t ∈ Ck, where Ω(t) is frequent, to Lk.6. If Lk = ∅, stop. Otherwise, set k ← k+1 and go to step 2.

Proof milestones

Theorem 1 All frequent graphs with path number 1 are produced byphase 1 of the algorithm.

Basis: For every path P and unbalanced vertex v of P there existsa vertex u such that (u,v)∈E(P) and P\(u,v) is a path.

Theorem 2 All frequent graphs with path number 2 are produced byphase 2 of the algorithm.

Basis: Each graph G with p(G)=2 can be expressed as a label-preservingcomposition relation on two paths from its any P-minimal path decomposition.

Theorem 3 All frequent graphs with p(G)>2 are produced by phase 3.Main steps: 1. There exists two paths P,Q in minimal path decomposition

of G such that G\P and G\Q are connected.2. G\P and G\Q are also frequent and were found earlier.3. If P and Q are disjoint in G, then application of a bijective

sum produces G.4. Otherwise, bijective sum and splice combined produce G.

19

Complexity

Exponential – as the number of frequent patterns can be exponentialon the size of the database

Difficult tasks:1. Support computation that consists of:

a. Finding all instances of a frequent pattern in the database.

b. Computing MIS (maximum independent set size) of an instance graph.

Relatively easy tasks:1. Candidate set generation:

polynomial on the size of frequent set fromprevious iteration,

2. Elimination of isomorphic candidate patterns:graph isomorphism computation is at worstexponential on the size of a pattern, not a database.

Phase #3 – complexity

Algorithm: Phase #3

1. Let L2 be the set of all frequent path pairs. Set k←3.2. Set Ck←∅, Lk ←∅.3. For every t1,t2 ∈ Lk-1 such that t1=<P1,…,Pi-1,Pi+1,…,Pj,…,Pk,C1>

and t2= <P1,…,Pj,…,Pj-1,Pj+1,…,Pk ,C2> do:a. Let C = BS(C1,C2,(k)-i-j,(k)-i-j).b. Add t=< P1,…,Pk ,C > to Ck (if p(Ω(t)) = k).c. For every t3=< Pi,Pj,C3 > ∈ L2,

add t=< P1,…,Pk ,C⊕i,jC3 > to Ck (if p(Ω(t)) = k).4. Remove all non P-minimal tuples from Ck.5. Add every t ∈ Ck, where Ω(t) is frequent, to Lk.6. If Lk = ∅, stop. Otherwise, set k ← k+1 and go to step 2.

subgraph and graphisomorphism

)**2

( 22

2

LkL

O k

20

Complexity (cont.)

Why is mining in real-life databases easier ?

real databases tend to be sparse rather than dense,real databases tend to have large number of differentlabels.

Impact on algorithm’s complexity:

the number of database subgraphs isomorphic to a givengraph pattern is not exponential,the size of instance graph is not exponential,instance graphs tend to be very sparse, which makesthe task of finding MIS much easier.

Additional improvements:approximate techniques can be used for MIS computationas user usually does not care for the exact support value.

Experiment overview

Goals of our experiments are:To compare our algorithm with naïve algorithms:

Naive1 – produce all graphs and computetheir support (B. D. McKay “Isomorph-free exhaustivegeneration”, J. of Algorithms vol. 26, 1998)Naive2 – at each iteration, add edge to frequentgraphs from previous iteration

To study algorithm’s behavior on various graph topologies:cliquestreessparse graphs vs dense graphs

To study the effect of following parameters on the numberof frequent patterns found:

size of the databasenumber of different labels

Test algorithm on both synthetic and real-life databases

21

Experiments setting

Experimental results on synthetic data: trees

Notation: S – support; N – nodes; L – labels; E – edgesFP – frequent patternsC – candidate patterns; I – isomorphism checks; SC – support calculations; ALG - algorithm in use.

SCSCIICCALGALGFPFPSSLLNN##

92922424100100Naive2Naive215157%7%44404011

262629294141AprioriApriori

87871818103103Naive2Naive214145%5%88606077


716716203203728728Naive2Naive244445%5%66606066


92922424100100Naive2Naive215155%5%44606055


2902906262306306Naive2Naive227273%3%88505044


4584588282470470Naive2Naive237373%3%66505033


1021024141110110Naive2Naive216167%7%44505022


22

Experimental results on synthetic data:sparse graphs

100100

9090

8080

8080

6060

5050

5050

EE SCSCIICCALGALGFPFPSSLLNN##

525233336060Naive2Naive214147%7%44404011


3873877474403403Naive2Naive232323%3%88808077


2362367777252252Naive2Naive227273%3%88707066


2532538686265265Naive2Naive227273%3%66606055


93933131101101Naive2Naive216164%4%44606044


3433437474355355Naive2Naive228285%5%66505033


767648488484Naive2Naive217175%5%66404022


919112921292129312936676762794279427572757559090481348134730473044222278517851702770273325259416941683378337221121121387813878126561265611LabelsLabelsEdgesEdgesNodesNodesData set #Data set #

Subsets of Movie databaseused in experiments

23

Experimental results on subsetsof Movie database

Number of frequent patternsNumber of frequent patternsSupSupportport

Number of frequent patternsNumber of frequent patternsSupSupportport

5%5%

6%6%

7%7%

8%8%

9%9%

10%10%

20%20%

DataDatasetset

7676111116161111141422223434558844557730%30%

7676101012121111131321213232225544446640%40%

84849912121010121218182121225533335550%50%

848488121210101212161699224433334460%60%

797977111110101212161655224433334470%70%

797977111110101212151555223333333380%80%

4646669955668844223333333390%90%

#6#6#5#5#4#4#3#3#2#2#1#1#6#6#5#5#4#4#3#3#2#2#1#1DataDatasetset

Comparison

Our algorithm vs naive onesNaive1 algorithm does not work on graphs with ≥ 10 nodesOur algorithm produces less candidate patterns and thereforeperforms less support computations than Naive2 algorithm.

Trees vs sparse graphsSupport computation is easier for treesLess candidate patterns are generated for trees

Synthetic vs real-life dataSynthetic graphs are not very regular. When increasing number of labels, the chance of finding non-trivial frequent graph patternsdecreases drastically.Large real-life graph databases are highly regular and containcomplex frequent graph patterns.

24

Experimental results (on synthetic data)

Experimental results (on synthetic data)

25

Pattern examples in Movies database

Conclusion

An Apriori-like algorithm for mining graph patternsthat uses edge-disjoint paths as building blocks has been constructed.

A problem of defining support measure for semi-structured data was addressed.

An experimental analysis of the algorithm was conducted.

26

Future work

Usage of building blocks other than edge disjoint paths, such as trees.

Using Apriori-TID technique at the advanced stages of the search.

Treat patterns that have high degree of resemblance, such as bisimular patterns, as representatives of their equivalency classes and generate representatives of each class instead of the full search.

Find additional examples of admissible support measures.

Take into account topological properties of a database graph while computing support.Compare with GSPAN algorithm using our Support measure

Date post:	19-Apr-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Computing frequent patterns from semi-structured data - CIW · Computing frequent patterns from...

Documents