+ All Categories
Home > Documents > Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. ·...

Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. ·...

Date post: 23-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
41
Mining Graph Patterns
Transcript
Page 1: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Mining Graph Patterns

Page 2: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented
Page 3: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Why mine graph patterns?

• Direct Use:– Mining over-represented sub-structures in chemical

databases– Mining conserved sub-networks– Program control flow analysis

• Indirect Uses:– Building block of further analysis

• Classification • Clustering• Similarity searches• Indexing

Page 4: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

What are graph patterns?

• Given a function f(g) and a threshold θ, find all subgraphs g, such that f(g) ≥ θ.

• Example: frequent subgraph mining.

Page 5: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

a

d

c

b a

d

c

b

a

d

c

b

e

fe

f

a

d f

G1 G3G2 G4

a

c

b

d

Frequent subgraph

Θ=3Is this the only frequent subgraph?

NO!

Page 6: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Other Mining Functions

• Maximal frequent subgraph mining

– A subgraph is maximal, if none of it super-graphs are frequent

• Closed frequent subgraph mining

– A frequent subgraph is closed, if all its supergraphshave a lesser frequency

• Significant subgraph mining

– G-test, p-value

Page 7: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Frequent Subgraph Mining

Page 8: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Frequent subgraph mining

• Apriori Based Approach (FSG)

– Find all frequent subgraphs of size K

– Find candidates of size k+1 edges by joining candidates of size k edges

– Must share a common subgraph of k-2 edges

Example: (FSG)

Page 9: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Pattern Growth Approach

• Pattern Growth Approach

– Depth first exploration

– Recursively grow a frequent subgraph

Page 10: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Mining Significant subgraphs

• What is significance?

– Gtest, p-value

– Both attempt to measure the deviation of the observed frequency from the expected frequency

– Example: Snow in Santa Barbara is significant, but snow in Alaska is not.

Page 11: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

P-value

• p-value : what's the probability of getting a result as extreme or more in the possible range of test statistics as the one we actually got?

• Lower the p-value, higher the significance

Page 12: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Problem formulation

• Find answer set– : Graph Database

– η : Significance Threshold

– : g is a subgraph of G

• Low frequency does not imply low significance and vice versa– Graph with frequency 1% can be significant if

expected frequency is 0.1%

Page 13: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Solution to Problem: Approach 1

• Number of frequent subgraphs grow exponentially with frequency

Calculate p-value to find significant sub-graphs

Graph Database

Bottleneck

Answer Set

Frequent Sub-graph Mining with lowfrequency threshold

Page 14: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented
Page 15: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Alternative Approximate Solution

Graph DBFeature Vectors Significant sub-feature

vectors

Slide Window

Feature Vector Mining

Sets of similar regions

Significant Subgraphs

Frequent sub-graph

mining

User defined p-value threshold

high frequency threshold

Page 16: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Converting graphs to feature vectors

• Random walk with Restart (RWR) on each node in a graph

• Feature vectors discretized to 10 bins

Page 17: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

What does RWR vectors preserve?

• Distribution of node-types around each node in graph

• Stores more structural information than a simple count of node-types

• Captures the feature vector representation of the subgraph around each node in a graph

Page 18: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Extracting information from feature vectors

• Floor of G1,G2,G3: [2,0,0,0,1,1,0,0,0]• Floor of G1,G2,G3,G4 : [0,0,0,0,0,0,0,0,0]• False positives pruned later

a

d

c

b a

d

c

b

a

d

c

b

e

fe

f

a

d f

G1 G3G2 G4

Page 19: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Measuring p-value of feature vector

• Sub-feature vector: X=[x1, ..,xn] is a sub-feature vector of Y =[y1, ..,yn] if xi≤ yi for i=1..n.

– Example: [2,3 1] ≤ [4,3,2].

– In other words, “X occurs in Y”

• Given a vector X:

– P(X) = Probability of X occurring in an arbitrary Y

= P(y1≥x1, ..,yn ≥ xn)

=

Page 20: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

More p-value calculation

• Individual feature probabilities calculated empirically.

• Example:

• P(a-b≥2)=3/4• P(a-e≥ 1)=1/4• P([2,0,0,0,1,1,0,0,0])= ¾* ¾* ¾ = 27/64

Page 21: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Probability Distribution of X

• The distribution can be modeled as a Binomial Distribution

– m = number of vectors in database

– μ = number of successes

• X occurring in a vector a “success”

Page 22: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

P-value…

• = observed frequency

Page 23: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Monotonicity properties of p-value

• If X is a sub-feature vector of Y

– p-value(X,s) ≥p-value(Y,s) for any support s

• For some support s1 ≥ s2

– p-value(X,s1) ≤ p-value(X,s2)

Page 24: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Mining Significant subgraphs

• What have we developed till now?

– Vector representation of subgraphs

– Significance of a subgraph using its vector representation

• Next Step?

– Find all significant vectors

Page 25: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

v1: 4,5,6v2: 3,2,4v3: 3,4,2v4: 2,3,3

2,2,2

2,2,32,3,23,2,2

3,2,44,5,6 3,4,2 3,2,4 3,4,2 2,3,3

4,5,64,5,6

{v1,v2,v4}

{v1,v2,v3,v4}

{v1,v3}{v1,v2}{v1,v3}{v1}

{v1}{v1}

{v1,v2,v3} {v1,v3,v4}

{v1,v2}{v1,v4}

X

X

X

X

Pruning Criteria:1. Duplicate2. Ceiling of supporting set is

less that p-value threshold

Page 26: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Definitions

• Vector X occurs in graph G

– X ≤ hi , hi ϵ G

– Ex: [3,1,2] occurs in G, [3, 3, 3] does not.

Page 27: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Definitions..

• Cut-off/Isolate structure around node n in Graph G within radius r

– Ex: around b within radius 1

– Ex: around f within radius 2

a

d

c

b

e

f

a

d

c

b

e

f

Page 28: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Mapping significant vectors to significant subgraphs

Significant sub-structures

Sets of nodes

Sets of similar sub-structures

Significant Vectors

Scan all nodes in database

Isolate sub-structures

around each node in a

set

Maximal frequent subgraph

mining

Page 29: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Application of significant subgraphs

• Over-represented molecular sub-structures

• Graph Classification

– Significant subgraphs are more efficient than frequent subgraphs

Page 30: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Graph Setting

Page 31: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Classification Flowchart

Significant Vectors from

Actives

Graph DB

Significant Vectors from

Inactives

Classification Result

K-nearest neighbor

classification

Q Set of Vectors

RWR

Top k significant

vectors

Active Vs

Inactive Vote

Page 32: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Experimental Results: Datasets

• AIDS dataset

• Cancer Datasets

Page 33: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Representing molecules as graphs

Page 34: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Time Vs. Frequency

Page 35: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Time vs DB size

Page 36: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Profiling of Computation Cost

Page 37: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Quality of Patterns

• Subgraphs mined from AIDS database

• Subgraphs mined from molecules active against Leukemia

– Sb and Bi are found at a frequency below 1%

– Current techniques unable to scale to such low frequencies

Page 38: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Classification

• Performance Measure: Area under ROC Curve (AUC)

• AUC is between 0 and 1.

• Higher the AUC better the performance.

Page 39: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

AUC Comparison

Page 40: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Running Time Comparison

Page 41: Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Questions?


Recommended