Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree
Wei Fan, Kun Zhang, Hong Cheng,
Jing Gao, Xifeng Yan, Jiawei Han,
Philip S. Yu, Olivier Verscheure
How to find good features from semi-structured raw data for classification
Feature Construction
Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.
y drawn from discrete set: classification y drawn from continuous variable: regression
When feature vectors are good, differences in accuracy among learners are not much.
Questions: where do good features come from?
Frequent Pattern-Based Feature Extraction
Data not in the pre-defined feature vectors Transactions
Biological sequence
Graph database
Frequent pattern is a good candidate for discriminative features So, how to mine them?
FP: Sub-graphO
A discovered pattern
HO
O
NSC 4960
NSC 191370
O O
NH
O
HN
O
O
SH
NSC 40773
O
O
O
HO
O
HO
O
O
NSC 164863 NS
H2N O
OOO
O
O O
O
OO
OH
O
NSC 699181
(example borrowed from George Karypis presentation)
Frequent Pattern Feature Vector Representation
P1 P2 P3
Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1
………
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
NN
DT
SVM
LRMining these predictivefeatures is an NP-hardproblem.
100 examples can get up to1010 patterns
Most are useless
Example 192 examples
12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ?
4% support, 92,000 patterns 192 vs 92,000 ??
Most patterns have no predictive power and cannot be used to construct features.
Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy
Data in “bad” feature space Discriminative patterns
A non-linear combination of single feature(s) Increase the expressive and discriminative power of the
feature space
An example
X Y C
0 0 0
1 1 1
-1 1 1
1 -1 1
-1 -1 1Data is non-linearly separable in (x, y)
0
1
1
x
y
1
1
New Feature Space
Data is linearly separable in (x, y, F)
Mine &
Transform
• Solving Problem
Map
Dat
a to
a D
iffer
ent S
pace
X Y C
0 0 0
1 1 1
-1 1 1
1 -1 1
-1 -1 1X Y F:x=0,
y=0 C
0 0 1 0
1 1 0 1
-1 1 0 1
1 -1 0 1
-1 -1 0 1
01 x
y
1
1
1
1
F
0
11
1ItemSet:F: x=0,y=0Association ruleF: x=0 y=0
Computational Issues Measured by its “frequency” or support.
E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns
“Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%.
NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive.
Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless.
Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias
problem, particularly to miss low support but high info gain patterns
1. Mine frequent patterns (>sup)
Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSet mine
Mined Discriminative
Patterns
1 2 4
select
2. Select most discriminative patterns;
3. Represent data in the feature space using such patterns;
4. Build classification models.
F1 F2 F4
Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1
………represent
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
NN
DT
SVM
LR
Conventional Procedure
Feature Construction and Selection
Two-Step Batch Method
Two Problems
Mine step combinatorial explosion
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSetmine
1. exponential explosion 2. patterns not considered if minsupport isn’t small
enough
Two Problems Select step
Issue of discriminative power
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
Mined Discriminative
Patterns
1 2 4
select
3. InfoGain against the complete dataset, NOT on subset of
examples
4. Correlation notdirectly evaluated on their
joint predictability
Direct Mining & Selection via Model-based Search Tree Basic Flow
Mined Discriminative Patterns
Compact set of highly
discriminative patterns
1234567...
Divide-and-Conquer Based Frequent Pattern Mining
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
6
Y
+
Y Y4
N
Few Data
N N
+
N
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%
…
… Y
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Feature Miner
Classifier
Global Support:
10*20%/10000=0.02%
Analyses (I)
1. Scalability (Theorem 1)
Upper bound
“Scale down” ratio to obtain extremely low support pat:
2. Bound on number of returned features (Theorem 2)
4. Non-overfitting5. Optimality under exhaustive search
Analyses (II)
3. Subspace is important for discriminative pattern
Original set: no-information gain if C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α
Subsets could have info gain:
Experimental Studies: Itemset Mining (I)
Scalability Comparison
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
0
1
2
3
4
Adult Chess Hypo Sick Sonar
Log(DTAbsSupport) Log(MbTAbsSupport)
Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT
sup)
Adult 252809 0.41%
Chess +∞ ~0%
Hypo 423439 0.0035%
Sick 4818391 0.00032%
Sonar 95507 0.00775%
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
Experimental Studies: Itemset Mining (II)
Accuracy of Mined Itemsets
70%
80%
90%
100%
Adult Chess Hypo Sick Sonar
DT Accuracy MbT Accuracy
4 Wins 1 loss
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
much smallernumber ofpatterns
Experimental Studies: Itemset Mining (III)
Convergence
Experimental Studies: Graph Mining (I)
9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%
2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%
O
O
O
HO
O
HO
O
O
Experimental Studies: Graph Mining (II) Scalability
0300600900
120015001800
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT #Pat MbT #Pat
0
1
2
3
4
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
Log(DT Abs Support) Log(MbT Abs Support)2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
Experimental Studies: Graph Mining (III) AUC and Accuracy
0.5
0.6
0.7
0.8
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbTAUC
Accuracy
0.88
0.92
0.96
1
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbT
11 Wins
10 Wins 1 Loss
AUC of MbT, DT MbT VS Benchmarks
Experimental Studies: Graph Mining (IV)
7 Wins, 4 losses
Summary Model-based Search Tree
Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play
Experiment Results Itemset Mining Graph Mining
Software and Dataset available from: www.cs.columbia.edu/~wfan