Download - Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree

Wei Fan, Kun Zhang, Hong Cheng,

Jing Gao, Xifeng Yan, Jiawei Han,

Philip S. Yu, Olivier Verscheure

How to find good features from semi-structured raw data for classification

Feature Construction

Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.

y drawn from discrete set: classification y drawn from continuous variable: regression

When feature vectors are good, differences in accuracy among learners are not much.

Questions: where do good features come from?

Frequent Pattern-Based Feature Extraction

Data not in the pre-defined feature vectors Transactions

Biological sequence

Graph database

Frequent pattern is a good candidate for discriminative features So, how to mine them?

FP: Sub-graphO

A discovered pattern

HO

O

NSC 4960

NSC 191370

O O

NH

O

HN

O

O

SH

NSC 40773

O

O

O

HO

O

HO

O

O

NSC 164863 NS

H2N O

OOO

O

O O

O

OO

OH

O

NSC 699181

(example borrowed from George Karypis presentation)

Frequent Pattern Feature Vector Representation

P1 P2 P3

Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1

………

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

NN

DT

SVM

LRMining these predictivefeatures is an NP-hardproblem.

100 examples can get up to1010 patterns

Most are useless

Example 192 examples

12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ?

4% support, 92,000 patterns 192 vs 92,000 ??

Most patterns have no predictive power and cannot be used to construct features.

Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy

Data in “bad” feature space Discriminative patterns

A non-linear combination of single feature(s) Increase the expressive and discriminative power of the

feature space

An example

X Y C

0 0 0

1 1 1

-1 1 1

1 -1 1

-1 -1 1Data is non-linearly separable in (x, y)

0

1

1

x

y

1

1

New Feature Space

Data is linearly separable in (x, y, F)

Mine &

Transform

• Solving Problem

Map

Dat

a to

a D

iffer

ent S

pace

X Y C

0 0 0

1 1 1

-1 1 1

1 -1 1

-1 -1 1X Y F:x=0,

y=0 C

0 0 1 0

1 1 0 1

-1 1 0 1

1 -1 0 1

-1 -1 0 1

01 x

y

1

1

1

1

F

0

11

1ItemSet:F: x=0,y=0Association ruleF: x=0 y=0

Computational Issues Measured by its “frequency” or support.

E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns

“Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%.

NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive.

Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless.

Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias

problem, particularly to miss low support but high info gain patterns

1. Mine frequent patterns (>sup)

Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSet mine

Mined Discriminative

Patterns

1 2 4

select

2. Select most discriminative patterns;

3. Represent data in the feature space using such patterns;

4. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1

………represent

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

NN

DT

SVM

LR

Conventional Procedure

Feature Construction and Selection

Two-Step Batch Method

Two Problems

Mine step combinatorial explosion

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSetmine

1. exponential explosion 2. patterns not considered if minsupport isn’t small

enough

Two Problems Select step

Issue of discriminative power

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

Mined Discriminative

Patterns

1 2 4

select

3. InfoGain against the complete dataset, NOT on subset of

examples

4. Correlation notdirectly evaluated on their

joint predictability

Direct Mining & Selection via Model-based Search Tree Basic Flow

Mined Discriminative Patterns

Compact set of highly

discriminative patterns

1234567...

Divide-and-Conquer Based Frequent Pattern Mining

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

6

Y

+

Y Y4

N

Few Data

N N

+

N

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%

…

… Y

dataset

1


Most discriminative F based on IG

Feature Miner

Classifier

Global Support:

10*20%/10000=0.02%

Analyses (I)

1. Scalability (Theorem 1)

Upper bound

“Scale down” ratio to obtain extremely low support pat:

2. Bound on number of returned features (Theorem 2)

4. Non-overfitting5. Optimality under exhaustive search

Analyses (II)

3. Subspace is important for discriminative pattern

Original set: no-information gain if C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α

Subsets could have info gain:

Experimental Studies: Itemset Mining (I)

Scalability Comparison

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

0

1

2

3

4


Log(DTAbsSupport) Log(MbTAbsSupport)

Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT

sup)

Adult 252809 0.41%

Chess +∞ ~0%

Hypo 423439 0.0035%

Sick 4818391 0.00032%

Sonar 95507 0.00775%

2


Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1



Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

2


Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1



Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N


N

Experimental Studies: Itemset Mining (II)

Accuracy of Mined Itemsets

70%

80%

90%

100%


DT Accuracy MbT Accuracy

4 Wins 1 loss

01

23

4


Log(DT #Pat) Log(MbT #Pat)

much smallernumber ofpatterns

Experimental Studies: Itemset Mining (III)

Convergence

Experimental Studies: Graph Mining (I)

9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%

2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%

O

O

O

HO

O

HO

O

O

Experimental Studies: Graph Mining (II) Scalability

0300600900

120015001800

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT #Pat MbT #Pat

0

1

2

3

4


Log(DT Abs Support) Log(MbT Abs Support)2


Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1



Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N


N

2


Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1



Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N


N

Experimental Studies: Graph Mining (III) AUC and Accuracy

0.5

0.6

0.7

0.8


DT MbTAUC

Accuracy

0.88

0.92

0.96

1


DT MbT

11 Wins

10 Wins 1 Loss

AUC of MbT, DT MbT VS Benchmarks

Experimental Studies: Graph Mining (IV)

7 Wins, 4 losses

Summary Model-based Search Tree

Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play

Experiment Results Itemset Mining Graph Mining

Software and Dataset available from: www.cs.columbia.edu/~wfan