+ All Categories
Home > Documents > From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM...

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM...

Date post: 27-Mar-2015
Category:
Upload: alex-mchugh
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
83
From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson www.cs.columbia.edu/~wfan www.weifan.info [email protected] , [email protected]
Transcript
Page 1: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

From Feature Construction, to Simple but Effective

Modeling, to Domain Transfer

Wei FanIBM T.J.Watson

www.cs.columbia.edu/~wfanwww.weifan.info

[email protected], [email protected]

Page 2: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Feature Vector

Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.

y drawn from discrete set: classification y drawn from continuous variable: regression

Page 3: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Frequent Pattern-Based Feature Construction

Data not in the pre-defined feature vectors Transactions

Biological sequence

Graph database

Frequent pattern is a good candidate for discriminative features So, how to mine them?

Page 4: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

FP: Sub-graphO

A discovered pattern

HO

O

NSC 4960

NSC 191370

O O

NH

O

HN

O

O

SH

NSC 40773

O

O

O

HO

O

HO

O

O

NSC 164863 NS

H2N O

OOO

O

O O

O

OO

OH

O

NSC 699181

(example borrowed from George Karypis presentation)

Page 5: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Computational Issues

Measured by its “frequency” or support. E.g. frequent subgraphs with sup ≥ 10%

Cannot enumerate sup = 10% without first enumerating all patterns > 10%.

Random sampling not work since it is not exhaustive.

NP hard problem

Page 6: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

1. Mine frequent patterns (>sup)

Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSet mine

Mined Discriminative

Patterns

1 2 4

select

2. Select most discriminative patterns;

3. Represent data in the feature space using such patterns;

4. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1

………represent

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

NN

DT

SVM

LR

Conventional Procedure

Feature Construction followed by Selection

Two-Step Batch Method

Page 7: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Two Problems

Mine step combinatorial explosion

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSetmine

1. exponential explosion 2. patterns not considered if minsupport isn’t small

enough

Page 8: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Two Problems Select step

Issue of discriminative power

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

Mined Discriminative

Patterns

1 2 4

select

3. InfoGain against the complete dataset, NOT on subset of

examples

4. Correlation notdirectly evaluated on their

joint predictability

Page 9: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Direct Mining & Selection via Model-based Search Tree Basic Flow

Mined Discriminative Patterns

Compact set of highly

discriminative patterns

1234567...

Divide-and-Conquer Based Frequent Pattern Mining

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

6

Y

+

Y Y4

N

Few Data

N N

+

N

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%

… Y

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Feature Miner

Classifier

Global Support:

10*20%/10000=0.02%

Page 10: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Analyses (I)

1. Scalability of pattern enumeration

Upper bound (Theorem 1):

“Scale down” ratio:

2. Bound on number of returned features

Page 11: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Analyses (II)

3. Subspace pattern selection

Original set:

Subset: 4. Non-overfitting

5. Optimality under exhaustive search

Page 12: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Itemset Mining (I)

Scalability Comparison

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

0

1

2

3

4

Adult Chess Hypo Sick Sonar

Log(DTAbsSupport) Log(MbTAbsSupport)

Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT

sup)

Adult 252809 0.41%

Chess +∞ ~0%

Hypo 423439 0.0035%

Sick 4818391 0.00032%

Sonar 95507 0.00775%

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

Page 13: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Itemset Mining (II)

Accuracy of Mined Itemsets

70%

80%

90%

100%

Adult Chess Hypo Sick Sonar

DT Accuracy MbT Accuracy

4 Wins 1 loss

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

But, much smallernumber ofpatterns

Page 14: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Itemset Mining (III)

Convergence

Page 15: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Graph Mining (I)

9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%

2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%

O

O

O

HO

O

HO

O

O

Page 16: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Graph Mining (II) Scalability

0300600900

120015001800

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT #Pat MbT #Pat

0

1

2

3

4

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

Log(DT Abs Support) Log(MbT Abs Support)2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

Page 17: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experimental Studies: Graph Mining (III) AUC and Accuracy

0.5

0.6

0.7

0.8

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbTAUC

Accuracy

0.88

0.92

0.96

1

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbT

11 Wins

10 Wins 1 Loss

Page 18: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

AUC of MbT, DT MbT VS Benchmarks

Experimental Studies: Graph Mining (IV)

7 Wins, 4 losses

Page 19: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Summary Model-based Search Tree

Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play

Experiment Results Itemset Mining Graph Mining

New: Found a DNA sequence not previously reported but can be explained in biology.

Code and dataset available for download

Page 20: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Even the true distribution is unknown, still assume that the data is generated by some known function. Estimate parameters inside

the function via training data CV on the training data

Model 1

Model 2Model 3

Model 4Model 5

Model 6

Some unknown distribution

How to train models?

There probably will always be mistakes unless:1. The chosen model indeed generates the distribution2. Data is sufficient to estimate those parameters

But how about, you don’t know which to choose or use the wrong one?

List of methods:• Logistic Regression• Probit models• Naïve Bayes• Kernel Methods• Linear Regression• RBF• Mixture models

After structure is prefixed, learning becomes optimization to minimize errors:

quadratic loss exponential loss slack variables

Page 21: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

How to train models II

Not quite sure the exact function, but use a family of “free-form” functions given some “preference criteria”.

There probably will always be mistakes unless: • the training data is sufficiently large.• free form function/criteria is appropriate.

List of methods:• Decision Trees• RIPPER rule learner• CBA: association rule• clustering-based methods• … …

Preference criteria Simplest hypothesis that fits the data is the best. Heuristics:

info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based

pruning. Truth: none of purity check functions guarantee accuracy on

unseen test data, it only tries to build a smaller model

Page 22: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Can Data Speak for Themselves? Make no assumption about the

true model, neither parametric form nor free form.

“Encode” the data in some rather “neutral” representations: Think of it like encoding

numbers in computer’s binary representation.

Always cannot represent some numbers, but overall accurate enough.

Main challenge: Avoid “rote learning”: do not

remember all the details Generalization “Evenly” representing

“numbers” – “Evenly” encoding the “data”.

Page 23: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Potential Advantages

If the accuracy is quite good, then Method is quite “automatic and easy” to use No Brainer – DM can be everybody’s tool.

Page 24: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Encoding Data for Major Problems

Classification: Given a set of labeled data items, such as, (amt, merchant category,

outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud.

Label: set of discrete values classifier: predict if a transaction is a fraud or non-fraud.

Probability Estimation: Similar to the above setting: estimate the probability that a transaction is a

fraud. Difference: no truth is given, i.e., no true probability

Regression: Given a set of valued data items, such as (zipcode, capital gain, education,

…), interested value is annual gross income. Target value: continuous values.

Several other on-going problems

Page 25: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Encoding Data in Decision Trees

Think of each tree as a way to “encode” the training data. Why tree? a decision tree records some common

characteristic of the data, but not every piece of trivial detail Obviously, each tree encodes the data differently. Subjective criteria that prefers some encodings than

others are always adhoc. Do not prefer anything then – just do it randomly

Minimizes the difference by multiple encodings, and then “average” them.

1 2 3 4 5 6 7

0.5

1.5

2.5

ssssss

ssssssss

ssss ssss

s

s

ssssssss

sssssssssss

ss

sssss

cc cc

cc

c

c

cc

c

c

c

cc cc

c

c

c

c

cc

ccc c

cc

cccc

ccc

ccccc

cc

c

cccc

cc

v

vv

v

v v

vvv

v

vvv

v

vv

v

vv

v

v

v vv

v

vvv

v

v

v vv

v v

vv

vv

v

vv

v

vv

v

vv

v

v

setosa

versicolor

virginica

Petal length

Peta

l w

idth

1 2 3 4 5 6 7

0.5

1.5

2.5

ssssss

ssssssss

ssss ssss

s

s

ssssssss

sssssssssss

ss

sssss

cc cc

cc

c

c

cc

c

c

c

cc cc

c

c

c

c

cc

ccc c

cc

cc

cc

ccc

ccccc

cc

c

cccc

cc

v

vv

v

v v

vvv

v

vvv

v

vv

v

vv

v

v

v vv

v

vvv

v

v

v vv

v v

vv

vv

v

vv

v

vv

v

vv

v

v

setosa

versicolor

virginica

Petal length

Peta

l w

idth

Page 26: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Random Decision Tree to Encode Data

-classification, regression, probability estimation

At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen

previously on a given decision path starting from the root to the current node.

A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Page 27: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Continued

We stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits:

Such as the total number of features.

Page 28: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Illustration of RDT

B1: {0,1}

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B3: continous

B1 == 0

B2 == 0?

Y

B3 < 0.3?

N

Y N

……… B3 < 0.6?

Random threshold 0.3

Random threshold 0.6

B1 chosen randomly

B2 chosen randomly

B3 chosen randomly

Page 29: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Classification

|

Petal.Width< 1.75

setosa 50/0/0

versicolor0/49/5

virginica 0/1/45

Petal.Length< 2.45

P(setosa|x,θ) = 0

P(versicolor|x,θ) = 49/54

P(virginica|x,θ) = 5/54

Page 30: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Regression

|

Petal.Width< 1.75

setosa Height=10in

versicolorHeight=15 in

virginica Height=12in

Petal.Length< 2.45

15 in average

value of all examples

In this leaf node

Page 31: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Prediction

Simply Averaging over multiple trees

Page 32: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Potential Advantage

Training can be very efficient. Particularly true for very large datasets. No cross-validation based estimation of

parameters for some parametric methods. Natural multi-class probability. Natural multi-label classification and

probability estimation. Imposes very little about the structures of the

model.

Page 33: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Reasons

The true distribution P(y|X) is never known. Is it an elephant?

Every random tree is not a random guess of this P(y|X). Their structure is, but not the “node statistics” Every random tree is consistent with the training data. Each tree is quite strong, not weak. In other words, if the distribution is the same, each random

tree itself is a rather decent model.

Page 34: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Expected Error Reduction

Proven that for quadratic loss, such as: for probability estimation:

( P(y|X) – P(y|X, θ) )2

regression problems ( y – f(x))2

General theorem: the “expected quadratic loss” of RDT (and any other model averaging) is less than any combined model chosen “at random”.

Page 35: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Theorem Summary

Page 36: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Number of trees

Sampling theory: The random decision tree can be thought as sampling

from a large (infinite when continuous features exist) population of trees.

Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough.

Page 37: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Variance Reduction

Page 38: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Optimal Decision Boundary

from Tony Liu’s thesis (supervised by Kai Ming Ting)

Page 39: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

RDT lookslike the optimal

boundary

Page 40: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Regression Decision Boundary (GUIDE)

Properties• Broken and Discontinuous• Some points are far from truth• Some wrong ups and downs

Page 41: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

RDT Computed FunctionProperties• Smooth and Continuous• Close to true function• All ups and downs caught

Page 42: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Hidden Variable

Page 43: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Hidden Variable Limitation of GUIDE

Need to decide grouping variables and independent variables. A non-trivial task.

If all variables are categorical, GUIDE becomes a single CART regression tree.

Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results.

Page 44: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

It grows like …

Page 45: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

ICDM’08 Cup Crown Winner

Nuclear ban monitoring RDT based approach is the highest award

winner.

Page 46: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Ozone Level Prediction (ICDM’06 Ozone Level Prediction (ICDM’06 Best Application Paper)Best Application Paper)

Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)

Page 47: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

SVM: 1-hr criteria CV

Page 48: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

AdaBoost: 1-hr criteria CV

Page 49: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

SVM: 8-hr criteria CV

Page 50: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

AdaBoost: 8-hr criteria CV

Page 51: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Other Applications

Credit Card Fraud Detection Late and Default Payment Prediction Intrusion Detection Semi Conductor Process Control Trading anomaly detection

Page 52: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Conclusion Imposing a particular form of model may not be a good idea to

train highly-accurate models for general purpose of DM. It may not even be efficient for some forms of models. RDT has been show to solve all three major problems in data

mining, classification, probability estimation and regressions, simply, efficiently and accurately.

When physical truth is unknown, RDT is highly recommended Code and dataset is available for download.

Page 53: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Standard Supervised Learning

New York Times

training (labeled)

test (unlabeled)

Classifier 85.5%

New York Times

Page 54: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

In Reality……

New York Times

training (labeled)

test (unlabeled)

Classifier 64.1%

New York Times

Labeled data not available!Reuters

Page 55: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Domain Difference Performance Droptrain test

NYT NYT

New York Times New York Times

Classifier 85.5%

Reuters NYT

Reuters New York Times

Classifier 64.1%

ideal setting

realistic setting

Page 56: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

A Synthetic Example

Training(have conflicting concepts)

Test

Partially overlapping

Page 57: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Goal

SourceDomain Target

Domain

SourceDomain

SourceDomain

To unify knowledge that are consistent with the test domain from multiple source domains (models)

Page 58: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Summary Transfer from one or multiple source

domains Target domain has no labeled examples

Do not need to re-train Rely on base models trained from each domain The base models are not necessarily developed

for transfer learning applications

Page 59: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Locally Weighted Ensemble

),( yxf k

k

i

iiE yxfxwyxf1

),()(),(

),(2 yxf

M1

M2

Mk

……

Training set 1),(1 yxf

),|(),( ii MxyYPyxf

),(maxarg| yxfxy Ey

Test example xTraining set 2

Training set k

……

)(1 xw

)(2 xw

)(xwk

k

i

i xw1

1)(

x-feature value y-class label

Training set

Page 60: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Modified Bayesian Model Averaging

M1

M2

Mk

……

Test set

),|( iMxyP

)|( DMP i

k

iii MxyPDMPxyP

1

),|()|()|(

Bayesian Model Averaging

M1

M2

Mk

……

Test set

Modified for Transfer Learning

),|( iMxyP)|( xMP i

k

iii MxyPxMPxyP

1

),|()|()|(

Page 61: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Global versus Local Weights

2.40 5.23-2.69 0.55-3.97 -3.622.08 -3.735.08 2.151.43 4.48……

x y

100001…

M1

0.60.40.20.10.61…

M2

0.90.60.40.10.30.2…

wg

0.30.30.30.30.30.3…

wl

0.20.60.70.50.31…

wg

0.70.70.70.70.70.7…

wl

0.80.40.30.50.70…

Locally weighting scheme Weight of each model is computed per example Weights are determined according to models’

performance on the test set, not training set

Training

Page 62: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Synthetic Example Revisited

Training(have conflicting concepts)

Test

Partially overlapping

M1 M2

M1 M 2

Page 63: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Optimal Local Weights

C1

C2

Test example x

0.9 0.1

0.4 0.6

0.8 0.2

Higher Weight

Optimal weights Solution to a regression problem

0.9 0.4

0.1 0.6

w1

w2=

0.8

0.2

k

i

i xw1

1)(

H w f

Page 64: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Approximate Optimal Weights

How to approximate the optimal weights M should be assigned a higher weight at x if P(y|M,x) is

closer to the true P(y|x) Have some labeled examples in the target domain

Use these examples to compute weights None of the examples in the target domain are labeled

Need to make some assumptions about the relationship between feature values and class labels

Optimal weights Impossible to get since f is unknown!

Page 65: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Clustering-Manifold Assumption

Test examples that are closer in feature space are more likely to share the same class label.

Page 66: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Graph-based Heuristics Graph-based weights approximation

Map the structures of models onto test domain

Clustering Structure

M1M2

weight on x

Page 67: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Graph-based Heuristics

Local weights calculation Weight of a model is proportional to the similarity

between its neighborhood graph and the clustering structure around x.

Higher Weight

Page 68: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Local Structure Based Adjustment Why adjustment is needed?

It is possible that no models’ structures are similar to the clustering structure at x

Simply means that the training information are conflicting with the true target distribution at x

Clustering Structure

M1M2

ErrorError

Page 69: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Local Structure Based Adjustment How to adjust?

Check if is below a threshold Ignore the training information and propagate the labels of

neighbors in the test set to x

Clustering Structure

M1M2

Page 70: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Verify the Assumption

Need to check the validity of this assumption Still, P(y|x) is unknown How to choose the appropriate clustering algorithm

Findings from real data sets This property is usually determined by the nature of the

task Positive cases: Document categorization Negative cases: Sentiment classification Could validate this assumption on the training set

Page 71: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Algorithm

Check Assumption

Neighborhood Graph Construction

Model Weight Computation

Weight Adjustment

Page 72: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Data Sets

Different applications Synthetic data sets Spam filtering: public email collection personal inboxes

(u01, u02, u03) (ECML/PKDD 2006) Text classification: same top-level classification problems

with different sub-fields in the training and test sets (Newsgroup, Reuters)

Intrusion detection data: different types of intrusions in training and test sets.

Page 73: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Baseline Methods Baseline Methods

One source domain: single models Winnow (WNN), Logistic Regression (LR), Support

Vector Machine (SVM) Transductive SVM (TSVM)

Multiple source domains: SVM on each of the domains TSVM on each of the domains

Merge all source domains into one: ALL SVM, TSVM

Simple averaging ensemble: SMA Locally weighted ensemble without local structure based

adjustment: pLWE Locally weighted ensemble: LWE

Implementation Package: Classification: SNoW, BBR, LibSVM, SVMlight Clustering: CLUTO package

Page 74: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Performance Measure

Prediction Accuracy 0-1 loss: accuracy Squared loss: mean squared error

Area Under ROC Curve (AUC)

Tradeoff between true positive rate and false positive rate Should be 1 ideally

Page 75: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

A Synthetic Example

Training(have conflicting concepts)

Test

Partially overlapping

Page 76: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Experiments on Synthetic Data

Page 77: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Spam Filtering

Problems Training set:

public emails Test set:

personal emails from three users: U00, U01, U02

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

Accuracy

MSE

Page 78: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

20 Newsgroup

C vs S

R vs T

R vs S

C vs T

C vs R

S vs T

Page 79: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

Acc

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

MSE

Page 80: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Reuters

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

pLWE

LR

SVM

SMA

TSVM

WNN

LWE

Accuracy

MSE

Problems Orgs vs People

(O vs Pe) Orgs vs Places

(O vs Pl) People vs

Places (Pe vs Pl)

Page 81: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

Intrusion Detection

Problems (Normal vs Intrusions) Normal vs R2L (1) Normal vs Probing (2) Normal vs DOS (3)

Tasks 2 + 1 -> 3 (DOS) 3 + 1 -> 2 (Probing) 3 + 2 -> 1 (R2L)

Page 82: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

ConclusionsLocally weighted ensemble framework

transfer useful knowledge from multiple source domains

Graph-based heuristics to compute weights Make the framework practical and effective

Code and Dataset available for download

Page 83: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson wfan  weifan@us.ibm.comweifan@us.ibm.com,

More information

www.weifan.info or www.cs.columbia.edu/~wfan For code, dataset and papers


Recommended