2005/01/21 董原賓

transcript

Systematic Data Selection to Mine Concept-Drifting Data Streams

Wei FanProceedings of the 2004 ACM SIGKDD international conf

erence on Knowledge discovery

2005/01/21 董原賓

Outline

Issues with data stream Will old data really help? Optimal models Computing optimal models Cross Validation Decision Tree

Ensemble Experiment Conclusion

Issues with data stream Concept drift

FOi(x) : the optimal model at time stamp i There is concept drift from time stamp i-1 to time stamp i,

if there are inconsistencies between FOi-1(x) and FOi(x) Data sufficient

A data set is considered sufficient if adding more data will not increase the generalization accuracy

Determining the sufficient amount can be formidably expensive

Even if the dataset is insufficient, still need to train a model that can best fit the changing data.

Will old data really help?

Definition: Si: most recent data chunk SP: previous data chunks, SP= S1 U…U Si-1

y = f(x): the underlying true model that we aim to model

Will old data really help? (Cont.)

Three types of SP: 1. FOi (x) ≠ FOi-1(x), obviously it will only can

cel out the changing concept 2. FOi (x) = FOi-1(x) ≠ y, both models agree t

hat their predictions are wrong, can’t determine if they will help or not

3. FOi (x) = FOi-1(x) = y, both models make the correct prediction, it is the only portion that may help

Will old data really help? (Cont.)

Optimal Models

The two main themes of our comparison is on

possible data insufficiency and concept drift

Four situations: 1. New data is sufficient by itself and

there is no concept drift, the optimal model should be the one trained from the new data itself.

Optimal Models (cont.)

2. New data is sufficient by itself and there is concept drift, the optimal model should be the one trained from the new data itself.

3. New data is insufficient by itself and there is no concept drift. If the previous data is sufficient, the optimal model should be the existing model. Otherwise, train a new model from new data plus existing data

Optimal Models (cont.)

4. New data is insufficient by itself and there is concept drift. Choose previous data chunks that have consistent concept with the new data chunk and combine them with the new data

We will usually never know if the data is indeed sufficient. Ideally, we should compare a few sensible choices if the training cost is affordable

Computing optimal models

Definition ： Di-1 ： the dataset that trained the mos

t recent optimal model FOi-1(x) and is collectively iteratively throughout the streaming mining process

si-1 ： examples selected from Di-1

Computing optimal models (cont.)

Steps: 1. Train a model FNi(x) from the new data c

hunk Si only. 2. Select examples from Di-1 that both the tr

ained new model FNi(x) and the recent optimal model FOi-1(x) make the correct prediction.

si-1 = { ∀(x,y) ∈ Di-1, such that, (FNi(x) = y) ∧ (FOi-1(x) = y) }

3. Train a model FNi+(x) from the data d

ata plus the selected data in the last step (Si U si-1)

4. Update the most recent model FOi-1

(x) with Si and call this model FOi-1+(x). W

hen update the model, keep the structure of the model and update its internal statistics

5. Using “cross-validation“ to compare the accuracy of FNi(x), FOi-1(x), FNi

(x) and FOi-1+(x). Choose the one that is

the most accurate and name it FOi(x) 6. Di is the training set that computes

FOi(x) and is one of Si, Di-1, Si U si-1, and Si U Di-1

Address how the above framework findsthe optimal model under all four previously discussed situations ： 1. New data is sufficient by itself and there

is no concept drift. Conceptually FNi(x) should be the optimal model. FOi-1(x), FNi

(x) and FOi-1+(x) could be its close match

2. New data is sufficient by itself and there is concept drift. FNi(x) should be the optimal model. FNi

+(x) could be very similar in performance to FNi(x)

3. New data is insufficient by itself and there is no concept drift. The optimal model should be either FOi-1(x) or FOi-1

+(x) 4. New data is insufficient by itself and there i

s concept drift. The optimal model should be either FNi(x) or FNi

Cross Validation Decision Tree Ensemble

Decision tree ensemble Step1: sequentially scans the complete

dataset once and finds out all features with information gain

Step2: When building trees, at each step choose a remaining feature randomly

Step3: The tree stop growing a branch if there are no more examples passing through that branch

Cross Validation Decision Tree Ensemble (cont.)

Rule1: Features without information gain will never be used

Rule2: Each discrete feature can be used at most once in a particular decision path

Rule3: Each continuous feature can be chosen multiple times on the same decision path

Rule4: The splitting threshold is a random value within the max and min of that feature

Rule5: In the training data, each example x is assign an initial weight 1.0 When missing feature value is encountered, the current weight of x is distributed across its children nodes. If the prior distribution of known values are given, the weight is distributed in proportion to this distribution. Otherwise, it is equally divided among the children nodes

Cross validation ： n: the size of the training set n-fold cross validation: leaves one exa

mple x out and use the remaining n – 1 examples to train a model and classify on the left-out example x

Compute the probability of x: Assume that we have two class labels, either fraud or non fraud, the probability if x being fraudulent is

Experiment (streaming data) Synthetic data ： Data with drifting concepts base

d on a moving hyperplane Roughly half of the examples are positive, and the other half negative

Credit card fraud data ： Real life credit card transaction for cost sensitive mining. One year period and 5million transactions. Investigating a fraud transaction cost = $90

Donation dataset ： The famous donation dataset that first appeared in KDDCUP’98 competition. 95412 known records. 96367 unknown records for test

Experiment( Synthetic dataset)

Experiment (credit card dataset)

Experiment (Donation dataset)

Experiment (Accuracy of Cross validation)

Experiment (Optimal model counts)

Experiment (Training time)

Experiment (Data sufficiency test)

Conclusion Using old data unselectively is like

gambling Proposed a cross-validation-based

framework to choose data and compare sensible choices

Proposed an implementation of this framework using cross-validation decision tree ensemble

2005/01/21 董原賓

Documents