+ All Categories
Home > Documents > [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain...

[IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain...

Date post: 23-Dec-2016
Category:
Upload: sachi
View: 213 times
Download: 0 times
Share this document with a friend
8
MuSeRA: Multiple Selectively Recursive Approach towards 1m balanced Stream Data Mining Sheng Chen, Haibo He, Kang Li, and Sachi Desai Abstract-Learning from data streams has inspired consid- erable interests in recent years due to its wide applications in the fields such as network intrusion detection, credit fraud identification, spam filtering, and many others. Given the fact that most algorithms developed thus far assume the class dis- tribution of the streaming data is relatively balanced, they will inevitably be confronted with severe performance deterioration when handling the imbalanced data streams. Evolved from our previous work SERA (SElectively Recursive Approach), the MuSeRA algorithm is proposed in this paper to deal with the problem of learning from imbalanced data streams. By maintaining an ensemble consisting of hypotheses built upon the coming training data chunks balanced by selectively accom- modating previous minority examples, MuSeRA can efficiently learn the target concept of the imbalanced data streams and thus obtain substantial performance improvement compared to our previous work SERA and the existing stream data mining algorithms. Simulation results validate the effectiveness of the proposed MuSeRA algorithm. I. INTRODUCTION Many real applications, such as network traffic monitoring, credit card aud detection, and web click stream generate continuously arriving data, known as data streams [1]. A data stream is a sequence of unbounded, real time data items with a very high data rate that can be read only once by an application [2]. Due to the high speed nature of data streams, it is practically required that the streaming data should not be scanned more than once, which is referred as one-pass constraint [3]. Further, classic algorithms handling static data set can reside all data in the memory to build the model. However for the virtually unbounded volume of data streams, algorithms developed for this kind of problem can only reserve a limited amount of data for building the model at arbitrary time. Besides these two conces for developing an algorithm for dealing with data streams, the most important challenge with respect to classification is the concept driſts in evolving data streams [3]. Concept driſts, also termed as data stream evolution [4] occur when the underlying target concept evolves over time. In such a scenario, researchers oſten face the stability-plasticity dilemma [5], since it is very difficult to strike a balance between maintaining the most Sheng Chen is with the Department of Electrical and Computer Engi- neering, Stevens Institute of Technology, Hoboken, NJ 07030 USA, email: [email protected]. Haibo He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, 02881 USA. email: [email protected]. Kang Li is with the School of Electronics, Electrical Engineering and Computer Science, Queen's University Belst, BT7 I, Northe Ireland, UK. Email: [email protected]. Sachi Desai is with the U.S. Army RDECOM-ARDEC, Picatinny Arsenal, New Jersey, 07806, USA. Email: [email protected] 978-1-4244-8126-2/101$26.00 ©2010 IEEE consistent knowledge (the current training data chunk) and accommodating previous relevant information (the previous data chunks). The existing algorithms for stationary data streams in- clude: V FDT [6] which constructs and updates a single decision tree system based on Hoeffding tree whenever there are new data coming, Le++ [7] which builds an ensemble for each coming data chunk and then combine them together to make the prediction on testing data set, and IMORL [8] which employs the knowledge om the immediate previous data chunk to facilitate the leing process on the current data chunk. Among others for nonstationary data streams, [9] proposed a general ensemble based amework. It builds a hypothesis for each coming data chunk, in which the weights are determined by applying all these hypotheses to the testing data set to evaluate the current data chunk. Rather than es- tablishing an ensemble model based on all accumulated data, OLIN dynamically calculates the size of training window and the number of new examples between model constructions to the current rate of concept driſts [10]. By adopting the idea of micro-clusters, on-demand classification method was proposed in [11]. It has two components: one stores the summary statistics of the data streams, while the other uses such information to conduct the classification. ANNCAD, proposed in [12], uses a multi-resolution data representation based on Haar wavelets transformation to adaptively locate the nearest neighbors of the testing data. The exponential fading factor is designed to address the concept driſts by weakening the impact of the old data to the classification process. Despite the great success having been achieved towards leing om time evolving data streams, few algorithms have been developed so far to handle imbalanced data streams. In many real world applications, the underlying class distributions of data streams are usually imbalanced. Gen- erally speaking, imbalanced leing problem corresponds to domains in which certain types of data distribution over- dominate the instance space compared to other data distri- butions [13]. Most existing imbalanced leing algorithms [14] [15] [16] take all data into memory at one time to build the leaing models, which contradicts with the unbounded memory nature of the data streams and hence cannot be applied directly to le om the imbalanced data streams. Evolved om our previous work SERA [17], the MuSeRA algorithm is proposed in this paper in an effort to provide a solution for dealing with the imbalanced data streams of nonstationary target concepts. It has the following contribu- tions:
Transcript
Page 1: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

MuSeRA: Multiple Selectively Recursive Approach towards 1m balanced Stream Data Mining

Sheng Chen, Haibo He, Kang Li, and Sachi Desai

Abstract- Learning from data streams has inspired consid­erable interests in recent years due to its wide applications in the fields such as network intrusion detection, credit fraud identification, spam filtering, and many others. Given the fact that most algorithms developed thus far assume the class dis­tribution of the streaming data is relatively balanced, they will inevitably be confronted with severe performance deterioration when handling the imbalanced data streams. Evolved from our previous work SERA (SElectively Recursive Approach), the MuSeRA algorithm is proposed in this paper to deal with the problem of learning from imbalanced data streams. By maintaining an ensemble consisting of hypotheses built upon the coming training data chunks balanced by selectively accom­modating previous minority examples, MuSeRA can efficiently learn the target concept of the imbalanced data streams and thus obtain substantial performance improvement compared to our previous work SERA and the existing stream data mining algorithms. Simulation results validate the effectiveness of the proposed MuSeRA algorithm.

I. INTRODUCTION

Many real applications, such as network traffic monitoring, credit card fraud detection, and web click stream generate continuously arriving data, known as data streams [1]. A data stream is a sequence of unbounded, real time data items with a very high data rate that can be read only once by an application [2]. Due to the high speed nature of data streams, it is practically required that the streaming data should not be scanned more than once, which is referred as one-pass constraint [3]. Further, classic algorithms handling static data set can reside all data in the memory to build the model. However for the virtually unbounded volume of data streams, algorithms developed for this kind of problem can only reserve a limited amount of data for building the model at arbitrary time. Besides these two concerns for developing an algorithm for dealing with data streams, the most important challenge with respect to classification is the concept drifts in evolving data streams [3]. Concept drifts, also termed as data stream evolution [4] occur when the underlying target concept evolves over time. In such a scenario, researchers often face the stability-plasticity dilemma [5], since it is very difficult to strike a balance between maintaining the most

Sheng Chen is with the Department of Electrical and Computer Engi­neering, Stevens Institute of Technology, Hoboken, NJ 07030 USA, email: [email protected]. Haibo He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA. email: [email protected]. Kang Li is with the School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfust, BT7 INN, Northern Ireland, UK. Email: [email protected]. Sachi Desai is with the U.S. Army RDECOM-ARDEC, Picatinny Arsenal, New Jersey, 07806, USA. Email: [email protected]

978-1-4244-8126-2/101$26.00 ©2010 IEEE

consistent knowledge (the current training data chunk) and accommodating previous relevant information (the previous data chunks).

The existing algorithms for stationary data streams in­clude: VFDT [6] which constructs and updates a single decision tree system based on Hoeffding tree whenever there are new data coming, Learn++ [7] which builds an ensemble for each coming data chunk and then combine them together to make the prediction on testing data set, and IMORL [8] which employs the knowledge from the immediate previous data chunk to facilitate the learning process on the current data chunk. Among others for non stationary data streams, [9] proposed a general ensemble based framework. It builds a hypothesis for each coming data chunk, in which the weights are determined by applying all these hypotheses to the testing data set to evaluate the current data chunk. Rather than es­tablishing an ensemble model based on all accumulated data, OLIN dynamically calculates the size of training window and the number of new examples between model constructions to the current rate of concept drifts [10]. By adopting the idea of micro-clusters, on-demand classification method was proposed in [11]. It has two components: one stores the summary statistics of the data streams, while the other uses such information to conduct the classification. ANNCAD, proposed in [12], uses a multi-resolution data representation based on Haar wavelets transformation to adaptively locate the nearest neighbors of the testing data. The exponential fading factor is designed to address the concept drifts by weakening the impact of the old data to the classification process.

Despite the great success having been achieved towards learning from time evolving data streams, few algorithms have been developed so far to handle imbalanced data streams. In many real world applications, the underlying class distributions of data streams are usually imbalanced. Gen­erally speaking, imbalanced learning problem corresponds to domains in which certain types of data distribution over­dominate the instance space compared to other data distri­butions [13]. Most existing imbalanced learning algorithms [14] [15] [16] take all data into memory at one time to build the learning models, which contradicts with the unbounded memory nature of the data streams and hence cannot be applied directly to learn from the imbalanced data streams.

Evolved from our previous work SERA [17], the MuSeRA algorithm is proposed in this paper in an effort to provide a solution for dealing with the imbalanced data streams of nonstationary target concepts. It has the following contribu­tions:

Page 2: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

1. Differing from generating synthetic minority instances to intensify the under-represented minority concepts by clas­sic imbalanced algorithms [14], MuSeRA collects all the minority data arriving so far and selectively accommodates them into the current data chunk to balance the class distri­butions. It is not a violation to the one-scan requirement of stream data mining, since the intrinsic nature of imbalanced data streams accounts for extremely limited amount of mi­nority class examples which accordingly can be accumulated with tolerable storage space.

2. Rather than accommodating all previous minority data into the current data chunk in [18], MuSeRA employs a met­ric to evaluate the similarity degree between each previous minority data and the current minority set, which is then used to determine the set of previous minority data that should be taken into the current data chunk.

3. Different from SERA which maintains a single hypothe­sis on the current amplified training data chunk, MuSeRA in addition includes hypotheses built on previous training data chunks, all of which are weighted as an ensemble to predict the class label of the current testing data set.

The rest of this paper is organized as follows. Section II discusses the technical details and the corresponding MuSeRA algorithm. Section III elaborates the simulation configuration details and the evaluation metrics, and then the statistical simulation results and the corresponding discussion are given. Section IV concludes the paper and enumerates several potential improvements for MuSeRA in the future.

II. THE PROPOSED LEARNING FRAMEW ORK FOR

NONSTATIONARY IMBALANCED DATA STREAM

A. The MuSeRA Learning Algorithm

The pseudo-code of the proposed MuSeRA for learning from the non stationary imbalanced data streams is formulated as follows:

Algorithm: MuSeRA - Multiple Selectively Recursive

Approach for Nonstationary Imbalanced Data Stream

Input: • The current timestamp k. • The minority class ratio 'Y of the data chunk. • The post-balance ratio I. • The current training data chunk Sk with m training

examples ( (Xl, YI)"'" (xm, Ym)), where Yi E C =

{I, ... , k} is the class label corresponding to the ith instance Xi.

• The current testing data chunk Tk with n testing in-t ( " ' ) h '· th ·th· t s ances xI,x2"",xn'w ere Xi IS ez ms ance.

• The data set G consisting of the minority examples arriving before the current timestamp k.

• The hypotheses set H k-l = {hI, h2, ... , hk-d, in which hi represents the hypothesis built on the post­balanced data chunk with timestamp i. Obviously HI =

cp.

Procedure: 1) Split Sk into the minority example set Pk and the

majority example set Nk• 2) If I> (k - 1) x 'Y

• Include the entire G into the current data chunk Sk, which gives rise to Sk = {Sk, G}.

3) Else a) Calculate the Mahalanobis distance di between

Pk and each minority example (Xi, Yi) in G based on equation (1)

di = V(Xi - JL)T�-I (Xi - JL) (1)

where JL is the mean value of Pk, and � is the covariance matrix of Pk.

b) Sort {di} in ascending order, then select the minority examples in G corresponding to the first (J - 'Y) x m terms in the sorted { di} and associate them as M.

c) Include M into the current data chunk Sk, which gives rise to Sk = {Sk, M}.

4) End If 5) Build soft-typed hypothesis hk based on Sk' and

include hk into the hypothesis set H k-l, which gives rise to Hk = {Hk-l, hd.

6) For each hypothesis hj E H k

• Apply hj on Sk to derive the weight Wj based on equations (2) and (3):

1 � . 2 ej =

ISkl � (1- g(Xi)) (:IJi,Yi)ESk

1 Wj = log( -) ej

(2)

(3)

where Iti (Xi) is the probability output by hj that Xi is an example belonging to class Yi.

7) End For 8) Append Pk into G, i.e., G = {G, Pk}.

Output: The final hypothesis h final for Tk is:

Fig. I shows the system level framework of the proposed MuSeRA algorithm. Data set G contains all minority training data prior to the current time. At time t = k, a certain amount ( (J - 'Y) x m) of minority examples in G are chosen based on the criterion that the Mahalanobis distance between them and the current minority examples set Pk are minimum. These examples are then appended to the current training data chunk Sk such that the ratio of minority example in the post­balanced training data chunk Sk is equal to I. Hypothesis hk is built upon Sk' and then added into the hypotheses

Page 3: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

• The majority training data The minority training data The selectively absorbed data

• The testing data __ -------t--'

t= 1 t = k

�. :

• •

••• • •

t = k + 1

� .. • • •

• • • •

t = k + 2

Jl h'�2 •

• ••

• •

•• • •

h(k+2) jinal

JL�

Time

;> ..... .

Fig. I. The system level framework of MuSeRA

set H k-l to obtain H k. Each of the hypotheses in H k is applied on Sk to calculate the error rate {ej} as done by equation (2), which is then used to calculate the weights {Wj} for each of them as done by equation (3). Finally, the hypotheses set H k are weighted by {Wj} to obtain the final hypothesis h final to make predictions on the current testing data set Tk.

B. The Theoretical Analysis of MuSeRA

Similar as our previous work SERA, MuSeRA selec­tively accommodates a certain amount of previous minority examples into the current training data chunk to balance the skewed class distribution. This is different from the mechanism of [18] which amplifies the current training data chunk by incorporating all previous minority examples regardless of their similarity degree to the current minority example set. In [17], it has been shown in empirical study that the performance of SERA was competitive compared to the universal accommodation mechanism employed by [18]. MuSeRA inherits the selective accommodation mechanism of SERA, which gives it a good opportunity to receive the similar benefits.

So far both [18] and SERA are focused on how to develop hypothesis/hypotheses on the amplified current training data chunk to optimize the prediction performance on the testing data set. By doing this way the learning algorithm can be always kept updated to the data with target concept mostly in accordance to the current one, thus the concept drifting problem is apparently alleviated or even bypassed. However, the sole maintenance of a hypothesis/hypotheses on the current training data chunk is more or less equal to

discarding a significant part of the previous knowledge, since knowledge of the previous majority class examples can never be accessed again either explicitly or implicitly once they have been processed. This situation, according to [5], can be partly defined as "catastrophic forgetting" that a qualified online learning system should always avoid disconnecting itself from previous knowledge available.

To this end, we propose to include hypotheses developed on all training data chunks over time to mine nonstationary stream data. Through a weighted combination, the previous knowledge can be preserved while the most contributory hypothesis can also exert greater impacts on the decision process by granting higher weights. [19] employs a uni­form voting mechanism to combine the hypotheses since it claims practically the testing data set may not necessarily be evolving with the streaming training data chunks. In this work, we assume the class distribution of the testing data set keeps tuned to the evolution of the training data chunks. The determination of the weights is in accordance to heuristics that they should be reversely proportional to the error the corresponding hypotheses have on the current training data chunk, which are mathematically shown in equations (2) and (3).

In [18], an ensemble was developed on the current am­plified training data chunk. Concretely the majority class examples are decomposed into K subsets with approximately identical size of the minority example set. Then each of these subsets are combined with a replicate of the minority example set, on which a hypothesis hi, (i = 1, . . . , K) is developed. Assuming the probability output of hypothesis hi that the testing instance x belongs to class c is g(x), the

Page 4: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

corresponding probability output of the ensemble classifier is:

K

fHx) = � Lg(x) n i=l

(5)

For elaboration simplicity, we refer this framework to "Uncorrelated Bagging' abbreviated as "UB" in the rest of this paper.

According to [20], the probability output of a soft-typed classifier for an instance x can be expressed as:

r(x) = p (clx) + 1]c (x) (6)

Where p (clx) is a posteriori probability that instance x belongs to class c, and 1]c (x) is the error associated with the ith output.

Based on equation (6), given that we are targeting binary classification problem, e.g., class i and j, the [21] proved the expected error can be reasonably approximated by:

E 2 p (Cj lx) - p (cilx) (7) rror = (1", .

·'c 2 where p ( Cj I x) and p ( Ci I x) are a posteriori probabilities that instance x belongs to class i and class j of the true

Bayes model, respectively; they are irrelevant to the training algorithm itself. Therefore, the expected error is proportional to the variance of 1]c (x) with a constant, i.e., Error ex (1�

c.

It was shown in [18] that given the independence of the hypotheses developed in each consolidated subset, the boundary variance of UB can be reduced by a factor of K2, i.e.,

(8)

Therefore, the expected error of UB should be proportional to �2 L�l (1�i. In the rest of this section, we would like to prove the error of MuSeRA is less than that of UB.

Since the weights of MuSeRA are determined reversely proportional to the errors the single classifiers have on the current training data chunk, they can approximately be described by:

C Wi = -2-a1J� (9)

C is a constant for all {wd. According to equation (6), the variance 1]g: ( x) is part of

MuSeRA's probability output, and can thus be represented as the weighted sum of the variance of single classifiers, i.e.,

N i 1]:: (x) = Li=l;:i1]c (X)

Li=l Wi (10)

If we make the similar assumption as in [18] that single classifiers are independent of each other, the variance of 1]g: ( x) can be represented by:

N 2 i (12 _ Li=l Wi (1c (x)

(11) '1?f - ",N w2 O,=1 , Taking equation (9) into consideration, equation (11) can

be simplified into:

(12)

On the other hand, we can make inequation (13) as follows:

N N 1 L(1�� x L � 2: N2 (13) i=l i=l '1�

The proof is straightforward: by assuming (1�i = ai, the left side of inequation 13 can be rewritten as:

C

N N 1 Lai x L� = N

L N

L i=l j=l 1 i=l,j=l i=j i=l,j=l i#j

1 N a2 + a2 = N+ - L ( ' 1 ) 2 i=l,j=l aiaj

ioj.j 2: N + (�) x 2 = N + N2 - N = N2

Based on equation (13), we have:

l i N _;-;--__ < _ '"' (12. ",N 1/(12. - N2 � '1� O,=1 '1� ,=1

(14)

(15)

The single classifier developed by MuSeRA is based on the big picture of the amplified training data chunk; while the single classifier developed by UB is based on part of the amplified training data chunk, which can be categorized into random undersampling. Random undersampling technique may cause the classifier to lose important concepts pertaining to the majority class [13]. Besides the simulation results in [17] also showed single classifier created by SERA is at least competitive as compared to UB. Therefore the error of the single classifier by MuSeRA should be no greater than that of the single classifier by UB, i.e., (1�i ::; (1�i. Provided sufficient training data chunks have been presented to the learning algorithm that N 2: K, we have:

1 N 1 K _ '"' ,..2. < _ '"' (1b2. N2 � V'1� - K2 � ,

i=l i=l (16)

Based on equations (8), (12), (15) and (16), we have:

(17)

According to previous discussion that Error ex (1�c

' we can conclude that MuSeRA can provide less erroneous prediction results than UB.

Page 5: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

III. SIMULATION RESULTS AND ANALY SIS

A. Simulation Objectives

[18] proposed to include all previous minority data into the current training data chunk to facilitate imbalanced learning. However, including excessively more minority data with concept drifts may undermine the original target concept of the current training data chunk. In the same way as done by our previous work SERA [17], MuSeRA incorporates a fixed amount of previous minority examples with the most similar underlying target concepts to the current minority example set. Our simulation will try to prove that the selective accommodation mechanism is more efficient than the universal accommodation mechanism.

Our previous work SERA is based on one single hy­pothesis built on the current amplified training data chunk. Our simulation will prove that by combining all previous similarly built hypotheses as well as the one built on the current amplified training data chunk in a properly weighted manner, the performance of MuSeRA for predicting the class label of the current testing data set can be considerably improved.

Finally, the existing imbalanced classification algorithms, such as SMOTE [14], usually employ over-sampling tech­nique to generate synthetic minority instances to balance the skewed class distribution of the data set. Such algorithms are inferior to our proposed MuSeRA for learning from imbalanced data streams, which will be justified by the simulation results as well.

The algorithms taken into the simulation for performance comparison are listed as follows:

• MuSeRA, which is denoted as "MuSeRA" in the sim­ulation results.

• SERA, which is denoted as "SERA" in the simulation results; it only uses the hypothesis built on the amplified current training data chunk to evaluate the current testing instance set

• The approach proposed in [18], which is denoted as "UB" in the simulation results.

• SMOTE, which is denoted as "SMOTE" in the sim­ulation results; it employs the synthetic minority over­sampling technique [14] to balance the class distribution of the current training data chunk, and then get back a hypothesis to evaluate the current testing data set.

B. Simulation Configuration and Evaluation Metrics

The synthetic data set proposed in [18] is used to validate the effectiveness and superiority of our proposed MuSeRA. Based on the occurrence mechanism of concept drifts dis­cussed in [18], the employed synthetic data set is designed with both feature change and conditional change, and the class label is determined non-stochastically.

The number of chunks in the data streams is set to be 100, each of which carries 1, 000 training examples and 10, 000 testing instances. The minority class ratio '"Y alternates in 0.01 and 0.05 in our current simulation. Comparative algorithms

-I ..., c (1) o OJ (J) (J)

Hypothesis Output

y N

p TP FN (True Positives) (False Negatives)

r-- -

FP TN (False Positives) (True Negatives) n

Fig. 2. Confusion Matrix for Performance Evaluation

are applied on the 100th training data chunk to have their performance evaluated.

The traditional evaluation metrics such as overall accuracy or overall classification error are not enough for evaluating the performance of the algorithms under imbalanced learn­ing scenarios [13]. To this end, several evaluation metrics associated with confusion matrix are employed to validate the effectiveness of the proposed MuSeRA.

The classification performance can be illustrated as the confusion matrix as shown in Fig. 2. Provided the minority and majority examples are deemed to belong to positive and negative class respectively, let {p, n} denote the positive and negative true class labels and {Y, N} denote the predicted positive and negative class labels, the evaluation metrics used in our simulation are shown as follows: • Overall Accuracy (OA):

• Precision:

• Recall:

• F-measure:

OA = TP + TN

TP + TN + FP + FN

.. TP PrecIsion = TP + FP

TP Recall = TP + FN

(18)

(19)

(20)

F (1 + (32) . Recall· Precision -measure = (21) f32 . Recall + Precision

• G-mean:

/ TP TN G-mean = V TP + FN x TN + FP (22)

Note that f3 is set to be 1 for F-measure in our current simulation.

Another popular evaluation metric for imbalanced learning is the Receiver Operating Characteristic (ROC) curve [22].

Page 6: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

TABLE I

SIMULATION RESULTS OF COMPARATIVE ALGORITHMS

f Method ,- 0.01 ,- 0.05

OA PrecIsion Recall F-measure G-mean OA Precision Recall F-measure G-mean MuSeRA 0.9839 0.1111 0.01 0.0183 0.1 0.9434 0.2105 0.048 0.0782 0.218

0.1 SERA 0.9594 0.0578 0.2 0.0897 0.4398 0.9209 0.1788 0.162 0.17 0.3945 UB 0.8397 0.0492 0.82 0.0928 0.8299 0.7403 0.1371 0.792 0.2337 0.7643

SMOTE 0.9593 0.0334 0.11 0.0513 0.3263 0.8005 0.1309 0.53 0.2099 0.6571 MuSeRA 0.9726 0.125 0.29 0.1747 0.533 0.9124 0.2767 0.466 0.3472 0.6604

0.3 SERA 0.9229 0.0649 0.5 0.1148 0.6809 0.851 0.1696 0.508 0.2543 0.6644 UB 0.831 0.0.0373 0.64 0.0704 0.7301 0.7415 0.1356 0.776 0.2309 0.7576

SMOTE 0.95 0.0413 0.18 0.0672 0.4152 0.8076 0.144 0.576 0.0.2404 0.6872 MuSeRA 0.946 0.1014 0.56 0.1718 0.7293 0.8753 0.2065 0.652 0.3136 0.7523

0.5 SERA 0.8931 0.0535 0.58 0.0979 0.721 0.8204 0.1604 0.612 0.2572 0.7133 UB 0.8177 0.043 0.81 0.0816 0.8139 0.7302 0.1317 0.786 0.2256 0.7561

SMOTE 0.9052 0.031 0.28 0.0558 0.5092 0.8252 0.1467 0.518 0.2286 0.6602

�!--�"�07, -7.03�O� .-7.0.'�O�' -7.,,�O�' -7.,,� fp-��

00 01 02 03 O. 05 06 01 08 09

fp_"te

00 01 02 03 O. 05 06 07 08 09

fp_".

(a) (b) (c)

01 02 03 04 05 OS 07 08 0.9 1,'LI1II'

���"�O�'�03�OT'�"�OT'�"�OT.�,,� 00 01 02 03 O. 05 08 07 08 09

fp_l1Ill1 I;Ll1It,

(d) (e) (I) Fig. 3. ROC Graphs of Comparative Algorithms

Based on the confusion matrix as defined in Fig. 2, one can calculate the tp_rate and fp-Tate as follows:

TP PR FP NR

(23)

(24)

ROC space is established by plotting tp_rate over fp-rate. Generally speaking, hard-type classifiers (those that only output discrete class labels) correspond to points in the ROC space: Up_rate; tp_rate). On the other hand, soft-type classifiers (those that output a likelihood of the degree to which an instance belongs to each class label) correspond to curves in the ROC space. Such curves are formulated by adjusting the decision threshold to generate a series of points

in the ROC space. In order to assess different classifiers' performance in this case, one generally uses the Area Under Curve (AUC) as an evaluation criterion; it is defined as the area between the ROC curve and the horizontal axis (axis representing fp-Tate).

C. Results and Discussion

Table I provides the simulation results in the scenarios that the minority class ratio I alternates in 0.01 and 0.05. when f = 0.1, UB provides competitive results, i.e., it outperforms other comparative methods in terms of Recall, F-measure, and G-mean. This is most possibly due to the insufficient amplification of the current minority example set. Since the skewed ratio of the amplified training data chunk is still high, it can be understood that MuSeRA and SERA yield inferior prediction results compared to UB.

Page 7: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

TABLE II

AUC OF COMPARATIVE ALGORITHMS

f Method AUC

'Y - 0.01 'Y - 0.05 MuSeRA 0.9084 0.8729

0.1 SERA 0.8073 0.7665 UB 0.8937 0.8491

SMOTE 0.5934 0.747 MuSeRA 0.9313 0.8752

0.3 SERA 0.8815 0.8391 UB 0.8301 0.7821

SMOTE 0.642 0.7759 MuSeRA 9236 0.873

0.5 SERA 0.8859 0.8383 UB 0.0.8558 0.8124

SMOTE 0.6321 0.7514

However, with enough previous minority examples being accommodated into the current training data chunk, i.e., f > 0.1, MuSeRA achieves a steady increase in its performance and outperform other comparative algorithms in terms of OA, Precision, and F -measure. The similar story also happens for SERA, which performs superiorly in terms of OA, Precision and F-measure when compared to VB. The last note for Table I is that SMOTE by no means bears comparison with other algorithms in terms of the numerical results, which means traditional imbalanced learning algorithm, e.g., over­sampling technique, cannot be directly applied to the stream data mining scenario.

Fig. 3 shows the ROC curves of the comparative algo­rithms in different scenarios, where Fig. 3(a)-3(c) correspond to f = 0.1, 0.3, and 0.5 when 'Y = 0.01, and Fig. 3(d)-3(f) correspond to f = 0.1, 0.3, and 0.5 when 'Y = 0.05. Table II shows the AVC results of these ROC curves, from which one can see the AVC of MuSeRA is greater than VB as well as other comparative algorithms in all simulated scenarios. SERA can also act competitive compared to VB when f > 0.1. These results verifies our discussion in Section 11-B that MuSeRA can provide superior performance when enough data chunks have been presented to MuSeRA.

IV. CONCLUSION

In this paper, we propose MuSeRA to learn from non­stationary imbalanced data streams. MuSeRA measures the Mahalanobis distance between the reserved previous minor­ity examples and the current minority example set, based on which a certain amount of previous minority examples are incorporated into the current training data chunk. Then a hypothesis is built on this amplified training data chunk. Dif­ferent from our previous work SERA which solely depends on the currently built hypothesis, the proposed MuSeRA algorithm weighs all the hypotheses built over time to constitute an ensemble classifier to accomplish the prediction task on testing data. The simulation results validate the effectiveness of MuSeRA.

There are several possible improvements for the proposed MuSeRA. Mahalanobis distance may be calculated inaccu­rately if the size of data set is not big enough. Given the very limited number of minority examples in the imbalanced data

streams, more efficient mechanisms to measure the similarity of the previous minority examples and the current minority examples sets, such as density estimation, could be consid­ered. Besides, with the elapse of time, the ensemble size will be inflated considerably. [21] discussed combining too many classifiers may jeopardize the independence assumption of single classifiers in ensemble that error cannot be reduced by N2 with the increase of hypotheses available. This, on one hand, does not deny our theoretical conclusion that MuSeRA can provide absolutely less erroneous results than VB since the ensemble size of VB is decreased with the accumulation of minority examples such that the number of hypotheses N needed for MuSeRA to outperform VB is correspondingly reduced. However on the other hand, this also raises an important issue related to the pruning technique that how to rule out the least informative hypotheses from ensemble to achieve best performance. Finally, given our simulation is currently only based on synthetic data set, extensive simulations on real-world data sets should be performed to fully justify the effectiveness of the proposed MuSeRA algorithm.

REFERENCES

[1] B. Babcock, S. Badu, M. Datar, R. Motwani, and J. Wisdom, "Models and issues in data stream systems," in Proc. PODS, 2002.

[2] M. M. Gaber, S. Krishnaswamy, and A. Zaslavsky, "Adaptive mining techniques for data streams using algorithm output granularity mo­hamed," in Workshop (AusDM 2003), Held in corifunction with the 2003 Congress on Evolutionary Computation (CEC 2003, 2003.

[3] C. Aggarwal, Data streams: models and algorithms, C. Aggarwal, Ed. Springer Press, 2007.

[4] --, "A framework for diagnosing changes in evolving data streams," in ACM SIGMOD Coriference, 2003, pp. 575-586.

[5] S. Grossberg, "Nonlinear neural networks: Principles, mechanisms, and architectures," Neural Networks, vol. 1, no. 1, pp. 17-161, 1988.

[6] P. Domingos and G. Hulten, "Mining high-speed data streams," in Proc. Int. Con! KDD. ACM Press, 2000, pp. 71-80.

[7] R. Polikar, L. Udpa, S. Udpa, and V. Honavar, "Learn++: An in­cremental learning algorithm for supervised neural networks," IEEE Transactions on System, Man and CybernetiCS (C), Special Issue on Knowledge Management, vol. 31, pp. 497-508, 2001.

[8] H. He and S. Chen, "Imorl: Incremental multiple-object recognition and localization," IEEE Transactions on Neural Networks, vol. 19, no. 10, pp. 1727-1738, 2008.

[9] H. Wang, W. Fan, P. S. Yu, and J. Han, "Mining concept-drifting data streams using ensemble classifiers," in KDD '03: Proceedings of the ninth ACM SIGKDD international coriference on Knowledge discovery and data mining, 2003, pp. 226-235.

[10] M. Last, "Online classification of nonstationary data streams," Intelli­gent data analysis, vol. 6, no. 2, pp. 129-147, 2002.

[ II] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "On demand classification of data streams," in Proc. Int. Con! KDD, 2004.

[12] Y. Law and C. Zaniolo, "An adaptive nearest neighbor classification algorithm for data streams," in Proc. European Con! PKDD, 2005.

[13] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.

[14] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "Smote: Synthetic minority over-sampling technique," Journal of Ar­tificial Intelligence Research, vol. 16, pp. 321-357, 2002.

[15] Masnadi-Shirazi and N. Vasconcelos, "Asymmetric boosting," in Proc. Int. Con! Machine Learning, 2007.

[16] X. Hong, S. Chen, and C. J. Harris, "A kernel-based two-class classifier for imbalanced data-sets," IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 28-41, 2007.

Page 8: [IEEE 2010 International Joint Conference on Neural Networks (IJCNN) - Barcelona, Spain (2010.07.18-2010.07.23)] The 2010 International Joint Conference on Neural Networks (IJCNN)

[17] S. Chen and H. He, "Sera: Selectively recursive approach towards nonstationary imbalanced stream data mining," Neural Networks, IEEE - INNS - ENNS International Joint Conference on, vol. 0, pp. 522-529, 2009.

[18] J. Gao, W. Fan, J. Han, and P. S. Yu, "A general framework for mining concept-drifting streams with skewed distribution," in Proc. Int. Con! SIAM, 2007.

[19] J. Gao, W. Fan, and J. Han, "On appropriate assumptions to mine data streams: Analysis and practice," in Proc. Int. Corif. Data Mining, Washington, DC, USA, 2007, pp. 143-152.

[20] K. Turner and J. Ghosh, "Error correlation and error reduction in ensemble classifiers," Connection SCience, vol. 8, no. 3-4, pp. 385-403, 1996.

[21] --, "Analysis of decision boundaries in linearly combined neural classifiers," Pattern Recognition, vol. 29, pp. 341-348, 1996.

[22] T. Fawcett, "Roc graphs: Notes and practical considerations for data mining researchers," Tech. Rep., 2003, hPL-2003-4.


Recommended