Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | donald-king |
View: | 217 times |
Download: | 1 times |
Data Stream Classification:Data Stream Classification:Training with Limited Training with Limited
Amount of Labeled DataAmount of Labeled Data
Mohammad Mehedy Masud
Latifur Khan Bhavani
ThuraisinghamUniversity of Texas at
Dallas
Jing GaoJiawei Han
University of Illinois at Urbana-Champaign
To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008Funded by: Air Force
Data Stream Classification Data Stream Classification TechniquesTechniques
Data stream classification is a challenging task because of two reasons◦ Infinite length – can’t use all historical data for
training◦ Concept-drift – old models become outdated
Solutions:◦ Single model classification with incremental
learning◦ Ensemble classification◦ Ensemble techniques can be updated more
efficiently, and handles concept-drift more effectively.
Our solution: ensemble approach
Ensemble ClassificationEnsemble ClassificationEnsemble techniques build an ensemble of M
classifiers.◦ The data stream is divided into equal-sized chunks◦ A new classifier is trained from each labeled chunk◦ The new classifier replaces one old classifier (if
required)Last labeled data
chunk
Data stream Last unlabeled data chunk
Train newClassifier
Ensemble
Ensemble classificationUpdate
ensemble
12
Limited Labeled Data Limited Labeled Data ProblemProblem
Ensemble techniques assume that the entire data chunk is labeled during training.
This assumption is impractical because◦ Data labeling is both time consuming and costly◦ It may not be possible to label all the data ◦ Specially in an streaming environment, where data
is being produced at a high speedOur solution:
◦ Train classifiers with limited amount of labeled data◦ Assuming a fraction of the data chunk is labeled◦ We obtain better result compared to other
techniques that train classifiers with fully labeled data chunk
Training With Partially-Training With Partially-Labeled ChunkLabeled Chunk
Train new classifier using semi-supervised clustering
If a new class has arrived in the stream, refine the existing classifiers
Update the ensemble
Last partially -labeled chunk
Data stream Last unlabeled data chunk
Train a classifier usingSemi-supervised
Clustering
Ensemble
Ensemble classificationrefine
ensemble
12
3
4
Update ensemble
Semi-Supervised Semi-Supervised ClusteringClustering
Overview:Only a few instances in the training data are
labeledWe apply impurity-based clusteringThe goal is to minimize cluster impurityA cluster is completely pure if all the labeled
data in that cluster is from the same classObjective function:
K: number of clusters, Xi: data points belonging to cluster i, Li: labeled data points belonging to cluster iImpi: Impurity of cluster i = Entropy * dissimilarity count
Semi-Supervised Clustering Semi-Supervised Clustering Using E-MUsing E-M
Constrained initialization:Initialize K seedsFor each class Cj,
Select kj seeds from the labeled instances of Cj using farthest-first traversal heuristic
where kj = (Nj/N) * K Nj = number of instances in class Cj, N = total number of
labeled instancesRepeat E-step and M-step until
convergence:E-step
Assign clusters to each instance So that the objective function is minimized Apply Iterative Conditional Mode (ICM) until convergence
M-stepRe-compute cluster centroids
Saving Cluster Summary as Micro-Saving Cluster Summary as Micro-ClusterCluster
For each of the K clusters created using the semi-supervised clustering, save the followingsCentroidn: total number of pointsL: total number of labeled pointsLt[ ]: array containing the total number of labeled
points belonging to each class.e.g. : Lt[j]: total number of data points belonging to class
Cj
Sum[ ]: array containing the sum of the attribute values of each dimension of all the data pointse.g. : Sum[r]: sum of the rth dimension of all data points
in the cluster
Using the Micro-Clusters as Using the Micro-Clusters as Classification ModelClassification Model
We remove all the raw data points after saving the micro-clusters
The set of K such micro-clusters built from a data chunk serves as a classification model
To classify a test data point x using the model:Find the Q nearest micro-clusters (by computing
the distance between x and their centroids)For each class Cj, Compute the “cumulative
normalized frequency (CNFrq[j])”, whereCNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters
Output the class Cj with the highest CNFrq[j]
ExperimentsExperimentsData sets:
Synthetic data – simulates evolving data streamsReal data – botnet data, collected from real botnet
traffic generated in a controlled environmentBaseline:
On-demand Stream C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-
demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006.
For training, we use 5% labeled data in each chunk. So, if there are 100 instances in a chunk
Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training
On Deman Stream uses 100 labeled instances for training
EnvironmentS/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz
dual core, 2GB RAM
ResultsResults
Each time unit = 1,000 data points (botnet) and 1,600 data points (synthetic)
ConclusionConclusionWe address a more practical approach
in classifying evolving data streams – training with limited amount of labeled data.
Our technique applies semi-supervised clustering to train classification models.
Our technique outperforms other state-of-the-art stream classification techniques, which use 20 times more labeled data for training than our technique.