+ All Categories
Home > Documents > AN ADAPTIVE GRID-BASED METHOD FOR CLUSTERING MULTI

AN ADAPTIVE GRID-BASED METHOD FOR CLUSTERING MULTI

Date post: 12-Mar-2022
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
13
AN ADAPTIVE GRID-BASED METHOD FOR CLUSTERING MULTI- DIMENSIONAL ONLINE DATA STREAMS Toktam Dehghani Department of Computer Engineering, Ferdowsi University Mashhad, Mashhad, Khorasan Razavi, Iran [email protected] http://toktamdehghani.com Mahmoud Naghibzadeh Department of Computer Engineering, Ferdowsi University Mashhad, Mashhad, Khorasan Razavi, Iran [email protected] http://profsite.um.ac.ir/~naghibzadeh/ Mohamadreza Afsharisaleh Department of Engineering, Islamic Azad University, Mashhad, Khorasan Razavi, Iran, [email protected] Abstract: Clustering is an important task in mining the evolving data streams. A lot of data streams are high dimensional in nature. Clustering in the high dimensional data space is a complex problem, which is inherently more complex for data streams. Most data stream clustering methods are not capable of dealing with high dimensional data streams; therefore they sacrifice the accuracy of clusters. In order to solve this problem we proposed an adaptive grid -based clustering method. Our focus is on providing up-to-date arbitrary shaped clusters along with improving the processing time and bounding the amount of the memory u sage. In our method (B+C tree), a structure called “B + cell tree” is used to keep the recent information of a data stream. In order to reduce the complexity of the clustering, a structure called “cluster tree” is proposed to maintain multi dimensional clusters. A Cluster tree yields high quality clusters by keeping the boundaries of clusters in a semi -optimal way. Cluster tree captures the dynamic changes of data streams and adjusts the clusters. Our performance study over a number of real and synthetic data streams demonstrates the scalability of algorithm on the number of dimensions and data without sacrificing the accuracy of identified clusters. Keywords: data streams; data mining; clustering; grid-based clustering; high dimensional data streams. 1. Introduction During the recent years, data streams have attracted attention in different applications of computer science, such as customer click streams, multimedia data, sensor data, network monitoring, telecommunication system, stock markets. A data stream is defined as a massive unbounded sequence of data elements continuously generated at a rapid rate [Park and Lee (2007)]. Management and processing of these online rapid unbounded streams raises new challenges because the traditional algorithms are usually not feasible to perform operations [Beringer and Hüllermeier (2003)]. Online data stream processing should satisfy the following requirements [Park and Lee (2007)]: 1. Each data element should be examined at must once to analyze a data stream. 2. Memory usage for data stream analysis should be confined finitely although new elements are continuously generated in a data stream. 3. Newly generated data elements should be processed as fast as possible to produce the up-to-date analysis result of a data stream. Clustering refers to the process of grouping a collection of objects into "clusters" such that objects within the same class that are similar in a certain sense, and objects from different classes that are dissimilar[Beringer and Hüllermeier (2003)]. Clustering of data streams has been studied in recent years but a few of these methods can effectively cluster large multi dimensional data streams. In this paper, we consider the problem of on-line Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST) ISSN : 0975-5462 Vol. 4 No.10 October 2012 4494
Transcript

AN ADAPTIVE GRID-BASED METHOD FOR CLUSTERING MULTI-

DIMENSIONAL ONLINE DATA STREAMS

Toktam Dehghani

Department of Computer Engineering, Ferdowsi University Mashhad, Mashhad, Khorasan Razavi, Iran

[email protected] http://toktamdehghani.com

Mahmoud Naghibzadeh

Department of Computer Engineering, Ferdowsi University Mashhad, Mashhad, Khorasan Razavi, Iran

[email protected] http://profsite.um.ac.ir/~naghibzadeh/

Mohamadreza Afsharisaleh

Department of Engineering, Islamic Azad University, Mashhad, Khorasan Razavi, Iran, [email protected]

Abstract:

Clustering is an important task in mining the evolving data streams. A lot of data streams are high dimensional in nature. Clustering in the high dimensional data space is a complex problem, which is inherently more complex for data streams. Most data stream clustering methods are not capable of dealing with high dimensional data streams; therefore they sacrifice the accuracy of clusters. In order to solve this problem we proposed an adaptive grid -based clustering method. Our focus is on providing up-to-date arbitrary shaped clusters along with improving the processing time and bounding the amount of the memory u sage. In our method (B+C tree), a structure called “B+cell tree” is used to keep the recent information of a data stream. In order to reduce the complexity of the clustering, a structure called “cluster tree” is proposed to maintain multi dimensional clusters. A Cluster tree yields high quality clusters by keeping the boundaries of clusters in a semi -optimal way. Cluster tree captures the dynamic changes of data streams and adjusts the clusters. Our performance study over a number of real and synthetic data streams demonstrates the scalability of algorithm on the number of dimensions and data without sacrificing the accuracy of identified clusters.

Keywords: data streams; data mining; clustering; grid-based clustering; high dimensional data streams.

1. Introduction

During the recent years, data streams have attracted attention in different applications of computer science, such as customer click streams, multimedia data, sensor data, network monitoring, telecommunication system, stock markets. A data stream is defined as a massive unbounded sequence of data elements continuously generated at a rapid rate [Park and Lee (2007)]. Management and processing of these online rapid unbounded streams raises new challenges because the traditional algorithms are usually not feasible to perform operations [Beringer and Hüllermeier (2003)]. Online data stream processing should satisfy the following requirements [Park and Lee (2007)]:

1. Each data element should be examined at must once to analyze a data stream. 2. Memory usage for data stream analysis should be confined finitely although new elements are continuously generated in a data stream. 3. Newly generated data elements should be processed as fast as possible to produce the up-to-date analysis result of a data stream.

Clustering refers to the process of grouping a collection of objects into "clusters" such that objects within the same class that are similar in a certain sense, and objects from different classes that are dissimilar[Beringer and Hüllermeier (2003)]. Clustering of data streams has been studied in recent years but a few of these methods can effectively cluster large multi dimensional data streams. In this paper, we consider the problem of on-line

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4494

clustering of multi dimensional data streams. Our focus is on providing up-to-date arbitrarily shaped clusters along with processing as fast as possible and bounding the amount of memory space used to maintain information. The remainder of the paper is organized as follows: section 2 provides some background information on data streams clustering algorithms. In section 3, a method for clustering data streams is proposed. In section 4, several experiment results are analyzed to evaluate the performance of the proposed method.

2. Related work

Clustering is one of the major data mining categories and it groups a set of data into classes called cluster. Clustering techniques are categorized into several different approaches. Partitioning, hierarchical, density-based, grid-based and model-based [Park and Lee (2007)][Guha et al. (2003)]. There are several clustering algorithms for data streams that use different approaches. In the following, data streams clustering algorithms such as STREAM [Guha et al. (2003)], CluStream [Agrawal et al. (2003)], HPStream [Agrawal et al. (2004)], EStream [Thanawin et al. (2007)], DenStream [Cao et al. (2006)], DStream [Chen and Yu (2007)], cell tree [Park and Lee (2007)], and CS tree [Jae, et al. (2009)]are discussed. In [Guha et al. (2003)], STREAM and LSEARCH algorithms are proposed to find the clusters of the continuously generated data elements over a data stream [Park and Lee (2007)] [Muthukrishnan (2003)]. It regards a data stream as a sequence of stream chunks. A stream chunk is a set of consecutive generated data elements that fits in the main memory. For each chunk, STREAM clusters its elements and retains the weighted cluster centers. The centers are weighted according to the number of elements attracted to them. Then, the weighted centers are retained for each examined chunk so far, to obtain a set of weighted centers for entire stream. STREAM uses LSEARCH which is a 0(1)–approximate k-means algorithm for clustering of the chunks and weighted centers. Although this algorithm makes a single pass over a data stream and uses small spacey, when the number of clusters is not known in advance, the LSEARCH routine should be iteratively performed until the quality of clusters is maximized, which makes it not directly applicable to data stream [Park and Lee (2007)] and like other partitioning approach, STREAM is incapable of revealing clusters of arbitrary shapes and detecting noise and outliers [Chen and Yu (2007)]. A hierarchical algorithm called CluStream [Agrawal et al. (2003)]is proposed for the clustering of evolving data streams. It divides the clustering process into the on-line and off-line components. The on-line component computes and stores statistics about the data stream using micro clusters. The information of a micro cluster is represented by a cluster feature vector which is similar to the cluster feature vector of BIRCH. The on-line micro cluster processing is divided into two phases: statistical data collection and updating of micro clusters. In the first phase, the totals of micro clusters are maintained. The predefined number of micro clustering is determined by the available space of main memory. In the second phase, micro clusters are updated when a new data element is processed. If the new data element falls within the boundary of an existing cluster, the feature vector of the micro cluster is updated by the new data element; otherwise, a new cluster with unique ID is created for the new data element. In this case, the number of micro clusters becomes larger than the predefined one; the nearest two micro clusters are merged into the one micro cluster or the oldest micro clusters are deleted. However the CluStream uses the predefined constant number of micro clusters which is especially risky for the evolving data stream [Chen and Yu (2007)]. This algorithm is not suitable for finding clusters over online data stream due to its offline components. To cluster evolving data stream based on both historical and current stream data, the snapshots of a set of micro clusters are stored at different levels of granularity, so more information maintain for more recent events as opposed to older events. In the off-line component, the macro clusters of CluStream are generated by executing the k-means algorithm for the accumulated snapshots of micro cluster. This component can perform user-directed macro clustering as cluster evolution analysis. To allow a user to explore the stream clusters over a specified time period 'h', the two snapshots of the micro cluster at the times 'tc' and 'tc-h' are compared. The k-means algorithm is executed on the subtracted cluster feature vectors. To analyze the evolution of micro cluster in the period 'h' ids of clusters in two snapshots are compared and the added, deleted or retained clusters are identified. CluStream yields high quality clusters and it maintains scalability in term of stream size. However, this algorithm is not suitable for finding clusters over a one-line data stream due to its off-line component. A lot of data streams are high dimensional in nature. A high dimensional case presents a special challenge to clustering algorithms even in the traditional domain of static data set and it is significantly more computationally intensive in the data stream environment. Another algorithm which is an extended form CluStream is introduced in [Agrawal et al. (2004)]. The algorithm is referred to as HPStream since it describes the high dimensional projected stream clustering method. The method incorporates the projection based clustering and a fading cluster structure. In projected clustering, each cluster is specific to a particular group of dimensions and the subset of dimensions may vary over the different clusters. Previous projected clustering method cannot be easily generalized to the data stream problem because they require multiple passes over the data and they are too computationally intensive for data stream problem. In addition, for data streams it’s

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4495

essential to design methods which efficiently adjust to the progression of streams. HPStream assigns to each cluster a bit-vector which corresponds to the relevant set of dimensions of data of the stream. Each element in this vector has 0-1 value according to whether or not a given dimension is included in that cluster. As the algorithm progress, this bit vector updates in order to reflect the changing set of dimensions. HPStream uses a fading cluster structure to be able to adjust the clusters in a flexible way. Fading cluster structure captures a sufficient number of statistics, so it is possible to compute key characteristics of the clusters. A function called fading function is defined which is a monotonic decreasing one and its values lies in the range (0,1). This function is exponential and gradually discounts the history of past behavior. HPStream is incrementally updatable and scalable on both the number of dimensions and size of the data stream and in comparisons with STREAM and CluStream, it achieves better clustering quality for high dimensional data [Agrawal et al. (2003)]. Since the characteristics of the data in streams evolve over time, various types of evolution should be supported by algorithms. In order to improve existing stream clustering algorithms, EStream [Thanawin et al. (2007)] was presented. EStream classifies evolution of clusters into five categories: appearance, disappearance, self evolution, merge and split. In this technique, incoming data, based on similarity score, may be assigned to an active cluster or be classified as on isolated. Eventually, if the region becomes dense, a new cluster appears. Existing clusters that contain only old data are faded, and ultimately disappear. By analyzing histograms, clusters can be split. Also, this algorithm checks every pair of cluster and merges the overlapping ones. If the number of clusters exceeds the defined limit, the algorithm merges the closest pairs. EStream improved stream clustering algorithms by supporting data evolutions and presenting a new suitable cluster representation and a distance function. However, EStream requires a limit on the number of clusters that may cause incorrect clustering. This algorithm needs a lot of data accommodated for appearance of initial clusters and detecting some evolutions such as merge. EStream exhibit linear runtime in the number of dimensions but polynomial one in the number of clusters due to the merging procedure. Previous proposed streaming algorithms produce spherical clusters. A density-based algorithm called DenStream [Cao et al. (2006)] was introduced to overcome these drawbacks. This algorithm can be divided into two parts: online part for maintaining micro cluster and offline part for generating the final clusters. In order to summarize the clusters with arbitrary shapes, the micro cluster synopsis is designed by a set of micro clusters. Clusters are found by applying DBSCAN in offline part. In addition to distinguishing potential clusters and outliers, DenStream stores them as micro clusters in an online way and separates their processing and memory space. For each new data if it's far from all potential and outliers-micro clusters, it creates a new outlier-micro cluster. An outlier-micro cluster whose weight is more than the threshold will be converted into a potential micro cluster. To limit memory consumption, DenStream uses a pruning strategy which provides opportunity for the growth of new clusters while promptly getting rid of outliers. So, in this algorithm no assumption on the number of clusters is needed. DenStream achieves consistently high clustering quality, but the some overall density for the absolute parameters making the result of clustering sensitive to parameter values. This algorithm cannot distinguish clusters which have different levels of density. DStream [Chen and Yu (2007)] is a density and grid-based algorithm like DenStream algorithms. DStream also tries to resolve incompetent to find clusters of arbitrary shapes. The difference is that it’s a grid based algorithm using the density grid structure. The algorithm uses an online component which maps each input data record into a grid cell and an offline component which computes the grid's density and clusters the grids based on their density. In online component, the space is partitioned into fine grids and new data records are mapped into the corresponding grid. The algorithm adapts a density decaying technique to capture the dynamic changes of a data stream. The offline component dynamically adjusts the cluster in every gap time. A grid cluster is a connected grid group which has higher density than the surrounding grids. Grids that are under consideration for clustering analysis are maintained in a grid –list. The grid list is implemented as a hash table to allow fast access and update. Further, a technique is developed to detect and remove sporadic grids mapped to by outliers. In this algorithm, sporadic grids that have previously received many data but the density is reduced by the effect of decay factor are not be removed and marked as sporadic because they may become dense in the future. During clustering algorithm, considering unsporadic grids in the grids list instead of the possible grids saves computing time, and space of the system. However, DStream algorithm does not perform well on the high dimensional data streams due to requiring very large number of grids. In [Jae, et al. (2009)], the grid–based clustering method called CS tree is proposed to find the clusters of continuously generated data elements over a data stream. The multi–dimensional data space of a data stream is partitioned into fixed number of equal–size grid cells. This fix number is called partitioning factor. This algorithm is composed to three steps. In the first step, on–going one-dimensional clusters in each dimension of a data stream are independently traced by the one- dimensional version of the cell-tree method [Park and Lee (2007)]. A unique cluster identifier is assigned to the individual cluster. The range of each one- dimensional cluster in an individual dimension becomes the granule of finding multi-dimensional cluster. Upon receiving a new data element, each of its dimensional value is compared with the one- dimensional clusters of its corresponding dimension. For each dimensional value, if a matched cluster exists, the identifier of the cluster is

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4496

obtained.predefinedata elemdimensiospace. AAmong thare correestimatedsacrificinDue to ofollowingbe scannbecause ogrid celldiscoverestreams, dimensiotries to finot preciand this number o Fig. 1(a)(c1,c2) anto make only thredimensiotree finds

3. The p

In this sefollowing

3. 1 fundA data stek. . . } adimensio

We noteapplicatiostream. Dgeneratedupdating

3. 2. A fa

To find cshould bdividing explore cdynamicathe cell i

. The result ed sequence oments in the onal clusters C

A node corresphe leaf nodes

esponding to td by a data ng the accuracur study, thergs are the CS ned in a sequof the defineds, few numbeed. Third, in thin the first

onal clusters afind the real cise. The resultmay lead to

of clusters mak) shows a twnd in the y dimthe final clust

ee clusters. Figon, in the x dims three cluster

proposed algo

ection, we presg the proposed

damental conctream for a d-darriving at timons, denoted b

e that since ons naturally Due to this red data elementhe distributio

fading structu

clusters over ae carefully ma multi dime

clusters in higally partitioninin the grid. Th

of this matchof dimensions.

space over tCS tree is useponding to a whose depth

the final clustdistribution s

cy of identifiedre are some prtree's problemential manner

d partition threers of the dahis algorithm,step one–dimre combined b

clusters by fints show that toverlap of th

kes the problewo dimensiona

mension thereters. So, CS trg. 1(b) shows mension theres in this data s

orithm (B+C

sent the fundad algorithm is

cepts dimensional d

me stamps {Tby:

a data streamimpose a limi

eason, it is essnts of a data son statistics o

ure for monito

a data stream amonitored. A censional spac

gher dimensionng the data sphe number of

hing is repres The support the total numed. A k-depthdense multi dare the same aters. For imprsynopsis. Thisd clusters. roblems in the

ms: First, for er to find the eshold, in the rata elements b, in order to re

mensional clusby CS tree and

nding a frequethe number ofhe clusters (oem more obvioal data space,e are two clusree finds foura two dimens

e is one clustespace due to th

Fig. 1) an exa

tree)

amental conce described.

data space N=N1…. Tk …. }

e

m is a massiited memory csential to use tream. In the f data element

oring the distr

accurately, thecommon way e into the finns. In order topace into a nupoints inside

sented by a of a rectangu

mber of data h node in CS dimensional ras the dimensroving the clus algorithm i

e CS tree meteach data elemrelated intervrecursive procbelong to theeduce the comsters in each d make the m

ently co-occurf multi-dimenoccultation). Ious. Also, upd, in this data sters (c3,c4). Ir clusters in thsional data spaer (c1) and in tthe overlappin

ample of clusterin

epts of the grid

N1 ×. . . ×Nd, Each data po

e , … , e

ive unboundeconstraint, it ia scalable mnext section, ts.

ribution statis

e distribution to find clust

nite intervals o monitor the umber of non- the cell can b

list of matchlar space is deelements gentree is corresrectangle spacsionality of theustering, the pis scalable on

thod that can ment, a single val which is cedure of parte final clustermplexity of the

dimension amulti-dimension

rred set of onnsional clusterIncreasing thedating of multspace in the

In CS tree, onhis data spaceace, in this dathe y dimensi

ng of clusters,

ng with CS tree

d which is ma

Consists of a oint ei is a mu

.

ed sequence s impossible t

method to monwe will discu

stic of data ele

statistics of cters and high-(cells) in eacdistribution ooverlapping rbe used to de

hed cluster idefined by the rnerated so fasponding to a ce is allowede data stream precise range n the number

be solved to glinked list in a time consumitioning the grrs and many e clustering ofare traced, thnal clusters. Ae-dimension crs and their oue density of tti dimension cx dimension

e-dimensionale due to the nota space after on there are twhowever there

inly based on

set of d-dimeulti-dimension

of data elemto maintain all

nitor the distriuss the structu

ements

ontinuously g-density regionch dimension,f data, a histoegions and thtermine the d

dentifiers orderation of the n

ar. In order tk-dimension

d to have a chand have highof each final r of dimensio

gain better reeach dimensi

uming processrid cells to finsmall cluster

f high dimenshen a sequencAlthough, the clusters, this mutliers are nothe data spacclusters is not

there are twal clusters are oise, howeverprojecting da

wo clusters (cre are only two

CS tree and t

ensional recornal record con

ments and dall the elementsibution of conure for mainta

generated dataons in the data, which are m

ogram is consthen mapping tdensity (count)

ered by a number of to find d-

rectangle hild node. h supports cluster is

ons while

sults. The on should

s. Second, nd the unit rs are not ional data ce of one algorithm method is t accurate e and the precise. o clusters combined r there are ata in each c2,c3). CS o clusters.

then in the

rds { e1. . . ntaining d

ata stream s of a data ntinuously aining and

a elements a space is merged to tructed by the data to ), average

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4497

and deviation of the data elements of the cell. Clustering patterns embedded in a data stream usually change as times goes by. In order to keep only the recent information of a data stream, the weight of information represented by each data element should be differentiated according to the generated time of the data element. To identify the recent change of data elements, a fading factor is used. A fading factor determines how fast the effect of old information is faded away. According to [Javitz and Valdes (1994)], the weight of information represented by a data element generated in a data stream can be decayed based on the decay rate (τ). The recent distribution statistics of a cell are defined as follows [Park and Lee (2007)]:

)1( ct cv τt‐v

1

)2( µt µ

vcv τ

t‐vet

ct

)3( δt

cv

ct δv 2 τ

t‐v µv 2

et 2

ct‐ µ

t 2

In these equations, τ, Ct, µ t, δ denote as follows: τ is the decay rate based on the model representation in [Javitz and Valdes (1994)]. Ct is the decayed count of data elements in the cell until 't'.

µt is the decayed average of the data elements in the cell until 't'. δ is the standard deviation the data elements in the cell until 't'. v is the latest update time of the cell.

3. 3. Parameters of the proposed algorithm

In our algorithm several parameters are used to manage clustering of data streams. The parameters are summarized in table 1.

Table 1: Clustering parameters

name Definition value λ Size of a unit cell 2-4-8-16 h Portioning factor 2-4-8-16

f-th Percent of data in a final cluster 0. 0001-0. 001-0. 01 c-th Percent of data in initial clusters f-th=>c-th s-th Percent of data in a sparse cluster f-th=>c-th>s-th p-th Percent of data in a dense cell p-th=(α*f-th)/log αЄ(0,1)

m-th Percent of data in a sparse cell m-th=(p-th)/(h+1)

3. 4. Adaptive grid-based method for maintaining the distribution statistic of data elements

In this paper, adaptive grid –based clustering is used for clustering of data elements in data streams. Grid-based clustering algorithms first cover the data space with grid cells. Statistical distribution is collected for all the data objects. Regions which have more points than a specified threshold are identified as dense. Dense regions that are adjacent to each other are merged to find the embedded clusters. Given the current data stream Dt for each one-dimensional data space N, distribution statistics of the corresponding cell, is updated. When the cell is dense enough, it is partitioned into smaller equal-size cells. Since such partitioning can be performed recursively in dense regions of the data space, the distribution statistics of these regions become more accurate. The current density of a cell is the ratio of the number of these data elements that are inside the interval of the cell over the total number of data elements. When the current density of a cell (g) is greater than or equal to partitioning threshold (p-th). It is partitioned into h (a predefined partitioning factor) smaller equal-size cells. The distribution statistics of new cells gi (1<= i <= h) are initialized by the normal distribution of as follows [Park and Lee (2007)]:

)4(

φ x1

g. δ √2πe

.

)5( g . .

.

.

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4498

In these e

In fig. 2 tg14). This partsmallest of data elwas dens

By mergconsider current dcell over such a ce

3. 5. B+c

In order "B+cell" ta faster firetrievingdefined a

In B+cell

equationsg .g . is theg . is theg . δ is the

the cell g2 is j

titioning proccell in the datlements in a dse in the past.

ging these spaa decay rate f

density of a cethe total num

ell is merged w

cell tree

to manage thtree is propos

finding and upg the distributas follows:

Each nodeId of eachrelationshiAll leaves

tree, two kindNon-leaf nLeaf nodestructure fo

g .

g . δ.

, g . and ge count of datae average of dae standard dev

ust becoming

edure can be ta space and indata stream ca

Fig. 2

arse cells, unnfor reducing well is low, the mber of data ewith a set of h

he dynamicallysed. B+cell treepdating of the tion statistic o

e will contain ah cell is definip according toappear in the

ds of nodes arnodes: This kines: This kind for storing dist

.

.

.

g . δ denote aa elements in gata elements iiation of data

g dense in the

recursively innterval size ofan be changed

2) A dense cell po

necessary cellweight of cells

ratio of the delements becom-1 sparse neig

y varied confe(based on B+

distribution stf neighbors' c

a number of cned by the beo their ids. same level, an

re defined (Fignd of node incof node incl

tribution statis

F

.

as follows: gi until 't'. in gi until 't'. elements in g

t-th turn and i

nvoked until f every unit ced as time goes

ortioning process

ls are eliminas which are ndecayed numbmes less than

ghbor cells.

figuration of c+ tree) providetatistics of the

cell in the mer

cells vary betweginning of it

and carrying th

g. 3): cludes a list oludes a list ostic of a cell, c

Fig. 3) B+cell tre

gi until 't'.

is partitioned

a unit cell is ell is the sameby, a specific

s [Park and Lee (2

ated and the mot updated in

ber of these dan or equal to p

cells in the enes of random ae cells, also mrging and the c

ween M/2 and ts range. Amo

he distribution

f cell's ids andf cell's ids ancalled cell’s in

ee

into smaller d

found. A unie as λ. Since thc cell may bec

2007)]

memory usagthe recent tur

ata that are inpredefined me

ntire range of access to the c

makes a sequenclustering pro

m (except rooong the cells

n statistic of th

d a list of poinnd a list of pnfo-box.

)6(

)7(

disjoint cells (

it cell is definhe distributioncome sparse a

ge can be redrns. For a cellnside the intererging thresho

f data space ecells in order tntial access pooducers. A B+c

ot). exists a total

he cell.

nters to its chipointers to th

(g11 g12 g13

ned as the n statistics although it

duced. We l, when he rval of the old (m-th),

efficiently, to prepare ossible for cell tree is

l ordering

ldren. he defined

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4499

Theorem 1: Given a partitioning factor h for a data set of a one-dimensional data space N, the minimum

number of recursive partitioning operation needs to produce a unit cell is log [Agrawal et al. (2005)].

Theorem 2: In a B+ cell tree, if n is the number of data elements and m is the maximum number of children a node can have, the average time complexity of searching, insertion and deleting will be log n [Mehta and Sahni (2004)].

Assume the total number of cells in B+cell tree in one dimensional space is and the maximum number

of children that a node can have is h, then according to the theorem 1 and 2, the average height of a B+cell tree islog range N /λ. The average time complexity of operations will be under the minimum number of the recursive partitioning operation needs to produce a unit cell.

Definition 1: insert procedure (1) For each new cell, perform a search to determine related leaf node. Record the path in a stack. (2) Insert id of new cells to the related node and the pointer to the cell's info-box. (3) If the node is full (more than m cells in a node), (i) Allocate new the leaf and move half of the node's cells to new cell.

(ii) Update the extra pointer of the node, its neighbors and the new node. (iii) Insert the smallest id of the new leaf into the parent. (4) If the parent is full, split it. (i) Add the middle id to the parent node.

(ii) Repeat until a parent is found that does not need to split. (5) If the root splits, create a new root which has one cell and two pointers.

Definition 2: partitioning procedure If the number of these data elements that are inside the interval of a cell over the total number of data elements is greater than equal to partitioning threshold (P-th), the cell is partitioned as follows:

(1) Split range of the cell into the h number of smaller equal cell. Create h-1 new ids. (2) Initialize the distribution statistics of new cells. (3) Assign a value between 0 to h-1, to each small cell according to their orders. (4) If a small cell has the same id as its parent cell, replace the parent cell with the small cell. (5) Else insert the (h-1) small cells into the B+cell tree.

Definition 3: removing procedure To merge the neighboring cells, each cell is removed as follows: (1) Start at root, find leaf node where the cell belongs. Remove the cell. (2) If the cell's id is the smallest in the node, update parent with the second smallest id in the cell. (3) If a leaf node is more than half-full, done! (4) If a leaf node cells less than it should, (5) If sum of number of cells in it and one of its adjacent nodes is more than m/2 Try to re-distribute, borrowing from the adjacent node. Else

Merge a node which sum of number of cells in it and other adjacent node is less than m. The node with bigger id must be deleted.

(6) Merge could propagate to root, decreasing height.

Definition 4: Merging procedure In partitioning procedure a value between 0 to h-1 is assigned to each new cell. This value shows the place of the new cell in the range of the parent cell; also it helps to recognize cells that were partitioned together. In order to find the sparse cells, leaf nodes of the tree are scanned. In B+cell tree, some of the neighboring cells can be in the other leaf node. Processing of these cell is available by the extra pointer references the nearest neighbor node in the tree. According to the assigned value of the cell, the direction of processing is determined:

(1) If the value is equal to zero, the (h-1) nodes in the right direction will be processed. (2) If the value is equal to h-1, the (h-1) nodes in the left direction will be processed. (3) Otherwise both directions will be processed. (4) Distribution statistics of all cells are merged. (5) Except the cell with an id equal to zero, the entire cell's ids are stored in a stack. (6) If the entire neighbor's of a sparse cell are sparse, they will be merged and replaced by a cell with the

smaller id. Other cells are popped from the stack and removed. The algorithm starts by maintaining the distribution statistics of new data elements in B+cell tree and partitioning, inserting and merging cells, in order to recognize the dense regions of one-dimensional data space.

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4500

We preseof "B+cel

3. 6. Clu

In this trsequencedimensio(Ck). For

Clusters dimensioshows a combines

3. 7. mu

For eachdimensiomonitors

Initially, For the c Bi,j

Chi

(1) If |B

For (2) If B(i) If

(ii) If (3) If B(i) In

(ii) Se

ent a "cluster ll tree" and "c

uster tree (C tr

ree, for each es. Based on donal clusters. Er a cluster in clCount: the coLee (2007)].

v is the last upCluster's interreal boundariChild []are Pdimensional c interface is a

onal cluster is cluster is covs the neighbor

ulti-dimension

h new data elon sequence, the list and it

Bi,j: A corresp|Bi,j |: Range oCount (Bi,j): TCount (ci-1): TDt: The total

the root of thorresponding j is dense enou

ild z is sparse

Bi,j | < λ then

r each child ofBi,j is not densf childz does nf childz is sparBi,j is dense annsert new childend the new c

tree" for comluster tree" m

ree)

data elementsdense cells ofEach node in luster tree, the

ount of data de

pdate time of rface is a deves. Pointers to thclusters.

a structure fora set of k-dim

vered with 3 hr hyper-cubes

<(x1,

nal clusters

lement, the cthe beginning

ts adjacent clu

ponding cell toof j-th cell of iThe number ofThe number ofnumber of dat

he tree is assumcell in each d

ugh to be a clu,

| |

enough to rem

| |

the clustering

f the parent cluse, not exist, then rse, delete chilnd childz doesd (cluster) if t

child as the par

mposing n-dimmakes "B+C tre

s, cells of thef one dimensio

the k-th depte following feements in the

the cluster. eloped structu

he children o

r maintaining mensional hyp

hyper cubes. T.

, x2) ⋁ (x1,y2)> ⋀

Fig.

correspondingg and the endusters to updat

o the new datai-th dimensionf data elementf data elementta elements un

med as the pardimension the uster, if

moved from cl

g condition is i

uster, if in the

stop clusterinldz and stop cs not exist, he parent clusrent cluster fo

mensional clusee".

e one dimensional space, onth of a cluster

eatures are maCk. Count is

1

ure for mainta

of a cluster.

g the boundaripercubes that

The proposed

⋀<(x4, x6) ⋁(y2,y2)

4) Clustering inte

g cell, in eachd of updated te the result. C

a element in in t in the j-th cet in the parenntil 't'.

rent cluster anfollowing con

(9)

lusters, if (10)

invalid and th

e i-th dimensio

ng! clustering!

ster is dense. or next dimens

sters from one

ional spaces ane-dimensionr tree is corre

aintained: calculated acc

aining the bou

Children of

ies of a k-dimcovers all thmethod scans

)> ⋀<(x4, x5) ⋁(y

erface

h-dimensionacells are ins

Clustering is b

-th dimension

ell of i-th dimet cluster in the

nd only a densnditions are di

he algorithm st

on it is adjacen

sion.

e-dimensional

are updated aal clusters are

esponding to a

cording to dec

undaries of a c

a k-dimensio

mensional clue surface of ts the hypercub

4,y5)>

l is updated. serted in a lisased on the fo

(j-th cell of i-

ension. e (i-1) –the dim

se unit cell caiscussed: (for

tops clustering

nt of Bi,j, the

l clusters. Com

according to de combined toa k-dimension

cay model of

(8)

cluster very c

onal cluster a

uster. Interfacethe main clust

ubes of an inte

According tst. Finally, clollowing param

-th dimension)

imension.

an be a part ofthe i-th dimen

g.

child (childz)

mbination

dimension o make d-nal cluster

[Park and

lose to its

are (k+1)-

e for a k-ter. Fig. 4 erface and

o defined luster tree meters:

)

f a cluster. nsion)

is updated.

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4501

3. 7. 1. Creation of new clusters

When a new data element arrives, the corresponding cells in each dimension are updated. If cells became dense, a new cluster is added to the cluster tree and if there is any adjacent cluster, it will be merged with them.

Theorem 3: If a collection of point S is a cluster in a k-dimensional space, then s is also a part of a cluster in any (k-1) dimensional projections of this space [Agrawal et al. (1998)]. So, only points belong to the same the k-1 dimensional cluster can be clustered together in the k dimension space.

According to these theorem 3 a child of a cluster in the i-th depth is an i dimensional cluster that in the past (i-1)-dimension it was a part of its parent cluster. The conditions of creating a new child (childz) in the cluster tree are defined as follows:

,| |

(11)

∑ |⁄ | (12)

If both conditions are satisfied, a new child (childz) will be created. The count of childz is initialized as follows:

, , ∑ (13) 3. 7. 2. Merging clusters

Our algorithm for merging clusters consists of the following steps, For each child of the parent cluster (v ∈ 1. . number of children):

Compare childz with childv If childz and childv are neighbors, childv is merged with childz. Delete childv.

3. 7. 3. Removing a cluster

Since the distribution statistics of data elements in data stream can be changed as time goes by, a cluster may become sparse although it was dense in the past. If the decayed number of data element in a cluster over the total number of data elements is less than the sparse threshold (s-th), the cluster can be removed from cluster tree.

3. 7. 4. Final clusters

For each data element the corresponding cells in B+cell tree are updated and forwarded for clustering. The Cluster tree is traversed and according to the distribution statistics of cells, the clusters in the depth of 1 to k (1 k d ) are updated. If there is a path with depth equal to the number of data element’s dimensions (d) and the number of data in the d-dimensional cluster over the total number of data elements until now is greater than the final cluster threshold (f-th) then the cluster is dense enough to reported as a final cluster. Final clustering threshold defines the percent of the minimum data elements that should be in a final cluster. In the experiments, f-th is a small value, in order to determine the clusters more accurately. Therefore, in the beginning of the data streams, just a few data in a region can make a cluster; gradually over time, the minimum number of data to make a cluster is increased. Table 2 shows the growing rate of the number of data needed to make a cluster (f-th=0. 001).

Table 2) The growing rate of the minimum number of data for clustering

The minimum number of

data elements for final clusters The minimum number of

data elements for intermediate clustersNumber of data elements

30. 3100030 310000300 30100000

1500 150500000

As table 2 shows, in the 500000th turn, to cluster a cell it should contain at least 1500 data elements; according to a real experience, in the 80 percent of the real clusters, the number of data elements is less than 1500. So, the f-th for large number of data elements can avoid the determination of small clusters. In order to solve this problem, a periodical adjustment is done on |Dt| as follows:

|D | |D | α α 1 α |D | (14)

α is the desired minimum percent of the data elements that should be part of clusters (1≫ ).

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4502

αi is num

3. 9. Refi

In our medata set, elements distributinumber odata in easets, wheorder to m

Fig. 5) E

4. Evalua

The evalprecisionthe rate ocluster Cto Zr. Thsolution w

Precis

Re

the FSc

In order generatedis rangedelements dimensio In order detectionThe size In the foexperimeWindows

ComIn fig. 6, single pamethod fresults. T

the number omber of the gen

finement

ethod, the seqthe sequence

in each dimeion of a dimenof children anach dimensionen a data set cminimize the u

Effect of the diffedeviations in

ation

luation criterin and recall. Pof correct mat

Ci with the ni nhe FScore valuwhere k is the

sion (correctn

ecall (accuracy

The FScore is

core of the ove

to evaluate thd by the data d over [0,100) are concent

on [Park and Lto show the p

n data set [KDof each dimenllowing, four

ents are perfos Vista and all

mparing "B+the performan

ass algorithmsfor data streamThe conditions

‐ (F-th=‐ Condi‐ The D

of the data elenerated data el

quence of dimee of dimensioension. They nsion is tightlnd the numbern on the numb

contains non-nunwanted beh

erent sequences on an ascending or

ia are describPrecision defintches in the mnumber of simue of a categoe number of cl

ness) is defined

y) is defined a

s defined as:

erall clustering

he performangenerator use) and the valutrated on ranLee (2007)]. performance

DD Cup (1999nsion is normadifferent exp

ormed on a 2l programs im

C tree" to thence of the pros (online). Thems. The params of the experi=0. 001, h=10)itions are chec

Data set for the

ements that blements in the

ensions does nons can be dcan be sortedy concentrater of nodes areber of Clusternumerical datahaviors in the

f dimensions in thrder. (b) Dimensi

bed as follownes the rate of

model solution.milar data cateory Zr is the mlusters. A goo

d as:

as:

g solution is:

ce of the propd in ENCLUS

ue of each datndomly chose

of proposed m)] is experimealized into [0,eriments are d

2. 4 GHZ cormplemented in

e previous algoposed algorithe direct comp

meters of threeiment are as fo). cked for refinee experiment i

belong to a fine constant peri

not have to bedetermined byd by the standed, the numbee reduced. Figr tree’s nodes.a, the k-meansclustering suc

the structure of thions are sorted by

ws [Zhao andf correct matc. Given a cateegorized. Let maximum FScod solution has

posed methodS [Cheng et ata element is en 20 data r

method on a rented. All 41 ,100). done to evalure 2 duo Pen

n Microsoft Vi

gorithms. hm is compar

parison is done methods are ollows:

ement after eais KDD-CUP’

nal cluster in iod of time.

e preordered. By monitoring dard deviationr of the nonadg. 5 shows th. Our algorithms technique cach as breaking

he Clustering treey the standard dev

d Karypis (20ches in the geegory Zr with nri be the num

core value attas the FScore c

d "B+C tree",al. (1999)]. Th

randomly selegions, with

real data set, tcontinuous at

uate the perforntium PC macisual Studio 20

red with the Le to CS tree wadjusted to pr

ach 1M data el’99.

a constant pe

But, based on the standard

ns in an ascendjacent clustere effect of them is designedan be applied g a cluster into

. (a) Dimensions viations in a desce

002)]: FScoreenerated soluti

the nr numbember of data ained in any cclose to one.

,

,

1

, a number ofhe domain of eected. In this randomly va

the KDD-CUPttributes are em

rmance of thechine with 2 005.

SEARCH andwhich is also rovide a simil

lements.

eriod of time

our knowledgd deviation ofnding order.

ers is decreasee standard de

d for the numeon the final c

o sub-clusters.

are sorted by theending order.

e is a combiion, and Recaer of similar din cluster Ci b

cluster of the

,

f synthetic daeach dimensioexperiment,

aried size in

P’99 networkmployed for c

e proposed meGB main m

d CS tree sinca grid-based lar situation to

over total

ge about a f the data When the

ed. So, the viation of erical data clusters, in

e standard

ination of all defines data, and a belonging clustering

(15)

(16)

(17)

(18)

ata set are onal value most data different

k intrusion clustering.

ethod. All emory on

ce they are clustering o gain fair

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4503

The resulusage of slightly mprocessinaverage p

StuThe perfexperimeThe averaccuracy and the maccurate,memory

StuFig. 8 illuthe exper

‐ ‐ ‐ ‐

Table 3 dimensio

StuFig. 9 shoare similwhich cluother han

lts show an imf our algorithmmore than CS ng time is neeprocessing tim

dying the scaformance of tent are the samrage processin

y increases. Inmemory usag so the proceused to maint

dying the scaustrates the periment are as f

(F-th=0. 00ConditionsThe Data sNumber of

shows the meon. The algorit

dy the performows the accurar to the expeusters data wind, the accurac

mprovement inm is noticeabltree, because

eded for clustme per each da

lability of "Bthe proposed me as the prevng time per e

n the beginninge are increasessing time antain clusters an

lability of "Berformance offollows: 01, h=4). s are checked set for the expf dimensions oemory usage thm has better

mance of "B+racy of the algeriment 3. Nuth distinguishcies of other a

n the accuracyly lower than in our algorit

tering. Fig. 6ata element is

+C tree" on tmethod on aious experime

each data elemng of a data stsed. With timnd the memornd cells is alm

+C tree" on tf the proposed

for refinemenperiment is ENof data streamand the procer performance

+C tree" on thgorithm on theumber of clushed borders, nualgorithms are

Fig. 6) Compari

y of the clustethe other alg

thm, in order tshows, as th

decreased.

the number ofa large data sent. This studyment as well tream, due to

me, the algoriry usage are

most not depen

the number ofd algorithm on

nt after each 1NCLUS. ms is varied froessing time are on the data w

he number ofe different numsters is variedumber of cluse decreased be

ing the performan

ering in the prgorithms. The to improve th

he number of

f data. stream is showy is done on 4as the memothe constructthm updates decreased. Findent on the n

f dimensions.n high dimensi

00M data elem

om 10 to 50. re increased r

with less than

f clusters. mber of cluste

d between 4 aters is not affe

ecause of the i

nce of algorithms

oposed algoriprocessing ti

e accuracy andata element

wn in the fig400,000 data ery usage decrion of the treetrees and thisg. 7 shows thumber of data

ional data stre

ments.

rapidly by inc100 dimension

ers. The condind 50. For a ected the propncrement of c

ithm. Also, theime of the algnd memory usts is increased

g. 7. Conditioelements. reases linearles, the process makes clushat the total aa elements.

eams. The con

creasing the nons.

itions of the exdata set like

posed algorithcluster’s occul

e memory gorithm is age, more

d, that the

ons of the

y and the ssing time ters more

amount of

nditions of

number of

xperiment ENCLUS m. On the ltation.

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4504

5. Conclu

In this paof a mustatistics “cluster tarbitrary the probaand size reduced breduces t

T

usion

aper, we propoulti-dimension

of data elemtree” is proposshaped cluste

ability of clusof data strea

by the definethe memory co

Fig. 7

Fig. 8) Compa

Fig. 9) Compar

Table 3) Performa

osed an adaptnal continuallyments of data

sed. Our studyers. The clustster’s overlappams. Also, thed clustering ponsumption by

7) the performanc

ring the scalabilit

ring the accuracy

ance of the algorit

tive grid-basedy generated dstreams, “b+cy over data strer tree maintaping is almosthe number of parameters. Fy increasing th

ce of "B+C tree"

ty of the algorithm

of algorithms on

thm on data strea

d clustering mdata stream.

cell tree” is dereams shows tains the bount none. The alf data accomminally, this al

the processing

on the number of

ms on the numbe

n the different num

ams with more tha

method (B+C In order to

efined. To cluthat the algori

ndaries of mullgorithm is scmodated for algorithm imprg time slightly

f data.

r of dimensions

mber of clusters.

an 100 dimension

tree) to distinmaintain the

uster high dimithm is capabllti dimensionacalable on the appearance ofroves the accu.

ns.

nguish potentiae on-going dimensional datle to provide ual clusters prenumber of di

f the initial curacy of clust

al clusters istribution ta streams up-to-date ecisely; so imensions clusters is tering and

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4505

References

[1] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. (1998): Automatic subspace clustering of high dimensional data for data mining applications, in: Proc. of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, June pp. 94–105.

[2] Agrawal, C. C.; Han, J.; Wang, J. (2003): A framework for clustering evolving data streams. In Proc. 29th international conference on very large data bases, pp. 81–92.

[3] Agrawal, C. C.; Han, J.; Wang, J.; Yu, P. S. (2004): A framework for projected clustering of high dimensional data streams”. In Proc. of 30th international conference on very large data bases, pp. 852–86.

[4] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. (2005): Automatic Subspace Clustering of High Dimensional Data. Data Mining Knowledge Discovery, Vol. 11, 1, pp. 5-33.

[5] Beringer, J.; Hüllermeier, E. (2003): Online Clustering of Parallel Data Streams, Data & Knowledge Engineering. [6] Cao F.; Ester M.; Qian W.; Zhou A. (2006): Density-Based Clustering over an Evolving Data Stream with Noise. Proceedings of the

SIAM Conference on Data Ming. [7] Chen Y.; Tu L. (2007): Density-Based Clustering for Real-Time Stream Data. KDD’07, August 12–15, San Jose, California, USA.

133-142. [8] Cheng, C. H.; Fu, A. W.; Zhang, Y. (1999): Entropy-based subspace clustering for mining numerical data, in: Proc. of ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, pp. 84–93. [9] Guha, S.; Meyerson, A.; Mishra, N.; Motwani, R.; O’Callaghan, L. (2003): Clustering data streams: Theory and practice, IEEE Trans.

Knowl. Data Eng. 15 (3), pp. 515–528. [10] Jae, W. L.; Park, N. H.; Lee, W. S. (2009): Efficiently tracing clusters over high-dimensional on-line data streams, Data & Knowledge

Engineering. [11] Javitz, H. S.; Valdes, A. (1994): The NIDES Statistical Component Description and Justification, Annual Report, A010. [12] Mehta, D. P.; Sahni, S. (2004): Handbook of Data Structures and Applications, Chapman & Hall/CRC, chapter 15. [13] Muthukrishnan, S. (2003): Data streams: algorithms and applications. Proc. of the fourteenth annual ACM-SIAM symposium on

discrete algorithms. [14] Park, N. H.; Lee, W. S. (2007): Cell trees: an Adaptive Synopsis structure for clustering multi-dimensional on-line data streams, Data

& Knowledge Engineering, 63(2), P. P. 528–549. [15] KDD Cup (1999): <http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>. [16] Thanawin, R.; Komkrit, U.; Kitsana, W. (2007): E-Stream: Evolution-based Technique for Stream Clustering. Springer-verlag Berlin

Heidelberg,ADMA, pp. 605-615. [17] Zhao, Y.; Karypis, G. (2002): Criterion Functions for Document Clustering: Experiments and Analysis.

Toktam Dehghani et al. / International Journal of Engineering Science and Technology (IJEST)

ISSN : 0975-5462 Vol. 4 No.10 October 2012 4506


Recommended