+ All Categories
Home > Documents > Distributed Data Clustering in Multi-Dimensional Peer …crpit.com/confpapers/CRPITV104Lodi.pdf ·...

Distributed Data Clustering in Multi-Dimensional Peer …crpit.com/confpapers/CRPITV104Lodi.pdf ·...

Date post: 09-Apr-2018
Category:
Upload: nguyenkiet
View: 220 times
Download: 2 times
Share this document with a friend
8
Distributed Data Clustering in Multi-Dimensional Peer-To-Peer Networks Stefano Lodi + Gianluca Moro ? Claudio Sartori + Dept. of Electronics Computer Science and Systems University of Bologna, ? Via Venezia, 52 - I-47023 Cesena (FC), Italy, + Viale Risorgimento, 2, Bologna, Italy, Email: {stefano.lodi, gianluca.moro, claudio.sartori}@unibo.it Abstract Several algorithms have been recently developed for distributed data clustering, which are applied when data cannot be concentrated on a single machine, for instance because of privacy reasons or due to net- work bandwidth limitations, or because of the huge amount of distributed data. Deployed and research Peer-to-Peer systems have proven to be able to man- age very large databases made up by thousands of personal computers resulting in a concrete solutions for the forthcoming new distributed database systems to be used in large grid computing networks and in clustering database management systems. Current distributed data clustering algorithms cannot be ap- plied to such kind of networks because they expect data be organized according to traditional distributed database management systems where the distribution of the relational schema is planned a-priori in the de- sign phase. In this paper we describe methods to cluster distributed data across peer-to-peer networks without requiring any costly reorganization of data, which would be infeasible in such a large and dynamic overlay networks, and without reducing their perfor- mance in message routing and query processing. We compare the data clustering quality and ef- ficiency of three multi-dimensional peer-to-peer sys- tems according to two well-known clustering tech- niques. Keywords: Data Mining, Peer-to-Peer, Data Cluster- ing, Multi-dimensional Data 1 Introduction Distributed and automated recording, analysis and mining of data generated by high-volume information sources is becoming common practice in medium sized and large enterprises and organizations. Whereas dis- tributed core database technology has been an active research area for decades, distributed data analysis and mining have been investigated only since the early nineties (Zaki & Ho 2000, Kargupta & Chan 2000) motivated by issues of scalability, bandwidth, privacy, and cooperation among competing data owners. An important distributed data mining problem which has been investigated recently is the distributed data clustering problem. The goal of data clustering is to extract new potential useful knowledge from a Copyright c 2010, Australian Computer Society, Inc. This pa- per appeared at the Twenty-First Australasian Database Con- ference (ADC2010), Brisbane, Australia, January 2010. Con- ferences in Research and Practice in Information Technology (CRPIT), Vol. 104, Heng Tao Shen and Athman Bouguettaya, Ed. Reproduction for academic, not-for profit purposes per- mitted provided this text is included. generally large data set by grouping together simi- lar data items and by separating dissimilar ones ac- cording to some defined dissimilarity measure among the data items themselves. In a distributed environ- ment, this goal must be achieved when data cannot be concentrated on a single machine, for instance be- cause of privacy concerns or due to network band- width limitations, or because of the huge amount of distributed data. Several algorithms have been developed for distributed data clustering (Johnson & Kargupta 1999, Kargupta, Huang, Sivakumar & Johnson 2001, Klusch, Lodi & Moro 2003, da Silva, Klusch, Lodi & Moro 2006, Merugu & Ghosh 2003, Tasoulis & Vrahatis 2004). A common scheme un- derlying all approaches is to first locally extract suit- able aggregates, then send the aggregates to a central site where they are processed and combined into a global approximate model. The kind of aggregates and combination algorithm depend on the data types and distributed environment under consideration, e.g. homogeneous or heterogeneous data, numeric or cat- egorical data. Among the various distributed computing paradigms, peer-to-peer (P2P) computing is cur- rently the topic of one of the largest bodies of both theoretical and applied research. In P2P computing networks, all nodes (peers ) cooperate with each other to perform a critical function in a decentralized manner, and all nodes are both users and providers of resources (Milojicic, Kalogeraki, Lukose, Nagaraja, Pruyne, Richard, Rollins & Xu 2002, Moro, Ouksel & Sartori 2002). In data management applications, deployed peer-to-peer systems have proven to be able to manage very large databases made up by thousands of personal computers. Many proposals in the literature have significantly improved the existing P2P systems in several aspects, such as searching performance, query expressivity, multi-dimensional distributed indexing The ensuing solutions can be effectively employed in the forthcoming new distributed database systems to be used in large grid computing networks and in clustering database management systems. In light of the foregoing, it is natural to foresee an evolution of P2P networks towards supporting dis- tributed data mining services, by which many peers spontaneously negotiate and cooperatively perform a distributed data mining task. In particular, the data clustering task matches well the features of P2P net- works, since clustering models exploit local informa- tion, and consequently clustering algorithms can be effective in handling topological changes and data up- dates. Current distributed data clustering algorithms cannot be directly applied to data stored in P2P net- works because they expect data to be organized ac- cording to traditional distributed database manage- ment systems where the distribution of the relational schema is planned a-priori in the design phase. Proc. 21st Australasian Database Conference (ADC 2010), Brisbane, Australia 171
Transcript

Distributed Data Clustering in Multi-DimensionalPeer-To-Peer Networks

Stefano Lodi+ Gianluca Moro? Claudio Sartori+

Dept. of Electronics Computer Science and SystemsUniversity of Bologna,

?Via Venezia, 52 - I-47023 Cesena (FC), Italy,+Viale Risorgimento, 2, Bologna, Italy,

Email: {stefano.lodi, gianluca.moro, claudio.sartori}@unibo.it

Abstract

Several algorithms have been recently developed fordistributed data clustering, which are applied whendata cannot be concentrated on a single machine, forinstance because of privacy reasons or due to net-work bandwidth limitations, or because of the hugeamount of distributed data. Deployed and researchPeer-to-Peer systems have proven to be able to man-age very large databases made up by thousands ofpersonal computers resulting in a concrete solutionsfor the forthcoming new distributed database systemsto be used in large grid computing networks and inclustering database management systems. Currentdistributed data clustering algorithms cannot be ap-plied to such kind of networks because they expectdata be organized according to traditional distributeddatabase management systems where the distributionof the relational schema is planned a-priori in the de-sign phase. In this paper we describe methods tocluster distributed data across peer-to-peer networkswithout requiring any costly reorganization of data,which would be infeasible in such a large and dynamicoverlay networks, and without reducing their perfor-mance in message routing and query processing.

We compare the data clustering quality and ef-ficiency of three multi-dimensional peer-to-peer sys-tems according to two well-known clustering tech-niques.

Keywords: Data Mining, Peer-to-Peer, Data Cluster-ing, Multi-dimensional Data

1 Introduction

Distributed and automated recording, analysis andmining of data generated by high-volume informationsources is becoming common practice in medium sizedand large enterprises and organizations. Whereas dis-tributed core database technology has been an activeresearch area for decades, distributed data analysisand mining have been investigated only since the earlynineties (Zaki & Ho 2000, Kargupta & Chan 2000)motivated by issues of scalability, bandwidth, privacy,and cooperation among competing data owners.

An important distributed data mining problemwhich has been investigated recently is the distributeddata clustering problem. The goal of data clusteringis to extract new potential useful knowledge from a

Copyright c©2010, Australian Computer Society, Inc. This pa-per appeared at the Twenty-First Australasian Database Con-ference (ADC2010), Brisbane, Australia, January 2010. Con-ferences in Research and Practice in Information Technology(CRPIT), Vol. 104, Heng Tao Shen and Athman Bouguettaya,Ed. Reproduction for academic, not-for profit purposes per-mitted provided this text is included.

generally large data set by grouping together simi-lar data items and by separating dissimilar ones ac-cording to some defined dissimilarity measure amongthe data items themselves. In a distributed environ-ment, this goal must be achieved when data cannotbe concentrated on a single machine, for instance be-cause of privacy concerns or due to network band-width limitations, or because of the huge amountof distributed data. Several algorithms have beendeveloped for distributed data clustering (Johnson& Kargupta 1999, Kargupta, Huang, Sivakumar &Johnson 2001, Klusch, Lodi & Moro 2003, da Silva,Klusch, Lodi & Moro 2006, Merugu & Ghosh 2003,Tasoulis & Vrahatis 2004). A common scheme un-derlying all approaches is to first locally extract suit-able aggregates, then send the aggregates to a centralsite where they are processed and combined into aglobal approximate model. The kind of aggregatesand combination algorithm depend on the data typesand distributed environment under consideration, e.g.homogeneous or heterogeneous data, numeric or cat-egorical data.

Among the various distributed computingparadigms, peer-to-peer (P2P) computing is cur-rently the topic of one of the largest bodies of boththeoretical and applied research. In P2P computingnetworks, all nodes (peers) cooperate with eachother to perform a critical function in a decentralizedmanner, and all nodes are both users and providers ofresources (Milojicic, Kalogeraki, Lukose, Nagaraja,Pruyne, Richard, Rollins & Xu 2002, Moro, Ouksel& Sartori 2002). In data management applications,deployed peer-to-peer systems have proven to beable to manage very large databases made up bythousands of personal computers. Many proposals inthe literature have significantly improved the existingP2P systems in several aspects, such as searchingperformance, query expressivity, multi-dimensionaldistributed indexing The ensuing solutions canbe effectively employed in the forthcoming newdistributed database systems to be used in largegrid computing networks and in clustering databasemanagement systems.

In light of the foregoing, it is natural to foresee anevolution of P2P networks towards supporting dis-tributed data mining services, by which many peersspontaneously negotiate and cooperatively perform adistributed data mining task. In particular, the dataclustering task matches well the features of P2P net-works, since clustering models exploit local informa-tion, and consequently clustering algorithms can beeffective in handling topological changes and data up-dates. Current distributed data clustering algorithmscannot be directly applied to data stored in P2P net-works because they expect data to be organized ac-cording to traditional distributed database manage-ment systems where the distribution of the relationalschema is planned a-priori in the design phase.

Proc. 21st Australasian Database Conference (ADC 2010), Brisbane, Australia

171

In this paper we describe methods to cluster datadistributed across peer-to-peer networks by usingthe same peer-to-peer systems with some revisions,namely without requiring any costly reorganizationof data, which would be infeasible in such large anddynamic overlay networks, and without reducing theirperformance in message routing and query processing.Moreover, we compare the data clustering quality andefficiency of three multi-dimensional peer-to-peer sys-tems with a well-known traditional clustering algo-rithm. The comparisons have been done by conduct-ing extensive experiments on the peer-to-peer systemstogether with the clustering algorithm we have fullyimplemented.

2 Related Works

Extensions of P2P networks to data analysis andmining services have been dealt with by relativelyfew research contributions to date. In (Wolff &Schuster 2004) the problem of association rule miningis extended to databases which are partitioned amonga very large number of computers that are dispersedover a wide area (large-scale distributed, or LSD, sys-tems), including databases in P2P and grid systems.The core of the approach is the LSD-Majority pro-tocol, an anytime distributed algorithm expressly de-signed for large-scale, dynamic distributed systems bywhich peers can decide if a given fraction of the peershas a data bit set or not. The Majority-Rule Algo-rithm for the discovery of association rules in P2Pdatabases adopts a direct rule generation approachand incorporates LSD-Majority, generalized to fre-quency counts, in order to decide which associationrules globally satisfy given support and confidence.The authors show that their approach exhibits goodlocality, fast convergence and low communication de-mands.

In (Klampanos & Jose 2004, Klampanos, Jose &van Rijsbergen 2006) the problem of P2P informa-tion retrieval is addressed by locally clustering doc-uments residing at each peer and subsequently clus-tering the peers by a one-pass algorithm: Each newpeer is assigned to the closest existing cluster, or ini-tiates a new peer cluster, depending on a distancethreshold. Notwithstanding the approach producesa clustering of the documents in the network, theseworks do not compare directly to ours, since theirmain goal is to show how simple forms of clusteringcan be exploited to reorganize the network to improvequery answering effectiveness. The work (Agostini& Moro 2004) describes a method for inducing theemergence of communities of peers semantically re-lated, which corresponds to the clustering of the P2Pnetwork by document contents. In this approach asqueries are resolved, the routing strategy of each peer,initially based on syntactic matching of keywords, be-comes more and more trust-based, namely, based onthe semantics of contents, leading to resolve querieswith a reduced number of hops.

Recently distributed data clustering approacheshave also been developed for wireless sensor networks,such as in (Lodi, Monti, Moro & Sartori 2009), wherethe peculiarity, differently from large wired peer-to-peer systems, is to satisfy severe constraints accordingto the kind of resources, such as energy consumption,short range connectivity, computational and memorylimits.

As of writing, there is only one study on P2P dataclustering not in relation to automatic, content-basedreorganization of the network for efficiency purposes.In (Li, Lee, Lee & Sivasubramaniam 2006) the PENSalgorithm is proposed to cluster data stored in P2Pnetworks with a CAN overlay, employing a density-

based criterion. Initially, each peer executes locallythe DBSCAN algorithm. Then, for each peer, neigh-bouring CAN zones which contain clusters that canbe merged to local clusters contained in the peer’szone are discovered, by performing a cluster expan-sion check. The check is performed bottom-up in thevirtual tree implicitly defined by CAN’s zone-splittingmechanism. Finally, arbiters appropriately selectedin the tree merge the clusters. The authors show thatthe communication cost of their approach is linearin the number of peers. Like the methods we haveconsidered in our analysis, the approach of this workassumes a density-based clustering model. However,clusters emerge by bounding the space embedding thedata along contours of constant density, as in the DB-SCAN algorithm, whereas the algorithms consideredin the present paper utilize either a gradient-basedcriterion, similar to the one proposed in (Hinneburg& Keim 1998) to define center-based clusters, or amean density criterion.

3 Multi-dimensional Peer-To-Peer Systems

In this section we review three different P2P networkswhich have been proposed in the literature: CAN,MURK-CAN, and MURK-SF. In Section 4, data clus-tering algorithms for each of these networks will bedescribed and experimentally evaluated.

A CAN (Content-Addressable Network) overlaynetwork (Ratnasamy, Francis, Handley, Karp &Schenker 2001) is a type of distributed hash ta-ble by which (key,value) pairs are mapped to a d-dimensional toroidal space by a deterministic hashfunction. The toroidal hash space is partitioned into“zones” which are assigned uniquely to nodes of thenetwork. Every node keeps a routing table as a listof of pointers to its immediate neighbours and of theboundaries of their zones. Using this information,query messages are routed from node to node by al-ways choosing the neighbour which decreases distanceto the query point most, until the node which ownsthe zone containing the query point is reached. Apeer joining the network randomly selects an exist-ing zone and sends (using routing) a split request tothe owning node, which splits it into two sub-zonesalong one dimension (the dimension is chosen as thenext dimension in a fixed ordering) and transfers tothe new peer both the ownership of the sub-zone andthe (key,value) pairs hashed to the sub-zone. A peerleaving the network hands over its zone and the as-sociated (key,value) pairs to one of its neighbors. Inboth cases, the routing tables of the nodes owningthe zones which are adjacent to the affected zone areupdated.

A MURK (MUlti-dimensional Rectangulationwith Kd-trees) network (Ganesan, Yang & Garcia-Molina 2004) manages a nested, rectangular partitionin a similar way, but in contrast to CAN, the partitionis defined in the data space directly, which is assumedto be a multi-dimensional vector space. Moreover,when a node arrives, the zone is split into two sub-zones, containing the same number of objects; thatis, MURK balances load whereas CAN balances vol-ume. Two different variants of MURK are introducedin (Ganesan et al. 2004), MURK-CAN and MURK-SF, which differ in the way nodes are linked by therouting tables. In MURK-CAN, neighbouring nodesare linked exactly as in CAN, whereas in MURK-SF,links are determined by a skip structure. A space-filling curve (the Hilbert curve) is used to map thepartition centroids of the zones to one-dimensionalspace. The images of all centroids induce a linear or-dering of the nodes which is used to build the skipgraph.

CRPIT Volume 104 - Database Technologies 2010

172

4 Density-Based Clustering

Data clustering is a descriptive data mining taskwhich aims at decomposing or partitioning a usuallymultivariate data set into groups such that the dataobjects in one group are similar to each other andare different as possible from those in other groups.Therefore, a clustering algorithm A() is a mappingfrom any data set S of objects to a clustering of S,that is, a collection of pairwise disjoint subsets of S.Clustering techniques inherently hinge on the notionof distance between data objects to be grouped, andall we need to know is the set of interobject distancesbut not the values of any of the data object variables.Several techniques for data clustering are availablebut must be matched by the developer to the objec-tives of the considered clustering task [Grabmeier andRudolph, 2002].

In partition-based clustering, for example, the taskis to partition a given data set into multiple dis-joint sets of data objects such that the objects withineach set are as homogeneous as possible. Homogene-ity here is captured by an appropriate cluster scor-ing function. Another option is based on the in-tuition that homogeneity is expected to be high indensely populated regions of the given data set. Con-sequently, searching for clusters may be reduced tosearching for dense regions of the data space whichare more likely to be populated by data objects.

We assume a set S = { ~Oi | i = 1, . . . , N} ⊆ Rd ofdata points or objects. Kernel estimators formalizethe following idea: The higher the number of neigh-bouring data objects ~Oi of some given ~O ∈ Rd, thehigher the density at ~O. The influence of ~Oi may bequantified by using a so called kernel function. Pre-cisely, a kernel function K (~x) is a real-valued, non-negative function on Rd having unit integral over Rd.Kernel functions are often non-increasing with ‖~x‖.When the kernel is given the vector difference between~O and ~Oi as argument, the latter property ensuresthat any element ~Oi in S exerts more influence onsome ~O ∈ Rd than elements which are farther from~O than the element. Prominent examples of kernelfunctions are the standard multivariate normal den-sity (2π)−d/2 exp(− 1

2 ~xT~x), the uniform kernelKu( ~O)

and the multivariate Epanechnikov kernel Ke( ~O), de-fined by

Ku( ~O) ={c−1d if ~xT~x < 1,

0, otherwise, (1)

Ke( ~O) ={

12c−1d (d+ 2)(1− ~xT~x) if ~xT~x < 1,

0, otherwise, (2)

where cd is the volume of the unit d-dimensionalsphere. A kernel estimator (KE) ϕ̂[S]( ~O) : Rd → R+

is defined as the sum over all data objects ~Oi of thedifferences between ~O and ~Oi, scaled by a factor h,called window width, and weighted by the kernel func-tion K:

ϕ̂[S]( ~O) =1

Nhd

N∑i=1

K

(1h

( ~O − ~Oi)). (3)

The estimate is therefore a sum of exactly one“bump”placed at each data object, dilated by h. The param-eter h ∈ R+ controls the smoothness of the estimate.Small values of h result in merging fewer bumps anda larger number of local maxima. Thus, the estimatereflects more accurately slight local variations in thedensity. Increasing h causes the distinctions between

regions having different local density to progressivelyblur and the number of local maxima to decrease, un-til the estimate is unimodal.

An objective criterium to choose h which hasgained wide acceptance is to minimize the mean in-tegrated square error (MISE), that is, the expectedvalue of the integrated squared pointwise differencebetween the estimate and the true density ϕ of thedata. An approximate minimizer is given by

hopt = A(K) N−1/(d+4), (4)

where A(K) depends also on the dimensionality of thedata d and the unknown true density ϕ. In particular,for the unit multivariate normal density

A(K) =( 4

2d+ 1

)1/(d+4)

. (5)

For a multivariate Gaussian density

h = hopt

√√√√d−1

d∑j=1

sjj (6)

where sjj is the data variance on the j-th dimension(Silverman 1986).

In some applications, including data clustering, itmay be useful to locally adapt the degree of smooth-ing of the estimate. In clustering, for instance, a sin-gle dataset may both contain large, sparse clusters,and smaller, dense clusters, possibly not well sepa-rated. The estimate given by (3) is not suitable insuch cases. In fact, a fixed global value of the win-dow width would either merge the smaller clusters, ormake emerge spurious details in the larger ones.

Adaptive density estimates have been proposedboth as generalizations of kernel estimates and near-est neighbour estimates. In the following we will recallthe latter family of estimators. The nearest neighbourestimator in d dimensions is defined as:

ψ̂[S]( ~O) =k/N

cdrk( ~O)d(7)

where rk( ~O) equalling k, the number of data ob-jects in the smallest sphere including the k-th neigh-bour of ~O to the expected number of such objects,Nψ̂[S]( ~O)cdrk( ~O)d. Equation (7) can be viewed as aspecial case for K = Ku of a kernel estimator havingrk( ~O) as window width:

ϕ̂[S]( ~O) =1

Nrk( ~O)d

N∑i=1

K

(~O − ~Oi

rk( ~O)

). (8)

The latter estimate is called a generalized nearestneighbour estimate (GNNE).

A simple property of kernel density estimates thatis of interest for P2P computing is locality. In orderto obtain a meaningful estimate, the window width his usually much smaller than the data range on everycoordinate. Moreover, the value of commonly usedkernel functions is negligible for distances larger thana few h units; it may even be zero if the kernel hasbounded support, as is the case for the Epanechnikovkernel. Therefore, in practice the number of distancesthat are needed for calculating the kernel density es-timate at a given object ~O may be much smaller thanthe number of data objects N , and the involved ob-jects span a small portion of the data space.

Once the kernel density estimate of a data set hasbeen computed, there is a a straightforward strategy

Proc. 21st Australasian Database Conference (ADC 2010), Brisbane, Australia

173

to cluster its objects: Detect disjoint regions of thedata space where the value of the estimate is high andgroup all data objects of each region into one cluster.Data clustering is thus reduced to space partitioning,and the different ways “high” can be defined inducedifferent clustering schemes.

In the approach of Koontz, Narendra and Fuku-naga (Koontz, Narendra & Fukunaga 1976), as gen-eralized in (Silverman 1986), each data object ~Oi isconnected by a directed edge to the data object ~Oj ,within a distance threshold, that maximizes the av-erage steepness of the density estimate between ~Oi

and ~Oj , and such that ϕ̂[S]( ~Oi) > ϕ̂[S]( ~Oj). Clus-ters are defined by the connected components in theresulting graph. More recently, Hinneburg and Keim(Hinneburg & Keim 1998) have proposed two types ofcluster. Center-defined clusters are based on the ideathat every local maximum of ϕ̂ having a sufficientlylarge density corresponds to a cluster including alldata objects which can be connected to the maxi-mum by a continuous, uphill path in the graph of ϕ̂.An arbitrary-shape cluster (Hinneburg & Keim 1998)is the union of center-defined clusters such that theirmaxima are connected by a continuous path whosedensity exceeds a threshold. A density-based cluster(Ester, Kriegel, Sander & Xu 1996) collects all dataobjects included in a region where the value of a ker-nel estimate with uniform kernel exceeds a threshold.

5 Density-Based Clustering in P2P Systems

When applying kernel-based clustering to P2P over-lay networks, some observations are in order.• It is mandatory to impose a bound on the dis-

tance H in hops of the zones containing the ob-jects that contribute to the estimate in a givenzone. A full calculation of summation (3) wouldrequire to answer an unacceptable number ofpoint queries. Note that, depending on the over-lay network, the lower bound on the distancefrom the center of a zone to an object in a zonebeyond H hops may be not greater than theradius of the zone itself. Thus, although thecontribution to the estimate at ~O of objects lo-cated more than, say, 4h is negligible, if 4h isgreater than zone radius, some terms of the es-timate may be missed. There will be a trade-offbetween network messaging costs and clusteringaccuracy, and clustering results must be experi-mentally compared with the ideal clustering ob-tained when H is large enough to reach all ob-jects.

• Different peers may prefer different parametersfor clustering the network’s data, e.g., differentvalues of h, kernel functions, maximum numberof hops, whether to use an adaptive estimate.Therefore, a peer interested in clustering the dataacts as a clustering initiator, i.e., it must takecare of all the preliminary steps needed to makeits choices available to the network, and to gatherinformation useful to make those choices, e.g.,descriptive statistics.

In this paper, we investigate two approaches toP2P density-based clustering. In both approaches,the computation of clusters is based on the gener-alized approach in (Silverman 1986) as described inSection 4. The first one, M1, uses kernel or gen-eralized nearest neighbour estimates, and it can besummarized as follows.

I. If the estimate (3) is used, then the initiator col-lects from every zone its object count, object

sum, and square sum to globally choose a windowwidth h according to Equations (4)–(6).

II. At every node: For every local data point ~O,compute the density estimate value ϕ̂[S]( ~O), inthe form (3) or (8), from the local data set andthe remote data points which are reachable byrouting a query message for at most H hops,where H is an integer parameter

III. At every node: Query the location and value ofall local maxima of the estimate located withinother zones

IV. At every node: associate each local data point tothe maximum which maximizes the ratio betweenthe value of the maximum and its distance fromthe point

The second approach, M2, exploits data spacepartitions implicitly generated by the data manage-ment subdivision among peers as described in Sec-tion 3. In this approach, the data are not purposelyreorganized or queried to compute a density estimateto perform a clustering. In this case, the density valueat data objects in a zone can be set as the ratio be-tween the number of objects in the zone and the vol-ume of the zone.

A. At every node: For every local data object ~O,compute the density estimate value ϕ̂[S]( ~O) fromthe local data set only, as the mean zone density,that is, the object count in the node’s zone di-vided by its volume

B. At every node: Define the maximum of thenode’s zone as the mean density of the zone, andits location as the geometric center of the zone

C. At every node: query the maximum of all zonesand their locations; associate each local datapoint to the maximum which maximizes, over allzones, the ratio between the value of the maxi-mum and its distance from the point

In this approach, no messages are sent over thenetwork for computing densities, but only for com-puting the clusters. Therefore, it is expected to bemuch more efficient than the previous one, but lessaccurate, due to the approximation in computing theestimates maxima.

6 Data Clustering Efficacy and Efficiency

The main goal of the experiments described in thenext section is to compare the accuracy of the clus-ters produced by three P2P systems, namely theirefficacy, as a function of the network costs, that istheir efficiency as clustering algorithms.

To determine the accuracy of clustering, we havecompared the clusters generated by each P2P sys-tem as a function of the number of hops, with theideal clustering computed by the system when routingthrough a large number of hops in order to include theentire network; for our experiments we have choosen1024. In the latter case, all zones are reachable fromevery other zone, thus simulating a density-based al-gorithm operating as if all distributed data were cen-tralized in a single machine, as far as query results areconcerned. Limiting the number of hops means thecomputed estimate is an approximation of the trueestimate computed by routing queries to the entirenetwork, which therefore yields a “reference” cluster-ing.

We have employed the Rand index (Rand 1971)as a measure of clustering accuracy. Let S =

CRPIT Volume 104 - Database Technologies 2010

174

Figure 1: Dataset S0

{ ~O1, . . . , ~ON} be a dataset of N objects and X and Ytwo data clusterings of S to be compared. The Randindex can be determined by computing the variablesa, b, c, d defined as follows:

• a is the number of objects in S that are in thesame partition in X and in the same partition inY ,

• b is the number of objects in S that are not in thesame partition inX and not in the same partitionin Y ,

• c is the number of objects in S that are in thesame partition inX and not in the same partitionin Y ,

• d is the number of objects in S that are not in thesame partition in X but are in the same partitionin Y .

The sum a + b can be regarded as the number ofagreements between X and Y , and c+d as the numberof disagreements between X and Y . The Rand indexR ∈ [0, 1] expresses the number of agreements as afraction of the total number of pairs of objects:

R =a+ b

a+ b+ c+ d=a+ b(

N2

)In our case, one of the two data clustering is alwaysthe one computed when H = 1024.

We have implemented in Java a simulator of thethree P2P systems described in Section 3, each cou-pled with the two density-based clustering algorithmsdescribed in Section 4.

We have conducted extensive experiments on adesktop workstation equipped with two Intel dual-core Xeon processors at 2.6GHz and 2GB internalmemory.

Two generated datasets of two-dimensional realvectors have been used in our experiments. The firstdataset, S0 shown in Figure 1, has 24000 vectors gen-erated from 5 normal densities. The second dataset,S1, is shown in Figure 2. It has 24000 vectors gen-erated from 5 normal densities. Three groups of 200vectors each have been generated very close in mean,with a deviation of 10. Two groups of 10700 vectorseach have been generated with a deviation of 70.

The experiments have been performed on both S0and S1 for both method M1, with KE and GNNEestimates, and M2. Each experiment compares thethree P2P networks as the number of hops varies from1 hop to 8. For each experiment we have analysed(i) how the Rand index improves as the number ofhops H increases (i.e. efficacy) and (ii) the efficiency

Figure 2: Dataset S1

Figure 3: Clustering of S0 by M1 with GNNE esti-mate and 1024 hops

measured by counting the number of messages amongpeers generated by the computation of density andclustering. The number of peers has been set to 1000,with 100 objects each on average.

Figure 3 shows a clustering computed on S0 byM1 with GNNE estimate and 1024 hops.

7 Experimental Results

Figure 4 illustrates the clustering accuracy, computedby using method M1 with KE density on S0, on theincrease of the number of hops (in the x axis). AllP2P systems attain a very good accuracy, over 0.95,with 8 hops. The best is MURK-CAN with 0.98. 1

At low hop counts, MURK-SF is significantly moreaccurate than the other systems. Similar results, interms of absolute accuracy, have been obtained on thesame dataset by M1 with GNNE density, as shownin Figure 5. In this case, MURK-SF is consistentlythe best system, although by a small margin. Theaccuracy of method M2 on S0, shown in Figure 6, ismuch poorer.

On dataset S1, the same set of experiments showsa less accurate behaviour of all P2P systems and clus-tering methods, particularly for low hop counts, as il-lustrated by Figures 7, 8, 9. This is due to the highercomplexity of dataset S1, which contains both sparseand dense cluster of different size.

The first set of experiments provides some evi-dence for a superior efficacy of MURK-SF over CANand MURK-CAN.

1In the sequel we will use MURK- and Torus- as synonyms.

Proc. 21st Australasian Database Conference (ADC 2010), Brisbane, Australia

175

H non-adaptive, Kernel-based Density, First Data Set

0,75

0,8

0,85

0,9

0,95

1

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 4: Accuracy of method M1 with KE densityon S0

H adaptive, Kernel-based Density, First Data Set

0,75

0,8

0,85

0,9

0,95

1

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 5: Accuracy of method M1 with GNNE den-sity on S0

Figures 10, 11, 12, 13 illustrate the network costsof M1 on both datasets.

The number of messages for M2 equals the num-ber of messages forM1. However, the size of a singlemessage si 1/(2b/3) the size of a message routed inmethod M1, where b is the bucket size. Therefore,assuming 100 object/peer, on average network costsare lower by a factor 66. In view of this relation, thefigures of network costs for M2 have been omitted.

The better clustering quality of MURK-SF can besimply explained by the strategy adopted to selectits neighbours according to which each peer has moreneighbours than CAN and MURK-CAN better dis-tributed in the data space. In fact, while the neigh-bours of a peer in CAN and MURK-CAN are thosethat manage a direct adjacent space partitions, theneighbours of a peer in MURK-SF can manage noncontiguous space partitions guaranteeing a bet- terview of the data space.

However, as it is shown in Figures 10–13, the net-work costs of MURK-SF are almost always greaterthan the other two P2P systems and for 4 hops thenumber of messages sent is more than 30% higherthan the number of messages sent by MURK-CANand CAN. At 8 hops the three systems are essentialyequivalent from the view point of network costs.

To be more precise, the number of messages de-picted in the Figures corresponds to the network traf-fic necessary to compute the density and then theclustering. The weight in terms of byte of each mes-sage depends basically on which density computa-

H adaptive, Mass/Volume Density, First Data Set

0,4

0,45

0,5

0,55

0,6

0,65

0,7

0,75

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 6: Accuracy of method M2 on S0

H non-adaptive, Kernel-based Density, Second Data Set

0,4

0,45

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 7: Accuracy of method M1 with KE densityon S1

tion is adopted, in fact the traditional M1requiresto transfer among peers entire data space partitions,while density inM2does not cost anything, since it iscomputed locally at the peer. The weight of clusteringmessages, independently on which density computa-tion is selected, is negligible because peers exchangea real number corresponding to their local maximumdensity.

8 Conclusions

In this paper we have described methods to clusterdata in multi-dimensional P2P networks without re-quiring a specific reorganization of the network andwithout altering or compromising the basic servicesof P2P systems, which are the routing mechanism,the data space partition among peers and the searchcapabilities.

We have applied our approach, which is a density-based solution, to CAN, MURK-CAN and MURK-SFdeveloping a simulator of the three systems. Besidesa traditional computation of the density, we have ex-perimented a novel technique in P2P systems whichconsist of calculating the density locally at the peer asthe ratio between the mass, i.e., the number of localdata objects, and the volume of local partitions.

The experiments have reported a difference of clus-tering quality of the two density approaches muchsmaller than their difference in network costs; in factthe network transmissions of the mass/volume tech-nique are several orders of magnitude less than thetraditional density-based approach, while their best

CRPIT Volume 104 - Database Technologies 2010

176

H adaptive, Kernel-based Density, Second Data Set

0,4

0,45

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 8: Accuracy of methodM1 with GNNE on S1

H adaptive, Mass/Volume Density, Second Data Set

0,4

0,45

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

1 2 4 8

Hop

Ran

d in

dex

Torus-CAN Torus-SF CAN

Figure 9: Accuracy of method M2 on S1

H non-adaptive, Kernel-based Density, First Data Set

0

5000

10000

15000

20000

25000

30000

1 2 4 8

Hop

Netw

ork

mess

ag

es

Torus-CAN Torus-SF CAN

Figure 10: Network costs of M1 with KE density onS0

clustering show a quality difference of about 16 per-centage points.

The methods described in this work can be ex-tended in several directions, among which the pos-sibility of improving the clustering quality of themass/volume-based technique by including in thedensity calculated locally at the peer, an influenceof its neighbour peers according to their local den-sity. Other developments of the approach regard theadoption of new multi-dimensional indexing designedfor distributed systems, both for wired environments,such as in (Moro & Ouksel 2003), and in wireless sen-

H adaptive, Kernel-based Density, First Data Set

0

5000

10000

15000

20000

25000

1 2 4 8

Hop

Netw

ork

mess

ag

es

Torus-CAN Torus-SF CAN

Figure 11: Network costs of M1 with GNNE densityon S0

H non-adaptive, Kernel-based Density, Second Data Set

0

5000

10000

15000

20000

25000

30000

1 2 4 8

Hop

Netw

ork

mess

ag

es

Torus-CAN Torus-SF CAN

Figure 12: Network costs of M1 with KE density onS1

H adaptive, Kernel-based Density, Second Data Set

0

5000

10000

15000

20000

25000

30000

1 2 4 8

Hop

Netw

ork

mess

ag

es

Torus-CAN Torus-SF CAN

Figure 13: Network costsM1 with GNNE density onS1

sor networks like in (Monti & Moro 2008, Monti &Moro 2009).

References

Agostini, A. & Moro, G. (2004), Identification of com-munities of peers by trust and reputation, inC. Bussler & D. Fensel, eds, ‘AIMSA’, Vol. 3192of Lecture Notes in Computer Science, Springer,pp. 85–95.

Proc. 21st Australasian Database Conference (ADC 2010), Brisbane, Australia

177

da Silva, J. C., Klusch, M., Lodi, S. & Moro,G. (2006), ‘Privacy-preserving agent-based dis-tributed data clustering’, Web Intelligence andAgent Systems 4(2), 221–238.

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996),A density-based algorithm for discovering clus-ters in large spatial databases with noise, in ‘Pro-ceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD-96)’, Portland, OR, pp. 226–231.

Ganesan, P., Yang, B. & Garcia-Molina, H. (2004),One torus to rule them all: multi-dimensionalqueries in p2p systems, in ‘Proceedings of the7th International Workshop on the Web andDatabases (WebDB 2004)’, ACM Press NewYork, NY, USA, pp. 19 – 24.

Hinneburg, A. & Keim, D. A. (1998), An effi-cient approach to clustering in large multime-dia databases with noise, in ‘Proceedings of theFourth International Conference on KnowledgeDiscovery and Data Mining (KDD-98)’, AAAIPress, New York City, New York, USA, pp. 58–65.

Johnson, E. & Kargupta, H. (1999), Collective, hi-erarchical clustering from distributed heteroge-neous data, in M. Zaki & C. Ho, eds, ‘Large-Scale Parallel KDD Systems’, Vol. 1759 ofLecture Notes in Computer Science, Springer,pp. 221–244.

Kargupta, H. & Chan, P., eds (2000), Distributed andParallel Data Mining, AAAI Press / MIT Press,Menlo Park, CA / Cambridge, MA.

Kargupta, H., Huang, W., Sivakumar, K. & John-son, E. L. (2001), ‘Distributed clustering usingcollective principal component analysis’, Knowl-edge and Information Systems 3(4), 422–448.URL: http://citeseer.nj.nec.com/article/kargupta01distributed.html

Klampanos, I. A. & Jose, J. M. (2004), An ar-chitecture for information retrieval over semi-collaborating peer-to-peer networks, in ‘Pro-ceedings of the 2004 ACM symposium on Ap-plied computing’, ACM Press New York, NY,USA, pp. 1078–1083.

Klampanos, I. A., Jose, J. M. & van Rijsbergen, C.J. K. (2006), Single-pass clustering for peer-to-peer information retrieval: The effect of docu-ment ordering, in ‘INFOSCALE ’06. Proceedingsof the First International Conference on ScalableInformation Systems’, ACM, Hong Kong.

Klusch, M., Lodi, S. & Moro, G. (2003), Dis-tributed clustering based on sampling local den-sity estimates, in ‘Proceedings of the 19th In-ternational Joint Conference on Artificial Intel-ligence, IJCAI-03’, AAAI Press, Acapulco, Mex-ico, pp. 485–490.

Koontz, W. L. G., Narendra, P. M. & Fukunaga, K.(1976), ‘A graph-theoretic approach to nonpara-metric cluster analysis’, ieeetrc C-25(9), 936–944.

Li, M., Lee, G., Lee, W.-C. & Sivasubramaniam, A.(2006), PENS: An algorithm for density-basedclustering in peer-to-peer systems, in ‘INFOS-CALE ’06. Proceedings of the First InternationalConference on Scalable Information Systems’,ACM, Hong Kong.

Lodi, S., Monti, G., Moro, G. & Sartori, C. (2009),Peer-to-peer data clustering in self-organizingsensor networks, in ‘Intelligent Techniques forWarehousing and Mining Sensor Network Data’,IGI Global, Information Science Reference, De-cember 2009, Hershey, PA, USA.

Merugu, S. & Ghosh, J. (2003), Privacy-preservingdistributed clustering using generative models,in ‘Proceedings of the 3rd IEEE InternationalConference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA’,IEEE Computer Society.

Milojicic, D. S., Kalogeraki, V., Lukose, R., Nagaraja,K., Pruyne, J., Richard, B., Rollins, S. & Xu, Z.(2002), Peer-to-peer computing, Technical Re-port HPL-2002-57, HP Lab.

Monti, G. & Moro, G. (2008), Multidimensional rangequery and load balancing in wireless ad hoc andsensor networks, in K. Wehrle, W. Kellerer, S. K.Singhal & R. Steinmetz, eds, ‘Peer-to-Peer Com-puting’, IEEE Computer Society, Los Alamitos,CA, USA, pp. 205–214.

Monti, G. & Moro, G. (2009), Self-organization andlocal learning methods for improving the appli-cability and efficiency of data-centric sensor net-works, in ‘QShine/AAA-IDEA 2009, LNICST22’, Institute for Computer Science, Social-Informatics and Telecommunications Engineer-ing, pp. 627–643.

Moro, G. & Ouksel, A. M. (2003), G-Grid: A class ofscalable and self-organizing data structures formulti-dimensional querying and content routingin p2p networks, in ‘Proceedings of Agents andPeer-to-Peer Computing, Melbourne, Australia’,Vol. 2872, pp. 123–137.

Moro, G., Ouksel, A. M. & Sartori, C. (2002), Agentsand peer-to-peer computing: A promising com-bination of paradigms, in ‘AP2PC’, pp. 1–14.

Rand, W. M. (1971), ‘Objective criteria for the eval-uation of clustering methods’, Journal of theAmerican Statistical Association 66(336), 846–850.

Ratnasamy, S., Francis, P., Handley, M., Karp,R. & Schenker, S. (2001), A scalable content-addressable network, in ‘Proceedings of the 2001conference on Applications, technologies, archi-tectures, and protocols for computer communi-cations’, San Diego, California, United States,pp. 161 – 172.

Silverman, B. W. (1986), Density Estimation forStatistics and Data Analysis, Chapman and Hall,London.

Tasoulis, D. K. & Vrahatis, M. N. (2004), Unsuper-vised distributed clustering, in ‘IASTED Inter-national Conference on Parallel and DistributedComputing and Networks’, Innsbruck, Austria,pp. 347–351.

Wolff, R. & Schuster, A. (2004), ‘Association rulemining in peer-to-peer systems’, IEEE Transac-tions on Systems, Man, And Cybernetics—PartB: Cybernetics 34(6), 2426–2438.

Zaki, M. J. & Ho, C.-T., eds (2000), Large-Scale Par-allel Data Mining, Vol. 1759 of Lecture Notes inComputer Science, Springer.

CRPIT Volume 104 - Database Technologies 2010

178


Recommended