+ All Categories
Home > Documents > Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 ·...

Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 ·...

Date post: 31-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
16
Research Article Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation Xiao Sun, 1,2 Tongda Zhang, 3 Yueting Chai, 1 and Yi Liu 1 1 National Engineering Laboratory for E-Commerce Technology, Tsinghua University, Beijing 100084, China 2 DNSLAB, China Internet Network Information Center, Beijing 100190, China 3 Electrical Engineering Department, Stanford University, Stanford, CA 94305, USA Correspondence should be addressed to Xiao Sun; [email protected] Received 10 March 2015; Accepted 28 May 2015 Academic Editor: J. Alfredo Hernandez Copyright © 2015 Xiao Sun et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the -means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. e experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. e results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. 1. Introduction Background and Related Work. e fast growing Internet technologies and multidisciplinary integration, such as social network, e-commerce, and bioinformatics, have accumulated huge amounts of data, which is far beyond human beings’ processing ability from both data scalability and structure complexity [1]. For example, as scientists study the working mechanism of the cell, they would gather data about protein sequences or genomic sequences, which could be as large as tens or hundreds of terabyte and have a fairly intricate structure inside. Even the smartest person has no way to deal with such a dataset without any assistant tool. Data mining technologies [2] like semisupervised learning [3] and deep learning [4] are developed to address this problem and play an important role in a lot of fields, such as smart home [5], supporting decision system [6], biology [7], and marketing science [8]. In most of these areas, people constantly want to gain knowledge and learn structure from the data they collected. Clustering [9], as one of the most important unsu- pervised learning methods in data mining, is designed for finding hidden structure in unlabeled dataset, which can be used for further processing, such as data summarization [10] and compression [11]. Despite the dozens of different clustering methods from a variety of fields, they can be roughly divided into two cate- gories, partitional method and hierarchical method [12]. Par- titional clustering method tries to generate definite numbers of clusters directly. Considering the computationally pro- hibitive cost to optimize criterion function globally, iterative strategy is usually adopted. On the other hand, hierarchical clustering method generates a group of clustering results; different threshold parameters lead to different clustering results. Both clustering methods have limitations which make them perform badly when applying on some dataset without any change like human behaviour dataset which has various Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2015, Article ID 829201, 16 pages http://dx.doi.org/10.1155/2015/829201
Transcript
Page 1: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Research ArticleLocalized Ambient Solidity Separation Algorithm BasedComputer User Segmentation

Xiao Sun12 Tongda Zhang3 Yueting Chai1 and Yi Liu1

1National Engineering Laboratory for E-Commerce Technology Tsinghua University Beijing 100084 China2DNSLAB China Internet Network Information Center Beijing 100190 China3Electrical Engineering Department Stanford University Stanford CA 94305 USA

Correspondence should be addressed to Xiao Sun sunx11mailstsinghuaeducn

Received 10 March 2015 Accepted 28 May 2015

Academic Editor J Alfredo Hernandez

Copyright copy 2015 Xiao Sun et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Most of popular clustering methods typically have some strong assumptions of the dataset For example the 119896-means implicitlyassumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance Howeverwhen dealing with datasets that have diverse distribution shapes or high dimensionality these assumptions might not be validanymore In order to overcome this weakness we proposed a new clustering algorithm named localized ambient solidity separation(LASS) algorithm using a new isolation criterion called centroid distance Compared with other density based isolation criteriaour proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density Theexperiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits theadvantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identifythe clusters which are adjacent overlapping and under background noise Finally we compared our LASS algorithm with thedissimilarity increments clustering method on a massive computer user dataset with over two million records that containsdemographic and behaviors informationThe results show that LASS algorithmworks extremely well on this computer user datasetand can gain more knowledge from it

1 Introduction

Background and Related Work The fast growing Internettechnologies andmultidisciplinary integration such as socialnetwork e-commerce and bioinformatics have accumulatedhuge amounts of data which is far beyond human beingsrsquoprocessing ability from both data scalability and structurecomplexity [1] For example as scientists study the workingmechanism of the cell they would gather data about proteinsequences or genomic sequences which could be as largeas tens or hundreds of terabyte and have a fairly intricatestructure inside Even the smartest person has no way to dealwith such a dataset without any assistant tool Data miningtechnologies [2] like semisupervised learning [3] and deeplearning [4] are developed to address this problem and playan important role in a lot of fields such as smart home [5]supporting decision system [6] biology [7] and marketingscience [8] In most of these areas people constantly want

to gain knowledge and learn structure from the data theycollected Clustering [9] as one of the most important unsu-pervised learning methods in data mining is designed forfinding hidden structure in unlabeled dataset which can beused for further processing such as data summarization [10]and compression [11]

Despite the dozens of different clustering methods froma variety of fields they can be roughly divided into two cate-gories partitional method and hierarchical method [12] Par-titional clustering method tries to generate definite numbersof clusters directly Considering the computationally pro-hibitive cost to optimize criterion function globally iterativestrategy is usually adopted On the other hand hierarchicalclustering method generates a group of clustering resultsdifferent threshold parameters lead to different clusteringresults Both clusteringmethods have limitationswhichmakethem perform badly when applying on some dataset withoutany change like human behaviour dataset which has various

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2015 Article ID 829201 16 pageshttpdxdoiorg1011552015829201

2 Computational Intelligence and Neuroscience

kinds of features and scales in high-dimensional space Thefirst limitation is the dimensionality The dataset we aredealing with is usually with a dimension higher than 3 whichmakes it almost impossible for people to have a clear intuitionof the data distribution Current clustering methods typicallyneed a given parameter to decide the number of generatedclusters For example in 119896-means [13] a predetermined para-meter 119896 which represents the number of clusters to be gener-ated is required to run the algorithm In single link and com-plete link [8] threshold parameter plays a similar role In suchcases the selection of parameter is highly subjective judge-ment and will become harder as the dimension goes up Alsohigh dimensionality makes traditional Euclidean densitynotion meaningless since the density tends to 0 as dimen-sionality increasesTherefore density-based clusteringmeth-ods with traditional similarity would get into trouble Thesecond limitation is the diversity of data distribution shapesThe distribution of objects in dataset is typically diversewhich may involve isolated adjacent overlapping and back-ground noise at the same time However current clusteringmethods usually make some strong assumptions on datadistribution shape For example 119896-means implicitly assumescircle shape of clusters because of its Euclidean distance basedoptimization function which makes it perform badly whenhandling nonglobular cluster cases Density-based clusteringmethod can handle clusters of arbitrary shape but it hasdifficulties in finding clusters if their densities vary a lotTaking density-based spatial clustering of applications withnoise (DBSCAN) as an example its sensitivity of densityvariation is influenced by the indicated radius which is fixedand selected in advance so it would have troubles if thedensities of clusters vary widely In a word since a lot ofcurrent massive datasets typically have high dimensionalityand diverse distribution shapes traditional clustering meth-ods like 119896-means single link complete link or basic density-based clustering algorithmare no longer a good choice In thispaper we address the problem of clustering the dataset withhigh dimensionality and diverse distribution shapes and tryto develop an applicable clustering algorithm

For the validation of clustering algorithm in practicalapplications a segmentation of Chinese computer users iscarried out in this paper Segmentation is another name ofclustering in some specific area For example in computerversion image segmentation [14] means to partition a digitalimage into several segments to make it easier for understand-ing or further analysis While in marketing managementmarket segmentation [15] or customer segmentation [16] usesclustering techniques to segment target market or customersinto a small number of groups who share common needs andcharacteristicsThe goal of market segmentation or customersegmentation is to address each customer effectively andmaximize his value according to the corresponding segmentRelated researches have been conducted about food market[17 18] vegetable consumers [19] financial market [20]banking industry [21] flight tourists [22] rail-trail users [23]and so on Although lots of works have been done about tra-ditional offline market segmentation not enough attentionis given to computer user or online market segmentation

Additionally existing researches about online market seg-mentation typically collect data through an online survey orquestionnaire [16 24 25] which cannot ensure the accu-racy and objectivity of respondersrsquo behavior informationsuch as computer use time per week and browsing time perweek In our research computer users demographic infor-mation is self-administered while their behaviour informa-tion is extracted from the log files of background softwarewhich real-timely records their human-computer interactionbehaviours term by term Therefore the computer userbehaviour information used in our research canminimize theerror caused by subjective perception bias

Dataset The dataset used in this paper is provided by ChinaInternet Network Information Center (CNNIC) [26] whichrecruits a sample of more than 30 thousand computer usersand records more than ten million items per day about theircomputer interaction behaviour These volunteers arerequired to install background software on their daily usedonline computers by which their interaction behaviours willbe collected In addition to interaction behaviours demo-graphic information such as gender and age has also beencollected when a volunteer creates his account Thousands ofpersonal attributesrsquo information together with their behav-iour information set up the validation foundation of ourproposed algorithm

More specifically the data used in this paper are extractedfrom 1000 randomly selected volunteersrsquo log files with overtwo million records in 7 days and their personal attributeinformation To protect privacy the volunteerrsquos name isreplaced by his hashed value so that actual identificationcannot be retrieved

Outline of the Paper The remainder of the paper is orga-nized as follows Section 21 shows the performance of ahierarchical dissimilarity increments clustering method ona designed two-dimensional benchmark dataset and severaldrawbacks are pointed out Sections 22 and 23 propose a newisolation criterion based on the nonhomogeneous densitywithin a cluster Section 24 demonstrates the performanceof our LASS clustering algorithm on the previous two-dimensional benchmark dataset In Section 3 our LASS clus-tering algorithm is applied on computer users dataset whichcontains their demographic and behaviour informationSection 31 describes the cleaning process of raw data and 7features are extracted to characterize computer users Sec-tions 32 and 33 describe the data normalization processand define a dissimilarity measurement in Section 34 ourLASS algorithm is performed on the normalized dataset seg-mentation and validation results are given in Section 34we give a comprehensive summarization and discussion ofthe segmentation results Finally we draw conclusions of thispaper and point out some potential directions in Section 4

2 Dissimilarity Increments and CentroidDistance Criteria Based Clustering Method

Based on the dissimilarity increments between neighbouringobjects within a cluster a new isolation criterion called

Computational Intelligence and Neuroscience 3

dissimilarity increments is proposed and a hierarchicalagglomerative clustering algorithm is designed [27] In thissection we first generate a two-dimensional benchmark data-set to test the effectiveness of the dissimilarity incrementsclustering method Strengths and weaknesses of this methodare discussed compared to other classical clusteringmethodsAfter that in order to make up for the pointed drawbacks weanalysed the characteristics of density distribution within acluster and proposed a new isolation criterion called centroiddistance based on which a nonhomogeneous density detec-tion algorithm is designed to generate further subclustersfrom an isolated parent cluster Then an integration of theoriginal dissimilarity increments clustering method and ourproposed centroid distance isolation criterion is made anew clustering algorithm named localized ambient solidityseparation (LASS) is developed Finally our LASS algorithmis applied on the two-dimensional benchmark dataset againand the performance is demonstrated

21 Dissimilarity Increments Based Clustering Method Inte-grating dissimilarity increments isolation criterion with hier-archical clustering method a novel hierarchical agglomer-ative clustering method has been proposed [27] which iscalled dissimilarity increments clustering method in thispaper Compared with classical hierarchical clustering meth-ods such as single link or complete link thismethod does notneed a threshold to determine the number of clusters Insteadthe number of generated clusters is automatically decided byalgorithm While on the other hand compared with classicalpartitioning clusteringmethods such as 119896-means thismethoddoes not make any prior hypothesis about cluster shape andthus can handle clusters of arbitrary shape as long as they arenaturally isolated

However dissimilarity increments clusteringmethod alsohas some drawbacksThat is due to the nature of hierarchicalclustering method it is not sensitive to the points in adjacentoverlapping and background noise area In Figure 1 a two-dimensional benchmark dataset is designed to show thisfact This dataset contains six well-isolated groups three ofwhich have nonhomogeneous internal structure We use thisdataset to test the performance of a clustering algorithm onidentifying clusters when they are completely isolated andsomewhat in touching As we can see from the figure the dis-similarity increments clustering method grouped the pointsinto six clusters which is consistentwith first glance intuitionHowever the clustering result also shows that this method isnot applicable in three cases which are the yellow cluster inthe upper half of Figure 1 and the red and green clusters inthe right half of Figure 1 The case of yellow forks representstwo adjacent clusters the case of red forks represents twooverlapping clusters and the case of green forks representsa cluster under background noise

22 The Density Distribution within a Cluster Consideringthe six identified clusters in Figure 1 we could find that thepointsrsquo density distribution within a cluster could be quitedifferent from one another Specifically the pointsrsquo densityof the three circle-shaped clusters in the bottom left part of

7

6

5

4

3

220 25 30 35 40 45 50

Figure 1 Result generated by dissimilarity increments clusteringmethod

Figure 1 is homogeneous while the remaining three clustersare nonhomogeneous Nonhomogeneous means that thepointsrsquo density does not change continuously and smoothlybut heavily with a clear boundary of two touching clustersSo a mechanism could be designed to identify potential sub-clusters within a given cluster based on the nonhomogeneousor heterogeneous distribution of density

The first question is how to define andmeasure density Inconvention the concept of pointsrsquo density refers to the num-ber of points in unit area But just as it is mentioned in Back-ground and Related Work (see Section 1) Euclidean notationof density would have trouble with high-dimensional datasetand cannot identify clusters when their densities vary widelyThe key idea to address these two problems is to associatedensity with each point and its surrounding context andmoreover to associate isolation criterion with pointsrsquo countdistribution rather than absolute values In this paper thedensity around point 119909

119894is defined as the reciprocal of the

centroid distance of 119909119894rsquos 119899 nearest neighbours just as formula

(1) shows In this formula Distance(sdot) is a defined functionto output the distance of two points set 119883 is a collection of119909119894rsquos 119899 nearest neighbour points 119909

119898refers to the point which

has the largest distance to 119909119894in set 119883 and Centroid(sdot) is a

function to calculate the centroid point of a given point setIntuitively the point which lies in high density areawill have asmall centroid distance and thus have a large value of densityaround

Density (119909119894) =

1Centroid Distance (119909

119894)

=1

(119899 minus 1) times Distance (119909119898Centroid (119883 minus 119909

119898))

(1)

A more concrete example of centroid distance is thetwo-dimensional case shown in Figure 2 in which 119901

0is the

target point and 1199011

sim 1199014

are 1199010rsquos 4 nearest neighbour

points among the given dataset With the help of the defined

4 Computational Intelligence and Neuroscience

x

y

p4 p3

p1

p2

p0

p5

Figure 2 Centroid distance of point 1199010

function Distance(sdot) we could find that compared with linesegmentations 119897

11990101199011 11989711990101199012

and 11989711990101199013

the distance of 1199010and 119901

4

say 11989711990101199014

is the largest So if 1199015is the centroid point of triangle

119901111990121199013 then 3119897

11990141199015is the centroid distance of 119901

0 Therefore

the density around point 1199010is 1(3119897

11990101199015) Considering the

correlation between centroid distance and density wewill usethe value of centroid distance directly to describe density inthe remainder of this paper

Based on the analysis above the pointsrsquo densities in cyancircle-shaped cluster and blue circle-shaped cluster inFigure 1 are analysed as Figures 3(a) and 3(b) the pointsrsquodensities in red forks cluster and green forks cluster are anal-ysed as Figures 4(a) and 4(b) The horizontal axis in thesefigures represents normalized centroid distance while thevertical axis represents the number of points ComparingFigure 4 with Figure 3 some law could be found The densitydistribution of cyan circle-shaped cluster and blue circle-shaped cluster which are homogeneous has only one peakas what is shown in Figure 3 In contrast there are at leasttwo apparent peaks on the density distribution curve ofred forks and green crosses clusters whose densities arenonhomogeneous as what is shown in Figure 4Therefore ananalogy can be drawn that the centroid distance distributioncurve of a given cluster would have more than one peakif heterogeneity exists Furthermore based on this analogythe centroid distance values corresponding to the valleys oncentroid distance distribution curvewhich hasmore than onepeak could be seen as a new isolation criterion

23 Centroid Distance Isolation Criterion Based on Nonho-mogeneous Density In order to identify different densitydistributions within a cluster we assume that its centroid dis-tance distribution obeys Gaussian Mixture Models (GMMs)as long as heterogeneity exists More specifically if there are119899 valleys on density distribution curve then for point 119909

119894

119901(Centroid Distance(119909119894)) obeys a GMM consisting of 119899 + 1

Gaussian distribution components as shown in the followingformula in which

119899+1sum

119894=1120587119894= 1

119873119894(119909 | 120583

119894 120590119894) =

1radic2120587120590

119894

exp [minus 12120590119894

(119909 minus 120583119894)2]

(2)

119901 (Centroid Distance (119909119894))

=

119899+1sum

119894=1120587119894119873119894(Centroid Distance (119909

119894) | 120583119894 120590119894)

(3)

Based on the GMM assumption we used EM algorithmto derive two sets of parameters 120587

119894 120583119894 and 120590

119894for the red

forks and green forks clusters in Figure 1 The results areshown in Figures 5(a) and 5(b) where the dashed-line curverepresents high density area and the dashed-dot curve rep-resents the other area Therefore the components of a GMMcould be derived from a given cluster whose centroid distancedistribution curve has at least one valley Specifically the 119909values of the intersection points of different Gaussian distri-butions in a GMM could be seen as isolation criterion

In terms of efficiency the complexity of EM algorithmdepends on the number of iterations and the complexity ofE and M step which is seriously related with cluster sizeIn order to guarantee the efficiency of isolation criterionrsquoscomputation we designed a more simple algorithm whichcould reduce the computational complexity to 119874(119899) where119899 is the number of points in a given cluster For the nextparagraph we will describe the thought of simplification

Through the observation of Figure 6 which demonstratesa comparison of GMM and centroid distance distributioncurve we could find that the 119909 values of the lowest pointof the valley on centroid distance distribution curve andthe intersection point of two Gaussian distributions arealmost identical So the task of identifying a GMM can beconverted into identifying the valleys on a centroid distancedistribution curve Intuitively if a valley is deep enoughthe corresponding centroid distance of the lowest point willbe a good partitioning value The concept of derivation isthen utilized to reflect this intuition here Figure 7 illustratesthe derivative of the centroid distance distribution curvesin Figure 6 The derivative segmentation corresponding toa peak-valley-peak segmentation on a density distributioncurve must satisfy two requirements The first is that it hasto cross zero point of vertical axis which means that there isindeed a valley on centroid distance distribution curve thereOn the premise of meeting this requirement the derivationsegmentation still needs to be long enough which meansthat the valley has enough depth to be a good isolationvalueThe dashed-line segmentations in Figure 7 satisfy thesetwo requirements and the corresponding centroid distancevalues are 6 and 8 which are nearly identical with the 119909 valuesof the intersection points of two Gaussian distributions inFigure 5

Computational Intelligence and Neuroscience 5

Histogram of centroid distance

10

9

8

7

6

5

Num

ber o

f poi

nts

0 1 2 3 4 5 6

Normalized centroid distance

(a)

Histogram of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance

14

12

10

8

6

4

0 2 4 6 8 10 12 14

(b)

Figure 3 Centroid distance histogram of two homogeneous clusters

Normalized centroid distance

22

20

18

16

14

12

10

8

0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(a)

20

18

16

14

12

10

8

Normalized centroid distance0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(b)

Figure 4 Centroid distance histogram of two heterogeneous clusters

Based on the analysis above a nonhomogeneous densitydetection algorithm is proposed to carry potential partitionswithin a given cluster This algorithm first uses crossing-zero index to filter optional partitioning values from allpoints and then measures the angles on either side of thispoint on centroid distance distribution curve to evaluate

the significant level of the isolation criterion A schematicdescription is as shown in Algorithm 1

In our nonhomogeneous density detection algorithmone parameter 119899 which is the number of points used to cal-culate centroid distance still needs to be decided In orderto give a determination policy of 119899 let us consider three

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 2: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

2 Computational Intelligence and Neuroscience

kinds of features and scales in high-dimensional space Thefirst limitation is the dimensionality The dataset we aredealing with is usually with a dimension higher than 3 whichmakes it almost impossible for people to have a clear intuitionof the data distribution Current clustering methods typicallyneed a given parameter to decide the number of generatedclusters For example in 119896-means [13] a predetermined para-meter 119896 which represents the number of clusters to be gener-ated is required to run the algorithm In single link and com-plete link [8] threshold parameter plays a similar role In suchcases the selection of parameter is highly subjective judge-ment and will become harder as the dimension goes up Alsohigh dimensionality makes traditional Euclidean densitynotion meaningless since the density tends to 0 as dimen-sionality increasesTherefore density-based clusteringmeth-ods with traditional similarity would get into trouble Thesecond limitation is the diversity of data distribution shapesThe distribution of objects in dataset is typically diversewhich may involve isolated adjacent overlapping and back-ground noise at the same time However current clusteringmethods usually make some strong assumptions on datadistribution shape For example 119896-means implicitly assumescircle shape of clusters because of its Euclidean distance basedoptimization function which makes it perform badly whenhandling nonglobular cluster cases Density-based clusteringmethod can handle clusters of arbitrary shape but it hasdifficulties in finding clusters if their densities vary a lotTaking density-based spatial clustering of applications withnoise (DBSCAN) as an example its sensitivity of densityvariation is influenced by the indicated radius which is fixedand selected in advance so it would have troubles if thedensities of clusters vary widely In a word since a lot ofcurrent massive datasets typically have high dimensionalityand diverse distribution shapes traditional clustering meth-ods like 119896-means single link complete link or basic density-based clustering algorithmare no longer a good choice In thispaper we address the problem of clustering the dataset withhigh dimensionality and diverse distribution shapes and tryto develop an applicable clustering algorithm

For the validation of clustering algorithm in practicalapplications a segmentation of Chinese computer users iscarried out in this paper Segmentation is another name ofclustering in some specific area For example in computerversion image segmentation [14] means to partition a digitalimage into several segments to make it easier for understand-ing or further analysis While in marketing managementmarket segmentation [15] or customer segmentation [16] usesclustering techniques to segment target market or customersinto a small number of groups who share common needs andcharacteristicsThe goal of market segmentation or customersegmentation is to address each customer effectively andmaximize his value according to the corresponding segmentRelated researches have been conducted about food market[17 18] vegetable consumers [19] financial market [20]banking industry [21] flight tourists [22] rail-trail users [23]and so on Although lots of works have been done about tra-ditional offline market segmentation not enough attentionis given to computer user or online market segmentation

Additionally existing researches about online market seg-mentation typically collect data through an online survey orquestionnaire [16 24 25] which cannot ensure the accu-racy and objectivity of respondersrsquo behavior informationsuch as computer use time per week and browsing time perweek In our research computer users demographic infor-mation is self-administered while their behaviour informa-tion is extracted from the log files of background softwarewhich real-timely records their human-computer interactionbehaviours term by term Therefore the computer userbehaviour information used in our research canminimize theerror caused by subjective perception bias

Dataset The dataset used in this paper is provided by ChinaInternet Network Information Center (CNNIC) [26] whichrecruits a sample of more than 30 thousand computer usersand records more than ten million items per day about theircomputer interaction behaviour These volunteers arerequired to install background software on their daily usedonline computers by which their interaction behaviours willbe collected In addition to interaction behaviours demo-graphic information such as gender and age has also beencollected when a volunteer creates his account Thousands ofpersonal attributesrsquo information together with their behav-iour information set up the validation foundation of ourproposed algorithm

More specifically the data used in this paper are extractedfrom 1000 randomly selected volunteersrsquo log files with overtwo million records in 7 days and their personal attributeinformation To protect privacy the volunteerrsquos name isreplaced by his hashed value so that actual identificationcannot be retrieved

Outline of the Paper The remainder of the paper is orga-nized as follows Section 21 shows the performance of ahierarchical dissimilarity increments clustering method ona designed two-dimensional benchmark dataset and severaldrawbacks are pointed out Sections 22 and 23 propose a newisolation criterion based on the nonhomogeneous densitywithin a cluster Section 24 demonstrates the performanceof our LASS clustering algorithm on the previous two-dimensional benchmark dataset In Section 3 our LASS clus-tering algorithm is applied on computer users dataset whichcontains their demographic and behaviour informationSection 31 describes the cleaning process of raw data and 7features are extracted to characterize computer users Sec-tions 32 and 33 describe the data normalization processand define a dissimilarity measurement in Section 34 ourLASS algorithm is performed on the normalized dataset seg-mentation and validation results are given in Section 34we give a comprehensive summarization and discussion ofthe segmentation results Finally we draw conclusions of thispaper and point out some potential directions in Section 4

2 Dissimilarity Increments and CentroidDistance Criteria Based Clustering Method

Based on the dissimilarity increments between neighbouringobjects within a cluster a new isolation criterion called

Computational Intelligence and Neuroscience 3

dissimilarity increments is proposed and a hierarchicalagglomerative clustering algorithm is designed [27] In thissection we first generate a two-dimensional benchmark data-set to test the effectiveness of the dissimilarity incrementsclustering method Strengths and weaknesses of this methodare discussed compared to other classical clusteringmethodsAfter that in order to make up for the pointed drawbacks weanalysed the characteristics of density distribution within acluster and proposed a new isolation criterion called centroiddistance based on which a nonhomogeneous density detec-tion algorithm is designed to generate further subclustersfrom an isolated parent cluster Then an integration of theoriginal dissimilarity increments clustering method and ourproposed centroid distance isolation criterion is made anew clustering algorithm named localized ambient solidityseparation (LASS) is developed Finally our LASS algorithmis applied on the two-dimensional benchmark dataset againand the performance is demonstrated

21 Dissimilarity Increments Based Clustering Method Inte-grating dissimilarity increments isolation criterion with hier-archical clustering method a novel hierarchical agglomer-ative clustering method has been proposed [27] which iscalled dissimilarity increments clustering method in thispaper Compared with classical hierarchical clustering meth-ods such as single link or complete link thismethod does notneed a threshold to determine the number of clusters Insteadthe number of generated clusters is automatically decided byalgorithm While on the other hand compared with classicalpartitioning clusteringmethods such as 119896-means thismethoddoes not make any prior hypothesis about cluster shape andthus can handle clusters of arbitrary shape as long as they arenaturally isolated

However dissimilarity increments clusteringmethod alsohas some drawbacksThat is due to the nature of hierarchicalclustering method it is not sensitive to the points in adjacentoverlapping and background noise area In Figure 1 a two-dimensional benchmark dataset is designed to show thisfact This dataset contains six well-isolated groups three ofwhich have nonhomogeneous internal structure We use thisdataset to test the performance of a clustering algorithm onidentifying clusters when they are completely isolated andsomewhat in touching As we can see from the figure the dis-similarity increments clustering method grouped the pointsinto six clusters which is consistentwith first glance intuitionHowever the clustering result also shows that this method isnot applicable in three cases which are the yellow cluster inthe upper half of Figure 1 and the red and green clusters inthe right half of Figure 1 The case of yellow forks representstwo adjacent clusters the case of red forks represents twooverlapping clusters and the case of green forks representsa cluster under background noise

22 The Density Distribution within a Cluster Consideringthe six identified clusters in Figure 1 we could find that thepointsrsquo density distribution within a cluster could be quitedifferent from one another Specifically the pointsrsquo densityof the three circle-shaped clusters in the bottom left part of

7

6

5

4

3

220 25 30 35 40 45 50

Figure 1 Result generated by dissimilarity increments clusteringmethod

Figure 1 is homogeneous while the remaining three clustersare nonhomogeneous Nonhomogeneous means that thepointsrsquo density does not change continuously and smoothlybut heavily with a clear boundary of two touching clustersSo a mechanism could be designed to identify potential sub-clusters within a given cluster based on the nonhomogeneousor heterogeneous distribution of density

The first question is how to define andmeasure density Inconvention the concept of pointsrsquo density refers to the num-ber of points in unit area But just as it is mentioned in Back-ground and Related Work (see Section 1) Euclidean notationof density would have trouble with high-dimensional datasetand cannot identify clusters when their densities vary widelyThe key idea to address these two problems is to associatedensity with each point and its surrounding context andmoreover to associate isolation criterion with pointsrsquo countdistribution rather than absolute values In this paper thedensity around point 119909

119894is defined as the reciprocal of the

centroid distance of 119909119894rsquos 119899 nearest neighbours just as formula

(1) shows In this formula Distance(sdot) is a defined functionto output the distance of two points set 119883 is a collection of119909119894rsquos 119899 nearest neighbour points 119909

119898refers to the point which

has the largest distance to 119909119894in set 119883 and Centroid(sdot) is a

function to calculate the centroid point of a given point setIntuitively the point which lies in high density areawill have asmall centroid distance and thus have a large value of densityaround

Density (119909119894) =

1Centroid Distance (119909

119894)

=1

(119899 minus 1) times Distance (119909119898Centroid (119883 minus 119909

119898))

(1)

A more concrete example of centroid distance is thetwo-dimensional case shown in Figure 2 in which 119901

0is the

target point and 1199011

sim 1199014

are 1199010rsquos 4 nearest neighbour

points among the given dataset With the help of the defined

4 Computational Intelligence and Neuroscience

x

y

p4 p3

p1

p2

p0

p5

Figure 2 Centroid distance of point 1199010

function Distance(sdot) we could find that compared with linesegmentations 119897

11990101199011 11989711990101199012

and 11989711990101199013

the distance of 1199010and 119901

4

say 11989711990101199014

is the largest So if 1199015is the centroid point of triangle

119901111990121199013 then 3119897

11990141199015is the centroid distance of 119901

0 Therefore

the density around point 1199010is 1(3119897

11990101199015) Considering the

correlation between centroid distance and density wewill usethe value of centroid distance directly to describe density inthe remainder of this paper

Based on the analysis above the pointsrsquo densities in cyancircle-shaped cluster and blue circle-shaped cluster inFigure 1 are analysed as Figures 3(a) and 3(b) the pointsrsquodensities in red forks cluster and green forks cluster are anal-ysed as Figures 4(a) and 4(b) The horizontal axis in thesefigures represents normalized centroid distance while thevertical axis represents the number of points ComparingFigure 4 with Figure 3 some law could be found The densitydistribution of cyan circle-shaped cluster and blue circle-shaped cluster which are homogeneous has only one peakas what is shown in Figure 3 In contrast there are at leasttwo apparent peaks on the density distribution curve ofred forks and green crosses clusters whose densities arenonhomogeneous as what is shown in Figure 4Therefore ananalogy can be drawn that the centroid distance distributioncurve of a given cluster would have more than one peakif heterogeneity exists Furthermore based on this analogythe centroid distance values corresponding to the valleys oncentroid distance distribution curvewhich hasmore than onepeak could be seen as a new isolation criterion

23 Centroid Distance Isolation Criterion Based on Nonho-mogeneous Density In order to identify different densitydistributions within a cluster we assume that its centroid dis-tance distribution obeys Gaussian Mixture Models (GMMs)as long as heterogeneity exists More specifically if there are119899 valleys on density distribution curve then for point 119909

119894

119901(Centroid Distance(119909119894)) obeys a GMM consisting of 119899 + 1

Gaussian distribution components as shown in the followingformula in which

119899+1sum

119894=1120587119894= 1

119873119894(119909 | 120583

119894 120590119894) =

1radic2120587120590

119894

exp [minus 12120590119894

(119909 minus 120583119894)2]

(2)

119901 (Centroid Distance (119909119894))

=

119899+1sum

119894=1120587119894119873119894(Centroid Distance (119909

119894) | 120583119894 120590119894)

(3)

Based on the GMM assumption we used EM algorithmto derive two sets of parameters 120587

119894 120583119894 and 120590

119894for the red

forks and green forks clusters in Figure 1 The results areshown in Figures 5(a) and 5(b) where the dashed-line curverepresents high density area and the dashed-dot curve rep-resents the other area Therefore the components of a GMMcould be derived from a given cluster whose centroid distancedistribution curve has at least one valley Specifically the 119909values of the intersection points of different Gaussian distri-butions in a GMM could be seen as isolation criterion

In terms of efficiency the complexity of EM algorithmdepends on the number of iterations and the complexity ofE and M step which is seriously related with cluster sizeIn order to guarantee the efficiency of isolation criterionrsquoscomputation we designed a more simple algorithm whichcould reduce the computational complexity to 119874(119899) where119899 is the number of points in a given cluster For the nextparagraph we will describe the thought of simplification

Through the observation of Figure 6 which demonstratesa comparison of GMM and centroid distance distributioncurve we could find that the 119909 values of the lowest pointof the valley on centroid distance distribution curve andthe intersection point of two Gaussian distributions arealmost identical So the task of identifying a GMM can beconverted into identifying the valleys on a centroid distancedistribution curve Intuitively if a valley is deep enoughthe corresponding centroid distance of the lowest point willbe a good partitioning value The concept of derivation isthen utilized to reflect this intuition here Figure 7 illustratesthe derivative of the centroid distance distribution curvesin Figure 6 The derivative segmentation corresponding toa peak-valley-peak segmentation on a density distributioncurve must satisfy two requirements The first is that it hasto cross zero point of vertical axis which means that there isindeed a valley on centroid distance distribution curve thereOn the premise of meeting this requirement the derivationsegmentation still needs to be long enough which meansthat the valley has enough depth to be a good isolationvalueThe dashed-line segmentations in Figure 7 satisfy thesetwo requirements and the corresponding centroid distancevalues are 6 and 8 which are nearly identical with the 119909 valuesof the intersection points of two Gaussian distributions inFigure 5

Computational Intelligence and Neuroscience 5

Histogram of centroid distance

10

9

8

7

6

5

Num

ber o

f poi

nts

0 1 2 3 4 5 6

Normalized centroid distance

(a)

Histogram of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance

14

12

10

8

6

4

0 2 4 6 8 10 12 14

(b)

Figure 3 Centroid distance histogram of two homogeneous clusters

Normalized centroid distance

22

20

18

16

14

12

10

8

0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(a)

20

18

16

14

12

10

8

Normalized centroid distance0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(b)

Figure 4 Centroid distance histogram of two heterogeneous clusters

Based on the analysis above a nonhomogeneous densitydetection algorithm is proposed to carry potential partitionswithin a given cluster This algorithm first uses crossing-zero index to filter optional partitioning values from allpoints and then measures the angles on either side of thispoint on centroid distance distribution curve to evaluate

the significant level of the isolation criterion A schematicdescription is as shown in Algorithm 1

In our nonhomogeneous density detection algorithmone parameter 119899 which is the number of points used to cal-culate centroid distance still needs to be decided In orderto give a determination policy of 119899 let us consider three

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 3: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 3

dissimilarity increments is proposed and a hierarchicalagglomerative clustering algorithm is designed [27] In thissection we first generate a two-dimensional benchmark data-set to test the effectiveness of the dissimilarity incrementsclustering method Strengths and weaknesses of this methodare discussed compared to other classical clusteringmethodsAfter that in order to make up for the pointed drawbacks weanalysed the characteristics of density distribution within acluster and proposed a new isolation criterion called centroiddistance based on which a nonhomogeneous density detec-tion algorithm is designed to generate further subclustersfrom an isolated parent cluster Then an integration of theoriginal dissimilarity increments clustering method and ourproposed centroid distance isolation criterion is made anew clustering algorithm named localized ambient solidityseparation (LASS) is developed Finally our LASS algorithmis applied on the two-dimensional benchmark dataset againand the performance is demonstrated

21 Dissimilarity Increments Based Clustering Method Inte-grating dissimilarity increments isolation criterion with hier-archical clustering method a novel hierarchical agglomer-ative clustering method has been proposed [27] which iscalled dissimilarity increments clustering method in thispaper Compared with classical hierarchical clustering meth-ods such as single link or complete link thismethod does notneed a threshold to determine the number of clusters Insteadthe number of generated clusters is automatically decided byalgorithm While on the other hand compared with classicalpartitioning clusteringmethods such as 119896-means thismethoddoes not make any prior hypothesis about cluster shape andthus can handle clusters of arbitrary shape as long as they arenaturally isolated

However dissimilarity increments clusteringmethod alsohas some drawbacksThat is due to the nature of hierarchicalclustering method it is not sensitive to the points in adjacentoverlapping and background noise area In Figure 1 a two-dimensional benchmark dataset is designed to show thisfact This dataset contains six well-isolated groups three ofwhich have nonhomogeneous internal structure We use thisdataset to test the performance of a clustering algorithm onidentifying clusters when they are completely isolated andsomewhat in touching As we can see from the figure the dis-similarity increments clustering method grouped the pointsinto six clusters which is consistentwith first glance intuitionHowever the clustering result also shows that this method isnot applicable in three cases which are the yellow cluster inthe upper half of Figure 1 and the red and green clusters inthe right half of Figure 1 The case of yellow forks representstwo adjacent clusters the case of red forks represents twooverlapping clusters and the case of green forks representsa cluster under background noise

22 The Density Distribution within a Cluster Consideringthe six identified clusters in Figure 1 we could find that thepointsrsquo density distribution within a cluster could be quitedifferent from one another Specifically the pointsrsquo densityof the three circle-shaped clusters in the bottom left part of

7

6

5

4

3

220 25 30 35 40 45 50

Figure 1 Result generated by dissimilarity increments clusteringmethod

Figure 1 is homogeneous while the remaining three clustersare nonhomogeneous Nonhomogeneous means that thepointsrsquo density does not change continuously and smoothlybut heavily with a clear boundary of two touching clustersSo a mechanism could be designed to identify potential sub-clusters within a given cluster based on the nonhomogeneousor heterogeneous distribution of density

The first question is how to define andmeasure density Inconvention the concept of pointsrsquo density refers to the num-ber of points in unit area But just as it is mentioned in Back-ground and Related Work (see Section 1) Euclidean notationof density would have trouble with high-dimensional datasetand cannot identify clusters when their densities vary widelyThe key idea to address these two problems is to associatedensity with each point and its surrounding context andmoreover to associate isolation criterion with pointsrsquo countdistribution rather than absolute values In this paper thedensity around point 119909

119894is defined as the reciprocal of the

centroid distance of 119909119894rsquos 119899 nearest neighbours just as formula

(1) shows In this formula Distance(sdot) is a defined functionto output the distance of two points set 119883 is a collection of119909119894rsquos 119899 nearest neighbour points 119909

119898refers to the point which

has the largest distance to 119909119894in set 119883 and Centroid(sdot) is a

function to calculate the centroid point of a given point setIntuitively the point which lies in high density areawill have asmall centroid distance and thus have a large value of densityaround

Density (119909119894) =

1Centroid Distance (119909

119894)

=1

(119899 minus 1) times Distance (119909119898Centroid (119883 minus 119909

119898))

(1)

A more concrete example of centroid distance is thetwo-dimensional case shown in Figure 2 in which 119901

0is the

target point and 1199011

sim 1199014

are 1199010rsquos 4 nearest neighbour

points among the given dataset With the help of the defined

4 Computational Intelligence and Neuroscience

x

y

p4 p3

p1

p2

p0

p5

Figure 2 Centroid distance of point 1199010

function Distance(sdot) we could find that compared with linesegmentations 119897

11990101199011 11989711990101199012

and 11989711990101199013

the distance of 1199010and 119901

4

say 11989711990101199014

is the largest So if 1199015is the centroid point of triangle

119901111990121199013 then 3119897

11990141199015is the centroid distance of 119901

0 Therefore

the density around point 1199010is 1(3119897

11990101199015) Considering the

correlation between centroid distance and density wewill usethe value of centroid distance directly to describe density inthe remainder of this paper

Based on the analysis above the pointsrsquo densities in cyancircle-shaped cluster and blue circle-shaped cluster inFigure 1 are analysed as Figures 3(a) and 3(b) the pointsrsquodensities in red forks cluster and green forks cluster are anal-ysed as Figures 4(a) and 4(b) The horizontal axis in thesefigures represents normalized centroid distance while thevertical axis represents the number of points ComparingFigure 4 with Figure 3 some law could be found The densitydistribution of cyan circle-shaped cluster and blue circle-shaped cluster which are homogeneous has only one peakas what is shown in Figure 3 In contrast there are at leasttwo apparent peaks on the density distribution curve ofred forks and green crosses clusters whose densities arenonhomogeneous as what is shown in Figure 4Therefore ananalogy can be drawn that the centroid distance distributioncurve of a given cluster would have more than one peakif heterogeneity exists Furthermore based on this analogythe centroid distance values corresponding to the valleys oncentroid distance distribution curvewhich hasmore than onepeak could be seen as a new isolation criterion

23 Centroid Distance Isolation Criterion Based on Nonho-mogeneous Density In order to identify different densitydistributions within a cluster we assume that its centroid dis-tance distribution obeys Gaussian Mixture Models (GMMs)as long as heterogeneity exists More specifically if there are119899 valleys on density distribution curve then for point 119909

119894

119901(Centroid Distance(119909119894)) obeys a GMM consisting of 119899 + 1

Gaussian distribution components as shown in the followingformula in which

119899+1sum

119894=1120587119894= 1

119873119894(119909 | 120583

119894 120590119894) =

1radic2120587120590

119894

exp [minus 12120590119894

(119909 minus 120583119894)2]

(2)

119901 (Centroid Distance (119909119894))

=

119899+1sum

119894=1120587119894119873119894(Centroid Distance (119909

119894) | 120583119894 120590119894)

(3)

Based on the GMM assumption we used EM algorithmto derive two sets of parameters 120587

119894 120583119894 and 120590

119894for the red

forks and green forks clusters in Figure 1 The results areshown in Figures 5(a) and 5(b) where the dashed-line curverepresents high density area and the dashed-dot curve rep-resents the other area Therefore the components of a GMMcould be derived from a given cluster whose centroid distancedistribution curve has at least one valley Specifically the 119909values of the intersection points of different Gaussian distri-butions in a GMM could be seen as isolation criterion

In terms of efficiency the complexity of EM algorithmdepends on the number of iterations and the complexity ofE and M step which is seriously related with cluster sizeIn order to guarantee the efficiency of isolation criterionrsquoscomputation we designed a more simple algorithm whichcould reduce the computational complexity to 119874(119899) where119899 is the number of points in a given cluster For the nextparagraph we will describe the thought of simplification

Through the observation of Figure 6 which demonstratesa comparison of GMM and centroid distance distributioncurve we could find that the 119909 values of the lowest pointof the valley on centroid distance distribution curve andthe intersection point of two Gaussian distributions arealmost identical So the task of identifying a GMM can beconverted into identifying the valleys on a centroid distancedistribution curve Intuitively if a valley is deep enoughthe corresponding centroid distance of the lowest point willbe a good partitioning value The concept of derivation isthen utilized to reflect this intuition here Figure 7 illustratesthe derivative of the centroid distance distribution curvesin Figure 6 The derivative segmentation corresponding toa peak-valley-peak segmentation on a density distributioncurve must satisfy two requirements The first is that it hasto cross zero point of vertical axis which means that there isindeed a valley on centroid distance distribution curve thereOn the premise of meeting this requirement the derivationsegmentation still needs to be long enough which meansthat the valley has enough depth to be a good isolationvalueThe dashed-line segmentations in Figure 7 satisfy thesetwo requirements and the corresponding centroid distancevalues are 6 and 8 which are nearly identical with the 119909 valuesof the intersection points of two Gaussian distributions inFigure 5

Computational Intelligence and Neuroscience 5

Histogram of centroid distance

10

9

8

7

6

5

Num

ber o

f poi

nts

0 1 2 3 4 5 6

Normalized centroid distance

(a)

Histogram of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance

14

12

10

8

6

4

0 2 4 6 8 10 12 14

(b)

Figure 3 Centroid distance histogram of two homogeneous clusters

Normalized centroid distance

22

20

18

16

14

12

10

8

0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(a)

20

18

16

14

12

10

8

Normalized centroid distance0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(b)

Figure 4 Centroid distance histogram of two heterogeneous clusters

Based on the analysis above a nonhomogeneous densitydetection algorithm is proposed to carry potential partitionswithin a given cluster This algorithm first uses crossing-zero index to filter optional partitioning values from allpoints and then measures the angles on either side of thispoint on centroid distance distribution curve to evaluate

the significant level of the isolation criterion A schematicdescription is as shown in Algorithm 1

In our nonhomogeneous density detection algorithmone parameter 119899 which is the number of points used to cal-culate centroid distance still needs to be decided In orderto give a determination policy of 119899 let us consider three

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 4: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

4 Computational Intelligence and Neuroscience

x

y

p4 p3

p1

p2

p0

p5

Figure 2 Centroid distance of point 1199010

function Distance(sdot) we could find that compared with linesegmentations 119897

11990101199011 11989711990101199012

and 11989711990101199013

the distance of 1199010and 119901

4

say 11989711990101199014

is the largest So if 1199015is the centroid point of triangle

119901111990121199013 then 3119897

11990141199015is the centroid distance of 119901

0 Therefore

the density around point 1199010is 1(3119897

11990101199015) Considering the

correlation between centroid distance and density wewill usethe value of centroid distance directly to describe density inthe remainder of this paper

Based on the analysis above the pointsrsquo densities in cyancircle-shaped cluster and blue circle-shaped cluster inFigure 1 are analysed as Figures 3(a) and 3(b) the pointsrsquodensities in red forks cluster and green forks cluster are anal-ysed as Figures 4(a) and 4(b) The horizontal axis in thesefigures represents normalized centroid distance while thevertical axis represents the number of points ComparingFigure 4 with Figure 3 some law could be found The densitydistribution of cyan circle-shaped cluster and blue circle-shaped cluster which are homogeneous has only one peakas what is shown in Figure 3 In contrast there are at leasttwo apparent peaks on the density distribution curve ofred forks and green crosses clusters whose densities arenonhomogeneous as what is shown in Figure 4Therefore ananalogy can be drawn that the centroid distance distributioncurve of a given cluster would have more than one peakif heterogeneity exists Furthermore based on this analogythe centroid distance values corresponding to the valleys oncentroid distance distribution curvewhich hasmore than onepeak could be seen as a new isolation criterion

23 Centroid Distance Isolation Criterion Based on Nonho-mogeneous Density In order to identify different densitydistributions within a cluster we assume that its centroid dis-tance distribution obeys Gaussian Mixture Models (GMMs)as long as heterogeneity exists More specifically if there are119899 valleys on density distribution curve then for point 119909

119894

119901(Centroid Distance(119909119894)) obeys a GMM consisting of 119899 + 1

Gaussian distribution components as shown in the followingformula in which

119899+1sum

119894=1120587119894= 1

119873119894(119909 | 120583

119894 120590119894) =

1radic2120587120590

119894

exp [minus 12120590119894

(119909 minus 120583119894)2]

(2)

119901 (Centroid Distance (119909119894))

=

119899+1sum

119894=1120587119894119873119894(Centroid Distance (119909

119894) | 120583119894 120590119894)

(3)

Based on the GMM assumption we used EM algorithmto derive two sets of parameters 120587

119894 120583119894 and 120590

119894for the red

forks and green forks clusters in Figure 1 The results areshown in Figures 5(a) and 5(b) where the dashed-line curverepresents high density area and the dashed-dot curve rep-resents the other area Therefore the components of a GMMcould be derived from a given cluster whose centroid distancedistribution curve has at least one valley Specifically the 119909values of the intersection points of different Gaussian distri-butions in a GMM could be seen as isolation criterion

In terms of efficiency the complexity of EM algorithmdepends on the number of iterations and the complexity ofE and M step which is seriously related with cluster sizeIn order to guarantee the efficiency of isolation criterionrsquoscomputation we designed a more simple algorithm whichcould reduce the computational complexity to 119874(119899) where119899 is the number of points in a given cluster For the nextparagraph we will describe the thought of simplification

Through the observation of Figure 6 which demonstratesa comparison of GMM and centroid distance distributioncurve we could find that the 119909 values of the lowest pointof the valley on centroid distance distribution curve andthe intersection point of two Gaussian distributions arealmost identical So the task of identifying a GMM can beconverted into identifying the valleys on a centroid distancedistribution curve Intuitively if a valley is deep enoughthe corresponding centroid distance of the lowest point willbe a good partitioning value The concept of derivation isthen utilized to reflect this intuition here Figure 7 illustratesthe derivative of the centroid distance distribution curvesin Figure 6 The derivative segmentation corresponding toa peak-valley-peak segmentation on a density distributioncurve must satisfy two requirements The first is that it hasto cross zero point of vertical axis which means that there isindeed a valley on centroid distance distribution curve thereOn the premise of meeting this requirement the derivationsegmentation still needs to be long enough which meansthat the valley has enough depth to be a good isolationvalueThe dashed-line segmentations in Figure 7 satisfy thesetwo requirements and the corresponding centroid distancevalues are 6 and 8 which are nearly identical with the 119909 valuesof the intersection points of two Gaussian distributions inFigure 5

Computational Intelligence and Neuroscience 5

Histogram of centroid distance

10

9

8

7

6

5

Num

ber o

f poi

nts

0 1 2 3 4 5 6

Normalized centroid distance

(a)

Histogram of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance

14

12

10

8

6

4

0 2 4 6 8 10 12 14

(b)

Figure 3 Centroid distance histogram of two homogeneous clusters

Normalized centroid distance

22

20

18

16

14

12

10

8

0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(a)

20

18

16

14

12

10

8

Normalized centroid distance0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(b)

Figure 4 Centroid distance histogram of two heterogeneous clusters

Based on the analysis above a nonhomogeneous densitydetection algorithm is proposed to carry potential partitionswithin a given cluster This algorithm first uses crossing-zero index to filter optional partitioning values from allpoints and then measures the angles on either side of thispoint on centroid distance distribution curve to evaluate

the significant level of the isolation criterion A schematicdescription is as shown in Algorithm 1

In our nonhomogeneous density detection algorithmone parameter 119899 which is the number of points used to cal-culate centroid distance still needs to be decided In orderto give a determination policy of 119899 let us consider three

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 5: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 5

Histogram of centroid distance

10

9

8

7

6

5

Num

ber o

f poi

nts

0 1 2 3 4 5 6

Normalized centroid distance

(a)

Histogram of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance

14

12

10

8

6

4

0 2 4 6 8 10 12 14

(b)

Figure 3 Centroid distance histogram of two homogeneous clusters

Normalized centroid distance

22

20

18

16

14

12

10

8

0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(a)

20

18

16

14

12

10

8

Normalized centroid distance0 10 20 30 40 50

Histogram of centroid distance

Num

ber o

f poi

nts

(b)

Figure 4 Centroid distance histogram of two heterogeneous clusters

Based on the analysis above a nonhomogeneous densitydetection algorithm is proposed to carry potential partitionswithin a given cluster This algorithm first uses crossing-zero index to filter optional partitioning values from allpoints and then measures the angles on either side of thispoint on centroid distance distribution curve to evaluate

the significant level of the isolation criterion A schematicdescription is as shown in Algorithm 1

In our nonhomogeneous density detection algorithmone parameter 119899 which is the number of points used to cal-culate centroid distance still needs to be decided In orderto give a determination policy of 119899 let us consider three

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 6: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

6 Computational Intelligence and Neuroscience

Input119873 samples of a certain cluster 119899 (the number of samples used to calculte centroid distance)Output patition values if neccesarySteps(1) Set Paritioning Points = 0 threshold = tan 45∘ = 1 the 119894th sample is 119878

119894

(2) Calculate the centroid distance for every sampleCentroid Distance(119904

119894) = (119899 minus 1)Distance(119904

119895 Centroid(119878

119894119899minus 119904119895))

119878119894119899

is the collection of 119899 nearest samples to 119904119894 119904119895is the sample which has the largest distance to 119904

119894in 119878119894119899

get histogram data (119909119894 119910119894) about centroid distance array 119894 = 1 2 lfloor11987310rfloor

(3) Set 119894 = 2(4) If 119894 == lfloor11987310rfloorThen stop and return the points in Paritioning PointsElse continue(5) If 119910

119894lt 119910119894minus1 and 119910

119894lt 119910119894+1

Then119895 = 119894tan 1 = 0 tan 2 = 0While 119895 gt 1 and 119910

119895lt 119910119895minus1

If tan 1 lt ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

Then tan 1 = ((119910119895minus1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

119895 = 119894While 119895 lt lfloor11987310rfloor and 119910

119895lt 119910119895+1

If tan 2 lt ((119910119895+1 minus 119910119895)(119909119895+1 minus 119909119895))

Then tan 2 = ((119910119895+1 minus 119910119895)(119909119895 minus 119909119895minus1))

119895 = 119895 minus 1

If tan 1 gt threshold and tan 2 gt thresholdThen

Paritioning Points = Paritioning Points cup 119904119894

Go to Step (6)Else continue

(6) 119894 = 119894 + 1Go to Step (4)

Algorithm 1

Table 1 First- and second-level nearest points in uniformly dis-tributed space

Dimensions One Two ThreeFirst-level nearest points 2 4 6Second-level nearest points 2 4 12

concrete examples in Figures 8(a) 8(b) and 8(c) which rep-resent uniformly distributed points in one- two- and three-dimensional space respectively Uniformly distributed pointsmeans for a given point there exist two nearest equidis-tant points on every dimension In our examples Euclid-ean distance is used and the value of nearest equal distanceis 119903 Further investigation tells us that the change of distancefrom a given point is not continuous but discrete In Figure 8for the central yellow point the first-level nearest points aremarked in red and the second-level nearest points aremarked in blue The three subfigures are summarized inTable 1 based onwhich formula (4) is put forward to calculatethe number of 119896-level nearest points in 119889-dimensional space(119896 le 119889) More specifically when 119896 equals 1 formula (4) isreduced to be the number of first-level nearest points whichis 2119889 We believe that the number of first-level nearest

points is sufficient for centroid distance computation inuniformly distributed dataset In reality however data canhardly be uniformly distributed so in order to guarantee theavailability of centroid distance to reflect nonhomogeneousdensity we multiply the first-level nearest pointsrsquo numberby 2 Formula (5) finally gives the policy to determine 119899 innonhomogeneous density detection algorithm according tothe dimension of data set

119899 = 119862119896

1198892119896 (4)

119899 = 4119889 (5)

24 The Integration of Dissimilarity Increment and CentroidDistance Criteria Applying nonhomogeneous density detec-tion algorithm after using dissimilarity increments clusteringmethod in other words taking dissimilarity increments andcentroid distance as an isolation criterion successively anew clustering algorithm named localized ambient solidityseparation algorithm (LASS) is developed and the clusteringresult is obtained Just as demonstrated in Figure 9 exceptfor the perfect partition of naturally isolated clusters theirinternal structure has also been explored and points arepartitioned further if necessary The yellow red and green

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 7: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 7

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

yaft

er b

eing

scal

ed

(b)

Figure 5 GMMs derived by EM algorithm from two heterogeneous clusters

20

15

10

5

00 10 20 30 40 50

x

Gaussian distribution 1Gaussian distribution 2

Centroid distance

yaft

er b

eing

scal

ed

(a)

20

15

10

5

00 10 20 30 40 50

x

yaft

er b

eing

scal

ed

Gaussian distribution 1Gaussian distribution 2

Centroid distance

(b)

Figure 6 Comparison of GMM and centroid distance distribution curve

clusters in Figure 1 are divided into two subclusters furtheraccording to their nonhomogeneous density distributionTherefore our LASS algorithm can handle clusters of arbi-trary shape which are isolated adjacent overlapping and

under background noise Moreover compared with the tra-ditional notation of density which is the number of pointsin unit Euclidean volume our proposed centroid distanceisolation criterion works well in high-dimensional space

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 8: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

8 Computational Intelligence and Neuroscience

3

2

1

0

minus1

minus2

minus3

Derivative of centroid distance

Num

ber o

f poi

nts

Normalized centroid distance0 10 20 30 40 50

(a)

Derivative of centroid distance

3

2

1

0

minus1

minus2

minus3

Num

ber o

f poi

nts

Normalized centroid distance0 5 10 15 20 25 30 35 40

(b)

Figure 7 Centroid distance derivative of two heterogeneous clusters

x

2rr0minus2r minusr

(a)

y

r

radic2r

(b) (c)

Figure 8 Uniformly distributed points in one- two- and three-dimensional space

7

6

5

4

3

220 25 30 35 40 45 50

Figure 9 Result generated by our LASS algorithm

actually it is evenmore sensitive as dimension increases Alsocompared with direct similarity centroid distance isolationcriterion takes into account the surrounding context of eachpoint by using its nrsquos nearest points and depends on thehistogram distribution instead of the exact absolute value ofsimilarity So it can automatically scale according to the den-sity of points All in all integrated dissimilarity incrementsand centroid distance isolation criteria together our LASSalgorithm can achieve broader applicability especially on thedataset with high dimension and diverse distribution shape

3 Computer User Segmentation

In this section our proposed LASS algorithm is applied oncomputer users dataset which contains their demographicand behaviour information To accomplish this we first

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 9: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 9

cleaned the raw data and extracted 7 features to characterizecomputer users Then the cleaned data is normalized and adissimilarity measurement is defined On the basis of thesethe original dissimilarity increments clustering algorithmand our LASS algorithm are applied on the dataset respec-tivelyThe clustering processes are analysed and the effective-ness of results is verified At last the segmentation result ofcomputer users is analysed and summarized

31 Data Cleaning and Features Selection The raw dataprovided by CNNIC contains two kinds of informationTheyare 1000 computer usersrsquo personal attributes and their com-puter using log files Specifically personal attributes includea volunteerrsquos gender birthday education level job typeincome level province of residence city of residence andtype of residence while computer using log files record these1000 volunteersrsquo computer interaction behaviours in 7 daysincluding start time end time websites browsing history andprograms opening history

Although many features could be extracted from rawdata we focus our attention on volunteersrsquo natural attributesas persons and their fundamental behavioursrsquo statistical indi-cators but ignore environmental and geographic factors suchas job type province of residence city of residence and resi-dence type The reason behind this is that we regard Internetas a strength which has broken down geographic barrierTherefore we assume that environmental and geographic fac-tors are no longer crucial influence factors in Internet worldFrom this point of view we extracted 7 features to profilecomputer users Taking the 119894th computer user 119906

119894as a concrete

example these extracted features are described inTable 2Thedata of volunteers whose value of Times(sdot) is less than 4 arecleared out and 775 sample data are left

32 Data Normalization and Dissimilarity MeasurementData normalization is needed before applying our LASS algo-rithm The reason to do so is that similarity measurement isusually sensitive to differences inmean and variability In thispaper two kinds of normalization are used as expressed informulas (6) and (7) respectively In formula (6) 119898

119895and 119904119895

are the mean and standard deviation of feature 119895 Throughthis transformation feature 119895 will have zero mean and unitvariance While in formula (7) function Rank(sdot) returns theranked number of119909lowast

119894119895in feature 119895 data sequenceTherefore the

transformed data will have a mean of (119899+1)2 and a varianceof (119899 + 1)[(2119899 + 1)6 minus (119899 + 1)4] where 119899 is the number ofdata Related study has shown that on the performance ofclustering formula (7) outperforms formula (6) particularlyin hierarchical clusteringmethods formula (7) ismore robustto outliers and noise in dataset [28]

119909119894119895=119909lowast

119894119895minus 119898119895

119904119895

(6)

119909119894119895= Rank (119909lowast

119894119895) (7)

In this paper for continuous variablersquos normalizationsuch as bootDuration(sdot) and visitingDuration(sdot) formulas (7)

Table 2 Description of computer users features

Variables Descriptions

Gender (119906119894)

The gender of 119906119894 discrete variable

1 stands for male0 stands for female

Age (119906119894) The age of 119906

119894 discrete variable between 10

and 70

Edu (119906119894)

The education level of 119906119894 discrete variable

0 below primary school1 junior school2 senior school3 junior college4 bachelor degree5 others

Income (119906119894)

The monthly income level of 119906119894 discrete

variable0 no income1 below 500 Yuan2 501ndash1000 Yuan3 1001ndash1500 Yuan4 1501ndash2000 Yuan5 2001ndash3000 Yuan6 3001ndash5000 Yuan7 5001ndash8000 Yuan8 8001ndash12000 Yuan9 others

Times (119906119894) Boot times of 119906

119894rsquos computer discrete

variableBooting Duration(119906119894)

The duration of 119906119894using computer

continuous variable

Brows Duration (119906119894) The duration of 119906

119894browsing websites

continuous variable

and (6) are used successively while for discrete variablersquosnormalization such as Gender(sdot) Age(sdot) and Edu(sdot) onlyformula (6) is used

After normalization a dissimilarity index is defined tomeasure the distance between different data As formula (8)shows it is a form of 1-normsrsquo sum where 119891

119894119899stands for the

value of 119894th datarsquos 119899th feature

Dissimilarity (119906119894 119906119895) =

7sum

119899=1

10038161003816100381610038161003816119891119894119899minus119891119895119899

10038161003816100381610038161003816 (8)

33 Computer Users Segmentation Process Our proposedLASS algorithm is applied for the segmentation of computerusers in this sectionThewhole segmentation process consistsof two parts Part I is the dissimilarity increments basedclustering strategy (for details please refer to Section 3 in[27]) which aims to find natural isolated clusters part II isour proposed centroid distance based clustering strategy (fordetails please refer to Section 23 in this paper) whose goalis to explore the internal structure of every cluster generatedby part I and identify potential subclusters that are adjacentoverlapping and under background noise

The clustering process is partly shown in Figure 10 wherethree representative clusters obtained in part I strategy arechosen to be demonstrated Further exploration is carried

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 10: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

10 Computational Intelligence and Neuroscience

Normalized centroid distance

15

10

5

0

Histogram of centroid distance

Num

ber o

f poi

nts

20 22 24 26 28 30 32 34

(a)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50 55

14

12

10

8

6

4

2

0

(b)

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

6

4

2

05 6 7 8 9 10

16

14

12

10

8

(c)

Figure 10 Centroid distance histogram of three clusters

out by part II strategy of LASS algorithm and a partitionvalley is found in cluster 2 as shown in Figure 10(b) Nextthe horizontal axis value of the lowest point on this valley canbe acquired as a further isolation criterion based on whichcluster 2 will be divided into two subclusters Figure 11 showsa comparison of the GMM generated by EM algorithm andcentroid distance distribution curve of cluster 2 Despite thedifferences between these two graphsrsquo shapes the acquiredtwo isolation criteria are nearly the same which validates oursimplification of GMMrsquos computation

34 Segmentation Results Analysis and Discussion The seg-mentation results generated by the original dissimilarityincrements method and our LASS algorithm are demon-strated in Tables 3 and 4 These two tables list the prototypessummarized from the obtained clusters As it is shown thesixth cluster in Table 3 is divided into two subclusters thesixth and seventh cluster in Table 4The reason of this furtherpartition as analyzed in Section 33 is the existence of a deepenough valley on cluster 6rsquos centroid distribution curve (asshown in Figure 10(b)) which implies the existence of twodifferent density areas within cluster 6 in Table 3

To understand this process some investigation shouldbe made about the relationship between Tables 3 and 4 InTable 3 cluster 6 is the largest group of all clusters whosegender proportion is almost 50 However an intuitive senseof behavior tells us that behavior mode should be seriouslyaffected by peoplersquos gender This intuition is proved by thefirst 5 clusters in Table 3 to some extent in which genderproportion is 100 male The reason why cluster 6 has not

Normalized centroid distance

Histogram of centroid distance

Num

ber o

f poi

nts

20 25 30 35 40 45 50

14

12

10

8

6

4

2

0

Centroid distanceGaussian distribution 1Gaussian distribution 2

Figure 11 Comparison of GMM and centroid distance distributioncurve

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 11: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 11

Table 3 Results generated by dissimilarity increments clustering method

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 352 Male 42Female 58 32 Junior college 1501ndash3000 Yuan 65 421 51

Table 4 Results generated by our LASS algorithm

Segment Size Gender Age Education level Income levelComputer using

frequency(timesweek)

Computerusing time

(hoursweek)

Websitebrowsing time(hoursweek)

1 24 Male 100Female 0 28 Junior college 2001ndash5000 Yuan 67 675 06

2 35 Male 100Female 0 24 Bachelor degree 0ndash500 Yuan 65 448 44

3 58 Male 100Female 0 41 Junior college 3001ndash5000 Yuan 57 447 29

4 70 Male 100Female 0 32 Senior school to

junior college 2001ndash3000 Yuan 6 33 31

5 185 Male 100Female 0 32 Bachelor degree 2001ndash5000 Yuan 59 39 67

6 136 Male 07Female 993 30 Junior college to

bachelor degree 1501ndash3000 Yuan 59 378 31

7 216 Male 681Female 329 33 Junior college 1001ndash2000 Yuan 69 448 63

been divided further apart by the dissimilarity incrementsclustering method is that there may exist much touchingareas in high-dimensional space of cluster 6 under whichsituation the dissimilarity increments clusteringmethod doesnot work anymore While our proposed centroid distancebased nonhomogeneous density detection algorithm hasfound that there still exist two potential subgroups withincluster 6 in Table 3 which are identified as clusters 6 and7 in Table 4 these two clusters are different in gender ageand computer using behaviors Cluster 6 is almost totallycomposed of women who spend less time on computer andwebsites browsing while in cluster 7 men are twice as muchas women who are older than people in cluster 6 and spendmuch more time on computers especially on browsing

In order to quantify the overall effectiveness of our LASSalgorithm a between group sum of dissimilarities (SDB) iscalculated as formula (9) which is the sumof the dissimilaritybetween a cluster centroid 119888

119894 and the overall centroid 119888

of all the data In this formula 119870 is the number of clusters

Table 5 Total SDB of two clustering methods

MethodDissimilarityincrements

clustering methodOur LASS algorithm

Total SDB 853 1109

and 119899119894is the number of points in cluster 119894 The higher the

total SDB is achieved the more adjoint the identified clustersare So it could be used to measure the effectiveness of aclustering methodThe total SDB of the original dissimilarityincrements clustering method and our LASS algorithm onthe given dataset are shown in Table 5 Obviously our LASSalgorithm achieves larger total SDB more specifically 30larger thus it fits for the given computer user dataset better

In terms of the evaluation of individual clusters silhouettecoefficient is used here whose value varies between minus1 and 1A positive value of silhouette coefficient is desirable

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 12: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

12 Computational Intelligence and Neuroscience

Table 6 The silhouette coefficients of clusters

Clusters Cluster 6in Table 3

Cluster 6in Table 4

Cluster 7 inTable 5

Silhouettecoefficient minus034 002 minus041

As Table 6 shows the silhouette coefficient value of cluster 6in Table 3 is negative which implies that the inside cohesionand outside separation of the cluster are not good So cluster6 in Table 3 could not be seen as a typical cluster whilethrough our LASS algorithm cluster 6 in Table 3 is identifiedas two individual clusters one of whose silhouette coefficientsis positive So as to cluster 7 whose silhouette coefficient isstill negative we guess that it belongs to some kind of back-ground noise This will be discussed later As for cluster 6 inTable 4 we believe that it is a typical prototype of Chinesefemale computer users which has not been revealed inTable 3 Therefore compared with the original dissimilarityincrements clustering method our LASS algorithm can gainmore knowledge and understanding from computer userdataset

Total SDB =

119870

sum

119894=1119899119894Dissimilarity (119888

119894 119888) (9)

Further Kruskal-WallisHTest is applied on the clusters inTable 4 to test the difference between two ormore clusters of agiven dimension As a nonparametric test method Kruskal-Wallis H Test is typically used to determine if there are sta-tistical significance differences between two or more groupsof an independent variable The results are shown in Tables 7and 8 In the hypothesis tests of Table 7 original hypothesisis that the distributions of a given variable in all 7 clusters areidentical and alternative hypothesis is that the distributionsof a given variable in all 7 clusters are not identical While inthe hypothesis tests of Table 8 original hypothesis is that thedistributions of a given variable in a given pair of clusters areidentical and alternative hypothesis is that the distributionsof a given variable in a given pair of clusters are not identicalThe 119901 values are listed and marked by star if they are biggerthan 005 whichmeans accepting the original hypothesis andrejecting the alternative one For the cases in which 119901 valueis below 005 the smaller the 119901 value is the more statisticallysignificant the variablersquos difference is In Table 7 all of the 119901values are below 0002 which means for any given variableits distributions are extremely different among the sevenclusters in Table 4Therefore we can draw the conclusion thatthese seven variables perform well in identifying differentgroups of computer users While in Table 8 119901 value changesa lot according to the given pair of clusters and variable Thesignificance of these seven variables to distinguish differentpair of clusters will be discussed one by one combined withTable 9 which reveals the detailed demographic and com-puter interaction behaviours characteristics of the obtainedseven computer users clusters

Segmentation results will be analysed from the perspec-tive of variables with the help of Table 8 and Tables 9 and 4

and significant characteristics will be pointed out For thevariable of gender Table 8 tells us that its distributions in thefirst five segments are identical which is proved to be 100male in Table 9 The most significant difference of genderlies among segments 1ndash5 segment 6 and segment 7 whichrepresentsmale groups female group andmix-gender grouprespectively For the variable of age Table 8 reveals that itsdistribution among segments 4ndash7 could be seen as identicalthemain difference happens between the first three segmentsCombined with Tables 9 and 4 we could find that segment 2consists of the youngest members whose age is around 24Segment 1 is a little bit elder group whose average age isaround 28 While segment 3 is a middle-aged group with anaverage age of 41 they are much older than other segmentsSo as to the variable of education level it discriminatesdifferent segments well Its distribution in segments 2 and 5could be seen as identical that has the highest education levelbachelor degree while the people from segment 4 have thelowest education level Other segments differ from oneanother For the variable of income level segment 1 earns thehighest income while segment 2 earns the lowest one Theincome level of segments 3 and 5 could be seen as identicalso it is with segments 4 and 6 And the former tworsquos incomeis lower than the latter tworsquos In the terms of computer usingfrequency the segments could be divided into two groupsthey are segments 1 2 and 7 and segments 3ndash6 The formergroup uses computer more frequently As for the variableof computer using time it discriminates segments 1 and 4well that spend the most and the least time on computerrespectively while for the remaining 5 segments no signifi-cant difference exists among their computer using time Forthe last variable website browsing time its distribution insegments 2 3 4 and 6 could be seen as identical differencemainly lies among segments 1 5 and 7 Specifically segment1 spends the least time on website browsing while segment 5spends the most and the browsing time of segment 7 falls inbetween segment 1 and segment 5

Based on the analysis above the 7 segments obtained byour LASS algorithm are summarized and discussed belowrespectively

Category 1 (little-browsing group) This group is entirelycomposed of young men who received a high educationlevel and earn a decent income The most significant featureof the people in this group is that although they spendthe most time on computers compared with other groupsthey seldom visit webpages We guess that for this group ofpeople the computer interaction behaviours mainly happenin workplace or public where personal browsing is notencouraged

Category 2 (little-income group) This group is composedof the youngest people who are purely male and have thehighest education level The most significant feature of thisgroup of people is that they have the same income level whichis no income Additionally they spend relatively more timeon computers and browsing websites We guess that the mainbody of this group is college students in progress who havelots of free time but no source of revenue

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 13: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 13

Table 7 119901 values of features among all clusters

Variables Gender Age Education level Income level Computer using frequency Computer using time Website browsing time119901 value lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002

Table 8 119901 values of features between two pairs of clusters

Variables Pair of segments1-2 1-3 1-4 1-5 1-6 1-7 2-3

Gender gt05lowast gt05lowast gt05lowast gt05lowast lt0002 lt0002 gt05lowast

Age lt0005 lt0002 gt005lowast lt005 gt02lowast gt005lowast lt0002Education level lt0002 gt05lowast lt0002 lt0002 gt005lowast gt05lowast lt0002Income level lt0002 gt02lowast gt01lowast gt05lowast gt005lowast lt0002 lt0002Computer using frequency gt02lowast lt0002 lt0005 lt0002 lt0002 gt01lowast lt0005Computer using time lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 gt05lowast

Website browsing time lt0002 lt005 lt001 lt0002 lt005 lt0002 gt01lowast

Variables Pair of segments2-4 2-5 2-6 2-7 3-4 3-5 3-6

Gender gt05lowast gt05lowast lt0002 lt0002 gt05lowast gt05lowast lt0002Age lt0002 lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Education level lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0005Income level lt0002 lt0002 lt0002 lt0002 lt0005 gt01lowast lt0002Computer using frequency lt005 lt0005 lt0002 gt05lowast gt02lowast gt02lowast gt05lowast

Computer using time lt001 gt02lowast gt01lowast gt05lowast lt0002 gt01lowast lt005Website browsing time gt02lowast gt005lowast gt005lowast gt05lowast gt05lowast lt0002 gt05lowast

Variables Pair of segments3-7 4-5 4-6 4-7 5-6 5-7 6-7

Gender lt0002 gt05lowast lt0002 lt0002 lt0002 lt0002 lt0002Age lt0002 gt05lowast gt01lowast gt05lowast lt002 gt02lowast gt01lowast

Education level gt02lowast lt0002 lt0002 lt0002 lt0002 lt0002 lt0002Income level lt0002 lt005 gt05lowast lt002 lt0002 lt0002 lt002Computer using frequency lt0002 gt05lowast gt02lowast lt0005 gt05lowast lt0002 lt0002Computer using time gt05lowast lt001 gt005lowast lt0002 gt02lowast gt01lowast lt005Website browsing time lt0005 lt0002 gt02lowast lt002 lt0002 lt005 lt0002

Category 3 (high-income group) This group of people isentirely middle-aged menThemost significant feature of thepeople in this group is the highest income they earn Besidesthey spend relatively less time on computer interaction interms of both using frequency and total browsing time Weguess that for the middle-aged men in this group most ofwhom have not received a higher education computers orInternet is not so necessary in their daily life

Category 4 (low-education group) This group is entirelycomposed of young men whose age is older than Categories1 and 2 The most significant feature of the people in thisgroup is their low-education level the average of which issenior school ranging from junior school to junior collegeMoreover they earn a medium level income and get smallervalues on every computer interaction index We guess that

this group of people is mainly engaged in jobs independentof computers

Category 5 (much-browsing group) The structure of thisgroup is very similar to Category 4 except for the highereducation they received say bachelor degree As it is shownpeople in this group earn more we guess that education dif-ference may account for this Also compared with othercategories especially Category 4 this group of people spendsmuch more time on browsing websites We guess that themain job types of this group could be intellectual work thusthey have close access to online computers

Category 6 (young-women group) Female accounts fornearly 100 in this group which is the only case in these 7categories However from computer interaction aspects say

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 14: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

14 Computational Intelligence and Neuroscience

Table 9 Demographic and behaviour description of computer user segmentations

Demographic and computerinteraction behaviourscharacteristics

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Total

GenderMale 100 100 100 100 100 07 681 718Female 0 0 0 0 0 993 319 282

Age10sim20 0 0 0 0 0 22 120 4020sim25 42 686 68 143 64 125 176 14625sim30 625 229 34 242 341 419 162 27230sim35 333 86 138 286 276 213 162 21235sim40 0 0 155 229 162 147 102 13440sim50 0 0 379 10 135 59 181 14050sim60 0 0 206 0 22 14 83 5060sim70 0 0 17 0 0 0 14 06

Education levelBelow primary school 0 0 0 0 0 0 14 04Junior school 0 0 17 29 0 22 144 51Senior school 0 0 259 743 05 199 269 211Junior college 100 20 551 229 265 331 199 298Bachelor degree 0 714 172 0 730 412 287 398Others 0 86 0 0 0 37 88 37

Income levelNo income 0 914 0 0 0 51 241 126Below 500 Yuan 0 57 0 0 0 07 32 14501ndash1000 Yuan 0 29 17 0 16 29 46 261001ndash1500 Yuan 0 0 52 114 43 59 97 661501ndash2000 Yuan 125 0 103 229 108 176 83 1202001ndash3000 Yuan 292 0 241 243 286 324 162 1253001ndash5000 Yuan 458 0 207 329 395 235 157 2565001ndash8000 Yuan 125 0 172 85 119 66 83 948001ndash12000 Yuan 0 0 121 0 32 44 28 34Others 0 0 0 0 0 07 69 29

Computer using frequencyMean 67 65 57 60 59 587 69 63Variance 004 020 033 018 025 040 14 068

Computer using timeMean 675 447 447 330 394 378 448 417Variance 023 088 131 031 070 083 131 099

Website browsing timeMean 064 443 288 312 67 31 63 495Variance 043 089 089 080 083 096 101 099

using frequency and browsing time this group is very similarto Category 4 So we guess that these two groups of peoplehave similar type of job or similar working circumstanceMoreover although these young women have a higher edu-cation level thanmen in Category 4 they do not earn a better

salaryWe guess that this phenomenonmay be due to the lackof career experience and gender discrimination

Category 7 (noise group) This category is the only gendermixed group in which men are twice as much as women

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 15: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

Computational Intelligence and Neuroscience 15

However in terms of age education level and income levelthis category shows no significant difference compared withtotal population And as for the variables of computer usingfrequency computer using time and website browsing timetheir variances are fairly large even bigger than the overallvariances So due to the dispersed distribution of this categoryon every dimension we believe that it is a noise group

4 Conclusion

In this paper we proposed a new clustering algorithm namedlocalized ambient solidity separation (LASS) algorithmThis algorithm is built on a new isolation criterion called cen-troid distance which is used to detect the nonhomogeneousdensity distribution of a given cluster The proposed isolationcriterion is based on the recognition that if there existnonhomogeneous densities within a cluster then partitionsshould be carried out The intuition behind this recognitionis GMM assumption of the pointsrsquo centroid distance value ina cluster EM algorithm was used to derive the componentsand parameters of a GMM Additionally in order to makethe algorithmmore efficient we designed a nonhomogeneousdensity detection algorithm to reduce computation com-plexity to 119874(119899) where 119899 is the number of points for cluster-ing Moreover the parameter determination policy of non-homogeneous density detection algorithm is investigatedFinally we integrated our designed nonhomogeneous densitydetection algorithm as a follow-up mechanism with theoriginal dissimilarity increments clustering method anddeveloped LASS algorithm It is demonstrated that comparedwith the original dissimilarity increments clustering methodour LASS algorithm not only can identify naturally isolatedclusters but also can identify the clusters which are adjacentoverlapping and under background noise

Additionally in order to evaluate the performance ofLASS algorithm in practice we applied it on the computeruser dataset which contains 1000 computer usersrsquo demo-graphic and behaviours information comparing with theresult got from the original dissimilarity increments cluster-ing method The segmentation results show that one of theclusters generated by the dissimilarity increments clusteringmethod is further divided into two subclusters by our LASSalgorithm The comparison of total SDB and silhouettecoefficient validates the rationality of this further partitionThe discussion and analysis of segmentation results are madeand prove that our LASS algorithm can gainmore knowledgeand understanding from dataset with high dimensionalityand diverse distribution shapes like computer user dataset

There are some future directions to explore from thispaper First the GMM assumption of centroid distance valuecan be further investigated and tested among more dis-tributions such as Gaussian and exponential Second ourproposed centroid distance isolation criterion could beintegrated with other traditional clustering methods eitherpartitional or hierarchical more strengths and weaknessescould be pointed out and analysed Third the centroiddistance based clustering strategy in our LASS algorithmrelies on the histogram distribution of centroid distancevalues therefore if the number of points in one cluster is too

small this clustering strategy may not work effectively anymore This drawback should be given enough attention andfurther investigated

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the dataset andtechnical support for their work provided by NELECT(National Engineering Laboratory for E-Commerce Technol-ogy Tsinghua University) and DNSLAB of China InternetNetwork Information Center

References

[1] E Miller ldquoCommunity cleverness requiredrdquo Nature vol 455no 7209 p 1 2008

[2] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann 2005

[3] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing MIT Press Cambridge Mass USA 2006

[4] G E Hinton S Osindero and Y-W Teh ldquoA fast learning algo-rithm for deep belief netsrdquo Neural Computation vol 18 no 7pp 1527ndash1554 2006

[5] V Jakkula and D J Cook ldquoAnomaly detection using temporaldata mining in a smart home environmentrdquo Methods of Infor-mation in Medicine vol 47 no 1 pp 70ndash75 2008

[6] Q-Y Tang and C-X Zhang ldquoData processing system (DPS)software with experimental design statistical analysis and datamining developed for use in entomological researchrdquo InsectScience vol 20 no 2 pp 254ndash260 2013

[7] M A Musen B Middleton and R A Greenes ldquoClinical deci-sion-support systemsrdquo in Biomedical Informatics pp 643ndash674Springer London UK 2014

[8] L Hubert ldquoApproximate evaluation techniques for the single-link and complete-link hierarchical clustering proceduresrdquoJournal of the American Statistical Association vol 69 no 347pp 698ndash704 1974

[9] E Diday and J C Simon ldquoClustering analysisrdquo in Digital Pat-tern Recognition vol 10 of Communication and Cybernetics pp47ndash94 Springer Berlin Germany 1980

[10] S Nassar J Sander and C Cheng ldquoIncremental and effectivedata summarization for dynamic hierarchical clusteringrdquo inProceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD rsquo04) pp 467ndash478 ACM June2004

[11] K Tasdemir ldquoVector quantization based approximate spectralclustering of large datasetsrdquo Pattern Recognition vol 45 no 8pp 3034ndash3044 2012

[12] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[13] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[14] N R Pal and S K Pal ldquoA review on image segmentation tech-niquesrdquo Pattern Recognition vol 26 no 9 pp 1277ndash1294 1993

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008

Page 16: Localized Ambient Solidity Separation Algorithm Based Computer … · 2016-05-15 · ResearchArticle Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

16 Computational Intelligence and Neuroscience

[15] M Wedel Market Segmentation Conceptual and Methodologi-cal Foundations Springer 2000

[16] R-S Wu and P-H Chou ldquoCustomer segmentation of multiplecategory data in e-commerce using a soft-clustering approachrdquoElectronic Commerce Research and Applications vol 10 no 3pp 331ndash341 2011

[17] MCOnwezenM J Reinders I A van der Lans et al ldquoA cross-national consumer segmentation based on food benefits thelink with consumption situations and food perceptionsrdquo FoodQuality and Preference vol 24 no 2 pp 276ndash286 2012

[18] FWestadMHersleth and P Lea ldquoStrategies for consumer seg-mentation with applications on preference datardquo Food Qualityand Preference vol 15 no 7-8 pp 681ndash687 2004

[19] J Macharia R Collins and T Sun ldquoValue-based consumer seg-mentation the key to sustainable agri-food chainsrdquo British FoodJournal vol 115 no 9 pp 1313ndash1328 2013

[20] T-C Hsieh and C Yang ldquoMulti-level latent class analysis ofinternet use pattern in Taiwanrdquo in e-Technologies and Networksfor Development vol 171 of Communications in Computer andInformation Science pp 197ndash208 Springer Berlin Germany2011

[21] Z Bosnjak and O Grljevic ldquoCredit users segmentation forimproved customer relationship management in bankingrdquo inProceedings of the 6th IEEE International Symposium on AppliedComputational Intelligence and Informatics (SACI rsquo11) pp 379ndash384 May 2011

[22] E Martinez-Garcia and M Royo-Vela ldquoSegmentation of low-cost flights users at secondary airportsrdquo Journal of Air TransportManagement vol 16 no 4 pp 234ndash237 2010

[23] M Bichis-Lupas and R N Moisey ldquoA benefit segmentation ofrail-trail users implications for marketing by local communi-tiesrdquo Journal of Park and Recreation Administration vol 19 no3 pp 78ndash92 2001

[24] A Bhatnagar and S Ghose ldquoA latent class segmentation analysisof e-shoppersrdquo Journal of Business Research vol 57 no 7 pp758ndash767 2004

[25] C Lorenzo-Romero and M-D Alarcon-del-Amo ldquoSegmenta-tion of users of social networking websitesrdquo Social Behavior andPersonality vol 40 no 3 pp 401ndash414 2012

[26] httpwwwcnnicnetcn[27] A L N Fred and J M N Leitao ldquoA new cluster isolation crite-

rion based on dissimilarity incrementsrdquo IEEE Transactions onPatternAnalysis andMachine Intelligence vol 25 no 8 pp 944ndash958 2003

[28] M C P De Souto D S A De Araujo I G Costa R G FSoares T B Ludermir and A Schliep ldquoComparative study onnormalization procedures for cluster analysis of gene expressiondatasetsrdquo in Proceedings of the IEEE International Joint Confer-ence onNeural Networks (IJCNN rsquo08) pp 2792ndash2798 June 2008


Recommended