Real-World Data Clustering Using a Hybrid of Normalized Particle … · 2019-07-05 · 1 Real-World...

transcript

Real-World Data Clustering Using a Hybrid of Normalized Particle Swarm

Optimization and Density Sensitive K-means Algorithm

Temitayo M. Fagbola1, Surendra C. Thakur2, Oludayo O. Olugbara3

1,2,3 ICT and Society Research Group, Durban University of Technology, P.O. Box 1334, Durban 4000, South Africa.

Abstract K-means is one of the most widely used classical partitioned clustering algorithms due to its speed of

convergence, adaptability nature to sparse data and simplicity of implementation. However, it only

guarantees convergence of sum of squares’ objective function to a local minimum while its convergence to

global optimum appears NP-hard when introduced to large, noisy and non-convex structures. This in turn

maximizes its error margin. Most currently existing improvements on K-means adopt techniques which

further introduce additional challenges including inaccurate clustering results, high time and space

complexities and sometimes premature convergence on K-means. However, high accuracy with large

datasets, robustness to noisy data, low clustering time and low sum-of-squared error are sought-after

capabilities of good clustering algorithms. In this paper, a hybrid Normalized-Particle-Swarm Optimized-

Density-Sensitive (NPSO-DS) K-means algorithm is proposed to manage the aforementioned limitations

of K-means. The proposed NPSO-DS K-means algorithm combines the global consistency property of the

normalized Particle Swarm Optimization (PSO) technique incorporating a Min-Max technique and a

clustering error as objective function with the stable properties of a density-sensitive K-means to realize

convergence of particles to global optimum with large and noisy real–world datasets. Using clustering

accuracy, sum-of-squared error and clustering time as evaluation metrics, the experimental results obtained

when the proposed algorithm was tested on Educational Process Mining (EPM) and Wine real-world

datasets show that it is significantly capable of consistently yielding high quality results. Furthermore, the

proposed NPSO-DS K-means algorithm could identify non-convex clustering structures and offers

appreciable robustness to noisy data, thus generalizing the application areas of the conventional K-means

algorithm.

Keywords – K-means, Normalized Particle Swarm Optimization, Clustering, Real World

Dataset, Density-Sensitive Distance Metric, Min-Max Normalization

1. Introduction

Clustering is a data mining technique that involves the grouping of a set of data objects

into multiple groups called clusters such that each object in a cluster share very close similarity

attributes that distinguish them from distinct objects in the other clusters (Joshi and Kaur, 2013;

Neelamadhab, Pragnyaban and Rasmita, 2012). The attribute values of each object are distinctive

characteristics used to assess the level of dissimilarities and similarities that uniquely differentiate

an object from other objects. Many applications arising from a wide range of problems including

exploratory data analysis, image segmentation, pattern recognition, medical image analysis, web

handling and mathematical programming have been developed using clustering algorithms (Chen

and Zhang, 2017; Jiawei and Micheline, 2006; Rauber, 2000). Owing to the huge amount of data

collected in databases, cluster analysis has recently become a major research area of interest to

many researchers. There are several applications where it is necessary to cluster a large collection

of patterns. For example, in document retrieval, millions of instances having high dimensionality

spanning beyond 100 have to be clustered to achieve data abstraction (Adigun, Omidiora,

Olabiyisi, Adetunji and Adedeji, 2012). In the same vein, the vagueness that characterizes the

border of region of most real-world data makes accurate clustering very difficult. Therefore,

clustering algorithms are expected to produce high quality results especially with large and noisy

real world datasets.

K-means is one of the most widely used classical partitioned clustering algorithm due to

its ease of interpretation, simplicity of implementation, speed of convergence with considerable

small size of clean data and adaptability to sparse data (Sharfuddin, Mohammad, Dip and

Mashiour, 2015). It uses the Euclidean distance dissimilarity measure, a non-convex objective

function, which often fails in an attempt to obtain correct clusters for data points with convex

distribution (Ling, Liefeng and Licheng, 2012). Since global consistency of data is crucial to

accurate clustering, Euclidean distance measure (EDM) is highly undesirable especially when

clusters have such complex structure and random distributions (Su and Chou, 2001).

Consequently, the error gap in K-means performance becomes widened as K-means could only

converge to local minima due to its associated EDM. In addition, K-means has a strong sensitivity

to noisy data (Adigun et al., 2012). If there is a certain amount of noise associated with a dataset,

the final clustering results produced by K-means become automatically impaired by errors (Zhou,

Bousquet, Lal, Weston and Scholkopf, 2004). In the same vein, potential errors that may evolve

when K-means is used to cluster certain real-world critical datasets emerging from medical,

security and finance sectors can be highly disastrous. This makes K-means less suitable for

clustering large and noisy real-world datasets (Li, Lei, Bo, Yue and Jin, 2015; Amita and Ashwani,

2014).

However, most currently existing improvements on K-means adopt techniques including

genetic algorithm (Jenn-Long, Yu-Tzu and Chih-Lung, 2012), principal component analysis

(Chetna and Garima, 2013), expectation maximization (Adigun et al., 2012), MapReduce and grid

(Li, Lei, Bo, Yue and Jin, 2015) to optimize the performance of K-means. However, these adopted

techniques often induce some additional performance drawbacks including longer steps before

convergence, curse of dimensionality, inaccurate clustering results, high time and space

complexities as well as premature convergence. In the same vein, most of these works were tested

only on noise-free or limited dataset size. Emphatically, insensitivity to noisy data, high accuracy

obtainable from large datasets, low clustering time and low sum-of-squared error are sought-after

capabilities of good clustering algorithms (Fagbola, Babatunde and Oyeleye, 2013). As a result,

obtaining an improved K-means that could guarantee convergence to global optimum with quality

results in the face of large and noisy real-world datasets still largely remains an open problem.

To overcome this problem, Particle Swarm Optimization (PSO) is considered to be a

leading and effective metaheuristic method that could offer improved precision, runtime efficiency

and robustness of results (Olugbara, Adetiba and Oyewole, 2015; Shinde and Gunjal, 2012) due

to its robustness to noise and its ability to efficiently find an optimal set of feature weights in large-

dimensional complex features (Oloyede, Fagbola, Olabiyisi, Omidiora and Oladosu, 2016) via a

global search. It is an evolutionary algorithm that mimics the schooling and the flocking social

behaviors of fishes and birds respectively (Kennedy and Eberhart, 1995). Characteristically, it is

fast, very simple to implement and understand, requires very few parameter settings and

computationally efficient. Furthermore, it has been adopted widely to optimize the performance of

other algorithms for solving clustering problems (Chen and Zhang, 2017; Qiang and Xinjian, 2011;

Sun, Xu and Ye, 2006), scheduling problems (Xia, Wu, Zhang and Yang, 2004; Koay and

Srinivasan, 2003), medical imaging (Jagdeep and Jatinder, 2017; Fazel and Wail, 2006) and

anomaly detection problems (Karami and Guerrero-Zapata, 2015; Abimbola, Temitayo and

Adekanmi, 2014) among others. In this study, a Normalized PSO (NPSO) based on Min-Max

technique and an integrated clustering error as the objective function was developed for prior pre-

processing of large, complex and noisy datasets before final clustering by K-means. The Euclidean

distance measure in K-means was replaced with a density-sensitive distance metric to maximize

the speed and improve the tendency of K-means to converge to a global optimum. Finally, a hybrid

Normalized-Particle-Swarm-Optimized-Density-Sensitive (NPSO-DS) K-means algorithm is

proposed as a major improvement over the conventional K-means and its existing modifications.

The three major contributions of this paper are mentioned as follows:

(1). Propose a modified Particle Swarm Optimization (PSO) algorithm based on Min-Max

normalization technique and termed Normalized PSO (NPSO) that uses clustering

error as the objective function. This algorithm can serve as a dimensionality reduction

technique capable of eliminating noise, managing the inherent curse of dimensionality

associated with most real-world datasets and evaluating particles’ fitness for optimal

feature subset selection in classical data mining problems domain.

(2). Propose a hybrid algorithm composed of NPSO and density-sensitive K-means. This

algorithm can be easily adapted to solving any feature selection and dimensionality

reduction problem characterized by large and noisy data with complex structures. It

can also be integrated seamlessly with any classification system to improve its quality.

(3). The proposed hybrid Normalized-Particle-Swarm-Optimized-Density-Sensitive

(NPSO-DS) K-means algorithm was evaluated quantitatively on the public

Educational Process Mining (EPM) and Wine real-world datasets using clustering

accuracy also known as Rand index, sum of squared error and clustering time as

metrics.

The rest of this paper is presented as follows: in section 2, K-means clustering algorithm, feature

selection for clustering, density-sensitive distance metric, Min-Max data normalization and trends

of improvement on K-means clustering algorithm are discussed. In section 3, real-world dataset

acquisition, development of a normalized particle swarm optimization algorithm and the hybrid

NPSO-density sensitive K-means algorithm are discussed. The results obtained are presented in

section 4 while conclusion and future works is presented in section 5.

2. Literature Review

Clustering is a common approach for statistical machine learning-based data analytics that

has been widely employed in a number of challenging domains like pattern recognition, medical

imaging, bioinformatics, social media analytics and so on (Su and Chou, 2001). Actually,

clustering is an unsupervised systematic learning approach that attempts to group a finite set of

closely related samples into one group called a cluster. Given an untagged dataset, it is required to

put like-samples in a cluster such that each cluster possesses maximum intracluster and minimum

intercluster similarities based on some indices (Joshi and Kaur, 2013). However, finding clusters

in high-dimensional spaces is computationally expensive and may degrade the learning

performance of most learning systems.

2.1 K-means clustering algorithm

K-means is one of the most commonly used algorithm in the field of data mining and

introduced to solve various clustering problems. K-means is a partitioning clustering technique in

which clusters are formed with the help of centroids. On the basis of these centroids, clusters can

vary from one another with different iterations (Nasser, Alkhaldi and Vert, 2004). Moreover, data

elements can vary from one cluster to another, as clusters are based on the random numbers known

as initial centroids. The clusters are fully dependent on the selection of the initial clusters centroids.

K data elements are selected as initial centers; then distances of all data elements are calculated by

squared Euclidean distance measure. Data elements having less distance to centroids are moved to

the appropriate cluster. The process is continued until no more changes occur in clusters. The

clusters generated by K-means are non-hierarchical in nature (Twinkle et al., 2014). It requires a

huge initial set to start the clustering and does not guarantee convergence. It is easy to implement

and debug its objective function by optimizing the intra-cluster similarity. K-means is applicable

only when the mean is defined and it terminates at a local optimum because it depends on gradient

descent algorithm (Hai, Yunlong, Li and Zhu, 2010). This makes it incapable of handling noise

and outliers. The pseudocode description of K-means is presented in Algorithm 1. By using

Euclidean distance as a measure of dissimilarity, K-means algorithm has a good performance on

the data with compact super-sphere distributions but tends to fail with data characterized by more

complex and unknown shapes, which indicates that this dissimilarity measure is undesirable when

clusters have random distributions. In this case, there arises the need for a more intuitive objective

function in k-means, on one hand to realize high intra-cluster (within-cluster) similarity and low

inter-cluster (between-cluster) similarity and on the other hand, for robustness to large, complex

and noisy datasets with arbitrary shaped clusters.

Input: Number of desired clusters K,

Output: A set of K clusters

Let D = {d1, d2,…,dn} be set of data objects

(1) Specify the number of clusters (k) for D

(2) Randomly select k centroids in the data space, D, or select first k instances

(3) Calculate the distance (arithmetic means) of all data points to the centroids in D

(4) Assign each data point to the nearest cluster using the shortest Euclidean distance

(5) Re-compute new cluster centers by averaging the observations assigned to a cluster

(6) Repeat steps 3, 4 and 5 until no more changes occur or convergence criterion is satisfied

(7) Stop

Algorithm 1: Conventional K-means (Azhar, Arthur and Vassilvitskii, 2012)

2.2 Feature Selection for Clustering

Feature selection is a problem pervasive in all domains of application of machine learning

and data mining including but not limited to product image classification, robotics and pattern

recognition, text categorization and medical applications especially for diagnosis, prognosis and

drug discovery (Fagbola, Olabiyisi and Adigun, 2012; Isabelle, 2008). Feature subset selection is

often a randomized or probabilistic selection of inputs which can further be formulated as an

optimization problem towards searching the solution space of subsets for an optimal or near-

optimal subset of features based on some specified criteria. Basically, feature selection algorithms

select subset of highly discriminant features. In other words, features that are capable of

discriminating samples that belong to different classes are identified and selected. This is a major

step to realizing effective utilization of computational resources and some cost savings. It often

provides better understanding of the data, the model and prediction performance (Fagbola et al.,

2012). Feature selection algorithms search for the best feature subset that reduces the feature space

dimensionality with the smallest change in classification accuracy. In other words, given a set of

D features, the algorithm chooses a subset of size d < D, which has the greatest ability to

discriminate between classes. The selection of the optimal subset out of all possible subsets is an

NP-hard problem. Thus, with large input spaces, the high computational overhead of optimal

methods necessitates the use of heuristic techniques to find near-optimal subsets in relatively

reduced computational times.

An exploratory study of some widely used feature selection algorithms including variants

of Sequential Forward / Backward Selection (SFS/SBS), Branch and Bound, and relaxed Branch

and Bound was carried out by Kudo and Sklansky (2000). Other approaches include genetic

algorithms (Siedlecki and Sklansky, 1988), floating search (Pudil et al., 1994), the Tabu search

metaheuristic (Zhang and Sun, 2006), simulated annealing (Siedlecki and Sklansky, 1988) and

particle swarm optimization (Shinde and Gunjal, 2012). However, Particle Swarm Optimization

(PSO) emerged as a leading metaheuristic method for feature selection and multi-thresholding

because of its ability to effectively find an optimal set of feature weights that improve precision,

runtime efficiency and robustness of results (Olugbara, Adetiba and Oyewole, 2015; Shinde and

Gunjal, 2012).

By description, PSO is a stochastic, population-based evolutionary algorithm for devising

efficient solutions to numerous general optimization problems. PSO simulates the shared behavior

happening among the flocking birds and schooling fishes (Mohammed, Pavlik, Cen, Wu and

Koedinger, 2009). It is computationally cheap due to its low memory and CPU requirements and

can easily be implemented (Qinghai, 2010; Eberhart, Simpson and Dobbins, 1996). Additionally,

it does not suffer from the problem of overfitting often encountered by other evolutionary

computation techniques (Kennedy and Eberhart, 1995). The search can be carried out by the speed

of the particle. It also depends on a population of individuals to discover favorable regions of the

search space. Every member in the population is called particle and the group of all particles is

called a swarm. The aim of the PSO is to find the particle position that results in the best evaluation

of a given objective (fitness) function. The flow diagram illustrating the PSO algorithm is

presented in Figure 1. PSO searches the problem domain by manipulating the trajectories of

moving points in a multidimensional space. The movement of each particle towards the optimal

solution is governed by the position and velocity of each individual, own previous best

performance and that of their neighbors. All particles receive the broadcast of the best position

encountered by all swarm particles. The relationship among the particles are often conceptualized

as a graph G={V, E}, where V depicts a swarm particle and E as an edge that connects the particles

together. Generally, the basic PSO algorithm consists of three steps which are the generation of

particle’s positions and velocities, velocity update and position update (Olaleye, Olabiyisi,

Olaniyan and Fagbola, 2014). First, the positions 𝑥𝑖𝑑 and velocities 𝑉𝑖𝑑 of the initial swarm of

particles are initialized randomly and generated using upper and lower bounds on the search

variables values, LB and UB, expressed as:

𝑋𝑖𝑑 = LB + rand (UB − LB) (1)

𝑉𝑖𝑑 =LB + rand (UB−LB)

∆𝑡 (2)

In equations (1 and 2), rand is a uniformly distributed random variable that takes a value between

0 and 1. This initialization process allows the swarm particles to be randomly distributed across

the search space. Afterwards, swarm updates its best value at every cycle in other to find the

optimized solution after several iterations using (Eberhart and Shi, 2001):

𝑉𝑖𝑑(𝑡 + 1) ← 𝑤 ∗ 𝑉𝑖𝑑(𝑡) + 𝑉𝑖𝑑(𝑡 + 1) ← 𝑤 ∗ 𝑉𝑖𝑑(𝑡) + 𝑐1𝑟1(𝑝𝑖𝑑(𝑡) − 𝑥𝑖𝑑(𝑡)) + 𝑐2𝑟2 (𝑝𝑔𝑑(𝑡) − 𝑥𝑖𝑑(𝑡)) (3)

𝑥𝑖𝑑(𝑡 + 1) ← 𝑥𝑖𝑑(𝑡) + 𝑉𝑖𝑑(𝑡 + 1) (4)

where 𝑉𝑖𝑑(𝑡) is the velocity of the particle 𝑖 in the time point 𝑡 in the search space along the

dimension 𝑑. 𝑝𝑖𝑑(𝑡) is the best position in which the particle previously got high fitness value

called pbest, 𝑥𝑖𝑑(𝑡) is the immediate position of the particle 𝑖 in the search space, 𝑟1 and 𝑟2 are

randomly generated numbers in the range[0,1], 𝑝𝑔𝑑(𝑡) is the overall best position in which a

particle got the best fitness value called the gbest, 𝑐1 and 𝑐2 are acceleration parameters and 𝑤 is

inertia weight whose value is decreased linearly over the time from 0.9 to 0.4. Furthermore,

𝑥𝑖𝑑(𝑡 + 1) is the new position which the particle must move to, where 𝑥𝑖𝑑(𝑡) is the immediate

position of the particle and 𝑉𝑖𝑑(𝑡 + 1) is the new velocity of the particle that actually determines the

new position of the particle (Mohammed et al., 2009). The three steps of velocity update, position

update and fitness calculations are repeated until a desired convergence criterion is met.

Figure 1: Flow Diagram Illustrating the Behaviour of Particle Swarm Optimization Algorithm

(Olaleye et al., 2014)

2.3 Density-Sensitive Distance Metric Given a density-adjusted length of line segment defined as (Ling et al., 2012):

𝐿(𝑥𝑖 , 𝑥𝑗) = 𝜌𝑑𝑖𝑠𝑡(𝑥𝑖,𝑥𝑗) − 1 (5)

where 𝑑𝑖𝑠𝑡(𝑥𝑖, 𝑥𝑗) is the Euclidean distance between xi and xj whilst ρ > 1 is the flexing factor, the

length of line segment between two points is elongated or shortened by adjusting the flexing factor

𝜌. To describe the global consistency of data points, let data points be the nodes of graph G =

(V,E), and 𝑝 ∈ 𝑉𝑙 be a path of length 𝑙 =: |𝑝| connecting the nodes 𝑝1 and 𝑝|𝑝|in which

(𝑝𝑘, 𝑝𝑘+1) ∈ 𝐸, 1 ≤ 𝑘 < |𝑝|. With 𝑃𝑖𝑗 denoting the set of all paths connecting nodes 𝑥𝑖 and 𝑥𝑗, the

density-sensitive distance metric between two points can be defined as (Ling, Liefeng and Licheng,

2012):

𝐷𝑖𝑗 = ∑ 𝐿(𝑃𝑘, 𝑃𝑘+1)|𝑝|−1𝑘=1𝑝∈𝑃𝑖,𝑗

𝑚𝑖𝑛 (6)

such that, 𝐷𝑖𝑗 satisfies the four conditions for a metric, that is,

𝐷𝑖𝑗 = 𝐷𝑗𝑖 : 𝐷𝑖𝑗 ≥ 0; 𝐷𝑖𝑗 ≤ 𝐷𝑖𝑘 + 𝐷𝑘𝑗 for all 𝑥𝑖, 𝑥𝑗, 𝑥𝑘; and 𝐷𝑖𝑗 = 0 𝑖𝑓𝑓 𝑥𝑖 = 𝑥𝑗 . (7)

With these conditions satisfied, the density-sensitive distance metric can measure the geodesic

distance along the manifold, which results in any two points in the same region of high density

being connected by a lot of shorter edges while any two points in different regions of high density

are connected by a longer edge through a region of low density. That is, the distance between a

pair of points is measured by seeking for the least path in the graph, 𝐺. This achieves the aim of

elongating the distance among data points in different regions of high density and simultaneously

shortening that in the same region of high density (Ling, Liefeng and Licheng, 2012). Hence, this

distance metric can help converge complex and unstructured data to global optimum.

2.4 Min-Max Data Normalization

Normalization is employed to standardize the features of a dataset using a specified

predefined criterion so that redundant and noisy objects can be eliminated and only valid and

reliable data which can improve on the quality of results are used. Normalization is sometimes

used to enhance specific feature measurement methods rather than fix problems. Data

normalizations techniques include Min-Max, Z-Score and decimal scaling (Vaishali and Rupa,

2011). However, Min-Max technique is chosen for this study because of its robustness to noise

(Xiaoyan and Yanping, 2016). Min-max normalization performs a linear transformation on the

original data. Suppose that 𝑚𝑖𝑛𝑎 and 𝑚𝑎𝑥𝑎 are the minimum and the maximum values for attribute

𝐴. Min-Max normalization maps a value 𝑣 of 𝐴 to 𝑣′ in the range [𝑚𝑖𝑛𝑎, 𝑚𝑎𝑥𝑎] by computing:

𝑣′ = 𝑣−𝑚𝑖𝑛𝑎

𝑚𝑎𝑥𝑎−𝑚𝑖𝑛𝑎 (7)

where 𝒗′ = new value for variable 𝑣, 𝑣 is the current value for variable v, 𝑚𝑖𝑛𝑎 is the minimum

value in the dataset and 𝑚𝑎𝑥𝑎 is the maximum value in the dataset.

2.5 Trends of Improvement on K-means Clustering Algorithm

Over time, several significant improvements have been made to K-means. Ming-Chuan,

Jungpin, Jin-Hua and Don-Lin (2005) modified K-means clustering algorithm using simple

partitioning method. The authors highlighted that most K-means methods require expensive

distance calculations of centroids to achieve convergence. In their work, binary splitting was used

to partition the original dataset into blocks. Each block unit (UB) containing at least one pattern

has its centroid (CUB) determined via a simple calculation in other to form a reduced dataset that

represents the original dataset. The reduced dataset was then used to compute the final centroid of

the original dataset. Each UB was examined on the boundary of candidate clusters to find the

closest final centroid for every pattern in the UB. In this manner, the time for calculating final

converged centroids was dramatically reduced. It was claimed that the algorithm showed

significant improvement in performance in terms of total execution time, the number of distance

calculations and the efficiency for clustering than other K-means algorithms. However, the

modified K-means need more iterations to achieve the k centroids, sometimes even spending the

maximum number of iterations do not achieve convergence.

Levent, Seyda, Ding and Lee (2007) developed a K-SVMeans for multi-type interrelated

datasets that combines K-means clustering with Support Vector Machines. In a bid to eliminate

the need for labeled training instances for SVM learning, the cluster assignments of K-means are

used to train an online SVM in the secondary data type, and the SVM effects the clustering

decisions of K-means in the primary clustering space. This heterogeneous clustering process

effectively increases the clustering performance compared to clustering using a single

homogeneous data source. The authors reported results for Euclidean and spherical K-means

averaged over ten runs. Euclidean K-means makes cluster assignment decisions based on the

Euclidean distances between the document vectors while the spherical K-means uses the cosine

distances between documents as the similarity metric. The experimental results on analysis of

citeseer and newsgroup datasets which are real-world web-based datasets show that K-SVMeans

can successfully discover topical clusters of documents and achieve better clustering solutions than

homogeneous K-means algorithm. However, K-SVMeans suffers from high computational effort

and can give inaccurate result if the initial dataset to be used for training SVM is very large. The

computational demands of SVM for parameter settings and training of dataset is also a great

challenge.

Mary and Raja (2009) used the Ant Colony Optimization (ACO) to improve K-means

clustering performance. The authors improved the cluster quality after grouping via a two-phased

process. The resultant technique uses Euclidean distance and remains highly sensitive to the

changes in the value of the initial k. This makes it less applicable for clustering real world datasets.

Qian and Xinjian (2011) developed an improved K-means algorithm in gene expression data

analysis based on the Kruskal algorithm. Firstly, the minimum spanning tree (MST) of the

clustered objects was obtained by Kruskal algorithm. Then, K-1 edges are deleted based on weights

in a descending order. At last, the average values of the objects containing the K-connected graphs

resulting from last two steps were regarded as the initial clustering centers to cluster. Results

showed that this method is less sensitive to the initial choice of K than the conventional K-means

algorithm and increased the stability and accuracy of clusters. However, the developed K-means

algorithm failed when tested on large, complex, vast and real-time datasets and suffers from high

time complexity. In addition, the developed technique has high program difficulty.

Adigun, Omidiora, Olabiyisi, Adetunji and Adedeji (2012) developed a hybrid K-means –

Expectation Maximization (KEM) algorithm to address the limitations of K-means and

Expectation Maximization (EM) algorithms. K-means converges only to local minima after large

number of trials while EM converges prematurely. The hybrid KEM algorithm was developed via

the initialization stage and the iterative stage. In the initialization stage, the weighted average

variation of the K-means algorithm was used to classify the data into the number of clusters

desired. At the iterative stage, a large number M, of uniformly distributed random cluster point

vectors for the cluster centers were selected. Any cluster point vectors that are too close to other

cluster point vectors were eliminated and M is reduced accordingly until the clusters produced

equal to the number of desired clusters. This was achieved by computing the distances between all

the clusters and eliminating the clusters with distances lesser than a predefined threshold value.

Assigning each of the feature vectors to the nearest random cluster point vector was the next step

achieved by computing the distance between each feature vector and all other cluster point vectors.

The feature vector was assigned to the cluster point vector such that the distance between them is

the shortest. The hybrid algorithm showed improvements over K-means and EM more accurately

in a computationally efficient manner and was tested on real world educational dataset. However,

the hybrid KEM still converges to local minima because the K-means component used Euclidean

distance metric and as such not suitable for clustering large real world dataset. The hybrid KEM

was not developed to handle noise which characterizes the real world datasets.

Momin and Yelmar (2012) developed a Rough–Fuzzy Possibilistic K-means (RFPKM).

Membership function of the fuzzy sets enables overlapping clusters and the concept of lower and

upper approximations from rough sets handles uncertainty, vagueness and incompleteness.

Possibilistic membership functions generate memberships which are compatible with the center of

the class and not coupled with centers of other classes. RFPKM can cluster categorical data by

using probability distribution of categorical values. The evaluation results obtained showed that

RFPKM gives reduced value of objective function for categorical data clustering than the

traditional K-means and variants considered. However, it produced inaccurate classification with

noisy and large datasets. Furthermore, Jenn-Long, Yu-Tzu and Chih-Lung (2012) developed a

hybrid method based on genetic algorithm (GA) and K-means algorithm termed as GAKM. The

function of GAKM is to determine the optimal weights of the attributes and centers of clusters that

are needed to classify the dataset. GA generates an optimal solution by means of reproduction,

crossover and mutation operators. In GAKM, the result of K-means algorithm was used to adjust

the GA parameters. If fitness value is satisfied, the best solution is obtained, otherwise, the GA

parameters are recombined and re-evaluated to generate an optimal number of clusters. The work

did not present any evaluation result. However, it was reported that the developed GAKM

performed better than K-means on categorical data. This improvement is at the expense of

additional computational overhead. This is because, GA used longer execution steps to obtain

optimal number of clusters. Overfitting is also a challenge of the developed GAKM because the

K-means module was implemented using Euclidean distance.

Shanmugapriya and Punithavalli (2012) developed a modified projected K-means

clustering algorithm with effective distance measure that continuously optimizes a comprehensive

objective function. In the objective function of this developed algorithm, an effective distance

measure makes use of local and non-local information to provide better clustering results in high

dimensional data. In order to avoid the value of the objective function from decreasing as a

consequence of the exclusion of dimensions, virtual dimensions were incorporated with the

objective function. It only works efficiently in principle as the developed algorithm was not

evaluated. Mohamed and Wesam (2013) addressed the problems of random initialization of

prototypes and the requirement of pre-defined number of clusters in the dataset for classical K-

means. Randomly initialized prototypes reportedly produce results that converge to local rather

than global optimum. Based on this rationale, an improved K-means clustering algorithm called

Efficient Data Clustering Algorithm (EDCA) was developed. This algorithm uses definition of

density computation of data points based on the K-Nearest Neighbor method to determine the

initial number of clusters. Furthermore, noise and outliers which affect K-means strongly were

detected. Their result showed slight improvement over the conventional K-means algorithm.

EDCA is able to detect clusters with different non-convex shapes, different sizes and densities.

This solution suffers from high computational overhead incurred from K-nearest neighbor method

and not suitable for highly complex data.

Nidhi and Ujjwal (2013) developed an incremental K-means clustering algorithm that

assigns any random data object to the first cluster of a given set of data objects. After selecting the

next random object, the distance between selected object and centroids of existing clusters was

determined. This distance was compared with the threshold limit so as to be able to group the

object into existing cluster or form a new cluster with that object. Experimental results revealed

that the developed algorithm produced clusters in lesser computation time but only with small and

noise-free dataset. It cannot handle large, noisy dataset in a computationally efficient manner due

to the rigid nature of the incremental approach used. Chunfei and Zhiyi (2013) modified the

traditional K-means algorithm by improving on the initial focal point and process of determining

the K value. The cluster center was initialized and adjusted. The Euclidean distance of various data

objects from each cluster center was calculated and the square error criterion function was

determined to ascertain if convergence is reached. The improved clustering algorithm added

weight of data point to the cluster center so as to reduce or even avoid the impact of the noise data

in the data set object. The final clustering result of the modified K-means showed improved

performance over some variants when evaluated. The developed technique was tested on small-

sized dataset with low noise level and thus not appropriate for real world data clustering.

Chetna and Garima (2013) developed a linear Principal Component Analysis (PCA)-based

hybrid K-means PSO algorithm for clustering large dataset. PCA module was executed to convert

high dimensional data to low dimension using covariance matrix. Then, the K-means clustering

algorithm was made to search for the clusters’ centroid locations using the Euclidean distance

similarity metric. This information was passed to the PSO module for the generation of the final

optimal clustering solution as the result. In general, PSO conducts a global search for the optimal

clustering, but more iteration numbers is required. The PSO was assisted by K-means to start with

good initial cluster centroid that converge faster thereby yielding a more compact result. The result

from the K-means module was treated as the initial seed for the PSO module to discover the

optimal solution by a globalized search to avoid high computational time complexity. Better

clustering result was obtained with PCA-based HYBRID (K-PSO) algorithm when compared with

PSO only. The hybrid system is complex, converged to local minima given clusters with wide

variation in size and shape, incurred high computational overhead and was not evaluated with other

improved K-means variants.

Furthermore, Li, Lei, Bo, Yue and Jin (2015) developed an improved K-means algorithm

based on MapReduce and grid. The improved method is divided into the same grid in space

according to the size of the data point property value and assigns it to the corresponding grid. It

counts the number of data points in each grid, selects 𝑀(𝑀 > 𝐾) grids, comprising the maximum

number of data points and calculate the central point. These M central points serve as input data

to determine the K value based on the clustering results. In the M points, it finds 𝐾 points farthest

from each other and those 𝐾 center points as the initial cluster center of K-means algorithm. At the

same time, the maximum value in M was included in 𝐾. If the number of data in the grid is less

than the threshold, then these points were considered as noise points and were removed. In order

to make the improved algorithm adapt to handle large data, the improved K-means algorithm was

paralleled and combined with the MapReduce framework. Theoretical analysis and experimental

results show that the improved algorithm compared to the traditional K-means clustering algorithm

has high quality results, less iteration and good stability.

Sharfuddin, Mohammad, Dip and Mashiour (2015) argued that the current minimum

distance in traditional K-means is not always the correct minimum distance because the distance

between a cluster center and each data point is measured in every iteration. This makes the

algorithm more complex and increases the number of computations. In the modified version of K-

means algorithm developed by the authors, a checkpoint value was added to store the center point

of the distance of two cluster centers and was used to determine the cluster an object is going to

be assigned to. This checkpoint value reduced the possibility of error during the clustering process.

The authors reported that the modified K-means requires less computation and has enhanced

accuracy than the traditional K-means algorithm as well as some modified variant of the traditional

K-means. However, shortage of available resources and time limited the work. The developed K-

means algorithm was not tested on large, complex, vast and real world datasets.

Min, Tommy and Rosa (2015) clustered heterogeneous data with K-means by mutual

information-based Unsupervised Feature Transformation (UFT). The work addressed the

computational complexities of K-means algorithm for datasets with large sample sizes and its

sensitivity to outliers. To address these challenges, the mutual information-based unsupervised

feature transformation which could transform non-numerical features into numerical features was

integrated with the conventional K-means to cluster the heterogeneous data. Simulation results

showed that the integrated UFT-K-means improved over other clustering algorithms with

reasonable clusters for one modified real-world dataset and five real-world benchmark datasets.

However, the developed algorithm is parameter-dependent and computationally highly inefficient.

Summarily, most existing improvements on K-means do not suffice its ability to cluster noisy and

large data accurately and in a computationally efficient manner while some others suffer from high

computational overhead. Sequel to these, clustering large and noisy dataset using K-means in a

computationally-efficient and accurate manner still remains largely an open problem which is

addressed in this study.

3. Materials and Method

The experimental architecture for the hybrid NPSO-DS K-means algorithm is presented in

3 developmental stages:

i. Real-World Dataset Acquisition

ii. Development of a Normalized Particle Swarm Optimization

iii. Integration of NPSO into a Density-Sensitive K-means algorithm

3.1 Real-World Dataset Acquisition

UCI Educational Process Mining (EPM) and wine datasets are the most widely used real

world datasets in literatures. These datasets can be accessed and downloaded from

https://archive.ics.uci.edu.ml/datasets. However, the description of the datasets is

presented in Table 1. However, sample EPM and Wine datasets are shown in Figures (2

and 3) respectively.

i. UCI Educational process mining (EPM) Dataset: This is a publicly-available learning

analytics dataset from smartlab located in Italy. It was collected in 2015 and contains the

time series of students’ activities during 6 laboratory sessions of a course on digital

electronics. There are 6 folders containing the students’ data per session. Each folder

contains up to 99 csv files each dedicated to each student log in that session. The number

of files in each folder changes due to the number of students present in each session.

However, each file contains 230318 instances and 13 integer attributes.

ii. UCI Wine dataset: The data contained are the results of a chemical analysis of wines grown

in Italy but derived from three different cultivars. The quantities of 13 constituents found

in each of the three types of wines is determined by the analysis.

Table 1

Description of the EPM and Wine Real World Datasets

UCI Datasets Instances Number of Attributes Type of Attribute

EPM 230318 13 integer

Wine 178 13 Integer and real

Figure 2: Sample Data of Educational Process Mining Dataset

Figure 3: Sample Data of Wine Dataset

3.2 Development of a Normalized Particle Swarm Optimization Algorithm Particle swarm optimization (PSO) technique was applied to reduce the dimension and

number of the particles to be clustered by K-means. The conventional PSO was modified such that

it incorporates the clustering error measure (CE) as the objective function. The clustering error

(CE) is defined as (Ling, Liefeng and Licheng, 2012):

𝐶𝐸(∆, ∆𝑡𝑟𝑢𝑒) =1

𝑛 ∑ ∑ 𝐶𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛 (𝑖, 𝑗)𝑘

𝑗=1𝑖≠𝑗𝑘𝑡𝑟𝑢𝑒𝑖=1 (8)

where the clustering produced, ∆, is given by

∆ = {𝐶1, 𝐶2, . . . , 𝐶𝑘}, (9)

the true clustering, ∆𝑡𝑟𝑢𝑒 , expressed as

∆𝑡𝑟𝑢𝑒 = {𝐶1𝑡𝑟𝑢𝑒 , 𝐶2

𝑡𝑟𝑢𝑒 , … , 𝐶𝑘𝑡𝑟𝑢𝑒𝑡𝑟𝑢𝑒 }, (10)

and n being the total number of data points. Thus, ∀𝑖 ∈ [1, . . . , 𝑘𝑡𝑟𝑢𝑒], 𝑗 ∈[1, . . . . , 𝑘], 𝐶𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛 (𝑖, 𝑗) denotes the number of same data points both in the true cluster 𝐶𝑖

𝑡𝑟𝑢𝑒

and in the cluster 𝐶𝑗 produced. However, there exists a renumbering problem. For example, cluster

1 in the true clustering might be assigned cluster 3 in the clustering produced and so on. To counter

that, the CE is computed for all possible renumbering of the clustering produced, and the minimum

of all those is taken. The best clustering performance is such with the smallest CE. The flowchart

and use case diagram of the proposed NPSO are presented in Figures (4 and 5) respectively. Given

a set of data point as input, the normalized PSO is expected to return a reduced set of discriminant

particles. In this study, five (5) major steps were carried out to develop the NPSO technique for a

given set of inputs which are 𝑛 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 {𝑥𝑖} 𝑖=1𝑛 , maximum iteration number 𝑡𝑚𝑎𝑥 and stop

threshold 𝑒. Step 1: Particles are initialized via random generation to form an initial population where each

particle represents a feasible cluster solution. The number of particles is taken as a product of

dataset dimension and number of clusters to be generated. The dataset represents a swarm and the

constituent elements represent the particles. Analytically, swarm is composed of a set of particles:

𝑝 = {p1, p2, p3, … , p𝑛} (11)

where n is the dimension of the dataset. Step 2: The position and velocity of the particles are

initialized, such that, at any time step 𝑡, the particle 𝑝𝑖 has two vectors, position, 𝑋𝑖(𝑡) and velocity,

𝑉𝑖(𝑡) associated. Each candidate solution possesses a position which represents the solution in

search space and velocity for the movement of particles for finding global optimal solution. The

particles’ position and velocity were initialized using equations (1 and 2) respectively. Step 3:

Evaluation of particles’ fitness: The fitness value of each particle was computed using the

clustering error described in equation (8). However, at each generation, best fitness values were

updated using (Gursharan and Harpreet, 2014):

𝑃𝑖(𝑡 + 1) = { 𝑃𝑖(𝑡) 𝑓(𝑋𝑖(𝑡 + 1)) ≤ 𝑓(𝑋𝑖(𝑡))𝑋𝑖(𝑡 + 1) 𝑓(𝑋𝑖(𝑡 + 1)) > 𝑓(𝑋𝑖(𝑡))

} (12)

where f denotes the fitness function (clustering error), 𝑃𝑖(𝑡) stands for the best fitness values and

the coordination where the value was calculated, 𝑋𝑖(𝑡) stands for the current position and t denotes

the generation step. Step 4: Position and velocity update: The search for the global optimal solution

was carried out through a dynamic update of the particles in swarm. Equation (3) is used to update

the velocity as a function of the initial velocity, the particle own best performance and the swarm

best performance. Position update was done by adding incremental change in position at each step

using equation (4). At this step, in the conventional PSO, some particles usually move out of search

space boundary which often lead to errors and in turn affects the overall output accuracy. This is

usually due to the presence of noisy data in the dataset (Gursharan and Harpreet, 2014).

Final global best particles’

population

{𝑥𝑖} 𝑖=1𝑚 : n > m

Dataset Input

{𝑥𝑖} 𝑖=1𝑛 , 𝑡𝑚𝑎𝑥, 𝑒

Generate initial particles

{𝑥𝑖} 𝑖=1𝑛 = 𝑝 = {p1, p2, p3, … , p𝑛}, t = 0

Initialize particles’ position and velocity

𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑋𝑖𝑑 = LB + rand (UB − LB)

𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝑉𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 𝑉𝑖𝑑 =LB + rand (UB − LB)

∆𝑡

t = t + 1

Evaluate particles’ fitness using clustering error (CE)

𝐶𝐸(∆, ∆𝑡𝑟𝑢𝑒) =1

𝑛 ∑ ∑ 𝐶𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛 (𝑖, 𝑗)𝑘

𝑗=1𝑖≠𝑗𝑘𝑡𝑟𝑢𝑒𝑖=1

∆𝑡𝑟𝑢𝑒 = {𝐶1𝑡𝑟𝑢𝑒 , 𝐶2

𝑡𝑟𝑢𝑒 , … , 𝐶𝑘𝑡𝑟𝑢𝑒𝑡𝑟𝑢𝑒 }

∆ = {𝐶1, 𝐶2, . . . , 𝐶𝑘} 𝐶𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛 (𝑖, 𝑗) = 𝑛 {𝐶𝑖

𝑡𝑟𝑢𝑒 ∩ 𝐶𝑗 }

Compute and update best fitness values at each generation

𝑃𝑖(𝑡 + 1) = { 𝑃𝑖(𝑡) 𝑓(𝑋𝑖(𝑡 + 1)) ≤ 𝑓(𝑋𝑖(𝑡))𝑋𝑖(𝑡 + 1) 𝑓(𝑋𝑖(𝑡 + 1)) > 𝑓(𝑋𝑖(𝑡))

Compute and update particles’ position and velocity

New position, 𝑥𝑖𝑑(𝑡 + 1) ← 𝑥𝑖𝑑(𝑡) + 𝑉𝑖𝑑(𝑡 + 1)

New velocity, 𝑉𝑖𝑑(𝑡 + 1) ← 𝑤 ∗ 𝑉𝑖𝑑(𝑡) + 𝑉𝑖𝑑(𝑡 + 1) ← 𝑤 ∗ 𝑉𝑖𝑑(𝑡) + 𝑐1𝑟1(𝑝𝑖𝑑(𝑡) − 𝑥𝑖𝑑(𝑡)) + 𝑐2𝑟2 (𝑝𝑔𝑑(𝑡) − 𝑥𝑖𝑑(𝑡))

Normalize particles using min-max algorithm

and generate new particles’ population

𝒗′ = 𝑣−𝑚𝑖𝑛𝑎

𝑚𝑎𝑥𝑎−𝑚𝑖𝑛𝑎

𝐼𝑠 𝑡 ≤ 𝑡𝑚𝑎𝑥?

Figure 4: Flowchart of the Proposed NPSO

Particle

real world datasetGenerate initial

population

Initialize

position and velocity

Compute and update new

position and velocity

Evaluate fitness

Compute and update best

fitness value at each

generation

Generate final

population of particles

accepts*

Usersupply

clustering error function

adopts

normalize

Min-Max Algorithm

adopts

minimizes

associates with

Figure 5: Use Cases Diagram for the NPSO

In this study, the devastating impact of the noisy data was addressed by forcing relevant particles

to remain within the boundary or reset to the boundary value by using Min-max normalization

function defined in equation (7).

Step 5: Steps 2-4 is repeated until one of following termination conditions is satisfied.

a. The maximum number of iterations is reached.

b. The mean change in centroid vectors is less than a predetermined value.

After the completion of step 5, the expected output is 𝑚 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 {𝑥𝑖} 𝑖=1𝑚 : 𝑚 ≪ 𝑛.

3.3 The Hybrid NPSO-Density Sensitive K-means Algorithm

The hybrid NPSO-density sensitive K-means is the product of integrating the NPSO

algorithm into a density sensitive k-means algorithm. The corresponding conceptual flow and use

cases diagrams are shown in Figures (6 and 7) respectively. As presented in Algorithm 2, with DS-

K-means, a density-sensitive distance is incorporated into K-means to replace the Euclidean

distance. The justification for this step is borne out of the fact that poor assignment of particles to

clusters is inevitable especially where the particle has equal minimum Euclidean distance to a

number of clusters.

Figure 6: The flow diagram of the Hybrid NPSO-DS K-means algorithm

Randomly choose k best position particles of

PSO to initialize k cluster centers

Input final particles’ population

from PSO

{𝑥𝑖} 𝑖=1𝑚 : n > m, k, tmax, 𝑒

Compute density-sensitive distance measure for any two points𝑥𝑖, 𝑥𝑗

𝐿(𝑥𝑖 , 𝑥𝑗) = 𝜌𝐷𝑖𝑗 − 1

𝐷𝑖𝑗 = 𝐿(𝑃𝑘, 𝑃𝑘+1)

|𝑝|−1

𝑘=1

𝑝∈𝑃𝑖,𝑗𝑚𝑖𝑛

Assign each particles to cluster to which the

density-sensitive distance of its center to the

point is minimum

Recalculate the cluster centroid after all

particles have been assigned

𝑡𝑚𝑎𝑥 reached or

centroids no

longer move?

Output partition of the dataset C1,…,Ck

Figure 7: Use Cases Diagram of the Hybrid NPSO-DS K-means algorithm

Input: 𝑚 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 {𝑥𝑖} 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 𝑓𝑟𝑜𝑚 𝑁𝑃𝑆𝑂𝑖=1𝑚 ; cluster number 𝑘, maximum

iteration number 𝑡𝑚𝑎𝑥, stop threshold e.

Output: Partition of the dataset 𝐶1, …, 𝐶𝑘. (1) randomly choose 𝑘 data points using the 𝐾 best position particles of PSO to initialize 𝑘

cluster centers;

(2) for any two data points 𝑥𝑖 and 𝑥𝑗 do

(3) compute the density-sensitive distance using equations (5 and 6);

(4) assign each particle to the closest centroid calculated by the minimum density-sensitive

distance;

(5) if all particles have not been assigned, then go to (4) else go to (6)

(6) recalculate new centroid for each cluster

(7) end for

(8) if centroids move or the maximum number of iterations, 𝑡𝑚𝑎𝑥, has not been reached, then

go to (2) else go to (9)

(9) stop

Algorithm 2: The hybrid NPSO-DS K-means Algorithm

Consequently, the centroids are forced to converge to local minimal and as such would be unable

to typify any group of data as desired (Olugbara, Adetiba and Oyewole, 2015). However,

employing a density-based objective function is capable of converging to global optimum even

with arbitrary and non-convex shaped clusters (Joshi and Kaur, 2013). Clusters can easily be

formed by data points located in dense regions while the low density regions separate data points

from different clusters.

3.4 Performance Evaluation Metrics

The performance of the developed NPSO-DS K-means algorithm was evaluated using the

following metrics:

i. Clustering time: This represents the time required to cluster all data points. This

parameter depends on the platform where the clustering is implemented and will dictate

if real-time functionality is available or not.

ii. Sum-of-Squared Error (SSE): This is the sum of squares of the departure from the

average for each calculated value of data (Jiming and Yu, 2005)

𝑆𝑆𝐸 = ∑ ( 𝑥𝑖 − �̅�)2𝑛

𝑖=1 (13)

where 𝑛 denotes the number of particles and 𝑥𝑖 represents the actual value of the 𝑖𝑡ℎ particle.

iii. Clustering Accuracy: This is also known as the Rand Index (RI), a measure that

describes the actual percentage of documents that are correctly assigned to their

corresponding clusters. It is defined as (Rand, 1971):

Clustering Accuracy = TP+TN

TP+TN+FP+FN x 100% (14)

where 𝑇𝑃, 𝑇𝑁, 𝐹𝑃 and 𝐹𝑁 represent the true positive, true negative, the false positive

and the false negative values respectively. In this study, 𝑇𝑃 defines two close particles

that are correctly assigned to the same cluster, a 𝑇𝑁 correctly assigns two contrasting

particles in different clusters. Similarly, 𝐹𝑃 defines two contrasting particles that are

wrongly placed in the same cluster while the 𝐹𝑁 wrongly assigns two close particles

in different clusters.

4. Results

In this study, a hybrid NPSO-DS K-means algorithm was developed and benchmarked with

three (3) variants which are K-means, PCA-based HYBRID (K-PSO) and UFT-K-means. All the

algorithms were implemented using MATLAB 7.7.0 (R2008b) on Windows 7 Ultimate 32-bit

operating system, AMD Athlon (tm) X2 DualCore QL-66 central processing unit with a speed of

2.2GHZ, 2GB random access memory and 320GB hard disk drive. We tested for values of K = 2,

3, 4 and the results obtained for educational process mining (EPM) and wine datasets are presented

in Tables (2 and 3) respectively. In all the evaluations, results were obtained for the three (3)

effective metrics for evaluating a good clustering algorithm which include clustering accuracy

clustering time and sum of squared error (Hai et al., 2010). In Figures (8a and 8b), the sample

visual outputs of NPSO-DS and PCA-based HYBRID (K-PSO) K-means Algorithm with EPM

Dataset are respectively shown for K = 2, 3 and 4.

Table 2

Evaluation results of the clustering algorithms using the EPM dataset

Number of

Clusters

Algorithm Clustering

accuracy

Clustering

time (s)

Sum of

Squared

K-means 64.8 83.2 0.48

PCA-based HYBRID (K-PSO) 72.1 99.4 0.39

UFT-K-means 77.7 87.2 0.33

Developed NPSO-DS K-means 80.2 85.7 0.28

K-means 67.3 78.7 0.42

UFT-K-means 79.1 84.7 0.27

K-means 69.2 74.4 0.36

UFT-K-means 87.3 80.4 0.22

Table 3

Evaluation results of the clustering algorithms using the wine dataset

Number of

Clusters

Algorithm Clustering

accuracy

Clustering

time (s)

Sum of

Squared

K-means 88.5 16.7 0.164

UFT-K-means 91.1 21.7 0.156

K-means 91.3 14.3 0.148

UFT-K-means 92.9 19.9 0.113

K-means 92.8 12.1 0.119

UFT-K-means 95.6 18.4 0.097

(K=2) (K=3) (K=4)

Figure 8a: Sample Output of NPSO-DS K-means Algorithm with EPM Dataset

(K=2) (K=3) (K=4)

Figure 8b: Sample Output of PCA-based HYBRID (K-PSO) with EPM Dataset

4.1 Clustering Accuracy (Rand Index)

As shown in Figure 9, the clustering accuracies produced by the original K-means, PCA-

based HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means for 2 clusters (K

= 2) using EPM dataset are 64.8%, 72.1%, 77.7% and 80.2% respectively. For 3 clusters (K = 3)

using EPM dataset, the accuracies obtained by the original K-means, PCA-based HYBRID (K-

PSO), UFT-K-means and the developed NPSO-DS K-means are 67.3%, 76.4%, 79.1% and 83.6%

respectively. When cluster number was increased to 4 (K = 4), the original K-means, PCA-based

HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means yielded accuracies of

69.2%, 83.9%, 87.3% and 92.4% respectively on EPM dataset. However, in Figure 10, the

clustering accuracies produced by the original K-means, PCA-based HYBRID (K-PSO), UFT-K-

means and the developed NPSO-DS K-means for 2 clusters (K = 2) using wine dataset are 88.5%,

89.4%, 91.1% and 93.6% respectively. In the same vein, the accuracies produced by the original

K-means, PCA-based HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means

are 91.3%, 92.2%, 92.9% and 94.8% respectively for 3 clusters (K = 3) with wine dataset. When

cluster number was increased to 4 (K = 4), the original K-means, PCA-based HYBRID (K-PSO),

UFT-K-means and the developed NPSO-DS K-means yielded accuracies of 92.8%, 94.1%, 95.6%

and 96.3% respectively on wine dataset.

Figure 9: Accuracy of the Clustering Algorithms on EPM Dataset

Figure 10: Accuracy of the Clustering Algorithms on wine Dataset

4.2 Clustering Time

The execution time of the clustering algorithms obtained on EPM dataset is presented in

Figure 11. The original K-means, PCA-based HYBRID (K-PSO), UFT-K-means and the

developed NPSO-DS K-means converged approximately in 83.2s, 99.4s, 87.2s and 85.7s

respectively when the number of clusters was 2. Similarly, the original K-means, PCA-based

64.8 67.3 69.272.176.4

83.977.7 79.1

87.380.2

2 Clusters 3 Clusters 4 Clusters

Number of Clusters

Clustering Accuracy on EPM Dataset of 230318 instances

K-means PCA-based HYBRID (K-PSO) UFT-k-means Developed NPSO-DS K-means

Number of Clusters

Clustering Accuracy on Wine Dataset of 178 instances

HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means converged in

approximately 78.7s, 91.8s, 84.7s and 80.5s respectively for 3 clusters. When cluster number was

increased to 4 (K = 4), the original K-means, PCA-based HYBRID (K-PSO), UFT-K-means and

the developed NPSO-DS K-means converged at approximate clustering time of 74.4s, 87.2s, 80.4s

and 74.8s respectively. Furthermore, the execution times used by the clustering algorithms on wine

dataset is conceptually represented in Figure 12.

Figure 11: Execution time of the Clustering Algorithms on EPM Dataset

Figure 12: Execution time of the Clustering Algorithms on Wine Dataset

83.278.7

99.491.8

87.287.2 84.780.4

85.780.5

Number of Clusters

Clustering Time on EPM Dataset of 230318 instances

14.312.1

24.3 23.821.721.7

19.918.4

17.215.1

Number of Clusters

Clustering Time on Wine Dataset of 178 instances

The original K-means, PCA-based HYBRID (K-PSO), UFT-K-means and the developed NPSO-

DS K-means converged approximately in 16.7s, 24.3s, 21.7s and 17.2s respectively when the

number of clusters was 2. Similarly, the original K-means, PCA-based HYBRID (K-PSO), UFT-

K-means and the developed NPSO-DS K-means converged in approximately 14.3s, 23.8s, 19.9s

and 15.1s respectively for 3 clusters. When cluster number was increased to 4 (K = 4), the original

K-means, PCA-based HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means

converged at approximate clustering time of 12.1s, 21.7s, 18.4s and 13.9s respectively.

4.3 Sum of Squared Error (SSE)

The SSE incurred by the clustering algorithms over EPM dataset is as shown in Figure 13.

DS K-means incurred error of 0.48, 0.39, 0.33 and 0.28 respectively when the number of clusters

was 2. Similarly, the original K-means, PCA-based HYBRID (K-PSO), UFT-K-means and the

developed NPSO-DS K-means yielded error of 0.42, 0.32, 0.27 and 0.21 respectively for 3 clusters.

When the cluster number was increased to 4 (K = 4), the original K-means, PCA-based HYBRID

(K-PSO), UFT-K-means and the developed NPSO-DS K-means had errors of 0.36, 0.28, 0.22 and

0.13 respectively. In Figure 14, the errors obtained by the clustering algorithms over wine dataset

are presented.

Figure 13: Error obtained from the Clustering Algorithms on EPM Dataset

Figure 14: Sum of Squared Error obtained from the Algorithms on Wine Dataset

0.480.42

0.360.390.32 0.28

0.330.27

0.220.28

0.210.13

Number of Clusters

Sum of Squared Error over EPM Dataset of 230318 instances

0.1640.148

0.160.126

0.1130.097

0.0980.082

Number of Clusters

Sum of Squared Error over Wine Dataset of 178 instances

DS K-means incurred error of 0.164, 0.16, 0.156 and 0.133 respectively when the number of

clusters was 2. In addition, the original K-means, PCA-based HYBRID (K-PSO), UFT-K-means

and the developed NPSO-DS K-means yielded error of 0.148, 0.126, 0.113 and 0.098 respectively

for 3 clusters. However, when the cluster number was increased to 4 (K = 4), the original K-means,

PCA-based HYBRID (K-PSO), UFT-K-means and the developed NPSO-DS K-means had errors

of 0.119, 0.106, 0.097 and 0.082 respectively.

4.4 Discussion

The developed NPSO-DS K-mean algorithm has a dominant performance with both the

EPM and the wine real world datasets compared with the conventional K-means, UFT-K-means

and PCA-based HYBRID (K-PSO) clustering algorithms especially in terms of clustering accuracy

and sum of squared error. The least accuracies produced by K-means in all the evaluations

conducted using EPM dataset indicated that K-means is not a good candidate for clustering large

real world dataset such as EPM which contains 230318 instances. However, as the number of

clusters was increased, K-means shows some improvements in accuracy but nevertheless, its

accuracy was the least among other algorithms considered. With cluster numbers (2, 3 and 4),

accuracies (64.8%, 67.3% and 69.2%) were obtained respectively for K-means. This indicates that

the higher the number of clusters, the better the clustering accuracy of K-means algorithm. This

was also a general behaviour of all the algorithms evaluated.

The results obtained for K-means corroborates with the assertion of Li et al. (2015) that K-

means can fail with large and noisy dataset because it only converges to local minima and suffers

the limitation imposed on it by Euclidean distance similarity metric by default. However, on the

wine dataset which contains only 178 instances, K-means drastically improved as accuracies

(88.5%, 91.3% and 92.8%) were obtained for cluster numbers (2, 3 and 4) respectively. This

implies that K-means is a very good algorithm for small datasets as stated by Twinkle et al. (2014).

It is worthy of mentioning that K-means is the most computationally efficient algorithm as it

produces the least clustering time in all the evaluations conducted on EPM and wine datasets,

followed by the developed NPSO-DS K-means, UFT-K-means and the PCA-based HYBRID (K-

PSO) algorithm in that order. In all the evaluations conducted on wine and EPM datasets, the

developed NPSO-DS K-means algorithm is the most accurate and with the least sum of squared

error followed by UFT-K-means, PCA-based HYBRID (K-PSO) and the original K-means in that

order. This challenging performance by the developed NPSO-DS K-means algorithm is due to the

fact that data normalization and relevant particle selection procedures as well as a globally

converging density-sensitive distance measure were incorporated into the developed NPSO-DS K-

means algorithm. Olaleye et al. (2014) and Fagbola et al. (2012) stated that improvements obtained

for feature selection and data normalization procedures invariably impacts on the effectiveness of

data mining algorithms which justifies the results obtained for the NPSO-DS K-means algorithm.

5. Conclusion and Future Works

This research work presents a NPSO-DS K-means algorithm based on relevant optimal

particle selection and density-sensitive distance measure. The results reveal that the developed

NPSO-DS K-means algorithm has a more dominant performance over the conventional K-means,

UFT-K-means and PCA-based HYBRID (K-PSO) algorithms especially in terms of clustering

accuracy. This challenging performance by the developed NPSO-DS K-means algorithm is due to

the fact that relevant particle selection procedure as well as the globally converging density-

sensitive distance measure were incorporated into the developed NPSO-DS K-means algorithm.

Olaleye et al. (2014) and Fagbola et al. (2012) stated that improvements obtained for efficient and

effective feature selection procedures invariably impact on the effectiveness of clustering

algorithms which justifies the results obtained for the NPSO-DS K-means algorithm. On the other

hand, the least accuracies produced by K-means in all the evaluations corroborated with the

assertion of Li et al. (2015) that K-means is not a good candidate for clustering large real world

datasets. The developed NPSO-DS K-means can identify non-convex clustering structures, thus

generalizing the application area of the conventional K-means algorithm. The experimental results

on EPM world dataset which contains 230318 instances validate the effectiveness of the developed

algorithm. The developed NPSO-DS K-means algorithm can be applied in situations where the

distributions of data points are not compact super-spheres. However, the near-optimal clustering

time produced by the developed NPSO-DS K-means can be further investigated for possible

improvements. Based on the results obtained, the developed NPSO-DS K-means clustering

algorithm performs best in all the evaluation conducted on EPM and wine datasets in terms of

clustering accuracy and sum of squared error. However, it yielded higher clustering time than the

conventional K-means only. This could be due to the time required to normalize and select relevant

features at each generation of NPSO technique before final clustering of resultant particles by DS-

K-means. Though, it is more computationally efficient than UFT-K-means and PCA-based

HYBRID (K-PSO), nevertheless, further research can be directed along this direction.

References

1. Abimbola Adebisi Adigun, Temitayo Matthew Fagbola and Adekanmi Adegun (2014).

Swarmdroid: Swarm Optimized Intrusion Detection System for the Android Mobile

Enterprise. International Journal of Computer Science Issues, Mauritius (IJCSI), Mauritius,

11 (3): pp 62-69.

2. Adigun A.A, Omidiora E.O, Olabiyisi S.O, Adetunji A.B, Adedeji O.T (2012): “Development

of a Hybrid K-means-expectation maximization clustering algorithm”, Journal of

computations & modeling, 2(4):55-65.

3. Amita V. and Ashwani K., (2014): “Performance Enhancement of K-means Clustering

Algorithms for High Dimensional Data sets”, International Journal of Advanced Research in

Computer Science and Software Engineering, 4(1), ISSN: 2277 128.

4. Azhar T., Arthur D. and S. Vassilvitskii (2012): “K-means++: The advantages of careful

seeding”, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms

(DA ’07), PA, USA, 1027-1035.

5. Chen J. Y. and H. Y. Zhang (2017). "Research on application of clustering algorithm based

on PSO for the web usage pattern," International Conference on Wireless Communications,

Networking and Mobile Computing, pp.3705-3708.

6. Chetna Sethi and Garima Mishra (2013): “A Linear PCA based hybrid K-means PSO

algorithm for clustering large dataset”, International Journal of Scientific & Engineering

Research, 4(6), 1559-1566.

7. Chunfei Zhang and Zhiyi Fang (2013): “An Improved K-means Clustering Algorithm”,

Journal of Information & Computational Science, 10(1), 193–199.

8. Eberhart R. C. and Shi Y. (2001): “Particle swarm optimization: Developments, Applications

and Resources”, IEEE Proceedings of the Congress on Evolutionary Computation, 27-30.

9. Eberhart R.C., Simpson P. and Dobbins R. (1996): “Computational Intelligence PC Tools”,

A Book of Intelligent Systems, Academic Press.

10. Fagbola Temitayo Matthew, Babatunde Ronke Seyi and Oyeleye Akinwale (2013). Image

Clustering Using a Hybrid GA-FCM Algorithm. International Journal of Engineering and

Technology, UK, 3(2): pp 99-107.

11. Fagbola Temitayo, Olabiyisi Stephen and Adigun Abimbola (2012): “Hybrid GA-SVM for

Efficient Feature Selection in E-mail Classification”, Computer Engineering and Intelligent

Systems, 3(3): 17-28.

12. Fazel Keshtkar and Wail Gueaieb (2006). Segmentation of Dental Radiographs using a

Swarm Intelligence Approach, in proceedings of IEEE Canadian Conference on Electrical

and Computer Engineering, Ottawa, Canada, pp. 328– 331. DOI:

10.1109/CCECE.2006.277656.

13. Gursharan Saini and Harpreet Kaur (2014): “A Novel Approach towards K-Mean Clustering

Algorithm with PSO”, International Journal of Computer Science and Information

Technologies, 5 (4), 5978-5986.

14. Hai Shen, Yunlong Zhu, Li Jin and Zhu Zhu (2010). Hybridization of Particle Swarm

Optimization with the K-Means Algorithm for Clustering Analysis. pp. 531-535, 978-1-4244-

6439-5/10/IEEE.

15. Isabelle Guyon (2008). Practical Feature Selection: from Correlation to Causality. Pattern

Recognition. Letter, 28(12):1438–1444.

16. Jagdeep Kaur and Jatinder Singh Bal (2017). A Study of Particle Swarm Optimization based

K-means Clustering for Detection of Dental Caries in Dental X-ray Images, International

Journal of Advanced Research in Computer Science, 8(4).

17. Jenn-Long, Yu-Tzu H. and Chih-Lung, G. (2012): Mining Student Behavior Models in

Learning-by-Teaching Environments. In Proceedings of the 1st International Conference on

Educational Data Mining, 127-136.

18. Jiawei Han and Micheline Kamber (2006): “Data Mining Concepts and Techniques”, Morgan

Kaufmann Publishers In; 2nd Revised edition edition.

19. Jiming Peng and Yu Xia (2005). A Cutting Algorithm for the Minimum Sum-of-Squared

Error Clustering. In Proceedings of the 2005 SIAM International Conference on Data

Mining, Society for Industrial and Applied Mathematics, ISBN: 978-0—89871-593-4

20. Joshi A., and Kaur R. (2013): A review: Comparative Study of Various Clustering Techniques

in Data Mining. International Journal of Advanced Research in Computer Science and

Software Engineering, 3(3), 55-57.

21. Karami A. and Guerrero-Zapata M. (2015). “A Fuzzy Anomaly Detection System based on

Hybrid PSO-K-means Algorithm in Content-centric Networks,” Neurocomputing, Vol. 149,

pp. 1253–1269.

22. Kennedy J. and Eberhart R.C. (1995): “Particle Swarm Optimization”, Proceedings IEEE

International Conference on Neural Networks, IV, p. 1942-1948.

23. Koay C. A. and D. Srinivasan (2003). “Particle Swarm Optimization-based Approach for

Generator Maintenance Scheduling,” in Proceedings of the IEEE Swarm Intelligence

Symposium (SIS ’03), pp. 167–173, IEEE.

24. Levent Bolelli, Ertekin Seyda, Zhou Ding and Clyde Lee (2007): A Clustering Method for

Web Data with Multi-type Interrelated Components. In: Proceedings of the International

Conference on the World Wide Web,1121-1122.

http://doi.acm.org/10.1145/1242572.1242725.

25. Li Zheng, Lei Tao, Yue Yin and Jin Ding (2015): “A Framework for Hierarchical Ensemble

Clustering”, ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 9(2): 9-

26. Ling Wang, Liefeng Bo, Licheng Jiao (2012): “A Modified K-means Clustering with a

Density-Sensitive Distance Metric”, Technical report, University of California, Department

of Information and Computer Science, Ir-vine, CA.

27. Mary C. Immaculate and Raja Kasmir (2009): “A Modified Ant-based Clustering for Medical

Data”, International Journal on Computer Science and Engineering, 2(7), 2253-2257.

28. Min Wei, Tommy W. S. Chow and Rosa H. M. Chan (2015): “Clustering Heterogeneous Data

with K-means by Mutual Information-Based Unsupervised Feature Transformation”, Entropy

2015, 17(3), 1535-1548; doi:10.3390/e17031535.

29. Ming-Chuan Hung, Jungpin Wu, Jin-Hua Chang and Don-Lin Yang (2005): “An Efficient K-

means Clustering Algorithm Using Simple Partitioning”, Journal of Information Science and

Engineering, 21, 1157-1177.

30. Mohammed T. H. Elbatta and Wesam M. Ashou (2013): “A Dynamic Method for Discovering

Density Varied Clusters”, International Journal of Signal Processing, Image Processing and

Pattern Recognition, 6(1), 123-134.

31. Mohammed Tiri, Pavlik, P., Cen, H., Wu, L. and Koedinger, K. (2009): Using Item-type

Performance Covariance to Improve the Skill Model of an Existing Tutor. In Proceedings of

the 1st International Conference on Educational Data Mining: 77-86.

32. Momin, B.F. and Yelmar, P.M. (2012). Modifications in K-means Clustering Algorithm',

International Journal of Soft Computing and Engineering, 2(3), pp.6297-6316.

33. Nasser S., Alkhaldi R. and Vert G. (2004): Semi-supervised learning literature survey,

University of Wisconsin-Madison.

34. Neelamadhab Padhy, Pragnyaban Mishra and Rasmita Panigrahi (2012): The Survey of Data

Mining Applications and Feature Scope. CoRR abs/1211.5723.

35. Nidhi Gupta and Ujjwal R. L. (2013), "An Efficient Incremental Clustering Algorithm" in

World of Computer Science and Information Technology Journal (WCSIT), 3(5),97-99.

36. Olaleye Oludare, Olabiyisi Stephen, Olaniyan Ayodele and Fagbola Temitayo (2014): “An

Optimized Feature Selection Technique for Email Classification”, International Journal of

Scientific and Technology Research, 3(10): 286-293.

37. Oloyede Ayodele, Fagbola Temitayo, Olabiyisi Stephen, Omidiora Elijah and Oladosu John

(2016): Development of a Modified Local Binary Pattern-Gabor Wavelet Transform Aging

Invariant Face Recognition System. In Proceedings of ACM International Conference on

Computing Research & Innovations, University of Ibadan, Nigeria, pp. 108-114, 7-9

September 2016.

38. Oludayo O. Olugbara, Emmanuel Adetiba and Stanley A. Oyewole (2015). Pixel Intensity

Clustering Algorithm for Multilevel Image Segmentation, Mathematical Problems in

Engineering, Volume 2015, pp. 1-19, http://dx.doi.org/10.1155/2015/649802.

39. Pudil P., Ferri F. J., Novovicova J. and Kittler J (1994). Floating Search Methods for Feature

Selection with Nonmonotonic Criterion Functions. Proceedings of the 12th IAPR International

Conference on Computer Vision and Image Processing, Pattern Recognition Letters, volume

15(11), pp. 1119-1125.

40. Qiang Niu and Xinjian Huang (2011): “An improved fuzzy C-means clustering algorithm

based on PSO”, Journal of Software. 6(5), 873-879.

41. Qinghai Bai (2010). “Analysis of Particle Swarm Optimization Algorithm”, Computer and

Information Science (CCSE), 3(1), 180-184.

42. Rand W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of

the American Statistical Association, 66:846-850.

43. Rauber (2000): Educational Data Mining: A Survey from 1995 to 1999. Expert Systems with

Applications; 33; 125-146.

44. Shanmugapriya B. and Punithavalli M. (2012): “A Modified Projected K-means Clustering

Algorithm with Effective Distance Measure”, International Journal of Computer Applications

44(8):32-36.

45. Sharfuddin Mahmood, Mohammad Saiedur Rahaman, Dip Nandi, Mashiour Rahmann

(2015): “A Proposed Modification of K-means Algorithm”, IJMECS, 7(6), 37-42.

46. Shinde P. V. and Gunjal B. L. (2012): “Particle Swarm Optimization - Best Feature Selection

method for Face Images”, International Journal of Scientific & Engineering Research, 3(8):

47. Siedlecki W. and Sklansky J. (1988). “On Automatic Feature Selection”, Int. J. Patt. Recog.

Art. Intell. 2(2): 197-220.

48. Su M.C. and Chou C.H. (2001). A Modified Version of the K-means Algorithm with a

Distance based on Cluster Symmetry. IEEE Trans. Pattern Anal. Machine Intel. 23 (6), 674–

49. Sun J., W. B. Xu and B. Ye (2006). "Quantum-behaved particle swarm optimization clustering

algorithm," Lecture Notes in Computer Science, Vol.4093, pp. 340-347.

50. Twinkle G., Lofter F. and Arun M. (2014): Survey on various enhanced K-means Algorithms,

International Journal of Advanced Research in Computer and Communication Engineering,

3(2):43-61.

51. Vaishali R. Patel and Rupa G. Mehta (2011). Impact of Outlier Removal and Normalization

Approach in Modified k-Means Clustering Algorithm, IJCSI International Journal of

Computer Science Issues, 8(5):2, pp. 331-336.

52. Xia W., Z. Wu, W. Zhang, and G. Yang (2004). “A New Hybrid Optimization Algorithm for

the Job-Shop Scheduling Problem,” in Proceedings of the American Control Conference

(AAC ’04), pp. 5552–5557, IEEE, Boston, Mass, USA.

53. Xiaoyan Wang and Yanping Bai (2016). A Modified MinMax 𝑘-Means Algorithm Based on

PSO. Computational Intelligence and Neuroscience, pp. 1-13,

http://dx.doi.org/10.1155/2016/4606384.

54. Zhang H. and Sun G. (2006): “Feature Selection Using Tabu Search Method,” Pattern

Recognition, 35(3): pp. 701-711.

55. Zhou D., Bousquet O., Lal T.N., Weston J. and Scholkopf B. (2004): Learning with Local and

Global Consistency. In: Thrun, S., Saul, L., Scholkopf B, Eds., Advances in Neural

Information Processing Systems 16. MIT Press, Cambridge, MA, USA, 321-328.

Real-World Data Clustering Using a Hybrid of Normalized Particle … · 2019-07-05 · 1 Real-World...

Documents