http://www.iaeme.com/IJCET/index.asp 217 [email protected]
International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 217-228, Article IJCET_09_04_024
Available online at http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4
Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com
ISSN Print: 0976-6367 and ISSN Online: 0976–6375
© IAEME Publication
CURE IMPLEMENTATION
Anchal Chauhan and Seema Maitrey
Krishna Inst itute of Engineering and Technology, Uttar Pradesh, India
ABSTRACT
Process of data mining is extraction of relevant knowledge and the interesting
patterns from large amount of the available information. There are many data mining
techniques, one is clustering technique. Process of clustering is unsupervised
classification of the patterns (data items, feature vector and observations) in groups i.e.
clusters. This paper intends to discuss CURE hierarchical algorithm and implement
CURE which is one of the clustering algorithm.
Keywords: Data mining, clustering, CURE hierarchical clustering.
Cite this Article: Anchal Chauhan and Seema Maitrey, Cure Implementation.
International Journal of Computer Engineering & Technology, 9(4), 2018, pp. 217-
228.
http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4
1. INTRODUCTION
Data mining process is a kind of sorting technique that is actually used for extracting the hidden
patterns from voluminous databases. Data mining is called as KDD (knowledge discovery in
databases) sometimes. Main goal of mining includes fast retrieval of information or data for
identifying hidden patterns and also the patterns that are not explored previously to reduce level
of complexity, knowledge discovery from database, time saving. [1] Classification is
supervised learning. In classification, class labels are defined previously and incoming data is
categorized according to class labels. On other hand, clustering is unsupervised learning. In
clustering, data is categorized in according to the similarities in to the different groups, and
then groups are labelled [2]. The process of clustering can be performed through different
algorithms such as partitioning, grid, density and hierarchical algorithms. Hierarchical
clustering algorithms are categorised as agglomerative and divisive algorithms [3] and
agglomerative is further categorised in CURE, BIRCH, ROCK and CHAMELEON [4]. This
paper focuses on CURE hierarchical clustering algorithm and its implementation.
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 218 [email protected]
Figure 1 Phases of data mining
2. RELATED WORK
Many researchers have carried out research on the CURE hierarchical clustering techniques in
past. Some papers are listed below that worked on clustering process and the CURE clustering
algorithms.
Sudipto Guha et.al [5] proposed CURE algorithm and tried to show efficiency of the CURE
on the large database. Then Qian Yuntao, Wang Qi and Shi Qing song [6] founded relation of
the shrinking scheme of the CURE algorithm, also hidden assumption of the spherical shape of
the clusters. Researcher G. Adomavicious et.al [7] proposed new approach for discovering
clusters in very large amount of the continuous arriving data as dataset and then used sampling
technique to cluster dataset. M. Kaya, R. Alhajj [8] introduced a automated method to perform
mining on fuzzy association rules by help of the genetic algorithm and the CURE algorithm.
Ogihera, Dwarakadas [9] worked on discovery of the clusters from the database updates. These
proposed method with the SPADE algorithm for the interactive and incremental frequent
sequence mining.
3. CURE CLUSTERING ALGORITHM
CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover
groups and identify interesting distributions in underlying data. Traditionally, clustering
favours the clusters having spherical shape and similar sizes, or weak in presence of outliers.
CURE is one algorithm which is very robust to the outliers, also performs well in identifying
clusters that have non spherical shapes or wide variances in the size. The CURE algorithm
achieves this through each and every cluster by certain fix number of the points which are
generated from selecting the well scattered points and then towards centre of cluster by
specified fraction. Ability of having more than the one representative point in per cluster allows
the CURE algorithm to adjust well in geometry of the non-spherical shape. The shrinking helps
in dampening effects of the outliers. For handling the large databases, CURE clustering
algorithm employs the combination of the random sampling techniques and partitioning [5].
CURE implements a novel hierarchical algorithm that adopts middle ground in between the
centroid based and the approaches based on representative object. Instead of making use of an
single centroid or the object for representing cluster, a fix number of the representative points
in the space are selected. Representative point in the clusters are generated through the
Cure Implementation
http://www.iaeme.com/IJCET/index.asp 219 [email protected]
selection of the well scattered objects and then moving and shrinking them towards cluster
centre by an specified fraction or an shrinking factor. In each of the step, two clusters with the
closest pair of the representative points are selected. Having the more than a single
representative point in per cluster allows the CURE algorithm for adjusting well in geometry
of the non-spherical shapes. Condensing and shrinking of the cluster helps in dampening effects
of the outliers. CURE algorithm is much more robust to the outliers and then helps in
identifying clusters which have non spherical shapes, also wide variances in the size. Because
of this, it scales very well for the large or voluminous databases without sacrificing the
clustering quality. The random sample which is drawn from dataset is firstly partitioned and
then each partition is clustered partially. All the partial clusters are again clustered in second
pass to get the all required clusters. It will confirm quality of the clusters produced from CURE
which is much better than those found from other algorithm. [10]
Figure 2 Overview of CURE Algorithm
4. PROPOSED WORK
In field of data mining, it’s well known that sometimes it becomes difficult to handle the
voluminous data or large amount of the data. So, we tried to take advantage of this issue and in
among many various clustering algorithms, the CURE hierarchical algorithm is considered to
be implemented as the CURE (Clustering usage Representatives) clustering technique finds the
clusters from voluminous database which is very robust to the outliers, and also determines
clusters with the non-spherical shapes. CURE algorithm is implemented by the combination of
data collection and the data reduction by use of random sampling method and partitioning
method.
Algorithm:
Input:
Table A main table
Table B join table
Column A Join column from main table
Column B Join column from join table
n number of cluster
Column C Filter column
Value Filter value
1. Join table A and table B on equivalence of column A and column B
2. Calculate count from result set and store it in a variable T= total number of rows.
3. Start random sampling by calculating size of clustered partition by dividing total number of
rows T from number of clusters n Size of cluster=total number of rows (T)/ number of clusters
(n)
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 220 [email protected]
4. Store this value to variable s
5. Select every nth row from result set with filter criteria build by given user input in terms of
filter column and filter value starting from 0th row.
6. Create other partition by selecting every nth row with ith starting element where i ranges from
1 to n-1.
7. Analyse all partitions and perform clustering.
8. Merge relevant partitions to get knowledge based and meaningful relevant data.
9. Repeat step 3 to 9 if further clustering is required.
10. End.
5. FLOWCHART
Cure Implementation
http://www.iaeme.com/IJCET/index.asp 221 [email protected]
6. STEPS OF ALGORITHM
STEP 1: Install software in the system by clicking install application icon. When installation
is complete, welcome page will appear on screen. Connection URL for DB based on the DB
name and the passwords and username value.
Figure 4 Initial Database Settings
STEP 2. Now we then start setting up the initial properties of this process that takes the few
inputs from user. The Main Table as the table A. Then check for checkbox, if we look for the
Join Data set for this procedure. If Join is required then we takes more input from the user as
the Join Table as table b, the join column for the table A as A column and Join column for the
table B as the column B. All of these setting get populated with connection string which is
provided in the previous step. It queries the system tables for giving list of the entire existing
table with their relationship fields; by this we can specify the initial setting for this process.
STEP 3. In the next step, we can see the count of the results set created, and then we will use
all this for applying request values for performing sampling, partitioning and the clustering
step. We then store results set in the temp table, provides random sampling request value which
are the number of clusters (n),filter value as value, filter column as column C. Also some
specified list containing conditionals that can be used to filter more data. It contains entries as
EQUAL, IN, BETWEEN etc. We have provided all settings for the random sampling procedure
and then click next for checking the partitioning results.
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 222 [email protected]
Figure 5 Setup Table Name
Figure 6 Setup Cure Properties
STEP 4: In step 4, we can see the different partitions that are the every nth value of result set
starting from the row number I, where the I ranges from 0 to the n-1. You can see all the result
set grouped each other which is based on value of column C provided in the previous steps in
the different partitions like the partition1, partition2 and partition 3 etc. This will show that
cluster is separated from the outliners. Also it can be seen that the filtered values based over
criteria specification. See Fig.
Cure Implementation
http://www.iaeme.com/IJCET/index.asp 223 [email protected]
Figure 7 Partition 1 results
Figure 8 Partition 2 results
STEP 5: In this step, after analysing and processing all partitions we have merge them in
single table as result. If one want to perform more clustering with the same database instance
then can go to previous setup page once again as the next step for starting furthur clustering
process.
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 224 [email protected]
Figure 9 Final Result Set
STEP 6: step 2 to 6 can be repeated, if more clustering is required on the same database
instance. Initial and cure settings can be provided, and process data as per the new setting to
see the more relevant and better clustering results.
Figure 10 Setup Table and Join Table Name
Cure Implementation
http://www.iaeme.com/IJCET/index.asp 225 [email protected]
Figure 11 Setup Join Clusters and CURE properties
Figure 12 Join Partition 1 Results
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 226 [email protected]
Figure 13 Join Partition 2 Results
Figure 14 Join Partition 3 Results
Cure Implementation
http://www.iaeme.com/IJCET/index.asp 227 [email protected]
Figure 15 Join Final Results
STEP 7: If analysis and processing is completed then process can be finished. By clicking on
the Finish Button on the Merge Results page.
Figure 16 Filtered Result based on parameter
7. CONCLUSION
In this paper we studied that the CURE clustering algorithm can determine the cluster with
non-spherical shape and the wide variance in size. CURE algorithm provides the better
execution time as compared to other algorithms in the large database from using random
sampling technique and the partitioning ways. CURE clustering algorithm works very well
when the data have the outliers. All outliers are detected firstly and then these are eliminated
in CURE hierarchical clustering algorithm. Each and every level or step is important to achieve
efficiency, scalability and as well as the concurrency improvement. So, it can be concluded that
CURE algorithm is suitable for handling the voluminous data.
Anchal Chauhan and Seema Maitrey
http://www.iaeme.com/IJCET/index.asp 228 [email protected]
8. FUTURE SCOPE
In future, parallel programming can be introduced with CURE algorithm through this we can
get the result with much more accuracy in very less time. In the CURE algorithm, during the
random sampling result set is break in various different partitions. As the enhancement to the
CURE hierarchical algorithm we can process these partitions in a parallel thread environment.
By this performance of CURE algorithm we can improved and can make it a very efficient
algorithm than the other hierarchical algorithm.
REFERENCES
[1] Smita, Priti Sharma, Use of Data Mining in Various Field: A Survey Paper, (May-Jun.
2014)
[2] Megha Mandloi, A Survey on Clustering Algorithms and K-Means, July-2014
[3] G.Thilagavathi, D.Srivaishnavi, N.Aparna, “A Survey on Efficient Hierarchical Algorithm
used in Clustering”, IJERT, Year: 2013.
[4] Marjan Kuchaki Rafsanjani, Zahra Asghari Varzaneh, Nasibeh Emami Chukanlo, A survey
of Hierarchical clustering algorithms”, The Journal of Mathematics and Computer Science,
Year: 2012
[5] Sudipto Guha, Rajeev Rastogi, and Keyuseok Shim, 1998. “CURE: An Efficient Clustering
Algorithm for Large Databases”. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on
Management of Data, pp. 73-84.
[6] C8Qian Yuntao, Shi Qingsong, Wang Qi 20c902. “CURENS: A Hierarchical Clustering
Algorithm with New Shrinking Scheme”, ICMLC’2002, Beijing, Nov., 4-5, pp. 895-899.
[7] G. Adomavicius, J. Bockstedt, and V. Parimi. “Scalable Temporal Clustering for Massive
Multidimensional Data Streams”. Proceedings of the 18th Workshop on Information
Technology and Systems (WITS'08), Paris, France, December 2008.
[8] M. Kaya, R. Alhajj. “Genetic Algorithm Based Framework for Mining Fuzzy Association
Rules”. Fuzzy Sets and Systems, 152 (3), (2005), 587-601.
[9] Srinivasan Parthasarathy, Mohammed J. Zaki, Mitsunori Ogihara, and Sandhya
Dwarkadas, “Incremental and Interactive Sequence Mining”. Proc. in 8th ACM
International Conference Information and Knowledge Management. Nov 1999.
[10] Seema Maitrey, C.K. Jha, Rajat Gupta & Jaiveer Singh (2012), “Enhancement of CURE
Clustering Technique in Data Mining”, Proceedings in International Journal of Computer
Application, Published by Foundation of Computer Science, New-York, USA. April 2012.