1
Knowledge Discovery and Data Mining In Agricultural Database
Using Machine Learning Techniques
PhD synopsis
For the degree of
Doctor of Philosophy
in Computer IT Engineering
Submitted by:
Bhagirath Parshuram Prajapati
(Enrol. No.: 119997107001, Batch: 2011)
Supervisor
Dr. Dhaval Kathiriya,
Dean & Principal, AIT,
Anand Agriculture University.
DPC Members:
Dr. Apurva M. Shah,
Associate Professor,
MSU.
Dr. Ramji Makwana,
CEO and Founder,
AI eSmart Solutions
Submitted to
Gujarat Technological University
2
1. Abstract
In 1990's, studies show that most people preferred information by human resource over
automated information retrieval systems. Though, the Pew Internet Survey(Fallows 2004)
found that “ the Internet is a good place to go for getting everyday information”, this survey
indicates that in the twenty-first century, people prefer and satisfied with information retrieval
system and world wide web(www). In the today’s world, the chain of digitalizing everything
exists in every field. Data are growing rapidly in gargantuan amount as they are all the time
gathered by affordable and numerous information-sensing devices. One such domain of
interest is where newer datasets are being generated is an Agriculture. From starting point of
the history of the human race to till date, we, human beings, depend on the agriculture for
daily nourishment needs including food, milk. In addition, the Industrial Revolution was
possible because of raw material provided by Agriculture. Nowadays in the Agriculture
domain, there are various data being collected and stored in computer systems. Concepts of
Data mining can be applied to Agriculture sector to analyse agriculture data sets. Now the
question is, can we automate the information retrieval system based on Data Mining in
Agriculture domain? Machine Learning, which is a field of Data Mining, provide necessary
techniques to solve the above problem. This research meant to evaluate a concept of the
classification technique and apply them to agricultural soil data sets to discover meaningful
relationships which can be used for decision making about quality crop production in massive
amount. The soil health card data base is consisting of macro and micro nutrients records of
soil samples taken from farm field and tested in Soil Laboratory. In this research, we have
concentrated on k-Nearest classification algorithm to classify soil sample instances into
appropriate fertilizers deficiency category. Although k-Nearest Neighbor classification is
simple and effective, it has the largest computational and storage requirements. In addition,
the effectiveness of classification decreases because of uneven distribution of training data. In
this research, we present a novel Fast k-Nearest Neighbor, Training Set Reduction k-Nearest
Neighbour and Hybrid k-Nearest Neighbour Classification methods for decreasing the
requirement of time and space. We have applied our new approaches on Soil health card
Agriculture data set and our evaluation illustrates that this approach can solve the mentioned
problems effectively. We have discussed the comparative analysis of methods to identify the
best method.
3
2. State of the art of the research topic
Background Study:
There are many presentations of Data mining approach. Machine learning is one of them and
widely used. Machine learning is a domain that is focused on developing algorithms that allow
computers to learn to resolve problems based on past records [1]. Data mining is a science to
discover knowledge from databases. The database contains a collection of instances (records
or case). Each instance used by machine learning and data mining algorithms is formatted
using the same set of fields (features, attributes, inputs, or variables). When the instances
contain the correct output (class label) then the learning process is called the supervised
learning [2]. The other Machine learning approach is clustering which works without knowing
the class label of instances is called unsupervised learning [3]. The focus of this research is on
classification and clustering for Agricultural Soil health card database.
k-Nearest Neighbors algorithm:
The k-Nearest Neighbor algorithm (k-NN) is a simple an instance based machine learning
algorithm [5],[6]. The k-NN finds k closest instances to a predefined instances and
classification decision is taken by identifying the most frequent class label among the training
data that have a minimum distance between the query instance and training instances [5]. The
distance is determined by the distance metric like Euclidean, Cosine, Chebyshev etc [4].
The k-Nearest Neighbor classification algorithm is a classical well-known method in Machine
learning [6], [7]. It is a well-established method in the area of pattern recognition and a lot of
research has been done on k-NN [8],[9],[10]. For example remote sensing [11],[12], image
processing [13],[14] and so on. Raymer et al. [15] applied k-NN in combination with a genetic
algorithm on medical data sets for knowledge discovery. Frigui et al. [16] used a k-NN
classifier to perform detection of land mines, here they adopted a possibilistic k-NN classifier.
Yang et al. [9] adopted the local mean based nearest neighbour algorithm to perform the
discriminant analysis. Li et al. [12] adopted the k-NN classifier to the image classification of
hyperspectral images. Bosch et al. [8] adopted k-NN classifier for classification of a scene.
For feature extraction and classification, Xu et al. [10] have performed the k-local hyper plane
4
nearest neighbor classification. In image classification Mensink et al.[13] applied the k-NN
classifier. Maji [17] applied the k-NN classifier for microarray data classification. To reduce
the dimensionality in pattern classification. Geng et al. [18] combined the k-NN with
dimensionality reduction technique. Z. Pan et al. [19] applied k-NN based on mutual
neighborhood information. Lu. et al. [20] proposed a hybrid feature selection algorithm for
gene expression data classification. Z. Pan et al.[21] adopted a new k-harmonic nearest
neighbor classifier for gene data classification.
The simple k-NN classification algorithm is as follows [22]:
Step 1: Add each sample to the training_list;
Step 2: For a given unlabeled sample si, select k nearest neighbors of si in training_list;
Step 3: Return class label for si, which occurs maximum time in top k training_list records.
The advantage of using k-NN:
No need to train dataset in the case of additional instances [23].
Simplicity and flexibility in using [23].
Weight can be used in case of significant features [24].
Accelerating k-NN:
In Machine learning classification, the k-NN is powerful and widely used nonparametric
technique for classification. Though it is exhaustive to perform a k-NN search which requires
a lot of computational resources in case there is large training data set, in this case, k-NN is
not preferable [25], [26]. Since many decades accelerating the k-NN search is one of the active
areas of research.
To speed up the k-NN searching is an interesting area of research and it is mainly divided into
two categories: template condensation and template reorganization [29]. Template
condensation identifies the redundant patterns in template set and removes it [25], [26], [27].
While the restructuring of templates is done in the template reorganization algorithms [30],
[31], [32], [33]. A lot of work has been done to find a new approach and in one such method,
5
classification performance is not affected while reducing the storage and computation cost
[22].
In some method out of total training set, representative samples are selected and remaining
are deleted to reduce the amount of training sample set. In text categorization research [34],
the training set is reduces based on the density, here text density is calculated and if it is found
bigger than the average density then removes some samples to reduce training samples in
training set.
Some research has extended the features affecting the k-NN performance, the best k value,
the training sample size, etc. Majumdar and Ward [35] combined the k-NN classifier with the
random projection technique. Ghosh et al. [36] estimated the optimal value of the k in k-NN.
Hu et al. [37] applied sample weight learning on the nearest neighbor classifier. Domeniconi
et al. [30] studied theoretically the large margin nearest neighbor classifiers. Parthasarathy and
Chatterji [29] explored the way to use k-NN in case sample size is small. Some researchers
have analyzed the data points relationships to the nearest neighbor relationships, like centers
of the classes and hyperplane data points. Gao et al. [40] have designed a nearest neighbor
classifier based on the center called center base nearest neighbor classifier. Li et al. [41] used
the local probabilistic centers of each class in the classification process. Vincent et al. [42]
applied the k-local hyperplane NN technique.
In some research work, researchers have explored the efficiency of the k-NN classifier.
Hernández-Rodríguez et al. [43] has proposed n approximate fast k most similar neighbor
classifier based on a tree structure and checked the efficiency of the k-NN classifier. Zhang
and Srihari [44] explored cluster based tree algorithms for the fast KNN classifier. Ghosh et
al. [45] explored the visualization and aggregation of nearest neighbor classifiers. Some
research work explored the distance metrics. Derrac et al. [46] proposed a method to improve
the performance of the k-NN classifier based on cooperative coevolution. Triguero et al. [47]
adopted the differential evolution to optimize the positioning of the prototypes to address the
limitations of the nearest neighbor classifier. Weinberger et al. [35] investigated distance
metric learning to obtain a large margin for the k-NN classifier.
6
3. Definition of the Problem
To explore the applicability of Machine learning techniques on Agricultural Dataset of Soil
health card and to propose improved efficient Machine learning algorithm to classify soil
sample into the categories of the deficiencies of micro and macro nutrients.
4. Objective and Scope of work
Objectives:
To study and analyze agriculture soil health card database, data mining on soil health
card database and applicability of machine learning on it.
To design the concept to carry out machine learning specifically classification
algorithm on Soil health card database.
To identify the concept to improve time and space complexity of classification
algorithm to classify soil samples in respective nutrients deficiencies category.
To measure the performance of proposed algorithms based on accuracy, precision,
recall and F1 measure.
To design and develop a software prototype to prove the above concept.
Scope:
In this section, we shall specify the scope of the research presented in the current work
as follows,
Based on the nature of this research, sub-dataset is abstracted from agricultural Soil
health card database, which consists of micro and macro nutrients for the individual
farm of selected district of Gujarat, this data set is preprocessed for applying machine
learning techniques.
Classification algorithm k-Nearest Neighbor is applied on Soil health card data set to
classify each soil sample into the categories of deficiencies of the nutrients.
The primary limitation of k-Nearest Neighbor is that it retains all the training data
because of that it is prone to high computational cost henceforth this research proposes
novel k-Nearest Neighbour algorithms like, Fast k-Nearest Neighbor, Training set
reduction k-Nearest Neighbor and Hybrid k-Nearest Neighbor, which is evaluated
comparatively.
7
5. Original contribution by the thesis
This research is distinctive in terms of its application in the Agricultural domain and
encompassing machine learning technique k-Nearest Neighbor algorithm with its
improvements. Agricultural database soil health card is used and from this database macro and
micro nutrients are abstracted for the particular farm for classification purpose. This research
work is original as there is no such work carried out on Soil Health Card Database in the state
of Gujarat. Though an effective k-Nearest Neighbor algorithm is proposed for classification, it
suffers from large storage and computational requirements. In this research, we present a novel
Fast k-Nearest Neighbor, Training Set Reduction k-Nearest Neighbour and Hybrid k-Nearest
Neighbour Classification methods for decreasing the requirement of time and space. The
original contribution is also observed in the research papers listed at the end.
6. Methodology of Research, Results / Comparisons
Methodology of Research
Research methodology encompasses ideas of a systematic representation of the methods
applied to carry out the study which emphasis on the theoretical illustration of the methods and
principles related to the knowledge of a branch.
To solve afore mentioned research problem, the main design research phases applied in this
work are as follows.
Problem awareness
In Agriculture database of soil health card the scope of Machine learning application is
to classify the soil samples into categories of deficiencies of nutrients. This step is also
to identify the lack of a general framework that can be used in the Agricultural domain
to extract knowledge.
Literature review
This step studies in detail about various Machine learning techniques with its
applications in various fields. It identifies a k-nearest neighbor classifier for this
research work as well as considers the selection of dataset that is being extracted from
Soil health card database. The main purpose of this phase is to generalize research
8
directions of improvements in existing k-nearest neighbor algorithm for aforesaid
classification task in the area of Agricultural Soil data set classification.
Implementation
The work is implemented in different steps as given below,
o A collection of data and generation of the data set.
o Pre-processing of data.
o Application of classification technique k-Nearest Neighbour.
o Application of classification technique Fast k-Nearest Neighbour.
o Application of classification technique training set reduction k-Nearest
Neighbour.
o Application of classification technique Hybrid k-Nearest Neighbour.
Evaluation
The purpose of evaluation phase is to check whether the implemented classification
methods meet the expectations. Hence comparison is done between classification
methods in terms of various measures.
Conclusion
It is the final phase of research cycle that covers the impact of the research and
contributions by researchers. Here, in this work, the scope and applicability of Machine
learning in an Agricultural data set of Soil health card are evaluated in considerations
of k-Nearest Neighbor classifiers and it's adaptation for the Agricultural domain.
Proposed work:
Collection of data and generation of the data set
This research work is concentrated on exploring the applicability of Machine learning
techniques on Agricultural Dataset of Soil health card and to propose improved efficient
Machine learning algorithm to classify soil sample into the categories of the deficiencies
of micro and macro nutrients. Hence, Agricultural soil dataset is collected from Soil
health card database(SHCDB), which is available with Anand Agriculture University.
SHCDB for Gujarat state is consisting soil health card data of individual farm of all
districts of Gujarat. Individual districts of Gujarat have their respective database table
having total 49 different attributes varies from identification of soil sample to soil
9
characteristics of the particular land. This SHCDB is maintained in MSSQL DBMS
system. For our research purpose, we have been provided SHCDB of six districts
namely Kutch, Rajkot, Banaskantha, Vadodara, Anand, and Surat. As it is mentioned
previously, total 49 attributes are there in SHCBD, out of these we are concerned with
macro and micro nutrients, henceforth four macro and four micro nutrients are
abstracted from SHCDB of each district and label is assigned to each sample indicating
deficiencies in it.
Table. 1 Soil health card data set
Different soil fertilizer treatments based on deficiency
• Macro soil nutrients parameters
• Potassium (K) <=150 ppm, then soil needs potassium fertilizer treatment.
• Sulfur (S) <= 20 ppm, then soil needs sulfur fertilizer treatment.
• Magnesium (Mg) <= 2 ppm, then soil needs treatment.
• Phosphorus (P) <= 20 ppm, then soil needs phosphorus treatment.
• Micro soil nutrients parameters
• Iron (Fe): if the content of Fe <=10 ppm than soil requires Ferrous Sulfate.
• Manganese (Mn): if content of Mn <=10 ppm than soil requires Manganese
Sulfate.
• Zinc (Zn): if the content of Zn <=1 ppm than soil requires than Zinc Sulfate.
• Copper (Cu): if the content of Cu <= 0.4 ppm than soil requires Copper Sulfate.
10
Application of classification technique k-Nearest Neighbor
We have first applied k-Nearest Neighbor algorithm on data set of district Kutch, which
is having 14000 samples of soil parameters from SHCDB and calculated accuracy,
precision, recall, F1 measures and classification time in milliseconds.
Algorithm 1: k-Nearest Neighbour (k-NN) Classifier
Input: A set of Agriculture records R = {R1, R2. . . Rn}, where n is the total number
of Agriculture records, training record set D.
Procedure:
Step 1: Divide the record data into one training set and test set as 50-50 split.
Step 2: For each test record, calculate similarity with each training record.
Step 3: Sort the training records in the descending order of the maximum cosine
similarity and select the top k training records.
Step 4: Assign a class to test record which occurs maximum times in the top k
training records.
Step 5: Construct a confusion matrix.
Step 6: Calculate performance measures from the confusion matrix.
Sr.
No k Accuracy Precision
Recall
F1
measure
Classification
Time in
millisecond
1 31 90.21 76.28 55.03 63.93 5766
2 33 88.85 92.22 67.92 78.23 5779
3 35 90.41 85.94 55.88 67.72 5777
4 37 90 88.28 60.97 72.13 5746
Table 2. Results of k-NN classifier
Results of k Nearest Neighbor classifier applied on agricultural soil data set SHCDB are
shown in table 2. For different values of k accuracy, precision, recall, F1 measure and
classification time is measured.
11
Application of classification technique Fast k-Nearest(F-kNN) Neighbor
The primary limitation of the simple k-NN algorithm is it needs to retain all the training
data and prone to high computational cost. In order to reduce the computation cost of
mentioned simple k-NN, we proposed and designed a fast k-Nearest Neighbor
algorithm. The fast k-NN (F-kNN) classifier first finds k clusters by employing k-Means
clustering algorithm. The class label of each cluster is the class whose maximum number
of records are present in that particular cluster [48]. For each test data, we have
calculated similarity with each cluster and assigned a class based on k-NN approach.
The fast k Nearest Neighbor algorithm is described as below.
Algorithm 2: Fast k-Nearest Neighbour algorithm (F-kNN)
Input: A set of Agriculture records R = {R1, R2…Rn}, where n is the total number of
agricultur records, training record set D.
Procedure:
Step 1: Divide the record data into one training set and test set as 50-50 split.
Step 2: Construct k clusters using kMeans clustering algorithm (validate k value
for kMeans clustering by Elbow method or Silhouette method) and assign a class
label to each cluster based on maximum occurrences of a particular class in that
cluster
Step 3: For each test record, calculate similarity with each cluster’s centroid.
Step 4: Sort the clusters in the descending order of the maximum cosine similarity
and select the top k clusters.
Step 5: Assign a class to test record whose summation of similarity is maximum in
the top k clusters.
Step 6: Construct a confusion matrix.
Step 7: Calculate performance measures from the confusion matrix.
In above algorithm in step 2, we have applied clustering validation techniques because
for cluster analysis there is always a question of how to evaluate the goodness of
clusters [49]. For kMeans clustering, it is desirable to perform clustering with optimum
k value to avoid finding patterns in noise and to compare it with other clustering
algorithms. In this research work we considered two methods of cluster validation, the
first method computes the sum of square of error (SSE) [50], [51] as below,
12
𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑥, 𝑐𝑖)2
𝑥∈𝑐𝑖
𝐾
𝑖=1
The SSE is defined as the sum of the squared distance between each member of the
cluster and its centroid. It checks measures cohesion [52], means how closely related
are objects in a cluster.
The second method of cluster validation is to compute silhouette value [53] as below,
Calculate a = average distance of i to the points in its cluster,
Calculate b = min (average distance of i to points in another cluster)
𝑠(𝑖) =𝑏(𝑖) − 𝑎(𝑖)
max {𝑎(𝑖), 𝑏(𝑖)}
It is interpretation and validation of consistency within cluster data, means it combines
ideas of both cohesion and separation [54]. Cluster separation measure how distinct or
well separated a cluster is from other clusters. Silhouette value is typically between 0
and 1 and the closer one is better.
Sr.
No
k
Accuracy Precision
Recall
F1
measure
Classification
Time in
millisecond
kMeans
(Applying
SSE)
1 31 88.64 88.57 94.91 91.64 199 141
2 33 87.92 87.96 94.91 89.51 191 131
3 35 88.85 88.79 94.91 91.75 217 131
4 37 88.78 88.72 92.91 90.77 233 181
Table 3. Results of Fast k-NN classifier, optimum kMeans by SSE
Results of Fast k Nearest Neighbor classifier applied on agricultural soil data set SHCDB is
shown in table 3 and table 4. For different values of k accuracy, precision, recall, F1 measure
and classification time is measured. Here, the optimum value of k in kMeans is decided by
the value of SSE and silhouette value respectively in table 3 and table 4.
Sr.
No
k
Accuracy Precision
Recall
F1 measure
Classification Time
in millisecond
kMeans (Applying
Silhouette value)
1 31 88.71 88.65 90.91 89.77 240 201
2 33 89.78 89.72 93.92 91.77 255 201
3 35 90.35 90.29 92.92 91.58 248 191
4 37 89.28 89.22 92.92 91.03 240 191
Table 4. Results of Fast k-NN classifier, optimum kMeans by silhouette value
13
Application of classification technique Training set reduction k-Nearest Neighbor
In Fast k-NN, the computational cost is significantly reduced compared to k-NN but
still it is retaining all the training instances in each cluster and having high storage
requirement. We have proposed and designed Training set reduction k-NN(TSR-kNN)
algorithm so that it reduces training set significantly and have the advantage of low
computational and storage cost.
Figure 1. Training set reduction k-NN [56]
Fig 1. Illustration of Classification using Training Set Reduction k-NN algorithm. In
this approach in the first phase, a training set is converted into a set of training vectors
[55]. The training vectors are given as an input into Training set reduction algorithm
and the algorithm’s output is the reduced training set. In the second phase the reduced
training set is employed by the classifier to classify a new test instance. We have applied
shrink (subtractive) algorithm [56] to reduce the training set.
Algorithm 3: Training set reduction k-NN (TRS-kNN)
Phase I: Shrink (subtractive) algorithm,
Input: A set of training instances T = {T1, T2,… Tn } where n is the total number
of Agriculture records, training record set D.
Step 1: Assign all the training documents into S
Step 2: Select randomly an instance P from S
Step 3: Classify the instance P using remaining instances from S.
Step 4: Remove the instance P if it is correctly classified.
14
Phase II:
Input: A set of Agriculture records R = {R1, R2. . . Rn}, where n is the total number of
agriculture records reduced training record set D.
Procedure:
Step 1: Divide the record data into one training set and test set as 50-50 split.
Step 2: Construct k clusters using kMeans clustering algorithm (validate k value
for kMeans clustering by Elbow method or Silhouette method) and assign a class
label to each cluster based on maximum occurrences of a particular class in that
cluster
Step 3: Sort the training records in the descending order of the maximum cosine
similarity and select the top k training records.
Step 4: Assign a class to test record which occurs maximum times in the top k
training records.
Step 5: Construct a confusion matrix.
Step 6: Calculate performance measures from the confusion matrix.
Sr.
No k Accuracy Precision Recall F1 measure
Classification
Time in
millisecond
Reduced
training set
1 31 89.71 81.81 52.01 63.59 1021 3005
2 33 90.21 86.27 68.03 76.07 1017 2855
3 35 90.5 84.73 59.59 69.97 1039 2970
4 37 89 79.69 53.16 63.78 1343 3040
Table 5. Results of Training set reduction k-NN classifier
Results of Training set reduction k Nearest Neighbor classifier applied on agricultural soil
data set SHCDB are shown in table 5. For different values of k accuracy, precision, recall,
F1 measure and classification time is measured. Here, the reduced training set is remaining
samples out of whole training set after performing Shrink (subtractive) algorithm on it.
15
Application of classification technique Training set reduction Fast k-Nearest
Neighbor(TSR-FkNN)
This method is a hybrid method, where we have combined features of both fast k-NN
and training set reduction.
Fig 2. An overview of hybrid machine learning technique TSR-FkNN.
Figure 2 provided an overview of a hybrid approach of the previous two. In hybrid
method, the training set reduction techniques are applied on training set feature vector
and the technique reduces training set. The reduced training set is given as an input to
clustering algorithm and a set of clusters are given as an input to Machine Learning
Algorithm and classifier model is learned. The classifier model assigns a class to a new
test instance.
Algorithm 4: Training set reduction fast k nearest neighbor
Phase I:
Input: A set of training instances T = {T1, T2,… Tn} where n is the total number
of Agriculture records, training record set D.
Step 1: Assign all the training documents into S
Step 2: Select randomly an instance P from S
Step 3: Classify the instance P using remaining instances from S.
Step 4: Remove the instance P if it is correctly classified.
Phase II:
Input: A set of Agriculture records R = {R1, R2 , . . . , Rn } where n is the total number
of Agriculture records, reduced training record set D.
16
Procedure:
. Step 1: Divide the record data into one training set and test set as 50-50 split.
Step 2: Construct k clusters using kMeans clustering algorithm (validate k value
for kMeans clustering by Elbow method or Silhouette method) and assign a class
label to each cluster based on maximum occurrences of a particular class in that
cluster
Step 3: For each test record, calculate similarity with each cluster’s centroid.
Step 4: Sort the clusters in the descending order of the maximum cosine similarity
and select the top k clusters.
Step 5: Assign a class to test record whose summation of similarity is maximum in
the top k clusters.
Step 6: Construct a confusion matrix.
Step 7: Calculate performance measures from the confusion matrix.
Sr.
No
k Accuracy Precision Recall F1 measure Classification
Time in
millisecond
kMeans
(applying SSE)
after training
set reduction
1 31 88.92 87.86 93.91 90.79 248 141
2 33 90.42 89.36 91.92 90.62 143 61
3 35 90.85 89.79 94.92 92.28 180 71
4 37 88.85 87.79 92.91 90.28 217 111
Table 6. Results of Training set reduction fast k-NN classifier
Results of Training set reduction Fast k Nearest Neighbor classifier applied on agricultural soil
data set SHCDB is shown in table 6 and table 7 in which for different values of k accuracy,
precision, recall, F1 measure and classification time is measured. Here, the optimum value of
k in kMeans is decided by the value of SSE and silhouette value respectively in table 6 and
table 7.
Sr.
No
k Accuracy Precision Recall F1 measure Classification
Time in
millisecond
kMeans (applying
Silhouette) after
training set reduction
1 31 92.64 92.57 93.91 93.24 261 131
2 33 92.14 95.07 92.91 93.98 219 151
3 35 93.85 95.79 93.92 94.84 192 131
4 37 91.85 94.79 92.91 93.84 217 141
Table 7. Results of Training set reduction fast k-NN classifier
17
Comparisons of results:
In this section comparisons between different proposed classification techniques is carried out in
terms of performance measures and time of classification in milliseconds.
This results are performed on computer with intel i5 processor and 4GB Ram, the software
IDE is NeatBeans 8.2. Depends on hardware some of the results may vary. The observed
results are on average of multiple run.
Accuracy comparison:
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying SSE)
TSR- FkNN
(applying
silhouette)
1 31 90.21 88.64 88.71 88.21 88.92 92.64
2 33 88.85 87.92 89.78 89.71 90.42 92.14
3 35 90.41 88.85 90.35 89.57 90.85 93.85
4 37 90 88.78 89.28 89.07 88.85 91.85
Table 8. Comparison of Accuracy for all k-NN classifiers
In table 8, the accuracy of different k-NN classifiers is compared. Williams et al. [57] adopted
accuracy as a measure to compare five machine learning algorithms and high accuracy algorithm
is preferred. In our research work it is found that proposed TSR-FkNN(applying silhouette)
classifier have the highest accuracy and all other classifiers also have accuracy less than accuracy
of TSR-FkNN(applying silhouette).
Precision comparison:
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying SSE)
TSR- FkNN
(applying
silhouette)
1 31 76.28 88.57 88.65 81.81 87.86 92.57
2 33 92.22 87.96 89.72 86.27 89.36 95.07
3 35 85.94 88.79 90.29 84.73 89.79 95.79
4 37 88.28 88.72 89.22 79.69 87.79 94.79
Table 9. Comparison of precision for all k-NN classifiers
In table 9, the precision of different k-NN classifiers is compared, Vafeiadis, Thanasis, et al [59]
have applied machine learning algorithms on customer churn data set and proposed precision as
18
one of the comparison measure for machine learning algorithms. In our research it is found that
TSR-FkNN(applying silhouette) is having highest precision than all other algorithms
Recall comparison:
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying SSE)
TSR- FkNN
(applying
silhouette)
1 31 55.03 90.91 88.91 52.01 93.91 93.91
2 33 67.92 91.91 91.92 68.03 91.92 92.91
3 35 55.88 91.91 90.92 59.59 92.91 93.92
4 37 60.97 90.91 90.92 53.16 92.91 92.91
Table 10. Comparison of recall for all k-NN classifiers
In table 10, recall of different k-NN classifiers is compared, Patel, Jigar, et al [60] have proposed
recall as one of the measure for comparing machine learning classifiers in share price prediction.
In our research it is found that recall of TSR-FkNN(applying silhouette) classifier is highest, and
all other classifiers have a value of recall less than TSR-FkNN(applying silhouette).
F1 measure comparison:
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying SSE)
TSR- FkNN
(applying
silhouette)
1 31 63.93 90.68 88.77 63.59 90.79 93.24
2 33 78.23 90.36 90.80 76.07 90.62 93.98
3 35 67.72 90.8 90.60 69.97 92.28 94.84
4 37 72.13 89.8 90.06 63.78 90.28 93.84
Table 11. Comparison of recall for all k-NN classifiers
In table 11, F1 measure of different k-NN classifiers is compared. Kanj, Sawsan, et al [61] has
adopted F1 as measure for comparing editing training data machine algorithms. In our research it
is found that F1 measure of TSR-FkNN(applying silhouette) classifier is highest and all other
classifiers have a value of F1 measure less than TSR-FkNN(applying silhouette).
19
Training set comparison:
In table 12, training instances of different k-NN classifiers are compared. Witten, Ian H., et al.[62]
have dedicated chapter on Reduction Techniques for Instance-Based Learning Algorithms. Here,
training set reduction machine learning algorithms are compared.
In our research TSR-FkNN( applying SSE) has lowest training instances when value of k is 33 and
35 respectively, All other classifiers have higher training instances while k-NN has highest training
instances. Here, training instances of all classifiers other than k-NN are reduced by applying novel
techniques designed for this research.
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying
SSE)
TSR- FkNN
(applying
silhouette)
1 31 7000 141 201 3005 141 131
2 33 7000 131 201 2855 61 151
3 35 7000 131 191 2970 71 131
4 37 7000 181 191 3040 111 141
Table 12. Comparison of training instances for all k-NN classifiers
Classification time comparison:
Value
k for
k-NN
k Nearest
Neighbor
Fast k-NN Training set
reduction k-NN
Hybrid method
Sr.
No
k k-NN F-kNN (
applying
SSE)
F-kNN
(applying
silhouette)
TRS-kNN TSR- FkNN
(applying SSE)
TSR- FkNN
(applying
silhouette)
1 31 5766 199 240 1021 248 261
2 33 5779 191 255 1017 143 219
3 35 5777 217 248 1039 180 192
4 37 5746 233 240 1343 217 217
Table 13. Comparison of classification time for all k-NN classifiers
In table 13, comparison of all classifier is done in terms of classification time in millisecond.
Williams et al. [57] applied method of comparing time of five classifiers. Bost, Raphael, et al. [58]
applied machine learning algorithm on encrypted data set and compared algorithm based on
execution time. In our research it is observed that TSR-FkNN (applying SSE) is having lowest
classification time when value of k is 33 and 35 respectively.
20
6. Conclusion
• Storage reduction:
• Storage requirement in k-NN is very high in comparison to other algorithms.
• For TSR- FkNN (applying SSE) storage requirement is lowest when value of k is
33 and 35 respectively followed by F-kNN and TSR-FkNN(applying silhouette).
Hence, in terms of storage TSR-FkNN and F-kNN are efficient.
• Execution time:
• Execution time is highest in kNN followed by TSR-kNN as they store more number
of instances for training purpose.
• Execution time is lowest in TSR-FkNN(applying SSE) followed by F-kNN as they
store less number of training instances.
• Generalization accuracy, precision, recall and F1 measure:
• Generalize accuracy of TSR-FkNN(applying silhouette) is highest compared to
other algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is
recommended .
• The precision of TSR-FkNN(applying silhouette) is highest compared to other
algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is
recommended .
• Recall and F1 measure of TSR-FkNN(applying silhouette) is highest compared to
other algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is
recommended .
In terms of Time, Space and Accuracy comparisons. The proposed novel hybrid algorithm
TSR-FkNN(applying silhouette) is the best algorithm hence it can be recommended for
classifying soil samples in respective nutrients deficiencies category.
21
7. Achievements with respect to objectives
We have successfully applied machine learning classifier k-NN on agriculture soil
health card data set to classify soil samples into a particular class of nutrients deficiency.
The design of novel approaches of k-NN classifier F-kNN, TSR-kNN and TSR-FkNN
is carried out successfully.
Noticeable performance improvements are observed in proposed work in terms of time
and space complexity.
9. Published papers
• “A Novel Framework for Association Rule Mining to observe Crop Cultivation Practices
based on Soil type”, International Journal of Computer Science and Information Security
(IJCSIS), Vol. 14, No. 9, September 2016.
• “Evaluation of Effectiveness of k-Means Cluster based Fast k-Nearest Neighbor
classification applied on Agriculture Dataset”, International Journal of Computer Science
and Information Security (IJCSIS), Vol. 14, No. 10, October 2016.
• “Reducing execution time of Machine Learning Techniques by Applying Greedy
Algorithms for Training Set Reduction”, International Journal of Computer Science and
Information Security (IJCSIS), Vol. 14, No. 12, December 2016.
• “Towards the new Similarity Measures in Application of Machine Learning Techniques
on Agriculture Dataset”, International Journal of Computer Applications (0975 – 8887)
Volume 156 – No 11, December 2016
22
10. References
[1] Tan, P.-N., M. Steinbach, et al., “Introduction to Data Mining. 2006: Addison-Wesley.”
[2] Witten, Ian H., et al. “Data Mining: Practical machine learning tools and techniques”. Morgan Kaufmann, 2016.
[3] Maimon, Oded, Lior Rokach, “Data mining and knowledge discovery handbook. Vol. 2. New York: Springer”,
2005.
[4] Kittipong C. , Pasapitch C. et al. “An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm”,
ICIE 2015.
[5] Thirumuruganathan, S. “A Detailed Introduction to k-Nearest Neighbor (k-NN) Algorithm”, 2010.
[6] T. Cover, P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27,
Jan. 1967.
[7] T. Denoeux, “A k-nearest neighbor classification rule based on Dempster–Shafer theory,” IEEE Trans. Syst.,
Man, Cybern., vol. 25, no. 5, pp. 804–813, May 1995.
[8] A. Bosch, A. Zisserman, and X. Muoz, “Scene classification using a hybrid generative/discriminative approach,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008.
[9] J. Yang, L. Zhang, J. Yang, and D. Zhang, “From classifiers to discriminators: A nearest neighbor rule induced
discriminant analysis,” Pattern Recognit., vol. 44, no. 7, pp. 1387–1402, 2011.
[10] J. Xu, J. Yang, and Z. Lai, “K-local hyperplane distance nearest neighbor classifier oriented local discriminant
analysis,” Inf. Sci., vol. 232, pp. 11–26, May 2013.
[11] H. Frigui and P. Gader, “Detection and discrimination of land mines in a ground-penetrating radar based on edge
histogram descriptors and a possibilistic K-nearest neighbor classifier,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 1, pp.
185–199, Feb. 2009.
[12] M. Li, M. M. Crawford, and J. Tian, “Local manifold learning-based k-nearest-neighbor for hyperspectral image
classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 11, pp. 4099–4109, Nov. 2010.
[13] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classification: Generalizing to new
classes at near-zero cost,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2624–2637, Nov. 2013.
[14] Acharya, Tinku, and Ajoy K. Ray. Image processing: principles and applications. John Wiley & Sons, 2005.
[15] M. L. Raymer, T. E. Doom, L. A. Kuhn, and W. F. Punch, “Knowledge discovery in medical and biological
datasets using a hybrid Bayes classifier/evolutionary algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol.
33, no. 5, pp. 802–813, Oct. 2003.
[16] H. Frigui and P. Gader, “Detection and discrimination of land mines in a ground-penetrating radar based on edge
histogram descriptors and a possibilistic K-nearest neighbor classifier,” IEEE Trans. Fuzzy Syst., vol. 17, no. 1, pp.
185–199, Feb. 2009.
[17] P. Maji, “Fuzzy–rough supervised attribute clustering algorithm and classification of microarray data,” IEEE
Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 222–233, Feb. 2011.
[18] X. Geng, D.-C. Zhan, and Z.-H. Zhou, “Supervised nonlinear dimensionality reduction for visualization and
classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 6, pp. 1098–1107, Dec. 2005.
[19] Z. Pan, Y. Wang and W. Ku, "A new general nearest neighbor classification based on the mutual neighborhood
information", Knowledge-Based Systems, pp. , 2017, ISSN 09507051.
23
[20] Lu, Huijuan, et al. "A hybrid feature selection algorithm for gene expression data
classification." Neurocomputing (2017).
[21] Pan, Zhibin, Yidi Wang, and Weiping Ku. "A new k-harmonic nearest neighbor classifier based on the multi-
local means." Expert Systems with Applications 67 (2017): 115-125.
[22] Yu, Xiaopeng. "The Research on an adaptive k-nearest neighbors classifier." Cognitive Informatics, 2006. ICCI
2006. 5th IEEE International Conference on. Vol. 1. IEEE, 2006.
[23] Parvin, H., H. Alizadeh, and B. B, MKNN: Modified k-Nearest Neighbor, in Proceeding of the World Congress
on Engineering and Computer Science. 2008: USA.
[24] Cedeno, W. and D. Agrafiotis, Using particle swarms for the development of QSAR models based on k-nearest
neighbor and kernel regression. Journal of Computer-Aided Molecular Design, 2003. 17(2-4).
[25] G.L. Ritter, H.B. Woodruff, S.R. Lowry, and T.L. Isenhour, “An Algorithm for a Selective Nearest Neighbor
Decision Rule,” IEEE Trans. Information Theory, vol. 21, pp. 665-669, Nov. 1975.
[26] C.L. Chang, “Finding Prototypes for Nearest Neighbor Decision Rule,” IEEE Trans. Computers, vol. 23, no. 11,
pp. 1179-1184, Nov. 1974.
[27] P.E. Hart, “Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, pp. 515-516, May
1968.
[28] D.W. Jacobs and D. Weinshall, “Classification with Non-Metric Distances: Image Retrieval and Class
Representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp. 583-600, June 2000.
[29] Zhang, Bin, and Sargur N. Srihari. "Fast k-nearest neighbor classification using cluster-based trees." IEEE
Transactions on Pattern analysis and machine intelligence 26.4 (2004): 525-528.
[30] A.J. Broder, “Strategies for Efficient Incremental Nearest Neighbor Search,” Pattern Recognition, vol. 23, nos.
1/2, pp. 171-178, Nov. 1986.
[31] A. Farago, T. Linder, and G. Lugosi, “Fast Nearest-Neighbor Search in Dissimilarity Spaces,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 957-962, Sept. 1993.
[32] B.S. Kim and S.B. Park, “A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 761-766, Nov. 1986.
[33] E. Vidal, “An Algorithm for Finding Nearest Neighbors in (Approximately) Constant Average Time,” Pattern
Recognition Letters, vol. 4, no. 3, pp. 145-157, July 1986.
[34] Wang, Chun-Yan, et al. "A K-Nearest Neighbor Algorithm based on cluster in text classification." Computer,
Mechatronics, Control and Electronic Engineering (CMCE), 2010 International Conference on. Vol. 1. IEEE, 2010.
[35] A. Majumdar and R. K. Ward, “Robust classifiers for data reduced via random projections,” IEEE Trans. Syst.,
Man, Cybern. B, Cybern, vol. 40, no. 5, pp. 1359–1371, Oct. 2010.
[36] A. K. Ghosh, P. Chaudhuri, and C. A. Murthy, “Multiscale classification using nearest neighbor density
estimates,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 5, pp. 1139–1148, Oct. 2006.
[37] Q. Hu, P. Zhu, Y. Yang, and D. Yu, “Letters: Large-margin nearest neighbor classifiers via sample weight
learning,” Neurocomputing, vol. 74, no. 4, pp. 656–660, 2011.
24
[38] C. Domeniconi, D. Gunopulos, and J. Peng, “Large margin nearest neighbor classifiers,” IEEE Trans. Neural
Netw., vol. 16, no. 4, pp. 899–909, Jul. 2005.
[39] G. Parthasarathy and B. N. Chatterji, “A class of new KNN methods for low sample problems,” IEEE Trans.
Syst., Man, Cybern., vol. 20, no. 3, pp. 715–718, May/Jun. 1990.
[40] Q. Gao and Z. Wang, “Center-based nearest neighbor classifier,” Pattern Recognit., vol. 40, no. 1, pp. 346–349,
2007.
[41] B. Li, Y. W. Chen, and Y.-Q. Chen, “The nearest neighbor algorithm of local probability centers,” IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 141–154, Feb. 2008.
[42] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbour algorithms,” in Proc. Adv.
Neural Inf. Process. Syst. (NIPS), vol. 14. Vancouver, BC, Canada, 2002, pp. 985–992.
[43] S. Hernández-Rodríguez, J. F. Martínez-Trinidad, and J. A. Carrasco-Ochoa, “Fast k most similar neighbor
classifier for mixed data (tree k-MSN),” Pattern Recognit., vol. 43, no. 3, pp. 873–886, 2010.
[44] B. Zhang and S. N. Srihari, “Fast k-nearest neighbor classification using cluster-based trees,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 26, no. 4, pp. 525–528, Apr. 2004.
[45] A. K. Ghosh, P. Chaudhuri, and C. A. Murthy, “On visualization and aggregation of nearest neighbor classifiers,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1592–1602, Oct. 2005.
[46] J. Derrac, I. Triguero, S. Garcia, and F. Herrera, “Integrating instance selection, instance weighting, and feature
weighting for nearest neighbor classifiers by coevolutionary algorithms,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,
vol. 42, no. 5, pp. 1383–1397, Oct. 2012.
[47] I. Triguero, S. García, and F. Herrera, “Differential evolution for optimizing the positioning of prototypes in
nearest neighbor classification,” Pattern Recognit., vol. 44, no. 4, pp. 901–916, 2011.
[48] B. P. Prajapati, and D. R. Kathiriya. "Evaluation of Effectiveness of k-Means Cluster based Fast k-Nearest
Neighbor classification applied on Agriculture Dataset." International Journal of Computer Science and Information
Security 14.10 (2016): 800.
[49] Hardy, André. "An examination of procedures for determining the number of clusters in a data set." New
approaches in classification and data analysis. Springer, Berlin, Heidelberg, 1994. 178-185.
[50] Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters
in a data set. Psychometrika, 50(2), 159-179.
[51] Lee, Paul H., et al. "A cluster analysis of patterns of objectively measured physical activity in Hong Kong." Public
health nutrition 16.8 (2013): 1436-1444.
[52] Arbelaitz, Olatz, et al. "An extensive comparative study of cluster validity indices." Pattern Recognition 46.1
(2013): 243-256.
[53] Rendón, Eréndira, et al. "Internal versus external cluster validation indexes." International Journal of computers
and communications 5.1 (2011): 27-34.
[54] Brun, Marcel, et al. "Model-based evaluation of clustering validation measures." Pattern recognition 40.3 (2007):
807-824.
[55] Prajapati, B.P. and Kathiriya, D.R., 2016. Reducing execution time of Machine Learning Techniques by Applying
Greedy Algorithms for Training Set Reduction. International Journal of Computer Science and Information Security,
14(12), p.705.
25
[56] Wettschereck, D., Aha, D.W. and Mohri, T., 1997. A review and empirical evaluation of feature weighting
methods for a class of lazy learning algorithms. In Lazy learning (pp. 273-314). Springer Netherlands.
[57] Williams, Nigel, Sebastian Zander, and Grenville Armitage. "A preliminary performance comparison of five
machine learning algorithms for practical IP traffic flow classification." ACM SIGCOMM Computer
Communication Review 36.5 (2006): 5-16.
[58] Bost, Raphael, et al. "Machine Learning Classification over Encrypted Data." NDSS. 2015.
[59] Vafeiadis, Thanasis, et al. "A comparison of machine learning techniques for customer churn prediction."
Simulation Modelling Practice and Theory 55 (2015): 1-9.
[60] Patel, Jigar, et al. "Predicting stock and stock price index movement using trend deterministic data preparation
and machine learning techniques." Expert Systems with Applications 42.1 (2015): 259-268.
[61] Kanj, Sawsan, et al. "Editing training data for multi-label classification with the k-nearest neighbor rule."
Pattern Analysis and Applications 19.1 (2016): 145-161.
[62] Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.