Knowledge Discovery and Data Mining In …...1 Knowledge Discovery and Data Mining In Agricultural...

1

Knowledge Discovery and Data Mining In Agricultural Database

Using Machine Learning Techniques

PhD synopsis

For the degree of

Doctor of Philosophy

in Computer IT Engineering

Submitted by:

Bhagirath Parshuram Prajapati

(Enrol. No.: 119997107001, Batch: 2011)

Supervisor

Dr. Dhaval Kathiriya,

Dean & Principal, AIT,

Anand Agriculture University.

DPC Members:

Dr. Apurva M. Shah,

Associate Professor,

MSU.

Dr. Ramji Makwana,

CEO and Founder,

AI eSmart Solutions

Submitted to

Gujarat Technological University

2

1. Abstract

In 1990's, studies show that most people preferred information by human resource over

automated information retrieval systems. Though, the Pew Internet Survey(Fallows 2004)

found that “ the Internet is a good place to go for getting everyday information”, this survey

indicates that in the twenty-first century, people prefer and satisfied with information retrieval

system and world wide web(www). In the today’s world, the chain of digitalizing everything

exists in every field. Data are growing rapidly in gargantuan amount as they are all the time

gathered by affordable and numerous information-sensing devices. One such domain of

interest is where newer datasets are being generated is an Agriculture. From starting point of

the history of the human race to till date, we, human beings, depend on the agriculture for

daily nourishment needs including food, milk. In addition, the Industrial Revolution was

possible because of raw material provided by Agriculture. Nowadays in the Agriculture

domain, there are various data being collected and stored in computer systems. Concepts of

Data mining can be applied to Agriculture sector to analyse agriculture data sets. Now the

question is, can we automate the information retrieval system based on Data Mining in

Agriculture domain? Machine Learning, which is a field of Data Mining, provide necessary

techniques to solve the above problem. This research meant to evaluate a concept of the

classification technique and apply them to agricultural soil data sets to discover meaningful

relationships which can be used for decision making about quality crop production in massive

amount. The soil health card data base is consisting of macro and micro nutrients records of

soil samples taken from farm field and tested in Soil Laboratory. In this research, we have

concentrated on k-Nearest classification algorithm to classify soil sample instances into

appropriate fertilizers deficiency category. Although k-Nearest Neighbor classification is

simple and effective, it has the largest computational and storage requirements. In addition,

the effectiveness of classification decreases because of uneven distribution of training data. In

this research, we present a novel Fast k-Nearest Neighbor, Training Set Reduction k-Nearest

Neighbour and Hybrid k-Nearest Neighbour Classification methods for decreasing the

requirement of time and space. We have applied our new approaches on Soil health card

Agriculture data set and our evaluation illustrates that this approach can solve the mentioned

problems effectively. We have discussed the comparative analysis of methods to identify the

best method.

3

2. State of the art of the research topic

Background Study:

There are many presentations of Data mining approach. Machine learning is one of them and

widely used. Machine learning is a domain that is focused on developing algorithms that allow

computers to learn to resolve problems based on past records [1]. Data mining is a science to

discover knowledge from databases. The database contains a collection of instances (records

or case). Each instance used by machine learning and data mining algorithms is formatted

using the same set of fields (features, attributes, inputs, or variables). When the instances

contain the correct output (class label) then the learning process is called the supervised

learning [2]. The other Machine learning approach is clustering which works without knowing

the class label of instances is called unsupervised learning [3]. The focus of this research is on

classification and clustering for Agricultural Soil health card database.

k-Nearest Neighbors algorithm:

The k-Nearest Neighbor algorithm (k-NN) is a simple an instance based machine learning

algorithm [5],[6]. The k-NN finds k closest instances to a predefined instances and

classification decision is taken by identifying the most frequent class label among the training

data that have a minimum distance between the query instance and training instances [5]. The

distance is determined by the distance metric like Euclidean, Cosine, Chebyshev etc [4].

The k-Nearest Neighbor classification algorithm is a classical well-known method in Machine

learning [6], [7]. It is a well-established method in the area of pattern recognition and a lot of

research has been done on k-NN [8],[9],[10]. For example remote sensing [11],[12], image

processing [13],[14] and so on. Raymer et al. [15] applied k-NN in combination with a genetic

algorithm on medical data sets for knowledge discovery. Frigui et al. [16] used a k-NN

classifier to perform detection of land mines, here they adopted a possibilistic k-NN classifier.

Yang et al. [9] adopted the local mean based nearest neighbour algorithm to perform the

discriminant analysis. Li et al. [12] adopted the k-NN classifier to the image classification of

hyperspectral images. Bosch et al. [8] adopted k-NN classifier for classification of a scene.

For feature extraction and classification, Xu et al. [10] have performed the k-local hyper plane

4

nearest neighbor classification. In image classification Mensink et al.[13] applied the k-NN

classifier. Maji [17] applied the k-NN classifier for microarray data classification. To reduce

the dimensionality in pattern classification. Geng et al. [18] combined the k-NN with

dimensionality reduction technique. Z. Pan et al. [19] applied k-NN based on mutual

neighborhood information. Lu. et al. [20] proposed a hybrid feature selection algorithm for

gene expression data classification. Z. Pan et al.[21] adopted a new k-harmonic nearest

neighbor classifier for gene data classification.

The simple k-NN classification algorithm is as follows [22]:

Step 1: Add each sample to the training_list;

Step 2: For a given unlabeled sample si, select k nearest neighbors of si in training_list;

Step 3: Return class label for si, which occurs maximum time in top k training_list records.

The advantage of using k-NN:

No need to train dataset in the case of additional instances [23].

Simplicity and flexibility in using [23].

Weight can be used in case of significant features [24].

Accelerating k-NN:

In Machine learning classification, the k-NN is powerful and widely used nonparametric

technique for classification. Though it is exhaustive to perform a k-NN search which requires

a lot of computational resources in case there is large training data set, in this case, k-NN is

not preferable [25], [26]. Since many decades accelerating the k-NN search is one of the active

areas of research.

To speed up the k-NN searching is an interesting area of research and it is mainly divided into

two categories: template condensation and template reorganization [29]. Template

condensation identifies the redundant patterns in template set and removes it [25], [26], [27].

While the restructuring of templates is done in the template reorganization algorithms [30],

[31], [32], [33]. A lot of work has been done to find a new approach and in one such method,

5

classification performance is not affected while reducing the storage and computation cost

[22].

In some method out of total training set, representative samples are selected and remaining

are deleted to reduce the amount of training sample set. In text categorization research [34],

the training set is reduces based on the density, here text density is calculated and if it is found

bigger than the average density then removes some samples to reduce training samples in

training set.

Some research has extended the features affecting the k-NN performance, the best k value,

the training sample size, etc. Majumdar and Ward [35] combined the k-NN classifier with the

random projection technique. Ghosh et al. [36] estimated the optimal value of the k in k-NN.

Hu et al. [37] applied sample weight learning on the nearest neighbor classifier. Domeniconi

et al. [30] studied theoretically the large margin nearest neighbor classifiers. Parthasarathy and

Chatterji [29] explored the way to use k-NN in case sample size is small. Some researchers

have analyzed the data points relationships to the nearest neighbor relationships, like centers

of the classes and hyperplane data points. Gao et al. [40] have designed a nearest neighbor

classifier based on the center called center base nearest neighbor classifier. Li et al. [41] used

the local probabilistic centers of each class in the classification process. Vincent et al. [42]

applied the k-local hyperplane NN technique.

In some research work, researchers have explored the efficiency of the k-NN classifier.

Hernández-Rodríguez et al. [43] has proposed n approximate fast k most similar neighbor

classifier based on a tree structure and checked the efficiency of the k-NN classifier. Zhang

and Srihari [44] explored cluster based tree algorithms for the fast KNN classifier. Ghosh et

al. [45] explored the visualization and aggregation of nearest neighbor classifiers. Some

research work explored the distance metrics. Derrac et al. [46] proposed a method to improve

the performance of the k-NN classifier based on cooperative coevolution. Triguero et al. [47]

adopted the differential evolution to optimize the positioning of the prototypes to address the

limitations of the nearest neighbor classifier. Weinberger et al. [35] investigated distance

metric learning to obtain a large margin for the k-NN classifier.

6

3. Definition of the Problem

To explore the applicability of Machine learning techniques on Agricultural Dataset of Soil

health card and to propose improved efficient Machine learning algorithm to classify soil

sample into the categories of the deficiencies of micro and macro nutrients.

4. Objective and Scope of work

Objectives:

To study and analyze agriculture soil health card database, data mining on soil health

card database and applicability of machine learning on it.

To design the concept to carry out machine learning specifically classification

algorithm on Soil health card database.

To identify the concept to improve time and space complexity of classification

algorithm to classify soil samples in respective nutrients deficiencies category.

To measure the performance of proposed algorithms based on accuracy, precision,

recall and F1 measure.

To design and develop a software prototype to prove the above concept.

Scope:

In this section, we shall specify the scope of the research presented in the current work

as follows,

Based on the nature of this research, sub-dataset is abstracted from agricultural Soil

health card database, which consists of micro and macro nutrients for the individual

farm of selected district of Gujarat, this data set is preprocessed for applying machine

learning techniques.

Classification algorithm k-Nearest Neighbor is applied on Soil health card data set to

classify each soil sample into the categories of deficiencies of the nutrients.

The primary limitation of k-Nearest Neighbor is that it retains all the training data

because of that it is prone to high computational cost henceforth this research proposes

novel k-Nearest Neighbour algorithms like, Fast k-Nearest Neighbor, Training set

reduction k-Nearest Neighbor and Hybrid k-Nearest Neighbor, which is evaluated

comparatively.

7

5. Original contribution by the thesis

This research is distinctive in terms of its application in the Agricultural domain and

encompassing machine learning technique k-Nearest Neighbor algorithm with its

improvements. Agricultural database soil health card is used and from this database macro and

micro nutrients are abstracted for the particular farm for classification purpose. This research

work is original as there is no such work carried out on Soil Health Card Database in the state

of Gujarat. Though an effective k-Nearest Neighbor algorithm is proposed for classification, it

suffers from large storage and computational requirements. In this research, we present a novel

Fast k-Nearest Neighbor, Training Set Reduction k-Nearest Neighbour and Hybrid k-Nearest

Neighbour Classification methods for decreasing the requirement of time and space. The

original contribution is also observed in the research papers listed at the end.

6. Methodology of Research, Results / Comparisons

Methodology of Research

Research methodology encompasses ideas of a systematic representation of the methods

applied to carry out the study which emphasis on the theoretical illustration of the methods and

principles related to the knowledge of a branch.

To solve afore mentioned research problem, the main design research phases applied in this

work are as follows.

Problem awareness

In Agriculture database of soil health card the scope of Machine learning application is

to classify the soil samples into categories of deficiencies of nutrients. This step is also

to identify the lack of a general framework that can be used in the Agricultural domain

to extract knowledge.

Literature review

This step studies in detail about various Machine learning techniques with its

applications in various fields. It identifies a k-nearest neighbor classifier for this

research work as well as considers the selection of dataset that is being extracted from

Soil health card database. The main purpose of this phase is to generalize research

8

directions of improvements in existing k-nearest neighbor algorithm for aforesaid

classification task in the area of Agricultural Soil data set classification.

Implementation

The work is implemented in different steps as given below,

o A collection of data and generation of the data set.

o Pre-processing of data.

o Application of classification technique k-Nearest Neighbour.

o Application of classification technique Fast k-Nearest Neighbour.

o Application of classification technique training set reduction k-Nearest

Neighbour.

o Application of classification technique Hybrid k-Nearest Neighbour.

Evaluation

The purpose of evaluation phase is to check whether the implemented classification

methods meet the expectations. Hence comparison is done between classification

methods in terms of various measures.

Conclusion

It is the final phase of research cycle that covers the impact of the research and

contributions by researchers. Here, in this work, the scope and applicability of Machine

learning in an Agricultural data set of Soil health card are evaluated in considerations

of k-Nearest Neighbor classifiers and it's adaptation for the Agricultural domain.

Proposed work:

Collection of data and generation of the data set

This research work is concentrated on exploring the applicability of Machine learning

techniques on Agricultural Dataset of Soil health card and to propose improved efficient

Machine learning algorithm to classify soil sample into the categories of the deficiencies

of micro and macro nutrients. Hence, Agricultural soil dataset is collected from Soil

health card database(SHCDB), which is available with Anand Agriculture University.

SHCDB for Gujarat state is consisting soil health card data of individual farm of all

districts of Gujarat. Individual districts of Gujarat have their respective database table

having total 49 different attributes varies from identification of soil sample to soil

9

characteristics of the particular land. This SHCDB is maintained in MSSQL DBMS

system. For our research purpose, we have been provided SHCDB of six districts

namely Kutch, Rajkot, Banaskantha, Vadodara, Anand, and Surat. As it is mentioned

previously, total 49 attributes are there in SHCBD, out of these we are concerned with

macro and micro nutrients, henceforth four macro and four micro nutrients are

abstracted from SHCDB of each district and label is assigned to each sample indicating

deficiencies in it.

Table. 1 Soil health card data set

Different soil fertilizer treatments based on deficiency

• Macro soil nutrients parameters

• Potassium (K) <=150 ppm, then soil needs potassium fertilizer treatment.

• Sulfur (S) <= 20 ppm, then soil needs sulfur fertilizer treatment.

• Magnesium (Mg) <= 2 ppm, then soil needs treatment.

• Phosphorus (P) <= 20 ppm, then soil needs phosphorus treatment.

• Micro soil nutrients parameters

• Iron (Fe): if the content of Fe <=10 ppm than soil requires Ferrous Sulfate.

• Manganese (Mn): if content of Mn <=10 ppm than soil requires Manganese

Sulfate.

• Zinc (Zn): if the content of Zn <=1 ppm than soil requires than Zinc Sulfate.

• Copper (Cu): if the content of Cu <= 0.4 ppm than soil requires Copper Sulfate.

10

Application of classification technique k-Nearest Neighbor

We have first applied k-Nearest Neighbor algorithm on data set of district Kutch, which

is having 14000 samples of soil parameters from SHCDB and calculated accuracy,

precision, recall, F1 measures and classification time in milliseconds.

Algorithm 1: k-Nearest Neighbour (k-NN) Classifier

Input: A set of Agriculture records R = {R1, R2. . . Rn}, where n is the total number

of Agriculture records, training record set D.

Procedure:

Step 1: Divide the record data into one training set and test set as 50-50 split.

Step 2: For each test record, calculate similarity with each training record.

Step 3: Sort the training records in the descending order of the maximum cosine

similarity and select the top k training records.

Step 4: Assign a class to test record which occurs maximum times in the top k

training records.

Step 5: Construct a confusion matrix.

Step 6: Calculate performance measures from the confusion matrix.

Sr.

No k Accuracy Precision

Recall

F1

measure

Classification

Time in

millisecond

1 31 90.21 76.28 55.03 63.93 5766

2 33 88.85 92.22 67.92 78.23 5779

3 35 90.41 85.94 55.88 67.72 5777

4 37 90 88.28 60.97 72.13 5746

Table 2. Results of k-NN classifier

Results of k Nearest Neighbor classifier applied on agricultural soil data set SHCDB are

shown in table 2. For different values of k accuracy, precision, recall, F1 measure and

classification time is measured.

11

Application of classification technique Fast k-Nearest(F-kNN) Neighbor

The primary limitation of the simple k-NN algorithm is it needs to retain all the training

data and prone to high computational cost. In order to reduce the computation cost of

mentioned simple k-NN, we proposed and designed a fast k-Nearest Neighbor

algorithm. The fast k-NN (F-kNN) classifier first finds k clusters by employing k-Means

clustering algorithm. The class label of each cluster is the class whose maximum number

of records are present in that particular cluster [48]. For each test data, we have

calculated similarity with each cluster and assigned a class based on k-NN approach.

The fast k Nearest Neighbor algorithm is described as below.

Algorithm 2: Fast k-Nearest Neighbour algorithm (F-kNN)

Input: A set of Agriculture records R = {R1, R2…Rn}, where n is the total number of

agricultur records, training record set D.

Procedure:


Step 2: Construct k clusters using kMeans clustering algorithm (validate k value

for kMeans clustering by Elbow method or Silhouette method) and assign a class

label to each cluster based on maximum occurrences of a particular class in that

cluster

Step 3: For each test record, calculate similarity with each cluster’s centroid.

Step 4: Sort the clusters in the descending order of the maximum cosine similarity

and select the top k clusters.

Step 5: Assign a class to test record whose summation of similarity is maximum in

the top k clusters.



In above algorithm in step 2, we have applied clustering validation techniques because

for cluster analysis there is always a question of how to evaluate the goodness of

clusters [49]. For kMeans clustering, it is desirable to perform clustering with optimum

k value to avoid finding patterns in noise and to compare it with other clustering

algorithms. In this research work we considered two methods of cluster validation, the

first method computes the sum of square of error (SSE) [50], [51] as below,

12

𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑥, 𝑐𝑖)2

𝑥∈𝑐𝑖

𝐾

𝑖=1

The SSE is defined as the sum of the squared distance between each member of the

cluster and its centroid. It checks measures cohesion [52], means how closely related

are objects in a cluster.

The second method of cluster validation is to compute silhouette value [53] as below,

Calculate a = average distance of i to the points in its cluster,

Calculate b = min (average distance of i to points in another cluster)

𝑠(𝑖) =𝑏(𝑖) − 𝑎(𝑖)

max {𝑎(𝑖), 𝑏(𝑖)}

It is interpretation and validation of consistency within cluster data, means it combines

ideas of both cohesion and separation [54]. Cluster separation measure how distinct or

well separated a cluster is from other clusters. Silhouette value is typically between 0

and 1 and the closer one is better.

Sr.

No

k

Accuracy Precision

Recall

F1

measure

Classification

Time in

millisecond

kMeans

(Applying

SSE)

1 31 88.64 88.57 94.91 91.64 199 141

2 33 87.92 87.96 94.91 89.51 191 131

3 35 88.85 88.79 94.91 91.75 217 131

4 37 88.78 88.72 92.91 90.77 233 181

Table 3. Results of Fast k-NN classifier, optimum kMeans by SSE

Results of Fast k Nearest Neighbor classifier applied on agricultural soil data set SHCDB is

shown in table 3 and table 4. For different values of k accuracy, precision, recall, F1 measure

and classification time is measured. Here, the optimum value of k in kMeans is decided by

the value of SSE and silhouette value respectively in table 3 and table 4.

Sr.

No

k

Accuracy Precision

Recall

F1 measure

Classification Time

in millisecond

kMeans (Applying

Silhouette value)

1 31 88.71 88.65 90.91 89.77 240 201

2 33 89.78 89.72 93.92 91.77 255 201

3 35 90.35 90.29 92.92 91.58 248 191

4 37 89.28 89.22 92.92 91.03 240 191

Table 4. Results of Fast k-NN classifier, optimum kMeans by silhouette value

13

Application of classification technique Training set reduction k-Nearest Neighbor

In Fast k-NN, the computational cost is significantly reduced compared to k-NN but

still it is retaining all the training instances in each cluster and having high storage

requirement. We have proposed and designed Training set reduction k-NN(TSR-kNN)

algorithm so that it reduces training set significantly and have the advantage of low

computational and storage cost.

Figure 1. Training set reduction k-NN [56]

Fig 1. Illustration of Classification using Training Set Reduction k-NN algorithm. In

this approach in the first phase, a training set is converted into a set of training vectors

[55]. The training vectors are given as an input into Training set reduction algorithm

and the algorithm’s output is the reduced training set. In the second phase the reduced

training set is employed by the classifier to classify a new test instance. We have applied

shrink (subtractive) algorithm [56] to reduce the training set.

Algorithm 3: Training set reduction k-NN (TRS-kNN)

Phase I: Shrink (subtractive) algorithm,

Input: A set of training instances T = {T1, T2,… Tn } where n is the total number


Step 1: Assign all the training documents into S

Step 2: Select randomly an instance P from S

Step 3: Classify the instance P using remaining instances from S.

Step 4: Remove the instance P if it is correctly classified.

14

Phase II:

Input: A set of Agriculture records R = {R1, R2. . . Rn}, where n is the total number of

agriculture records reduced training record set D.

Procedure:





cluster

Step 3: Sort the training records in the descending order of the maximum cosine

similarity and select the top k training records.

Step 4: Assign a class to test record which occurs maximum times in the top k

training records.



Sr.

No k Accuracy Precision Recall F1 measure

Classification

Time in

millisecond

Reduced

training set

1 31 89.71 81.81 52.01 63.59 1021 3005

2 33 90.21 86.27 68.03 76.07 1017 2855

3 35 90.5 84.73 59.59 69.97 1039 2970

4 37 89 79.69 53.16 63.78 1343 3040

Table 5. Results of Training set reduction k-NN classifier

Results of Training set reduction k Nearest Neighbor classifier applied on agricultural soil

data set SHCDB are shown in table 5. For different values of k accuracy, precision, recall,

F1 measure and classification time is measured. Here, the reduced training set is remaining

samples out of whole training set after performing Shrink (subtractive) algorithm on it.

15

Application of classification technique Training set reduction Fast k-Nearest

Neighbor(TSR-FkNN)

This method is a hybrid method, where we have combined features of both fast k-NN

and training set reduction.

Fig 2. An overview of hybrid machine learning technique TSR-FkNN.

Figure 2 provided an overview of a hybrid approach of the previous two. In hybrid

method, the training set reduction techniques are applied on training set feature vector

and the technique reduces training set. The reduced training set is given as an input to

clustering algorithm and a set of clusters are given as an input to Machine Learning

Algorithm and classifier model is learned. The classifier model assigns a class to a new

test instance.

Algorithm 4: Training set reduction fast k nearest neighbor

Phase I:

Input: A set of training instances T = {T1, T2,… Tn} where n is the total number


Step 1: Assign all the training documents into S

Step 2: Select randomly an instance P from S

Step 3: Classify the instance P using remaining instances from S.

Step 4: Remove the instance P if it is correctly classified.

Phase II:

Input: A set of Agriculture records R = {R1, R2 , . . . , Rn } where n is the total number

of Agriculture records, reduced training record set D.

16

Procedure:

. Step 1: Divide the record data into one training set and test set as 50-50 split.




cluster

Step 3: For each test record, calculate similarity with each cluster’s centroid.

Step 4: Sort the clusters in the descending order of the maximum cosine similarity

and select the top k clusters.

Step 5: Assign a class to test record whose summation of similarity is maximum in

the top k clusters.



Sr.

No

k Accuracy Precision Recall F1 measure Classification

Time in

millisecond

kMeans

(applying SSE)

after training

set reduction

1 31 88.92 87.86 93.91 90.79 248 141

2 33 90.42 89.36 91.92 90.62 143 61

3 35 90.85 89.79 94.92 92.28 180 71

4 37 88.85 87.79 92.91 90.28 217 111

Table 6. Results of Training set reduction fast k-NN classifier

Results of Training set reduction Fast k Nearest Neighbor classifier applied on agricultural soil

data set SHCDB is shown in table 6 and table 7 in which for different values of k accuracy,

precision, recall, F1 measure and classification time is measured. Here, the optimum value of

k in kMeans is decided by the value of SSE and silhouette value respectively in table 6 and

table 7.

Sr.

No

k Accuracy Precision Recall F1 measure Classification

Time in

millisecond

kMeans (applying

Silhouette) after

training set reduction

1 31 92.64 92.57 93.91 93.24 261 131

2 33 92.14 95.07 92.91 93.98 219 151

3 35 93.85 95.79 93.92 94.84 192 131

4 37 91.85 94.79 92.91 93.84 217 141

Table 7. Results of Training set reduction fast k-NN classifier

17

Comparisons of results:

In this section comparisons between different proposed classification techniques is carried out in

terms of performance measures and time of classification in milliseconds.

This results are performed on computer with intel i5 processor and 4GB Ram, the software

IDE is NeatBeans 8.2. Depends on hardware some of the results may vary. The observed

results are on average of multiple run.

Accuracy comparison:

Value

k for

k-NN

k Nearest

Neighbor

Fast k-NN Training set

reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying SSE)

TSR- FkNN

(applying

silhouette)

1 31 90.21 88.64 88.71 88.21 88.92 92.64

2 33 88.85 87.92 89.78 89.71 90.42 92.14

3 35 90.41 88.85 90.35 89.57 90.85 93.85

4 37 90 88.78 89.28 89.07 88.85 91.85

Table 8. Comparison of Accuracy for all k-NN classifiers

In table 8, the accuracy of different k-NN classifiers is compared. Williams et al. [57] adopted

accuracy as a measure to compare five machine learning algorithms and high accuracy algorithm

is preferred. In our research work it is found that proposed TSR-FkNN(applying silhouette)

classifier have the highest accuracy and all other classifiers also have accuracy less than accuracy

of TSR-FkNN(applying silhouette).

Precision comparison:

Value

k for

k-NN

k Nearest

Neighbor


reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying SSE)

TSR- FkNN

(applying

silhouette)

1 31 76.28 88.57 88.65 81.81 87.86 92.57

2 33 92.22 87.96 89.72 86.27 89.36 95.07

3 35 85.94 88.79 90.29 84.73 89.79 95.79

4 37 88.28 88.72 89.22 79.69 87.79 94.79

Table 9. Comparison of precision for all k-NN classifiers

In table 9, the precision of different k-NN classifiers is compared, Vafeiadis, Thanasis, et al [59]

have applied machine learning algorithms on customer churn data set and proposed precision as

18

one of the comparison measure for machine learning algorithms. In our research it is found that

TSR-FkNN(applying silhouette) is having highest precision than all other algorithms

Recall comparison:

Value

k for

k-NN

k Nearest

Neighbor


reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying SSE)

TSR- FkNN

(applying

silhouette)

1 31 55.03 90.91 88.91 52.01 93.91 93.91

2 33 67.92 91.91 91.92 68.03 91.92 92.91

3 35 55.88 91.91 90.92 59.59 92.91 93.92

4 37 60.97 90.91 90.92 53.16 92.91 92.91

Table 10. Comparison of recall for all k-NN classifiers

In table 10, recall of different k-NN classifiers is compared, Patel, Jigar, et al [60] have proposed

recall as one of the measure for comparing machine learning classifiers in share price prediction.

In our research it is found that recall of TSR-FkNN(applying silhouette) classifier is highest, and

all other classifiers have a value of recall less than TSR-FkNN(applying silhouette).

F1 measure comparison:

Value

k for

k-NN

k Nearest

Neighbor


reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying SSE)

TSR- FkNN

(applying

silhouette)

1 31 63.93 90.68 88.77 63.59 90.79 93.24

2 33 78.23 90.36 90.80 76.07 90.62 93.98

3 35 67.72 90.8 90.60 69.97 92.28 94.84

4 37 72.13 89.8 90.06 63.78 90.28 93.84

Table 11. Comparison of recall for all k-NN classifiers

In table 11, F1 measure of different k-NN classifiers is compared. Kanj, Sawsan, et al [61] has

adopted F1 as measure for comparing editing training data machine algorithms. In our research it

is found that F1 measure of TSR-FkNN(applying silhouette) classifier is highest and all other

classifiers have a value of F1 measure less than TSR-FkNN(applying silhouette).

19

Training set comparison:

In table 12, training instances of different k-NN classifiers are compared. Witten, Ian H., et al.[62]

have dedicated chapter on Reduction Techniques for Instance-Based Learning Algorithms. Here,

training set reduction machine learning algorithms are compared.

In our research TSR-FkNN( applying SSE) has lowest training instances when value of k is 33 and

35 respectively, All other classifiers have higher training instances while k-NN has highest training

instances. Here, training instances of all classifiers other than k-NN are reduced by applying novel

techniques designed for this research.

Value

k for

k-NN

k Nearest

Neighbor


reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying

SSE)

TSR- FkNN

(applying

silhouette)

1 31 7000 141 201 3005 141 131

2 33 7000 131 201 2855 61 151

3 35 7000 131 191 2970 71 131

4 37 7000 181 191 3040 111 141

Table 12. Comparison of training instances for all k-NN classifiers

Classification time comparison:

Value

k for

k-NN

k Nearest

Neighbor


reduction k-NN

Hybrid method

Sr.

No

k k-NN F-kNN (

applying

SSE)

F-kNN

(applying

silhouette)

TRS-kNN TSR- FkNN

(applying SSE)

TSR- FkNN

(applying

silhouette)

1 31 5766 199 240 1021 248 261

2 33 5779 191 255 1017 143 219

3 35 5777 217 248 1039 180 192

4 37 5746 233 240 1343 217 217

Table 13. Comparison of classification time for all k-NN classifiers

In table 13, comparison of all classifier is done in terms of classification time in millisecond.

Williams et al. [57] applied method of comparing time of five classifiers. Bost, Raphael, et al. [58]

applied machine learning algorithm on encrypted data set and compared algorithm based on

execution time. In our research it is observed that TSR-FkNN (applying SSE) is having lowest

classification time when value of k is 33 and 35 respectively.

20

6. Conclusion

• Storage reduction:

• Storage requirement in k-NN is very high in comparison to other algorithms.

• For TSR- FkNN (applying SSE) storage requirement is lowest when value of k is

33 and 35 respectively followed by F-kNN and TSR-FkNN(applying silhouette).

Hence, in terms of storage TSR-FkNN and F-kNN are efficient.

• Execution time:

• Execution time is highest in kNN followed by TSR-kNN as they store more number

of instances for training purpose.

• Execution time is lowest in TSR-FkNN(applying SSE) followed by F-kNN as they

store less number of training instances.

• Generalization accuracy, precision, recall and F1 measure:

• Generalize accuracy of TSR-FkNN(applying silhouette) is highest compared to

other algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is

recommended .

• The precision of TSR-FkNN(applying silhouette) is highest compared to other

algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is

recommended .

• Recall and F1 measure of TSR-FkNN(applying silhouette) is highest compared to

other algorithms hence in terms of accuracy TSR-FkNN(applying silhouette) is

recommended .

In terms of Time, Space and Accuracy comparisons. The proposed novel hybrid algorithm

TSR-FkNN(applying silhouette) is the best algorithm hence it can be recommended for

classifying soil samples in respective nutrients deficiencies category.

21

7. Achievements with respect to objectives

We have successfully applied machine learning classifier k-NN on agriculture soil

health card data set to classify soil samples into a particular class of nutrients deficiency.

The design of novel approaches of k-NN classifier F-kNN, TSR-kNN and TSR-FkNN

is carried out successfully.

Noticeable performance improvements are observed in proposed work in terms of time

and space complexity.

9. Published papers

• “A Novel Framework for Association Rule Mining to observe Crop Cultivation Practices

based on Soil type”, International Journal of Computer Science and Information Security

(IJCSIS), Vol. 14, No. 9, September 2016.

• “Evaluation of Effectiveness of k-Means Cluster based Fast k-Nearest Neighbor

classification applied on Agriculture Dataset”, International Journal of Computer Science

and Information Security (IJCSIS), Vol. 14, No. 10, October 2016.

• “Reducing execution time of Machine Learning Techniques by Applying Greedy

Algorithms for Training Set Reduction”, International Journal of Computer Science and

Information Security (IJCSIS), Vol. 14, No. 12, December 2016.

• “Towards the new Similarity Measures in Application of Machine Learning Techniques

on Agriculture Dataset”, International Journal of Computer Applications (0975 – 8887)

Volume 156 – No 11, December 2016

22

10. References

[1] Tan, P.-N., M. Steinbach, et al., “Introduction to Data Mining. 2006: Addison-Wesley.”

[2] Witten, Ian H., et al. “Data Mining: Practical machine learning tools and techniques”. Morgan Kaufmann, 2016.

[3] Maimon, Oded, Lior Rokach, “Data mining and knowledge discovery handbook. Vol. 2. New York: Springer”,

2005.

[4] Kittipong C. , Pasapitch C. et al. “An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm”,

ICIE 2015.

[5] Thirumuruganathan, S. “A Detailed Introduction to k-Nearest Neighbor (k-NN) Algorithm”, 2010.

[6] T. Cover, P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27,

Jan. 1967.

[7] T. Denoeux, “A k-nearest neighbor classification rule based on Dempster–Shafer theory,” IEEE Trans. Syst.,

Man, Cybern., vol. 25, no. 5, pp. 804–813, May 1995.

[8] A. Bosch, A. Zisserman, and X. Muoz, “Scene classification using a hybrid generative/discriminative approach,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008.

[9] J. Yang, L. Zhang, J. Yang, and D. Zhang, “From classifiers to discriminators: A nearest neighbor rule induced

discriminant analysis,” Pattern Recognit., vol. 44, no. 7, pp. 1387–1402, 2011.

[10] J. Xu, J. Yang, and Z. Lai, “K-local hyperplane distance nearest neighbor classifier oriented local discriminant

analysis,” Inf. Sci., vol. 232, pp. 11–26, May 2013.

[11] H. Frigui and P. Gader, “Detection and discrimination of land mines in a ground-penetrating radar based on edge

histogram descriptors and a possibilistic K-nearest neighbor classifier,” IEEE Trans. Fuzzy Syst.,vol. 17, no. 1, pp.

185–199, Feb. 2009.

[12] M. Li, M. M. Crawford, and J. Tian, “Local manifold learning-based k-nearest-neighbor for hyperspectral image

classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 11, pp. 4099–4109, Nov. 2010.

[13] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classification: Generalizing to new

classes at near-zero cost,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2624–2637, Nov. 2013.

[14] Acharya, Tinku, and Ajoy K. Ray. Image processing: principles and applications. John Wiley & Sons, 2005.

[15] M. L. Raymer, T. E. Doom, L. A. Kuhn, and W. F. Punch, “Knowledge discovery in medical and biological

datasets using a hybrid Bayes classifier/evolutionary algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol.

33, no. 5, pp. 802–813, Oct. 2003.

[16] H. Frigui and P. Gader, “Detection and discrimination of land mines in a ground-penetrating radar based on edge

histogram descriptors and a possibilistic K-nearest neighbor classifier,” IEEE Trans. Fuzzy Syst., vol. 17, no. 1, pp.

185–199, Feb. 2009.

[17] P. Maji, “Fuzzy–rough supervised attribute clustering algorithm and classification of microarray data,” IEEE

Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 222–233, Feb. 2011.

[18] X. Geng, D.-C. Zhan, and Z.-H. Zhou, “Supervised nonlinear dimensionality reduction for visualization and

classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 6, pp. 1098–1107, Dec. 2005.

[19] Z. Pan, Y. Wang and W. Ku, "A new general nearest neighbor classification based on the mutual neighborhood

information", Knowledge-Based Systems, pp. , 2017, ISSN 09507051.

23

[20] Lu, Huijuan, et al. "A hybrid feature selection algorithm for gene expression data

classification." Neurocomputing (2017).

[21] Pan, Zhibin, Yidi Wang, and Weiping Ku. "A new k-harmonic nearest neighbor classifier based on the multi-

local means." Expert Systems with Applications 67 (2017): 115-125.

[22] Yu, Xiaopeng. "The Research on an adaptive k-nearest neighbors classifier." Cognitive Informatics, 2006. ICCI

2006. 5th IEEE International Conference on. Vol. 1. IEEE, 2006.

[23] Parvin, H., H. Alizadeh, and B. B, MKNN: Modified k-Nearest Neighbor, in Proceeding of the World Congress

on Engineering and Computer Science. 2008: USA.

[24] Cedeno, W. and D. Agrafiotis, Using particle swarms for the development of QSAR models based on k-nearest

neighbor and kernel regression. Journal of Computer-Aided Molecular Design, 2003. 17(2-4).

[25] G.L. Ritter, H.B. Woodruff, S.R. Lowry, and T.L. Isenhour, “An Algorithm for a Selective Nearest Neighbor

Decision Rule,” IEEE Trans. Information Theory, vol. 21, pp. 665-669, Nov. 1975.

[26] C.L. Chang, “Finding Prototypes for Nearest Neighbor Decision Rule,” IEEE Trans. Computers, vol. 23, no. 11,

pp. 1179-1184, Nov. 1974.

[27] P.E. Hart, “Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, pp. 515-516, May

1968.

[28] D.W. Jacobs and D. Weinshall, “Classification with Non-Metric Distances: Image Retrieval and Class

Representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp. 583-600, June 2000.

[29] Zhang, Bin, and Sargur N. Srihari. "Fast k-nearest neighbor classification using cluster-based trees." IEEE

Transactions on Pattern analysis and machine intelligence 26.4 (2004): 525-528.

[30] A.J. Broder, “Strategies for Efficient Incremental Nearest Neighbor Search,” Pattern Recognition, vol. 23, nos.

1/2, pp. 171-178, Nov. 1986.

[31] A. Farago, T. Linder, and G. Lugosi, “Fast Nearest-Neighbor Search in Dissimilarity Spaces,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 957-962, Sept. 1993.

[32] B.S. Kim and S.B. Park, “A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition,” IEEE

Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 761-766, Nov. 1986.

[33] E. Vidal, “An Algorithm for Finding Nearest Neighbors in (Approximately) Constant Average Time,” Pattern

Recognition Letters, vol. 4, no. 3, pp. 145-157, July 1986.

[34] Wang, Chun-Yan, et al. "A K-Nearest Neighbor Algorithm based on cluster in text classification." Computer,

Mechatronics, Control and Electronic Engineering (CMCE), 2010 International Conference on. Vol. 1. IEEE, 2010.

[35] A. Majumdar and R. K. Ward, “Robust classifiers for data reduced via random projections,” IEEE Trans. Syst.,

Man, Cybern. B, Cybern, vol. 40, no. 5, pp. 1359–1371, Oct. 2010.

[36] A. K. Ghosh, P. Chaudhuri, and C. A. Murthy, “Multiscale classification using nearest neighbor density

estimates,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 5, pp. 1139–1148, Oct. 2006.

[37] Q. Hu, P. Zhu, Y. Yang, and D. Yu, “Letters: Large-margin nearest neighbor classifiers via sample weight

learning,” Neurocomputing, vol. 74, no. 4, pp. 656–660, 2011.

24

[38] C. Domeniconi, D. Gunopulos, and J. Peng, “Large margin nearest neighbor classifiers,” IEEE Trans. Neural

Netw., vol. 16, no. 4, pp. 899–909, Jul. 2005.

[39] G. Parthasarathy and B. N. Chatterji, “A class of new KNN methods for low sample problems,” IEEE Trans.

Syst., Man, Cybern., vol. 20, no. 3, pp. 715–718, May/Jun. 1990.

[40] Q. Gao and Z. Wang, “Center-based nearest neighbor classifier,” Pattern Recognit., vol. 40, no. 1, pp. 346–349,

2007.

[41] B. Li, Y. W. Chen, and Y.-Q. Chen, “The nearest neighbor algorithm of local probability centers,” IEEE Trans.

Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 141–154, Feb. 2008.

[42] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbour algorithms,” in Proc. Adv.

Neural Inf. Process. Syst. (NIPS), vol. 14. Vancouver, BC, Canada, 2002, pp. 985–992.

[43] S. Hernández-Rodríguez, J. F. Martínez-Trinidad, and J. A. Carrasco-Ochoa, “Fast k most similar neighbor

classifier for mixed data (tree k-MSN),” Pattern Recognit., vol. 43, no. 3, pp. 873–886, 2010.

[44] B. Zhang and S. N. Srihari, “Fast k-nearest neighbor classification using cluster-based trees,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 26, no. 4, pp. 525–528, Apr. 2004.

[45] A. K. Ghosh, P. Chaudhuri, and C. A. Murthy, “On visualization and aggregation of nearest neighbor classifiers,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1592–1602, Oct. 2005.

[46] J. Derrac, I. Triguero, S. Garcia, and F. Herrera, “Integrating instance selection, instance weighting, and feature

weighting for nearest neighbor classifiers by coevolutionary algorithms,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,

vol. 42, no. 5, pp. 1383–1397, Oct. 2012.

[47] I. Triguero, S. García, and F. Herrera, “Differential evolution for optimizing the positioning of prototypes in

nearest neighbor classification,” Pattern Recognit., vol. 44, no. 4, pp. 901–916, 2011.

[48] B. P. Prajapati, and D. R. Kathiriya. "Evaluation of Effectiveness of k-Means Cluster based Fast k-Nearest

Neighbor classification applied on Agriculture Dataset." International Journal of Computer Science and Information

Security 14.10 (2016): 800.

[49] Hardy, André. "An examination of procedures for determining the number of clusters in a data set." New

approaches in classification and data analysis. Springer, Berlin, Heidelberg, 1994. 178-185.

[50] Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters

in a data set. Psychometrika, 50(2), 159-179.

[51] Lee, Paul H., et al. "A cluster analysis of patterns of objectively measured physical activity in Hong Kong." Public

health nutrition 16.8 (2013): 1436-1444.

[52] Arbelaitz, Olatz, et al. "An extensive comparative study of cluster validity indices." Pattern Recognition 46.1

(2013): 243-256.

[53] Rendón, Eréndira, et al. "Internal versus external cluster validation indexes." International Journal of computers

and communications 5.1 (2011): 27-34.

[54] Brun, Marcel, et al. "Model-based evaluation of clustering validation measures." Pattern recognition 40.3 (2007):

807-824.

[55] Prajapati, B.P. and Kathiriya, D.R., 2016. Reducing execution time of Machine Learning Techniques by Applying

Greedy Algorithms for Training Set Reduction. International Journal of Computer Science and Information Security,

14(12), p.705.

25

[56] Wettschereck, D., Aha, D.W. and Mohri, T., 1997. A review and empirical evaluation of feature weighting

methods for a class of lazy learning algorithms. In Lazy learning (pp. 273-314). Springer Netherlands.

[57] Williams, Nigel, Sebastian Zander, and Grenville Armitage. "A preliminary performance comparison of five

machine learning algorithms for practical IP traffic flow classification." ACM SIGCOMM Computer

Communication Review 36.5 (2006): 5-16.

[58] Bost, Raphael, et al. "Machine Learning Classification over Encrypted Data." NDSS. 2015.

[59] Vafeiadis, Thanasis, et al. "A comparison of machine learning techniques for customer churn prediction."

Simulation Modelling Practice and Theory 55 (2015): 1-9.

[60] Patel, Jigar, et al. "Predicting stock and stock price index movement using trend deterministic data preparation

and machine learning techniques." Expert Systems with Applications 42.1 (2015): 259-268.

[61] Kanj, Sawsan, et al. "Editing training data for multi-label classification with the k-nearest neighbor rule."

Pattern Analysis and Applications 19.1 (2016): 145-161.

[62] Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Knowledge Discovery and Data Mining In …...1 Knowledge Discovery and Data Mining In Agricultural...

Documents