Download - Regularized Weighted Ensemble of Deep

8/20/2019 Regularized Weighted Ensemble of Deep

1/19

International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

DOI:10.5121/ijcsa.2015.5305 47

Regularized Weighted Ensemble of DeepClassifiers

Shruti Asmita1 and K.K. Shukla

2

1Department of Computer Science, Banasthali University, Jaipur-302001,Rajasthan, India

2Department of Computer Science and Engineering, Indian Institute of Technology,

Banaras Hindu University, Varanasi-221005, Uttar Pradesh, India

A BSTRACT

Ensemble of classifiers increases the performance of the classification since the decision of many experts

are fused together to generate the resultant decision for prediction making. Deep learning is a

classification algorithm where along with the basic learning technique, fine tuning learning is done forimproved precision of learning. Deep classifier ensemble learning is having a good scope of research.

Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All

these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep

support vector machine performs the prediction analysis on the three UCI repository problems IRIS,

Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the

classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level

deep classifier ensemble gives the best result in our experiments.

KEYWORDS

Deep learning, support vector machine, feature subset selection, singular value decomposition,

regularization

1.INTRODUCTION

Machine learning is a domain of computational statistics, a specialized field of prediction making.This aims at artificial learning i.e. the construction of such algorithms which are capable of

learning from data [1]. Such learning is based on the development of model from training dataand hence making decisions using the model on the test data. Supervised machine learning [2] is

marked by the presence of a supervisor in a way that training set comprising of a number of

inputs and corresponding output i.e. associated label is provided to the machine for initiallearning and model forming. Later with the help of this generated model, required output is

generated on any input not present in the training set. On the other side, the unsupervised learning[2] does not contain any such supervisor. These try to find out hidden relation between the

unlabelled data. Classification, regression etc. are techniques of supervised learning whereasclustering, self-organizing neural network map etc. are techniques of unsupervised learning.

Other learning approaches in existence are semi supervised learning, reinforcement learning,developmental learning etc.

In classification [3], the training data is divided into two or more classes. A model is required to

be formed which can distinguish between the category and generate an ability to place new input

instances in the correct class to which it belongs. The performance measure of classification is theclassification accuracy. The goal of any learning lies in achieving best possible classification


2/19


48

accuracy. Several classification algorithms are being applied onto various datasets but the scopeof improvement in the performance through the use of new techniques is always there. Machine

learning aims at obtaining high test accuracy. Number of popular classifiers used widely forseveral classification techniques are k nearest neighbour classifier, decision tree classifier,

frequent pattern classifier, bayes classifier, rule based classifier, support vector machine (SVM)

classifier etc. [4] Among these SVM [5] classifier is most studied and implemented classifierthese days because of its high accuracy and exceptional ability to model complex non-linear

decision boundaries by mapping non-linear data to higher dimensions. Hence both linear as wellas non-linear data can be well classified by this classifier. Also, because of the presence of

support vectors in SVM classifiers, the compactness of the classification is very high. Groups of

people can often make better decisions than individuals [6]. Hence the ensemble of classificationmodels results into improved classification accuracy than the individual classifier model.

The task of prediction can be time series where the training data for model generation is recorderover a long span of time and in such cases batch learning is done [7]. In batch learning, the model

generated on individual batches till the previous time unit is ensembled to form the resultantmodel for testing of present batch data. Another prediction can be non-time series where the

training data for model generation contains various instances at one particular time instant. Batch

learning is not feasible in such classifications since all the instances are equally related to eachother. Hence for obtaining the ensemble of classifiers, the techniques possible for individualmodel generation are bagging [6] with bootstrap subsampling, deep learning and feature subsetselection etc. These techniques aim at increasing the diversity for the ensemble of classifiers.

Even in the ensemble of classifier model, there occurs an ill posed problem of overfitting. Thisproblem can be handled through regularization. The vector norms applied in the process of

regularization handles overfitting by reducing the mean squared distance between the traininginstances.

This paper deals with three prediction problems, first, the prediction of type of IRIS plant from

among Iris Setosa, Iris Versicolour and Iris Virginica, second, the prediction of a good radarreturn or a bad radar return from the Ionosphere and third, the prediction of type of wheat kernel

from among Kama, Rosa and Canadian variety. The prediction making for above is done through

regularized weighted ensemble of deep support vector machine classifiers. The individual modelsfor the ensemble learning are generated through feature subset selection and deep learning. Theweights are assigned to each individual model by majority voting technique. These weights are

then regularized through four variations i.e. norm 1, norm 2, tikhonov and singular value

decomposition (SVD) reduced norm 2 regularization. This form of regularization reduces thecurvature of each depression and convolution of the non-linear boundary plot of SVM and hence

the loss function is modified to promote generalization and provide the essential curve fitting overthe input feature vectors for classification. To the best of our knowledge, this technique of

regularization of weights with deep learning and such ensemble learning approaches in thesupervised machine learning task, for dealing with the problem of overfitting of the classifiers has

yet not been applied to such prediction problems. In the stretch of paper firstly the detail aboutdataset and background concepts are discussed. Moving further the algorithm, framework,

experiment results and comparison analysis is done.

2. DATA SET

Three prediction problems used in this paper are summarized in table 1. The training set and test

set comprise of 70% and 30% of the whole database respectively. This ratio of 7:3 is an arbitraryratio but is chosen because it is a good practical ratio according to most of the experiments in

machine learning.


3/19


49

2.1.IRIS Dataset

Iris database is created by R.A. Fisher and donated by Michael Marshall in July 1988 [8]. This isa popular dataset and is being successfully used in several problems related to prediction and

pattern recognition. The data set contains 3 classes specifying the type of iris plant from among

Iris Setosa, Iris Versicolour and Iris Virginica. There are a total of 50 instances per class in thewhole dataset. The classification problem is the prediction of category of Iris plant. The four

attributes or features in record of the dataset are sepal length (cm), sepal width (cm), petal length(cm) and petal width (cm). Table 2 describes the number of instances of each class in total,

training and test data of Iris data set. Table 3 describes major previous related work done on Irisdata.

Table 1. Instances distribution in training and test set of data

S.No.

Dataset Year ofdata set

creation

Numberof classes

Numberof

features

Totalnumber

of

instances

Trainingnumber of

instances

Testnumber of

instances

1 IRIS 1988 3 4 150 105 45

2 Ionosphere 1989 2 34 351 246 1053 Seed 2012 3 7 210 147 63

Table 2. Number of instances of each class in total, training and test data set of Iris Data set

S. No. Feature Total number of

instance

Training

number of

instance

Test number of

instance

1 Iris Setosa 50 35 15

2 Iris Versicolour 50 35 15

3 Iris Virginica 50 35 15

Table 3. Previous major experiments reported on Iris data set and classification accuracy achieved in each

case

S. No. Year of

Research

Problem Statement Reported

classification

accuracy (%)

1. 2014 Neuro-fuzzy classifier system [9] 96.70

2. 2013 Evolving neural network ensembles using string genetic

algorithms for pattern classification [10]

93.30

3. 2012 Hybrid SVM and decision tree classifier [11] 97.08

4. 2012 Classifier ensemble for SVM [12] 95.00

5. 2011 One class SVM weighted bagging [13] 92.00

6. 2010 Large margin classifier SVM [14] 95.307. 2010 Feature subset selection in neural network classifier[15] 97.00

8. 2008 SVM based semi supervised classification [16] 95.00

9. 2003 SVM Ensemble with majority voting [17] : SVM: Bagging

: Boosting

96.5096.80

97.20


4/19


50

2.2.Ionosphere Dataset

Ionosphere database is created at Johns Hopkins University and donated Vince Sigillito in 1989[18]. For collection of the dataset, radar system is used. This radar contains phased array of 16

high frequency antennas with the help of which the free electrons in the ionosphere are recorded.

The two classes into which the categorization has to be done are “good” and “bad” ionosphere.Predictions are done on the basis of 34 attributes. This large number of attribute lists marks this

dataset different from the other two dataset mentioned in this section. Table 4 shows the numberof instances of each class in total, training and test data of Ionosphere data set. Table 5 shows

previous major similar contribution on Ionosphere data set

Table 4. Number of instances of each class in total, training and test data set of Ionosphere Data set


instance

Training

number of

instance

Test number

of instance

1 Good radar signal 224 168 56

2 Bad radar signal 127 78 49

Table 5. Previous major experiments reported on Ionosphere data set and classification accuracy achieved

in each case

S. No. Year of

Research


classification

accuracy (%)

1 2014 Classifier ensemble based on weighted accuracy and

diversity [19]

94.00

2 2014 Weighted classifier ensemble SVM [20] 94.00

3 2013 Artificial immune recognition through SVM

classification[21]

93.00

4 2013 One class ensemble classifier majority voting approach[22]

89.80

5 2010 Fast local Radial basis function kernel SVMclassification[23]

93.72

6 2008 Oblique decision tree embedded with SVM

classification [24]

92.59

7 2008 SVM infinite ensemble learning [25] 92.00

8 2006 Evolving ensemble of classifiers with majority voting

[26]

81.00

2.3. Seed Dataset

Seed database is one of the new database and hence has a very few previous experiments in itslist. Dataset mentions in it the geometrical properties of the kernel which is a characteristic to

differentiate varieties of wheat i.e. Kama, Rosa and Canadian. For the collection of the datasetsome X-ray techniques are used [27]. Seven parameters of wheat kernels which forms the feature

set in the dataset are area (A), compactness (C = 4*pi*A/P^2), perimeter (P), length of kernel,

asymmetry coefficient, width of kernel, and length of kernel groove. Table 6 shows number ofinstances of each class in total, training and test data set of seed data. Table 7 shows the major


5/19


51

previous similar contribution on seed data. (To the best of our knowledge, this data has beenworked upon on the similar proposals, only by its developers, till date. Hence a single previous

work is reported in table 7).

Table 6. Number of instances of each class in total, training and test data set of Seed Data set


instance

Training

number of

instance

Test number of

instance

1 Kama 70 49 21

2 Kosa 70 49 21

3 Canadian 70 49 21

Table 7. Previous major experiments reported on seed data set and classification accuracy achieved in each

case

S. No. Year of

Research


classificationaccuracy (%)

1 2012 Complete gradient clustering with K Mean Algorithm

[28]

92.00

3.BACKGROUND APPROACH

3.1.SVM Classifier

Origin of SVM classifiers lies in VC dimensions. VC dimension is defined on a set of function. It

is the maximum number of points that can be separated in all possible ways by that set of

function. The non-linearly separable data are transformed to higher dimensions for achieving

classification through SVM (figure 1). The margin between the classes can be soft margin or hardmargin (figure 2). In case of soft margin classifiers, the model generated contains the

compensation of the misclassified instances. However the hard margin does not allow anymisclassification. Instead, it plots strict non-linear boundary to avoid misclassification. SVM

classifies the data through hinge loss optimization function. Soft margin classification is moreprevalent than hard margin classification since the later faces a very high rate of overfitting.

Figure 1. SVM


6/19


52

3.2.Ensemble of classifiers

Ensemble Learning is the process of training multiple learning machines individually and therebycombining their outputs similar to a committee of decision makers. The principle behind this

method of decision making is that the individual predictions combined appropriately, should have

better overall accuracy, on average, than any individual committee member [29]. PrimeAggregation method applied in the ensemble learning are voting techniques such as majority

voting, borda count aggregation, behaviour knowledge based aggregation, dynamic classifier

selection etc. [30]. Out of these, our proposed learning technique uses majority voting [31]aggregation. The three versions of majority voting are unanimous voting, simple voting andplurality voting. Plurality voting is the most optimal form of majority voting.

Majority voting in the proposed statement of this paper aims at giving high weightage to more

qualified experts in the ensemble of classifiers. The expertise is inversely proportional to theclassification error.

Figure 2. Hard Margin SVM plot

3.3.Feature Subset Selection

Feature selection algorithms attempt to select features which are useful and deselect the features

which are not helpful or destructive to learning [32]. Feature subset selection is an importantphase of pre-processing in machine learning [33]. At times in this phase some feature areremoved totally. However these removed features may become important when incorporated in

some combination with other features. This disadvantage of feature selection can be removed byutilizing it in ensemble learning. Here several combinations of features are selected through some

algorithms to form individual models to be ensembled. Various selection algorithms areexhaustive selection (evaluation of all possible subset of features), branch and bound selection(evaluation using branch and bound algorithm), sequential forward selection (SFS) (select best

single feature and then add one feature at a time in combination which maximizes decision

accuracy), sequential backward selection (SBS) (select all the features and remove one feature ata time which maximizes the decision accuracy) and best individual feature selection (evaluationof all the N features individually and then taking the best set of features) etc. [34]. SFS is bottom

up procedure and SBS is top down procedure. Here the exhaustive selection is most ideal

approach but is feasible only when the number of attributes is few in numbers. Otherwise thepossible combination can shoot exponentially in number, not possible to handle.


7/19


53

3.4.Deep Learning

Deep SVM is inspired from the success of deep neural networks [35], deep belief networks [35],

and deep Boltzmann machine [36] etc. Multilayer perceptron with many hidden layers is anexample of deep learning. Deep learning is a type of machine learning techniques that learns

multiple levels of representations in deep architectures [37]. There are chances of theconventional classifiers to get trapped in local optima of objective function. But the deep

architectures learns the feature representations through both supervised training and fine tuning atfurther deep phases of learning. First phase of deep SVM is the standard training process. Then in

the second phase, the kernel activations of the support vectors of first phase are set as inputs for

another SVM and so on till whatever level of tuning is required to be done [38]. Usually thetuning starts to repeat after 3-4 levels of deep learning. This training procedure is greedy in

nature. This makes the computationally very efficient. Ensemble of each phase of learning in the

deep learning further increases the precision of the model. However, there exist fine tuninglearning, but the model function still over fits the data points due to non-linear kernel activation

learning.

3.5.Regularization

The concept of regularization came into existence in 1990’s. In the supervised machine learningproblems, accurate prediction is more important than the close fit of the function onto the data.Hence generalization is appreciated or in other words overfitting of function has to be checked. In

figure 3 the blue curve is a 2 degree curve, red curve is a 4 degree curve and the green curve is the

8 degree curve which is the maximum out of the three. The green curve plots the close fitboundary between the two classes, but the test accuracy decreases. However the blue curve shows

minimum training accuracy but chances of betterment in test accuracy is the maximum in thiscase. Green curve marks overfitting. Hence it can be said that the overfitting occurs when

generalization is decreased. Regularization is a measure to check this overfitting. This providesproblem stability. Regularization restricts the hypothesis space to a linear function or a

polynomial of a particular degree according to the scenarios and smoothness to the function isprovided by putting the function in Reproducing kernel hilbert space (RKHS). A regularization

parameter ‘λ ’ associated with the regularization term of optimization function which controls the

trade-off between stability and accuracy.

Figure 3. Fitting of classifier on the data set


8/19


54

In case of the ensemble learning, the regularization can be applied to the optimization of the lossfunction. By doing this the degree of the best fit polynomial is reduced and test classification

accuracy is improved. On the other side, overfitting can also be dealt with by keeping the degreeof the best fit function constant and regularizing the weightage associated to each individual

classifiers participating in the ensemble learning. This reduces the curvature of each positive or

negative depression in the curve without reducing the degree of whole curve. Hence the lossfunction is modified to provide the boundary fitting over the input feature vectors.

Another statistical technique is bootstrap resampling in which a new set dataset DT’ is drawn out

from the previous dataset DT by random sampling with replacement. Bagging is performed by

applying this in several iterations and then performing ensemble learning onto this. For a largeDT, the number of individual samples that are not present in any of the bootstrapped dataset is

large. The probability that first training sample is not selected once is (1- 1/N) and not selected at

all is (1-1/N)N [1]. Since N -> ᴔ, 1/e = 0.36 .Hence only about 63% of original training samples

are represented in any bootstrapped set. Since bagging reduces variance, it provides an alternative

approach to regularization [6] because even if each classifier is individually overfit, they arelikely to be overfit to different things.

4.PROPOSED WORK

In our work, regularized ensemble of deep SVM classifier has been used which shows a markableimprovement in the classification accuracy of prediction problems. For training and optimization

of our problem, we have used a popular library libSVM [40,41]. The ensemble of deep classifiers

is generated using four different frameworks shown in fig 4, fig 5, fig 6, fig 7. Fig 4 showsensemble of classifiers based on feature subset selection framework where the individual models

are formed by different training on different feature subset. Even those features which do not

contribute well in isolation or total combination, may work well in some combinations. Thisexplores all the best possible decisions using feature combinations. Fig 5 shows the ensemble of

deep classifiers level 1 where each individual model is generated by the training in each phase ofdeep learning. This provides improved basic training through fine tuning of deep phases. Fig 6

shows the ensemble of deep classifiers level 2, where fine tuning at a further level is done. Fig 7

shows a combination of motive achieved in fig 4 and fig 6 i.e. ensemble of deep classifierslearning with feature subset selection.

Figure 4. Ensemble of classifiers based on feature subset selection framework


9/19


55

Figure 5. Ensemble of deep classifiers level 1 framework

Figure 6. Ensemble of deep classifiers level 2 framework


10/19


56

Figure 7. Ensemble of deep classifiers learning with feature subset selection

For SVM, the loss function optimized is the hinge loss L(f(x),y)=max(0,1-y.f(x)). It has beenobserved that the regularization technique that generates the best accuracy for our proposed work

is the singular value decomposition (SVD) reduced weight matrix with regularization parameterʎ1 and square of norm 2 of weight matrix with regularization parameter ʎ2. Other regularization

factors are norm1, norm2 and tikhonov regularization. The objective function is described in

equation 1:

(1)

Here βi is achieved through regularized majority voting.

Algorithm 1: Regularized ensemble of classifiers using exhaustive feature subset selection

1: Start2: Find all the possible combinations of features

3: Train the SVM classifier all combinations received in 1

4: Estimate the weights {β1……… βt} associated with each individual model through regularizedmajority voting

5: Evaluate ensemble of classifier model

6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

7: End

Algorithm 2: Regularized ensemble of classifiers using best N feature subset selection

1: Start

2: Train the SVM classifier on all individual features.3: Record the accuracy generated and the corresponding feature in descending order.4: Train SVM classifiers on Classifierset= {Best N, Best N-1, Best N-2…… Best 1}


11/19


57

4: Estimate the weights {β1……… βt} associated with each member of Classifierset throughregularized majority voting technique

5: Evaluate ensemble of classifier model6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

7: End

Algorithm 3: Regularized ensemble of deep classifiers

1: Start

2: for level= 1: t

3: Train the SVM classifier on data set D and record the model generated in [Model]4: Generate new data set D’ with the support vectors of model generated

5: D=D’

6: end for7: Estimate the weights {β1……… βt} associated with each member of [Model] through regularized

majority voting technique5: Evaluate ensemble of classifier model

6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

7: End

Regularization parameter λ associated with the regularization term is an important term to controlthe trade-off between stability and accuracy. There are many regularization techniques in

existence and this is also a topic under further research. L1 Regularization is norm 1regularization factor which penalizes all the factors equally. This focuses on selection of only the

relevant factors. Its numerical definition is λ 1.||β||1. L1 penalty is linear which tends to producemany points with zero curvature. A disadvantage with this regularizer is slow convergence in caseof large scale problems. Secondly, L2 regularizer minimizes curvature at all the points in the

curve by applying penalty that scales square of curvature. Its numerical definition is λ 1.||β||2.Complexity of L2 regularization is greater than L1 regularizer. Thirdly,

Tikhonov regularizer is a special case of L2 Regularization numerically defined by term

(λ 1)2

.(||β||2)2

. Further the SVD reduced norm 2 regularization is represented as λ 1. SVD(β) + λ 2.(||β||2). SVD has multiple roles and can be viewed as a method for transforming correlatedvariables into a set of uncorrelated ones that better expose the various relationships among the

original data items, a method for identifying and ordering the dimensions along which data points

exhibit the most variation and a method for data reduction by finding the best approximation ofthe original data points using fewer dimensions. Regularization path varies with the experimental

conditions.

5.EXPERIMENTS

In all the experiments listed below, SVM classifier is used because it evaluates dot products of

vectors in the higher dimension to construct the dividing boundary. The choice of a kernelfunction depends on the model to plot. A polynomial kernel allows to model feature conjunctions

up to the order of the polynomial. Radial basis functions (RBF) allows plotting circular

boundaries in higher dimensions. Linear kernel allows putting linear boundaries in higherdimensions. Multiclass classification is best achieved through RBF. If ƴ is the kernel bandwidth

parameter and (Xi , Xj) is vector to be transformed to higher dimensions, equations 2 shows RBFkernel equation.

(2)


12/19


58

Other important algorithm used is the parameter estimation algorithm of Grid search. In v-foldcross-validation, the training set is divided into v subsets of equal size. Classifiers are trained on

v-1 subsets and are tested on one subset. Hence each instance is predicted once and so the crossvalidation accuracy is the percentage of data which are correctly classified. The kernel parameters

(C, ƴ ) are estimated using cross-validation. Various combination of (C, ƴ ) is tried and one with

best cross validation accuracy is picked. In the experiments of our proposed work, libSVM library[40, 41], is used for training multi class SVMs with RBF kernel. The features in the training and

test datasets were scaled in the range [-1, +1]. 10 fold cross validation is used for choosing thekernel bandwidth parameter ƴ and SVM C parameter through grid search. The range of (C, ƴ ) are

[2-10

,2-9

, …..25] and [2

-5, 2

-4,…..2

10] respectively. The range for regularization parameter λ 1 and λ 2

is 0 < λ 1 < 0.5 and 0 < λ 2 < 0.5 respectively. Five cases of experiment are described below. Resultsof bagging technique are listed in table 8 for a comparative vision.

Case 1: Bagging Ensemble of classifiers

Case 2: Ensemble of classifiers based on feature subset selectionCase 3: Ensemble of classifiers in deep learning level 1

Case 4: Ensemble of classifiers in deep learning level 2Case 5: Ensemble of classifiers in deep learning level 1 with the feature subset selection

Cases 2,3,4,5 have subcases for the following regularization schemes:Setting 1: SVD reduced Norm 2 regularization

Setting 2: Norm 1 regularization

Setting 3: Norm 2 regularizationSetting 4: Tikhonov regularization

Table 8. Results of Bagging ensemble of classifiers in all the three dataset

S.No. Dataset Classification

accuracy (%) in

Bagging Ensemble of

classifiers

1. IRIS Data set 96.66

2. Ionosphere Dataset

87.87

3. Seed Data set 95.38

For the feature subset selection IRIS data set uses Algorithm 1 i.e. exhaustive feature subset

selection. This is most optimal selection algorithm. For Ionosphere and Seed data set, since the

number of features or the attributes is very large in number, it is very lengthy and highly complexto find out all the possible combinations of attributes. Hence they both use Algorithm 2 i.e. best N

feature subset selection. Fig 8 represents the classification accuracy results of experiments on

IRIS dataset. Fig 9 and fig 10 represents 2D and 3D scatterplot where different colours mark thedifferent class vectors respectively. Similarly Fig 11, fig 12, fig 13 are the corresponding results

on Ionosphere dataset and fig 14, fig 15 and fig 16 are the corresponding results on Seed dataset.


13/19


59

Figure 8. Results of experiments on IRIS Dataset

Figure 9. 2D Scatter plot between all pair of attributes in IRIS dataset

Figure 10. 3D Scatter plot between all pair of attributes in IRIS dataset


14/19


60

Figure 11. Results of experiment on Ionosphere dataset

Figure 12. 2D Scatter plot on the best set of features in Ionosphere dataset.

Figure 13. 3D Scatter plot on the best set of features in Ionosphere dataset


15/19


61

Figure 14. Results of experiment on seed dataset

Figure 15. 2D Scatter plot on the best set of features in Seed dataset.


16/19


62

Figure 16. 3D Scatter plot on the best set of features in Seed dataset.

6.OBSERVATION

The results in all the above three set of experiments show the improved classification accuracy

than the major reported previous results, in the case of ensemble of deep classifiers level 2 withthe SVD reduced norm 2 regularizations which is nearly 99%. Time taken in this particular case

for various dataset is reported in table 9. It is to be noted that time taken in case of ionospheredata is comparatively larger than other two dataset due to comparatively large number of features

in it. The deep learning on the complete dataset is generating better results than the deep learning

on the feature subset selection schemes. This is because the fine tuning in the presence of all thefeatures is better in comparison to the feature subset. The penalty in Norm 1 regularization deletes

many noise features by estimating their coefficients to zero since it is not differentiable at zero.

Whereas the penalty in Norm 2 regularization uses all the input features in classification becauseit is differentiable at all points in the function. Hence Norm 2 regularization achieves higher ordersmoothness for curve estimation.

Table 9. Time taken in deep learning level 2 with full feature set ensemble learning with SVD

reduced Norm 2 regularization

S. No. Dataset Time (sec)

1. IRIS Data set 16.37

2. Ionosphere Data set 123.78

3. Seed Data set 32.78

Next, since the bagging model shows the inclusion of only about 63% of the original trainingsamples in any bootstrapped set (as discussed in section 3.5), the regularization provided by thistechnique is not as smooth as the ensemble of deep classifiers. Analysis of the regularizers

applied above can be done on the basis of worst case time complexity. In Norm 1 regularization,there are total of (t-1) sum operations computed at run of algorithm. Time Complexity O(t) is

reported. In Norm 2 regularization, there are total of (t-1) sum operations, t operations to squareall the elements, and 1 square root operation is computed. Time complexity O(3t) is reported. One


17/19


63

degree regularization parameter is applied. In the tikhonov regularization, time complexity O(3t)is same as L2 regularization but here 2 degree regularization parameter is applied. In

(SVD+Norm2), there are two expressions involved. O(t2) for SVD computation summed with

O(3t) for norm 2 computation. Hence time complexity O(t2) is reported.

7.CONCLUSION

The deep learning approach for the improvement in the classification accuracy is very prevalent

in the artificial neural network field. The deep SVM classifier is still an emerging concept. Herethe experiments prove a good scope of deep learning with SVM classifiers. Regularization ofdeep learning has further marked an improvement in classification accuracy. Many other

regularization techniques could be applied for comparison and better results. Other feature

selection strategies such as SFS and SBS could also be applied for feature subset selection.

REFERENCES

[1] Rob Schapire, “Theoretical Machine Learning”, COS 511, Lec No. 1, p. 1-6, 2008

[2] R. Sathya, Annamma Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms

for Pattern Classification”, IJARAI, Vol 2, No. 2, 2013[3] D. Michie, D.J. Spiegelhalter, C.C. Taylor, “Machine Learning, Neural and Statistical Classification”,

Tutorial section 2.1, p. 6-16, 1994

[4] S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”, Informatica,

Vol 31, p. 249-268, 2007

[5] Koby Crammer, Yoram Singer, “On the Algorithmic Implementation of Multiclass Kernel-based

Vector Machines”, JMLR 2, p. 256-295, 2001

[6] Hal Daume III, “A course in Machine Learning”, Ensemble learning CIML, V0-8, Ch 1,p. 148-155,

2012

[7] A.Vergara, Shankar Vembu, Tuba Ayhanb, Margaret A. Ryanc, Margie L. Homerc, Ramon Huertaa

“Chemical gas sensor drift compensation using classifier ensembles” . Sensors and Actuators B, p.

166-167 2012.

[8] Fisher’s IRIS dataset, UCI repository, https://archive.ics.uci.edu/ml/datasets/Iris, 1988

[9] Vaishali Arya, R.K.Rathy, “An Efficient Neuro-Fuzzy Approach for Classification of Iris Dataset”,

International Conference on Reliability, Optimization and Information Technology, p. 161- 165, 2014.[10] Xiaoyang Fu and Shuqing Zhang, “Evolving Neural Network Ensembles Using Variable String

Genetic Algorithm for Pattern Classification”, Sixth International Conference on Computational

Intelligence, p. 81-85 2013.

[11] Anshu Bharadwaj, Sonajharia Minz, “Hybrid Approach for Classification using Support Vector

Machine and Decision Tree”, International Conference on Advances in Electronics, Electrical and

Computer Science Engineering, p. 337-341, 2012.

[12] Hamid Parvin, Sajad Parvin, “Robust Classifier Ensemble for Improving the Performance of

Classification”, Eleventh Mexican International Conference on Artificial Intelligence, IEEE special

session, Vol 11, p. 52-57, 2012.

[13] Xue-Fang Chen, Hong-Jie Xing, Xi-Zhao Wang, “A modified AdaBoost method for one-class SVM

and its application to novelty detection”, IEEE, Vol 11 p. 3506-3511, 2011.

[14] Hakan Cevikalp, Bill Triggs , Hasan Serhan Yavuz , Yalc, Mahide, Atalay Barkana, “Large margin

classifiers based on affine hulls” Elsevier, Vol 73, p. 3160-3168, 2010.

[15] A. Marcano-Cedeño, J. Quintanilla-Domínguez, M.G. Cortina-Januchs, D. Andina, “Feature SelectionUsing Sequential Forward Selection and classification applying Artificial Metaplasticity Neural

Network”, IEEE, No. 36, p. 2845-2850, 2010

[16] Narendra S. Chaudhari, Aruna Tiwari, Jaya Thomas,“Performance Evaluation of SVM Based Semi-

supervised Classification Algorithm”, 10th Intl. Conf. on Control, Automation, Robotics and Vision,

No. 10, p. 1942-1947, 2008

[17] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, Sung Yang Bang, “ Constructing support

vector machine ensemble”, The journal of the pattern recognition society, Vol 36, p. 2757-2767, 2003.


18/19


64

[18] Vince Sigillito, Ionosphere Dataset , UCI repository, https://archive.ics.uci.edu/ml/datasets/Ionosphere,

1989

[19] Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, “Constructing Better Classifier Ensemble Based on

Weighted Accuracy and Diversity Measure”, The Scientific World Journal, Volume 2014, Article No.

961747, p. 1-12, 2014

[20] Shasha Mao, LichengJiao, LinXiong, ShuipingGou, BoChen, Sai-KitYeung, “Weighted classifier

ensemble based on quadratic form”, Elsevier Vol 48, Issue 5, p. 1688-1706, 2014[21] Darwin Tay, Chueh Loo Poh, Richard I. Kitney, “An Evolutionary Data-Conscious Artificial Immune

Recognition System” , Proceedings of the 15th annual conference on Genetic and evolutionary

computation, p. 1101-1108, 2013

[22] Eitan Menahem, Lior Rokach, Yuval Elovici, “Combining One-Class Classifiers via Meta Learning”,

Proceedings of 22 ACM international conference on information & knowledge management, No. 22,

p. 2435-2440, 2013

[23] Nicola Segata , Enrico Blanzieri, “Fast and Scalable Local Kernel Machines”, JMLR, Vol 1, p. 1883-

1926, 2010

[24] Vlado Menkovski, Ioannis T. Christou, Sofoklis Efremidis, “Oblique Decision Trees Using

Embedded Support Vector Machines in Classifier Ensembles” , Vol 11, p. 1-6, 2008

[25] Hsuan-Tien Lin , Ling Li, “Support Vector Machinery for Infinite Ensemble Learning”, JMLR , Vol

9, p. 285-312, 2008

[26] Albert Hung-Ren Ko, Robert Sabourin, Alceu de Souza Britto, “Evolving Ensemble of Classifiers in

Random Subspace”, Proceedings of the 8th annual conference on Genetic and evolutionarycomputation, p. 1473-1480, 2006

[27] Gorzata’s Seed Data set, UCI repository, https://archive.ics.uci.edu/ml/datasets/seeds, 2012

[28] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete

HGradient Clustering Algorithm for Feature Analysis of X-ray Images”, Information Technology in

Biomedicine, Springer-Verlag, p. 15-24, 2010

[29] Gavin Brown, Encyclopaedia of Machine Learning Vol 1, p. 312-320, 2010

[30] Robi Polaker, Ensemble based systems in decision making, IEEE, Vol 6, Issue 3, p. 21-45

[31] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung-Yang Bang, Support Vector

Machine Ensemble with Bagging, Springer, LNCS 2388, p. 397-408, 2002

[32] David W. Opitz, “Feature Selection for Ensembles”, American Association for Artificial Intelligence,

AAAI Proceeding No. 99, p.1-6, 1999

[33] Mohamed A. Aly, “Novel Methods for the Feature Subset Ensembles Approach”, International

Journal of Artificial Intelligence and Machine Learning, Vol. 6, No. 4, p. 1-7, 2006

[34] Anil K. Jain, Robert P.W. Duin, Jianchang Mao, “Statistical Pattern Recognition: A Review”,IEEEtransactions on pattern analysis and machine intelligence, Vol 22, Issue 1, p. 4-37, 2000

[35] Dong Yu and Li Deng, “Deep Learning and Its Applications to Signal and Information Processing” ,

IEEE processing magazine Vol 28, Issue 1, p. 145-154, 2011

[36] Nitish Srivastava, Ruslan Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”,

ICML, 25 Annual Conferrence on learning theory, No. 25, p. 1-9, 2012

[37] Xue-Wen chen, Xiaotong Lin, “Big Data Deep Learning: Challenges and Perspectives”, IEEE Access,

Vol 2, p. 514-525, 2014.[38] Azizi Abdullah, Remco C. Veltkamp, Marco A. Wiering, “An Ensemble of Deep Support Vector

Machines for Image Categorization”, International Conference of Soft Computing and Pattern

Recognition, p.301-306, 2009.

[39] Hal Daume III, From zero to reproducing kernel hilbert spaces in twelve pages or less, p.1-12, 2004

[40] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, Software. Available at

http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.

[41] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, Department of Computer Science, NationalTaiwan University,Taipei 106, Taiwan, 2003, Practical Guide to Support Vector Classification, p 1-

16, 2003.


19/19


65

Authors

Ms. Shruti Asmita (B.Tech., 2013 – KEC Ghaziabad, Uttar Pradesh Technical University, Lucknow) is a

M.Tech. Computer Science scholar at Banasthali University, Jaipur and pursuing her research internship at

IIT-BHU (CSE), Varanasi. Her research interests include data mining, image processing, machine learning

and sensor networks etc.

Dr. K.K. Shukla (Ph. D., 1993 - Institute of Technology (BHU), Varanasi) is professor and current head of

department at Indian Institute of Technology, Banaras Hindu University Varanasi, India. He has been

awarded B.Tech from APSU, Rewa in 1980, M.Tech. from IT (BHU) in 1982 and PhD from IT (BHU) in

1993. He is having research and teaching experience of 30 years, He is having more than 120 research

papers in reputed journals and conferences and more than 90 citations. His present research collaborations

in India include ISRO and TCS. Out of India research collaborations includes INRIA, France and ETS,

Canada. He has many popular books under his authorship on subjects Neuro-computers, RTS Scheduling,

Fuzzy modelling and Image Compression. His field of research includes image processing and pattern

recognition, fuzzy logics, wireless sensor networks and machine learning etc.