8/20/2019 Regularized Weighted Ensemble of Deep
1/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
DOI:10.5121/ijcsa.2015.5305 47
Regularized Weighted Ensemble of DeepClassifiers
Shruti Asmita1 and K.K. Shukla
2
1Department of Computer Science, Banasthali University, Jaipur-302001,Rajasthan, India
2Department of Computer Science and Engineering, Indian Institute of Technology,
Banaras Hindu University, Varanasi-221005, Uttar Pradesh, India
A BSTRACT
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a
classification algorithm where along with the basic learning technique, fine tuning learning is done forimproved precision of learning. Deep classifier ensemble learning is having a good scope of research.
Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All
these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep
support vector machine performs the prediction analysis on the three UCI repository problems IRIS,
Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
KEYWORDS
Deep learning, support vector machine, feature subset selection, singular value decomposition,
regularization
1.INTRODUCTION
Machine learning is a domain of computational statistics, a specialized field of prediction making.This aims at artificial learning i.e. the construction of such algorithms which are capable of
learning from data [1]. Such learning is based on the development of model from training dataand hence making decisions using the model on the test data. Supervised machine learning [2] is
marked by the presence of a supervisor in a way that training set comprising of a number of
inputs and corresponding output i.e. associated label is provided to the machine for initiallearning and model forming. Later with the help of this generated model, required output is
generated on any input not present in the training set. On the other side, the unsupervised learning[2] does not contain any such supervisor. These try to find out hidden relation between the
unlabelled data. Classification, regression etc. are techniques of supervised learning whereasclustering, self-organizing neural network map etc. are techniques of unsupervised learning.
Other learning approaches in existence are semi supervised learning, reinforcement learning,developmental learning etc.
In classification [3], the training data is divided into two or more classes. A model is required to
be formed which can distinguish between the category and generate an ability to place new input
instances in the correct class to which it belongs. The performance measure of classification is theclassification accuracy. The goal of any learning lies in achieving best possible classification
8/20/2019 Regularized Weighted Ensemble of Deep
2/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
48
accuracy. Several classification algorithms are being applied onto various datasets but the scopeof improvement in the performance through the use of new techniques is always there. Machine
learning aims at obtaining high test accuracy. Number of popular classifiers used widely forseveral classification techniques are k nearest neighbour classifier, decision tree classifier,
frequent pattern classifier, bayes classifier, rule based classifier, support vector machine (SVM)
classifier etc. [4] Among these SVM [5] classifier is most studied and implemented classifierthese days because of its high accuracy and exceptional ability to model complex non-linear
decision boundaries by mapping non-linear data to higher dimensions. Hence both linear as wellas non-linear data can be well classified by this classifier. Also, because of the presence of
support vectors in SVM classifiers, the compactness of the classification is very high. Groups of
people can often make better decisions than individuals [6]. Hence the ensemble of classificationmodels results into improved classification accuracy than the individual classifier model.
The task of prediction can be time series where the training data for model generation is recorderover a long span of time and in such cases batch learning is done [7]. In batch learning, the model
generated on individual batches till the previous time unit is ensembled to form the resultantmodel for testing of present batch data. Another prediction can be non-time series where the
training data for model generation contains various instances at one particular time instant. Batch
learning is not feasible in such classifications since all the instances are equally related to eachother. Hence for obtaining the ensemble of classifiers, the techniques possible for individualmodel generation are bagging [6] with bootstrap subsampling, deep learning and feature subsetselection etc. These techniques aim at increasing the diversity for the ensemble of classifiers.
Even in the ensemble of classifier model, there occurs an ill posed problem of overfitting. Thisproblem can be handled through regularization. The vector norms applied in the process of
regularization handles overfitting by reducing the mean squared distance between the traininginstances.
This paper deals with three prediction problems, first, the prediction of type of IRIS plant from
among Iris Setosa, Iris Versicolour and Iris Virginica, second, the prediction of a good radarreturn or a bad radar return from the Ionosphere and third, the prediction of type of wheat kernel
from among Kama, Rosa and Canadian variety. The prediction making for above is done through
regularized weighted ensemble of deep support vector machine classifiers. The individual modelsfor the ensemble learning are generated through feature subset selection and deep learning. Theweights are assigned to each individual model by majority voting technique. These weights are
then regularized through four variations i.e. norm 1, norm 2, tikhonov and singular value
decomposition (SVD) reduced norm 2 regularization. This form of regularization reduces thecurvature of each depression and convolution of the non-linear boundary plot of SVM and hence
the loss function is modified to promote generalization and provide the essential curve fitting overthe input feature vectors for classification. To the best of our knowledge, this technique of
regularization of weights with deep learning and such ensemble learning approaches in thesupervised machine learning task, for dealing with the problem of overfitting of the classifiers has
yet not been applied to such prediction problems. In the stretch of paper firstly the detail aboutdataset and background concepts are discussed. Moving further the algorithm, framework,
experiment results and comparison analysis is done.
2. DATA SET
Three prediction problems used in this paper are summarized in table 1. The training set and test
set comprise of 70% and 30% of the whole database respectively. This ratio of 7:3 is an arbitraryratio but is chosen because it is a good practical ratio according to most of the experiments in
machine learning.
8/20/2019 Regularized Weighted Ensemble of Deep
3/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
49
2.1.IRIS Dataset
Iris database is created by R.A. Fisher and donated by Michael Marshall in July 1988 [8]. This isa popular dataset and is being successfully used in several problems related to prediction and
pattern recognition. The data set contains 3 classes specifying the type of iris plant from among
Iris Setosa, Iris Versicolour and Iris Virginica. There are a total of 50 instances per class in thewhole dataset. The classification problem is the prediction of category of Iris plant. The four
attributes or features in record of the dataset are sepal length (cm), sepal width (cm), petal length(cm) and petal width (cm). Table 2 describes the number of instances of each class in total,
training and test data of Iris data set. Table 3 describes major previous related work done on Irisdata.
Table 1. Instances distribution in training and test set of data
S.No.
Dataset Year ofdata set
creation
Numberof classes
Numberof
features
Totalnumber
of
instances
Trainingnumber of
instances
Testnumber of
instances
1 IRIS 1988 3 4 150 105 45
2 Ionosphere 1989 2 34 351 246 1053 Seed 2012 3 7 210 147 63
Table 2. Number of instances of each class in total, training and test data set of Iris Data set
S. No. Feature Total number of
instance
Training
number of
instance
Test number of
instance
1 Iris Setosa 50 35 15
2 Iris Versicolour 50 35 15
3 Iris Virginica 50 35 15
Table 3. Previous major experiments reported on Iris data set and classification accuracy achieved in each
case
S. No. Year of
Research
Problem Statement Reported
classification
accuracy (%)
1. 2014 Neuro-fuzzy classifier system [9] 96.70
2. 2013 Evolving neural network ensembles using string genetic
algorithms for pattern classification [10]
93.30
3. 2012 Hybrid SVM and decision tree classifier [11] 97.08
4. 2012 Classifier ensemble for SVM [12] 95.00
5. 2011 One class SVM weighted bagging [13] 92.00
6. 2010 Large margin classifier SVM [14] 95.307. 2010 Feature subset selection in neural network classifier[15] 97.00
8. 2008 SVM based semi supervised classification [16] 95.00
9. 2003 SVM Ensemble with majority voting [17] : SVM: Bagging
: Boosting
96.5096.80
97.20
8/20/2019 Regularized Weighted Ensemble of Deep
4/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
50
2.2.Ionosphere Dataset
Ionosphere database is created at Johns Hopkins University and donated Vince Sigillito in 1989[18]. For collection of the dataset, radar system is used. This radar contains phased array of 16
high frequency antennas with the help of which the free electrons in the ionosphere are recorded.
The two classes into which the categorization has to be done are “good” and “bad” ionosphere.Predictions are done on the basis of 34 attributes. This large number of attribute lists marks this
dataset different from the other two dataset mentioned in this section. Table 4 shows the numberof instances of each class in total, training and test data of Ionosphere data set. Table 5 shows
previous major similar contribution on Ionosphere data set
Table 4. Number of instances of each class in total, training and test data set of Ionosphere Data set
S. No. Feature Total number of
instance
Training
number of
instance
Test number
of instance
1 Good radar signal 224 168 56
2 Bad radar signal 127 78 49
Table 5. Previous major experiments reported on Ionosphere data set and classification accuracy achieved
in each case
S. No. Year of
Research
Problem Statement Reported
classification
accuracy (%)
1 2014 Classifier ensemble based on weighted accuracy and
diversity [19]
94.00
2 2014 Weighted classifier ensemble SVM [20] 94.00
3 2013 Artificial immune recognition through SVM
classification[21]
93.00
4 2013 One class ensemble classifier majority voting approach[22]
89.80
5 2010 Fast local Radial basis function kernel SVMclassification[23]
93.72
6 2008 Oblique decision tree embedded with SVM
classification [24]
92.59
7 2008 SVM infinite ensemble learning [25] 92.00
8 2006 Evolving ensemble of classifiers with majority voting
[26]
81.00
2.3. Seed Dataset
Seed database is one of the new database and hence has a very few previous experiments in itslist. Dataset mentions in it the geometrical properties of the kernel which is a characteristic to
differentiate varieties of wheat i.e. Kama, Rosa and Canadian. For the collection of the datasetsome X-ray techniques are used [27]. Seven parameters of wheat kernels which forms the feature
set in the dataset are area (A), compactness (C = 4*pi*A/P^2), perimeter (P), length of kernel,
asymmetry coefficient, width of kernel, and length of kernel groove. Table 6 shows number ofinstances of each class in total, training and test data set of seed data. Table 7 shows the major
8/20/2019 Regularized Weighted Ensemble of Deep
5/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
51
previous similar contribution on seed data. (To the best of our knowledge, this data has beenworked upon on the similar proposals, only by its developers, till date. Hence a single previous
work is reported in table 7).
Table 6. Number of instances of each class in total, training and test data set of Seed Data set
S. No. Feature Total number of
instance
Training
number of
instance
Test number of
instance
1 Kama 70 49 21
2 Kosa 70 49 21
3 Canadian 70 49 21
Table 7. Previous major experiments reported on seed data set and classification accuracy achieved in each
case
S. No. Year of
Research
Problem Statement Reported
classificationaccuracy (%)
1 2012 Complete gradient clustering with K Mean Algorithm
[28]
92.00
3.BACKGROUND APPROACH
3.1.SVM Classifier
Origin of SVM classifiers lies in VC dimensions. VC dimension is defined on a set of function. It
is the maximum number of points that can be separated in all possible ways by that set of
function. The non-linearly separable data are transformed to higher dimensions for achieving
classification through SVM (figure 1). The margin between the classes can be soft margin or hardmargin (figure 2). In case of soft margin classifiers, the model generated contains the
compensation of the misclassified instances. However the hard margin does not allow anymisclassification. Instead, it plots strict non-linear boundary to avoid misclassification. SVM
classifies the data through hinge loss optimization function. Soft margin classification is moreprevalent than hard margin classification since the later faces a very high rate of overfitting.
Figure 1. SVM
8/20/2019 Regularized Weighted Ensemble of Deep
6/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
52
3.2.Ensemble of classifiers
Ensemble Learning is the process of training multiple learning machines individually and therebycombining their outputs similar to a committee of decision makers. The principle behind this
method of decision making is that the individual predictions combined appropriately, should have
better overall accuracy, on average, than any individual committee member [29]. PrimeAggregation method applied in the ensemble learning are voting techniques such as majority
voting, borda count aggregation, behaviour knowledge based aggregation, dynamic classifier
selection etc. [30]. Out of these, our proposed learning technique uses majority voting [31]aggregation. The three versions of majority voting are unanimous voting, simple voting andplurality voting. Plurality voting is the most optimal form of majority voting.
Majority voting in the proposed statement of this paper aims at giving high weightage to more
qualified experts in the ensemble of classifiers. The expertise is inversely proportional to theclassification error.
Figure 2. Hard Margin SVM plot
3.3.Feature Subset Selection
Feature selection algorithms attempt to select features which are useful and deselect the features
which are not helpful or destructive to learning [32]. Feature subset selection is an importantphase of pre-processing in machine learning [33]. At times in this phase some feature areremoved totally. However these removed features may become important when incorporated in
some combination with other features. This disadvantage of feature selection can be removed byutilizing it in ensemble learning. Here several combinations of features are selected through some
algorithms to form individual models to be ensembled. Various selection algorithms areexhaustive selection (evaluation of all possible subset of features), branch and bound selection(evaluation using branch and bound algorithm), sequential forward selection (SFS) (select best
single feature and then add one feature at a time in combination which maximizes decision
accuracy), sequential backward selection (SBS) (select all the features and remove one feature ata time which maximizes the decision accuracy) and best individual feature selection (evaluationof all the N features individually and then taking the best set of features) etc. [34]. SFS is bottom
up procedure and SBS is top down procedure. Here the exhaustive selection is most ideal
approach but is feasible only when the number of attributes is few in numbers. Otherwise thepossible combination can shoot exponentially in number, not possible to handle.
8/20/2019 Regularized Weighted Ensemble of Deep
7/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
53
3.4.Deep Learning
Deep SVM is inspired from the success of deep neural networks [35], deep belief networks [35],
and deep Boltzmann machine [36] etc. Multilayer perceptron with many hidden layers is anexample of deep learning. Deep learning is a type of machine learning techniques that learns
multiple levels of representations in deep architectures [37]. There are chances of theconventional classifiers to get trapped in local optima of objective function. But the deep
architectures learns the feature representations through both supervised training and fine tuning atfurther deep phases of learning. First phase of deep SVM is the standard training process. Then in
the second phase, the kernel activations of the support vectors of first phase are set as inputs for
another SVM and so on till whatever level of tuning is required to be done [38]. Usually thetuning starts to repeat after 3-4 levels of deep learning. This training procedure is greedy in
nature. This makes the computationally very efficient. Ensemble of each phase of learning in the
deep learning further increases the precision of the model. However, there exist fine tuninglearning, but the model function still over fits the data points due to non-linear kernel activation
learning.
3.5.Regularization
The concept of regularization came into existence in 1990’s. In the supervised machine learningproblems, accurate prediction is more important than the close fit of the function onto the data.Hence generalization is appreciated or in other words overfitting of function has to be checked. In
figure 3 the blue curve is a 2 degree curve, red curve is a 4 degree curve and the green curve is the
8 degree curve which is the maximum out of the three. The green curve plots the close fitboundary between the two classes, but the test accuracy decreases. However the blue curve shows
minimum training accuracy but chances of betterment in test accuracy is the maximum in thiscase. Green curve marks overfitting. Hence it can be said that the overfitting occurs when
generalization is decreased. Regularization is a measure to check this overfitting. This providesproblem stability. Regularization restricts the hypothesis space to a linear function or a
polynomial of a particular degree according to the scenarios and smoothness to the function isprovided by putting the function in Reproducing kernel hilbert space (RKHS). A regularization
parameter ‘λ ’ associated with the regularization term of optimization function which controls the
trade-off between stability and accuracy.
Figure 3. Fitting of classifier on the data set
8/20/2019 Regularized Weighted Ensemble of Deep
8/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
54
In case of the ensemble learning, the regularization can be applied to the optimization of the lossfunction. By doing this the degree of the best fit polynomial is reduced and test classification
accuracy is improved. On the other side, overfitting can also be dealt with by keeping the degreeof the best fit function constant and regularizing the weightage associated to each individual
classifiers participating in the ensemble learning. This reduces the curvature of each positive or
negative depression in the curve without reducing the degree of whole curve. Hence the lossfunction is modified to provide the boundary fitting over the input feature vectors.
Another statistical technique is bootstrap resampling in which a new set dataset DT’ is drawn out
from the previous dataset DT by random sampling with replacement. Bagging is performed by
applying this in several iterations and then performing ensemble learning onto this. For a largeDT, the number of individual samples that are not present in any of the bootstrapped dataset is
large. The probability that first training sample is not selected once is (1- 1/N) and not selected at
all is (1-1/N)N [1]. Since N -> ᴔ, 1/e = 0.36 .Hence only about 63% of original training samples
are represented in any bootstrapped set. Since bagging reduces variance, it provides an alternative
approach to regularization [6] because even if each classifier is individually overfit, they arelikely to be overfit to different things.
4.PROPOSED WORK
In our work, regularized ensemble of deep SVM classifier has been used which shows a markableimprovement in the classification accuracy of prediction problems. For training and optimization
of our problem, we have used a popular library libSVM [40,41]. The ensemble of deep classifiers
is generated using four different frameworks shown in fig 4, fig 5, fig 6, fig 7. Fig 4 showsensemble of classifiers based on feature subset selection framework where the individual models
are formed by different training on different feature subset. Even those features which do not
contribute well in isolation or total combination, may work well in some combinations. Thisexplores all the best possible decisions using feature combinations. Fig 5 shows the ensemble of
deep classifiers level 1 where each individual model is generated by the training in each phase ofdeep learning. This provides improved basic training through fine tuning of deep phases. Fig 6
shows the ensemble of deep classifiers level 2, where fine tuning at a further level is done. Fig 7
shows a combination of motive achieved in fig 4 and fig 6 i.e. ensemble of deep classifierslearning with feature subset selection.
Figure 4. Ensemble of classifiers based on feature subset selection framework
8/20/2019 Regularized Weighted Ensemble of Deep
9/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
55
Figure 5. Ensemble of deep classifiers level 1 framework
Figure 6. Ensemble of deep classifiers level 2 framework
8/20/2019 Regularized Weighted Ensemble of Deep
10/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
56
Figure 7. Ensemble of deep classifiers learning with feature subset selection
For SVM, the loss function optimized is the hinge loss L(f(x),y)=max(0,1-y.f(x)). It has beenobserved that the regularization technique that generates the best accuracy for our proposed work
is the singular value decomposition (SVD) reduced weight matrix with regularization parameterʎ1 and square of norm 2 of weight matrix with regularization parameter ʎ2. Other regularization
factors are norm1, norm2 and tikhonov regularization. The objective function is described in
equation 1:
(1)
Here βi is achieved through regularized majority voting.
Algorithm 1: Regularized ensemble of classifiers using exhaustive feature subset selection
1: Start2: Find all the possible combinations of features
3: Train the SVM classifier all combinations received in 1
4: Estimate the weights {β1……… βt} associated with each individual model through regularizedmajority voting
5: Evaluate ensemble of classifier model
6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.
7: End
Algorithm 2: Regularized ensemble of classifiers using best N feature subset selection
1: Start
2: Train the SVM classifier on all individual features.3: Record the accuracy generated and the corresponding feature in descending order.4: Train SVM classifiers on Classifierset= {Best N, Best N-1, Best N-2…… Best 1}
8/20/2019 Regularized Weighted Ensemble of Deep
11/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
57
4: Estimate the weights {β1……… βt} associated with each member of Classifierset throughregularized majority voting technique
5: Evaluate ensemble of classifier model6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.
7: End
Algorithm 3: Regularized ensemble of deep classifiers
1: Start
2: for level= 1: t
3: Train the SVM classifier on data set D and record the model generated in [Model]4: Generate new data set D’ with the support vectors of model generated
5: D=D’
6: end for7: Estimate the weights {β1……… βt} associated with each member of [Model] through regularized
majority voting technique5: Evaluate ensemble of classifier model
6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.
7: End
Regularization parameter λ associated with the regularization term is an important term to controlthe trade-off between stability and accuracy. There are many regularization techniques in
existence and this is also a topic under further research. L1 Regularization is norm 1regularization factor which penalizes all the factors equally. This focuses on selection of only the
relevant factors. Its numerical definition is λ 1.||β||1. L1 penalty is linear which tends to producemany points with zero curvature. A disadvantage with this regularizer is slow convergence in caseof large scale problems. Secondly, L2 regularizer minimizes curvature at all the points in the
curve by applying penalty that scales square of curvature. Its numerical definition is λ 1.||β||2.Complexity of L2 regularization is greater than L1 regularizer. Thirdly,
Tikhonov regularizer is a special case of L2 Regularization numerically defined by term
(λ 1)2
.(||β||2)2
. Further the SVD reduced norm 2 regularization is represented as λ 1. SVD(β) + λ 2.(||β||2). SVD has multiple roles and can be viewed as a method for transforming correlatedvariables into a set of uncorrelated ones that better expose the various relationships among the
original data items, a method for identifying and ordering the dimensions along which data points
exhibit the most variation and a method for data reduction by finding the best approximation ofthe original data points using fewer dimensions. Regularization path varies with the experimental
conditions.
5.EXPERIMENTS
In all the experiments listed below, SVM classifier is used because it evaluates dot products of
vectors in the higher dimension to construct the dividing boundary. The choice of a kernelfunction depends on the model to plot. A polynomial kernel allows to model feature conjunctions
up to the order of the polynomial. Radial basis functions (RBF) allows plotting circular
boundaries in higher dimensions. Linear kernel allows putting linear boundaries in higherdimensions. Multiclass classification is best achieved through RBF. If ƴ is the kernel bandwidth
parameter and (Xi , Xj) is vector to be transformed to higher dimensions, equations 2 shows RBFkernel equation.
(2)
8/20/2019 Regularized Weighted Ensemble of Deep
12/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
58
Other important algorithm used is the parameter estimation algorithm of Grid search. In v-foldcross-validation, the training set is divided into v subsets of equal size. Classifiers are trained on
v-1 subsets and are tested on one subset. Hence each instance is predicted once and so the crossvalidation accuracy is the percentage of data which are correctly classified. The kernel parameters
(C, ƴ ) are estimated using cross-validation. Various combination of (C, ƴ ) is tried and one with
best cross validation accuracy is picked. In the experiments of our proposed work, libSVM library[40, 41], is used for training multi class SVMs with RBF kernel. The features in the training and
test datasets were scaled in the range [-1, +1]. 10 fold cross validation is used for choosing thekernel bandwidth parameter ƴ and SVM C parameter through grid search. The range of (C, ƴ ) are
[2-10
,2-9
, …..25] and [2
-5, 2
-4,…..2
10] respectively. The range for regularization parameter λ 1 and λ 2
is 0 < λ 1 < 0.5 and 0 < λ 2 < 0.5 respectively. Five cases of experiment are described below. Resultsof bagging technique are listed in table 8 for a comparative vision.
Case 1: Bagging Ensemble of classifiers
Case 2: Ensemble of classifiers based on feature subset selectionCase 3: Ensemble of classifiers in deep learning level 1
Case 4: Ensemble of classifiers in deep learning level 2Case 5: Ensemble of classifiers in deep learning level 1 with the feature subset selection
Cases 2,3,4,5 have subcases for the following regularization schemes:Setting 1: SVD reduced Norm 2 regularization
Setting 2: Norm 1 regularization
Setting 3: Norm 2 regularizationSetting 4: Tikhonov regularization
Table 8. Results of Bagging ensemble of classifiers in all the three dataset
S.No. Dataset Classification
accuracy (%) in
Bagging Ensemble of
classifiers
1. IRIS Data set 96.66
2. Ionosphere Dataset
87.87
3. Seed Data set 95.38
For the feature subset selection IRIS data set uses Algorithm 1 i.e. exhaustive feature subset
selection. This is most optimal selection algorithm. For Ionosphere and Seed data set, since the
number of features or the attributes is very large in number, it is very lengthy and highly complexto find out all the possible combinations of attributes. Hence they both use Algorithm 2 i.e. best N
feature subset selection. Fig 8 represents the classification accuracy results of experiments on
IRIS dataset. Fig 9 and fig 10 represents 2D and 3D scatterplot where different colours mark thedifferent class vectors respectively. Similarly Fig 11, fig 12, fig 13 are the corresponding results
on Ionosphere dataset and fig 14, fig 15 and fig 16 are the corresponding results on Seed dataset.
8/20/2019 Regularized Weighted Ensemble of Deep
13/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
59
Figure 8. Results of experiments on IRIS Dataset
Figure 9. 2D Scatter plot between all pair of attributes in IRIS dataset
Figure 10. 3D Scatter plot between all pair of attributes in IRIS dataset
8/20/2019 Regularized Weighted Ensemble of Deep
14/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
60
Figure 11. Results of experiment on Ionosphere dataset
Figure 12. 2D Scatter plot on the best set of features in Ionosphere dataset.
Figure 13. 3D Scatter plot on the best set of features in Ionosphere dataset
8/20/2019 Regularized Weighted Ensemble of Deep
15/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
61
Figure 14. Results of experiment on seed dataset
Figure 15. 2D Scatter plot on the best set of features in Seed dataset.
8/20/2019 Regularized Weighted Ensemble of Deep
16/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
62
Figure 16. 3D Scatter plot on the best set of features in Seed dataset.
6.OBSERVATION
The results in all the above three set of experiments show the improved classification accuracy
than the major reported previous results, in the case of ensemble of deep classifiers level 2 withthe SVD reduced norm 2 regularizations which is nearly 99%. Time taken in this particular case
for various dataset is reported in table 9. It is to be noted that time taken in case of ionospheredata is comparatively larger than other two dataset due to comparatively large number of features
in it. The deep learning on the complete dataset is generating better results than the deep learning
on the feature subset selection schemes. This is because the fine tuning in the presence of all thefeatures is better in comparison to the feature subset. The penalty in Norm 1 regularization deletes
many noise features by estimating their coefficients to zero since it is not differentiable at zero.
Whereas the penalty in Norm 2 regularization uses all the input features in classification becauseit is differentiable at all points in the function. Hence Norm 2 regularization achieves higher ordersmoothness for curve estimation.
Table 9. Time taken in deep learning level 2 with full feature set ensemble learning with SVD
reduced Norm 2 regularization
S. No. Dataset Time (sec)
1. IRIS Data set 16.37
2. Ionosphere Data set 123.78
3. Seed Data set 32.78
Next, since the bagging model shows the inclusion of only about 63% of the original trainingsamples in any bootstrapped set (as discussed in section 3.5), the regularization provided by thistechnique is not as smooth as the ensemble of deep classifiers. Analysis of the regularizers
applied above can be done on the basis of worst case time complexity. In Norm 1 regularization,there are total of (t-1) sum operations computed at run of algorithm. Time Complexity O(t) is
reported. In Norm 2 regularization, there are total of (t-1) sum operations, t operations to squareall the elements, and 1 square root operation is computed. Time complexity O(3t) is reported. One
8/20/2019 Regularized Weighted Ensemble of Deep
17/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
63
degree regularization parameter is applied. In the tikhonov regularization, time complexity O(3t)is same as L2 regularization but here 2 degree regularization parameter is applied. In
(SVD+Norm2), there are two expressions involved. O(t2) for SVD computation summed with
O(3t) for norm 2 computation. Hence time complexity O(t2) is reported.
7.CONCLUSION
The deep learning approach for the improvement in the classification accuracy is very prevalent
in the artificial neural network field. The deep SVM classifier is still an emerging concept. Herethe experiments prove a good scope of deep learning with SVM classifiers. Regularization ofdeep learning has further marked an improvement in classification accuracy. Many other
regularization techniques could be applied for comparison and better results. Other feature
selection strategies such as SFS and SBS could also be applied for feature subset selection.
REFERENCES
[1] Rob Schapire, “Theoretical Machine Learning”, COS 511, Lec No. 1, p. 1-6, 2008
[2] R. Sathya, Annamma Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms
for Pattern Classification”, IJARAI, Vol 2, No. 2, 2013[3] D. Michie, D.J. Spiegelhalter, C.C. Taylor, “Machine Learning, Neural and Statistical Classification”,
Tutorial section 2.1, p. 6-16, 1994
[4] S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”, Informatica,
Vol 31, p. 249-268, 2007
[5] Koby Crammer, Yoram Singer, “On the Algorithmic Implementation of Multiclass Kernel-based
Vector Machines”, JMLR 2, p. 256-295, 2001
[6] Hal Daume III, “A course in Machine Learning”, Ensemble learning CIML, V0-8, Ch 1,p. 148-155,
2012
[7] A.Vergara, Shankar Vembu, Tuba Ayhanb, Margaret A. Ryanc, Margie L. Homerc, Ramon Huertaa
“Chemical gas sensor drift compensation using classifier ensembles” . Sensors and Actuators B, p.
166-167 2012.
[8] Fisher’s IRIS dataset, UCI repository, https://archive.ics.uci.edu/ml/datasets/Iris, 1988
[9] Vaishali Arya, R.K.Rathy, “An Efficient Neuro-Fuzzy Approach for Classification of Iris Dataset”,
International Conference on Reliability, Optimization and Information Technology, p. 161- 165, 2014.[10] Xiaoyang Fu and Shuqing Zhang, “Evolving Neural Network Ensembles Using Variable String
Genetic Algorithm for Pattern Classification”, Sixth International Conference on Computational
Intelligence, p. 81-85 2013.
[11] Anshu Bharadwaj, Sonajharia Minz, “Hybrid Approach for Classification using Support Vector
Machine and Decision Tree”, International Conference on Advances in Electronics, Electrical and
Computer Science Engineering, p. 337-341, 2012.
[12] Hamid Parvin, Sajad Parvin, “Robust Classifier Ensemble for Improving the Performance of
Classification”, Eleventh Mexican International Conference on Artificial Intelligence, IEEE special
session, Vol 11, p. 52-57, 2012.
[13] Xue-Fang Chen, Hong-Jie Xing, Xi-Zhao Wang, “A modified AdaBoost method for one-class SVM
and its application to novelty detection”, IEEE, Vol 11 p. 3506-3511, 2011.
[14] Hakan Cevikalp, Bill Triggs , Hasan Serhan Yavuz , Yalc, Mahide, Atalay Barkana, “Large margin
classifiers based on affine hulls” Elsevier, Vol 73, p. 3160-3168, 2010.
[15] A. Marcano-Cedeño, J. Quintanilla-Domínguez, M.G. Cortina-Januchs, D. Andina, “Feature SelectionUsing Sequential Forward Selection and classification applying Artificial Metaplasticity Neural
Network”, IEEE, No. 36, p. 2845-2850, 2010
[16] Narendra S. Chaudhari, Aruna Tiwari, Jaya Thomas,“Performance Evaluation of SVM Based Semi-
supervised Classification Algorithm”, 10th Intl. Conf. on Control, Automation, Robotics and Vision,
No. 10, p. 1942-1947, 2008
[17] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, Sung Yang Bang, “ Constructing support
vector machine ensemble”, The journal of the pattern recognition society, Vol 36, p. 2757-2767, 2003.
8/20/2019 Regularized Weighted Ensemble of Deep
18/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
64
[18] Vince Sigillito, Ionosphere Dataset , UCI repository, https://archive.ics.uci.edu/ml/datasets/Ionosphere,
1989
[19] Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, “Constructing Better Classifier Ensemble Based on
Weighted Accuracy and Diversity Measure”, The Scientific World Journal, Volume 2014, Article No.
961747, p. 1-12, 2014
[20] Shasha Mao, LichengJiao, LinXiong, ShuipingGou, BoChen, Sai-KitYeung, “Weighted classifier
ensemble based on quadratic form”, Elsevier Vol 48, Issue 5, p. 1688-1706, 2014[21] Darwin Tay, Chueh Loo Poh, Richard I. Kitney, “An Evolutionary Data-Conscious Artificial Immune
Recognition System” , Proceedings of the 15th annual conference on Genetic and evolutionary
computation, p. 1101-1108, 2013
[22] Eitan Menahem, Lior Rokach, Yuval Elovici, “Combining One-Class Classifiers via Meta Learning”,
Proceedings of 22 ACM international conference on information & knowledge management, No. 22,
p. 2435-2440, 2013
[23] Nicola Segata , Enrico Blanzieri, “Fast and Scalable Local Kernel Machines”, JMLR, Vol 1, p. 1883-
1926, 2010
[24] Vlado Menkovski, Ioannis T. Christou, Sofoklis Efremidis, “Oblique Decision Trees Using
Embedded Support Vector Machines in Classifier Ensembles” , Vol 11, p. 1-6, 2008
[25] Hsuan-Tien Lin , Ling Li, “Support Vector Machinery for Infinite Ensemble Learning”, JMLR , Vol
9, p. 285-312, 2008
[26] Albert Hung-Ren Ko, Robert Sabourin, Alceu de Souza Britto, “Evolving Ensemble of Classifiers in
Random Subspace”, Proceedings of the 8th annual conference on Genetic and evolutionarycomputation, p. 1473-1480, 2006
[27] Gorzata’s Seed Data set, UCI repository, https://archive.ics.uci.edu/ml/datasets/seeds, 2012
[28] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete
HGradient Clustering Algorithm for Feature Analysis of X-ray Images”, Information Technology in
Biomedicine, Springer-Verlag, p. 15-24, 2010
[29] Gavin Brown, Encyclopaedia of Machine Learning Vol 1, p. 312-320, 2010
[30] Robi Polaker, Ensemble based systems in decision making, IEEE, Vol 6, Issue 3, p. 21-45
[31] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung-Yang Bang, Support Vector
Machine Ensemble with Bagging, Springer, LNCS 2388, p. 397-408, 2002
[32] David W. Opitz, “Feature Selection for Ensembles”, American Association for Artificial Intelligence,
AAAI Proceeding No. 99, p.1-6, 1999
[33] Mohamed A. Aly, “Novel Methods for the Feature Subset Ensembles Approach”, International
Journal of Artificial Intelligence and Machine Learning, Vol. 6, No. 4, p. 1-7, 2006
[34] Anil K. Jain, Robert P.W. Duin, Jianchang Mao, “Statistical Pattern Recognition: A Review”,IEEEtransactions on pattern analysis and machine intelligence, Vol 22, Issue 1, p. 4-37, 2000
[35] Dong Yu and Li Deng, “Deep Learning and Its Applications to Signal and Information Processing” ,
IEEE processing magazine Vol 28, Issue 1, p. 145-154, 2011
[36] Nitish Srivastava, Ruslan Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”,
ICML, 25 Annual Conferrence on learning theory, No. 25, p. 1-9, 2012
[37] Xue-Wen chen, Xiaotong Lin, “Big Data Deep Learning: Challenges and Perspectives”, IEEE Access,
Vol 2, p. 514-525, 2014.[38] Azizi Abdullah, Remco C. Veltkamp, Marco A. Wiering, “An Ensemble of Deep Support Vector
Machines for Image Categorization”, International Conference of Soft Computing and Pattern
Recognition, p.301-306, 2009.
[39] Hal Daume III, From zero to reproducing kernel hilbert spaces in twelve pages or less, p.1-12, 2004
[40] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, Software. Available at
http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
[41] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, Department of Computer Science, NationalTaiwan University,Taipei 106, Taiwan, 2003, Practical Guide to Support Vector Classification, p 1-
16, 2003.
8/20/2019 Regularized Weighted Ensemble of Deep
19/19
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015
65
Authors
Ms. Shruti Asmita (B.Tech., 2013 – KEC Ghaziabad, Uttar Pradesh Technical University, Lucknow) is a
M.Tech. Computer Science scholar at Banasthali University, Jaipur and pursuing her research internship at
IIT-BHU (CSE), Varanasi. Her research interests include data mining, image processing, machine learning
and sensor networks etc.
Dr. K.K. Shukla (Ph. D., 1993 - Institute of Technology (BHU), Varanasi) is professor and current head of
department at Indian Institute of Technology, Banaras Hindu University Varanasi, India. He has been
awarded B.Tech from APSU, Rewa in 1980, M.Tech. from IT (BHU) in 1982 and PhD from IT (BHU) in
1993. He is having research and teaching experience of 30 years, He is having more than 120 research
papers in reputed journals and conferences and more than 90 citations. His present research collaborations
in India include ISRO and TCS. Out of India research collaborations includes INRIA, France and ETS,
Canada. He has many popular books under his authorship on subjects Neuro-computers, RTS Scheduling,
Fuzzy modelling and Image Compression. His field of research includes image processing and pattern
recognition, fuzzy logics, wireless sensor networks and machine learning etc.