EigenValue Decay a New Method for Neural Network Regularization

Eigenvalue decay: A new method for neural network regularization

Oswaldo Ludwig n, Urbano Nunes, Rui AraujoISR-Institute of Systems and Robotics, University of Coimbra, Portugal

a r t i c l e i n f o

Article history:Received 17 October 2012Received in revised form28 April 2013Accepted 14 August 2013Communicated by M.T. ManryAvailable online 13 September 2013

Keywords:TransductionRegularizationGenetic algorithmClassification marginNeural network

a b s t r a c t

This paper proposes two new training algorithms for multilayer perceptrons based on evolutionarycomputation, regularization, and transduction. Regularization is a commonly used technique forpreventing the learning algorithm from overfitting the training data. In this context, this work introducesand analyzes a novel regularization scheme for neural networks (NNs) named eigenvalue decay, whichaims at improving the classification margin. The introduction of eigenvalue decay led to the developmentof a new training method based on the same principles of SVM, and so named Support Vector NN(SVNN). Finally, by analogy with the transductive SVM (TSVM), it is proposed a transductive NN (TNN), byexploiting SVNN in order to address transductive learning. The effectiveness of the proposed algorithmsis evaluated on seven benchmark datasets.

& 2013 Elsevier B.V. All rights reserved.

1. Introduction

One of the problems that occur during syntactic classifiertraining is called overfitting. The error on the training dataset isdriven to a small value; however, the error is large when new dataare presented to the trained classifier. It occurs because theclassifier does not learn to generalize when new situations arepresented. This phenomenon is related to the classifier complexity,which can be minimized by using regularization techniques [1]and [2]. In this paper we will apply regularization to improve theclassification margin, which is an effective strategy to decrease theclassifier complexity, in Vapnik sense, by exploiting the geometricstructure in the feature space of the training examples.

There are three usual regularization techniques for neuralnetworks (NNs): early stopping [3], curvature-driven smoothing[4], and weight decay [5]. In the early stopping criterion thelabeled data are divided into training and validation datasets. Aftersome number of iterations the NN begins to overfit the data andthe error on the validation dataset begins to rise. When thevalidation error increases during a specified number of iterations,the algorithm stops the training section and applies the weightsand biases at the minimum of the validation error to the NN.Curvature-driven smoothing includes smoothness requirementson the cost function of learning algorithms, which depend on thederivatives of the network mapping. Weight decay is implementedby including additional terms in the cost function of learning

algorithms, which penalize overly high values of weights andbiases, in order to control the classifier complexity, which forcesthe NN response to be smoother and less likely to overfit. Thiswork introduces and analyzing a novel regularization scheme,named eigenvalue decay, aiming at improving the classificationmargin, as will be shown in Section 3.

In the context of some on-the-fly applications, the use of SVMwith nonlinear kernels requires a prohibitive computational cost,since its decision function requires a summation of nonlinearfunctions that demands a large amount of time when the numberof support vectors is big. Therefore, a maximal-margin neuralnetwork [6–9] can be a suitable option for such kind of applica-tion, since it can offer a fast nonlinear classification with goodgeneralization capacity. This work introduces a novel algorithm formaximum margin training that is based on regularization andevolutionary computing. Such method is exploited in order tointroduce a transductive training method for NN.

The paper is organized as follows. In Section 2 we propose theeigenvalue decay, while the relationship between such regulariza-tion method and the classification margin is analyzed in Section 3.In Section 4 it is proposed a new maximum-margin trainingmethod based on genetic algorithms (GA) that is extended to thetransductive approach in Section 5. Section 6 reports the experi-ments, while Section 7 summarizes some conclusions.

2. Eigenvalue decay

A multilayer perceptron (MLP) with one sigmoidal hidden layerand linear output layer is a universal approximator, because the

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.neucom.2013.08.005

n Corresponding author. Tel.: þ351 239796205.E-mail addresses: [email protected], [email protected] (O. Ludwig).

Neurocomputing 124 (2014) 33–42

www.sciencedirect.com/science/journal/09252312

www.elsevier.com/locate/neucom

http://dx.doi.org/10.1016/j.neucom.2013.08.005



http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2013.08.005&domain=pdf



mailto:[email protected]

mailto:[email protected]


sigmoidal hidden units of such model compose a basis of linearlyindependent soft functions [10]; therefore, this work focuses insuch NN, whose model is given by

yh ¼φðW1 � xþb1Þy ¼WT

2yhþb2 ð1Þwhere yh is the output vector of the hidden layer, W1 is a matrixwhose elements are the synaptic weights that connect the inputelements with the hidden neurons, W2 is vector whose elementsare the synaptic weights that connect the hidden neurons to theoutput, b1 is the bias vector of the hidden layer, b2 is the bias of theoutput layer, x is the input vector, and φð�Þ is the sigmoid function.The most usual objective function for MLPs is the MSE:

E¼ 1N

∑N

i ¼ 1ðyi� yiÞ2 ð2Þ

where N is the cardinality of the training dataset, yi is the targetoutput, yi is the output estimated by the MLP for the input xibelonging to the training dataset. However, in case of the usualweight decay method [11], additional terms which penalize overlyhigh values of weights and biases are included. Therefore, thegeneric form for the objective function is

En ¼ Eþκ1 ∑wi AW1

w2i þκ2 ∑

wj AW2

w2j þκ3 ∑

bð1;kÞ Ab1

b2ð1;kÞ þκ4b22 ð3Þ

whereW1,W2, b1, and b2 are the MLP parameters, according to (1),and κi40, i¼ ð1…4Þ are regularization hyperparameters. Suchmethod was theoretically analyzed by [12], which concludes thatthe bounds on the expected risk of MLPs depend on the magnitudeof the parameters rather than the number of parameters. Namely,in [12] the author showed that the misclassification probabilitycan be bounded in terms of the empirical risk, the number oftraining examples, and a scale-sensitive version of the VC-dimen-sion, known as the fat-shattering dimension,1 which can be upper-bounded in terms of the magnitudes of the network parameters,independently from the number of parameters.2 In short, asregards weight-decay, [12] only shows that such method can beapplied to control the capacity of the classifier space. However, thebest known way to minimize the capacity of the classifier spacewithout committing the accuracy on the training data is tomaximize the classification margin, which is the SVM principle.Unfortunately, from the best of our knowledge, there is no proofthat weight-decay can maximize the margin. Therefore, wepropose the eigenvalue-decay, for which it is possible to establisha relationship between the eigenvalue minimization and theclassification margin. The objective function of eigenvalue-decay is

Enn ¼ EþκðλminþλmaxÞ ð4Þwhere λmin is the smallest eigenvalue of W1W

T1 and λmax is the

biggest eigenvalue of W1WT1.

3. Analysis on eigenvalue decay

In this section we show a relationship between eigenvaluedecay and the classification margin, mi. Our analysis requires thefollowing lemma:

Lemma 1 (Horn and Johnson [13]). Let K denote the field of realnumbers, Kn�n a vector space containing all matrices with n rows andn columns with entries in K, AAKn�n be a symmetric positive-semidefinite matrix, λmin be the smallest eigenvalue of A, and λmax

be the largest eigenvalue of A. Therefore, for any xAKn, the following

inequalities hold true:

λminxTxrxTAxrλmaxxTx ð5Þ

Proof. The upper bound on xTAx, i.e. the second inequality of (5),is well known; however, this work also requires the lower boundon xTAx, i.e. the first inequality of (5). Therefore, since this proof isquite compact, we will save a small space in this work to presentthe derivation of both bounds as follows:Let V ¼ ½v1;…; vn� be the square n�nmatrix whose ith column is

the eigenvector vi of A, and Λ be the diagonal matrix whose ithdiagonal element is the eigenvalue λi of A; therefore, the followingrelations hold:

xTAx¼ xTVV �1AVV �1x¼ xTVΛV �1x¼ xTVΛVTx ð6ÞTaking into account that A is positive-semidefinite, i.e. 8 i, λiZ0:

xTVðλminIÞVTxrxTVΛVTxrxTVðλmaxIÞVTx ð7Þwhere I is the identity matrix. Note that xTVðλminIÞVTx¼ λminxTxand xTVðλmaxIÞVTx¼ λmaxxTx; therefore, substituting (6) into (7)yields (5). □

The following theorem gives a lower and an upper bound onthe classification margin:

Theorem 1. Let mi be the margin of the training example i, i.e. thesmallest orthogonal distance between the classifier separating hyper-surface and the training example i, λmax be the biggest eigenvalue ofW1W

T1 , and λmin be the smallest eigenvalue of W1W

T1; then, for

mi40, i.e. an example correctly classified, the following inequalitieshold true:

1ffiffiffiffiffiffiffiffiffiffiλmax

p μrmir1ffiffiffiffiffiffiffiffiffiλmin

p μ ð8Þ

where

μ¼minj

yiWT

2ΓjW1ðxi�xjprojÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiWT

2ΓjΓTj W2

q0B@

1CA; ð9Þ

Γj ¼

φ′ðv1Þ 0 ⋯ 00 φ′ðv2Þ ⋯ 0⋮ ⋮ ⋱ ⋮0 0 ⋯ φ′ðvnÞ

266664

377775;

½v1; v2;…; vn�T ¼W1 � xkþb1; φ′ðvnÞ ¼∂φ∂vn xjproj

; xjproj

��

is the jth projection of the ith training example, xi, on the separatinghypersurface, as illustrated in Fig.1, and yi is the target class of xi.

Proof. The first step in this proof is the calculation of the gradientof the NN output y in relation to the input vector x at the projectedpoint, xjproj. From (1) we have

∇yði;jÞ ¼∂y∂x xjproj

¼WT2ΓjW1

�� ð10Þ

The vector

p!ði;jÞ ¼∇yði;jÞ

J∇yði;jÞ Jð11Þ

is normal to the separating surface, giving the direction from xi toxproj

j; therefore

xi�xjproj ¼ dði;jÞ p!

ði;jÞ ð12Þ1 See Theorem 2 of [12].2 See Theorem 13 of [12].

O. Ludwig et al. / Neurocomputing 124 (2014) 33–4234

where dði;jÞ is the scalar distance between xi and xprojj. From (12) we

have

∇yði;jÞðxi�xjprojÞ ¼ dði;jÞ∇yði;jÞ p!

ði;jÞ ð13Þ

Substituting (11) into (13) and solving for dði;jÞ yields

dði;jÞ ¼∇yði;jÞðxi�xjprojÞ

J∇yði;jÞ Jð14Þ

The sign of dði;jÞ depends on which side of the decision surface theexample, xi, is placed. It means that an example, xi, correctlyclassified whose target class is �1 corresponds to dði;jÞo0. On theother hand, the classification margin must be positive in case ofexamples correctly classified, and negative in case of misclassifiedexamples, independently from their target classes. Therefore, themargin is defined as a function of yidði;jÞ, where yiAf�1;1g is thetarget class of the ith example. More specifically, the margin, mi, isthe smallest value of yidði;jÞ in relation to j, that is

mi ¼minj

ðyidði;jÞÞ ð15Þ

Substituting (14) in (15) yields

mi ¼minj

yi∇yði;jÞðxi�xjprojÞ

J∇yði;jÞ J

!ð16Þ

Substituting (10) in (16) yields

mi ¼minj

yiWT

2ΓjW1ðxi�xjprojÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiWT

2ΓjW1WT1Γ

Tj W2

q0B@

1CA: ð17Þ

Note that W1WT1 is a symmetric positive-semidefinite matrix,

therefore, from Lemma 1, the inequalities:

λminWT2ΓjΓ

Tj W2rWT

2ΓjW1WT1Γ

Tj W2rλmaxW

T2ΓjΓ

Tj W2 ð18Þ

hold true for any Γj and any W2. From (18) and (17) it is easy toderive (8). □

Taking into account that λmax and λmin are the denominators ofthe bounds in (8), the training method based on eigenvalue decaydecreases λmax and λmin aiming at increasing the lower and theupper bounds on the classification margin. However, eigenvaluedecay does not assure, by itself, increasing the margin, becauseμ is a function of W1 among other NN parameters. Therefore, toevaluate the effect of eigenvalue decay on the classification margin,we performed comparative experiments with real world data-sets (see Section 6). Fig. 2 illustrates the separating surface for the

toy examples and Fig. 3 illustrates the margins3 generated by NNstrained without and with eigenvalue decay. The boundary betweenwhite and colored areas represents the SVM-like classificationmargin, i.e. for input data belonging to the yellow area, the NNmodel outputs 0r yio1, while for input data belonging to the greenarea, the model outputs 04 yi4�1. The boundary between coloredareas represents the separating surface. The training methodsproposed in this paper are similar to SVM, i.e. the data which lieinto the colored areas, or fall on the wrong side of the separatingsurface, increase a penalty term. The algorithm minimizes thepenalty term in such a way to move the colored area, and so theseparating surface, away from the training data. Therefore, the largeris the colored area, i.e. the smaller the eigenvalues, the larger thedistance between the training data and the separating surface.

4. Maximum-margin training by GA

Theorem 1 allows us to propose a maximal-margin trainingmethod quite similar to SVM [14], in the sense that the proposedmethod also minimizes values related with the parameters of theclassifier model, in order to maximize the margin, allowing theminimization of the classifier complexity without committing theaccuracy on the training data.

The main idea of our method is not only to avoid nonlinearSVM kernels, in such a way as to offer a faster nonlinear classifier,but also to be based on the maximal-margin principle; moreover,the proposed method is more suitable for on-the-fly applications,such as object detection [15,16]. The SVM decision function isgiven by

cðxÞ ¼ sgn ∑Nsv

i ¼ 1yiαiKðxi; xÞþb

!ð19Þ

where αi and b are SVM parameters, ðxi; yiÞ is the ith supportvector data pair, sgn ð�Þ is 1 if the argument is greater than zeroand �1 if it is less than zero, and Kð�; �Þ is a nonlinear kernel

Fig. 1. A feature space representing a nonlinear separating surface with theprojections, xjproj, of the ith training example, xi, and examples of orthogonaldistances dði;jÞ .

Fig. 2. First toy example on the effect of eigenvalue decay on the margin: (a) NNtrained without eigenvalue decay, (b) NN trained with eigenvalue decay. Theboundary between white and colored areas represents the SVM-like classificationmargin, i.e. for input data belonging to the yellow area, the NN model outputs0r y io1, while for input data belonging to the green area, the model outputs04 y i4�1. (For interpretation of the references to color in this figure caption, thereader is referred to the web version of this article.)

3 The margins were generated by a dense grid of points. The output of thetrained NN is calculated for each point. If the output is �1oyo1, the pointreceives a colored pixel.

O. Ludwig et al. / Neurocomputing 124 (2014) 33–42 35

function, i.e. the algorithm fits the maximum-margin hyperplanein a transformed feature space, in order to enable a nonlinearclassification. Notice that (19) requires a large amount of timewhen the number of support vectors, Nsv, is big. This factmotivated this new SVM-like training method for NN, namedSupport Vector NN (SVNN), here proposed.

In order to better understand our method, it is convenient totake into account the soft margin SVM optimization problem, asfollows:

minw;ξi

12‖w‖2þC ∑

N

i ¼ 1ξi

!ð20Þ

subject to

8 ijyiðw � xi�bÞZ1�ξi ð21Þ

8 ijξiZ0 ð22Þ

wherew and b compose the separating hyperplane, C is a constant,yi is the target class of the ith training example, and ξi are slackvariables, which measure the degree of misclassification of thevector xi. The optimization is a trade off between a large margin(min‖w‖2), and a small error penalty (min C∑N

i ¼ 1ξi).We propose to train the NN by solving the similar optimization

problem:

minW1 ;ξi

λminþλmaxþC1

N∑N

i ¼ 1ξi

!ð23Þ

subject to

8 ijyiyiZ1�ξi ð24Þ

8 ijξiZ0 ð25Þwhere y is given by (1), C1 is a regularization hyperparameter, yi isthe target class of the ith training example, and ξi are also slackvariables, which measure the degree of misclassification of thevector xi.

The constrained optimization problem (23)–(25) is replaced bythe equivalent unconstrained optimization problem (26) [17], thathas the discontinuous objective function Φ, which disables thegradient-based optimization methods; therefore, a real-coded GAis applied to solve (26), using Φ as fitness function [18]:

minW1 ;W2 ;b1 ;b2

Φ ð26Þ

where

Φ¼ λminþλmaxþC1

N∑N

i ¼ 1HðyiyiÞ ð27Þ

and HðtÞ ¼maxð0;1�tÞ is the Hinge loss.Note that the last term of (27) penalizes models whose

estimated outputs do not fit the constraint yiyiZ1, in such away as to save a minimal margin, while the minimization of thefirst two terms of (27) aims at the enlargement of such minimalmargin by eigenvalue decay, as suggested by Theorem 1.

Algorithm 1 details the proposed optimization process. Thechromosome of each individual is coded into a vertical vectorcomposed by the concatenation of all the columns of W1 with W2,b1, and b2. The algorithm starts by randomly generating the initialpopulation of Npop individuals in a uniform distribution, accordingto the Nguyen–Widrow criterion [19]. During the loop overgenerations the fitness value of each individual is evaluated onthe training dataset, according to (26) and (27). Then, the indivi-duals are ranked according to their fitness values, and the cross-over operator is applied to generate new individuals by randomlyselecting the parents by their ranks, according to the randomvariable pA ½1;Npop� proposed in our previous work [20]:

p¼ ðNpop�1Þeaϑ�1ea�1

þ1 ð28Þ

where ϑA ½0;1� is a random variable with uniform distribution anda40 sets the selective pressure, more specifically, the larger a, thelarger the probability of low values of p, which are related to high-ranked individuals.

Algorithm 1. Maximal margin training by GA.

Input: X, y: matrices with N training datapairs;nneu: number of hidden neurons;C1 : regularization hyperparameter;a: selective pressure;maxgener: maximum number of generations;Npop: population sizeOutput: W1, W2, b1, and b2: NN parametersgenerate a set of Npop chromosomes, {Cr}, for the initial

population, taking into account the number of inputelements and nneu; therefore, each chromosome is a verticalvector Cr¼ ½w1;…;wnw ;b1;…; bnb �T containing all the Ng

synaptic weights and biases randomly generated accordingto the Nguyen–Widrow criterion [19];

for generation¼ 1 : maxgener doevaluating the population:for ind¼ 1 : Npop do

rearrange the genes, Crind, of individual ind, in order tocompose the NN parameters W1, W2, b1, and b2.

Fig. 3. Second toy example: (a) NN trained without eigenvalue decay, (b) NNtrained with eigenvalue decay. (For interpretation of the references to color in thisfigure caption, the reader is referred to the web version of this article.)


for i¼ 1 : N docalculate yi , according to (1), using the weights and

biases of individual ind;end forcalculateΦ for the individual ind, according to (26), using y

and the set of NN outputs fyi g previously calculated;Φind←Φ:storing the fitness of individual ind;

end forrank the individuals according to their fitness Φind;store the genes of the best individual in Crbest;performing the crossover:k←0;for ind¼ 1 : Npop dok←kþ1;randomly selecting the indexes of parents by using the

asymmetric distribution proposed in [20], and alsoapplied in [21]:ϑj← random number A ½0;1� with uniform distribution,

j¼ ð1;2Þ;parentj←roundððNpop�1Þeaϑj �1

ea �1 þ1Þ, j¼ ð1;2Þ;assembling the chromosome Crsonk :

for n¼ 1 : Ng doη← random number A ½0;1� with uniform distribution;Crsonðk;nÞ←ηCrðparent1 ;nÞ þð1�ηÞCrðparent2 ;nÞ: calculating the nth

gene to compose the chromosome of the kth individual ofthe new generation, by means of weighted average;end for

end forend forrearrange the genes of the best individual, Crbest, in order to

compose the NN parameters W1, W2, b1, and b2.

5. Transductive neural networks

This section deals with transduction, a concept in which nogeneral decision rule is inferred. Differently from inductive infer-ence, in the case of transduction the inferred decision rule aimsonly at the labels of the unlabeled testing data.

The SVM-like training method, introduced in the previoussection, can be exploited to address transductive learning. There-fore, we propose the transductive NN (TNN), which is similar tothe transductive SVM (TSVM) [14]. The transductive algorithmtakes advantage of the unlabeled data similar to the inductivesemi-supervised learning algorithm. However, differently from theinductive semi-supervised learning, transduction is based on theVapnik principle, which states that when trying to solve someproblem, one should not solve a more difficult problem, such asthe induction of a general decision rule, as an intermediate step.

The proposed TNN accomplishes transduction by finding thosetest labels for which, after training a NN on the combined trainingand test datasets, the margin on the both datasets is maximal.Therefore, similar to TSVM, TNN exploits the geometric structurein the feature vectors of the test examples, by taking into accountthe principle of low density separation. Such principle assumesthat the decision boundary should lie in a low-density region ofthe feature space, because a decision boundary that cuts a datacluster into two different classes is not in accordance with thecluster assumption, which can be stated as follows: if points are inthe same data cluster, they are likely to be of the same class.

The TNN training method can be easily implemented byincluding in (26) an additional term that penalizes all the unla-beled data which are near to the decision boundary, in order toplace the decision boundary away from high-density regions.

Therefore, the new optimization problem is

minW1 ;W2 ;b1 ;b2

Φn ð29Þ

where

Φn ¼ λminþλmaxþC1

N∑N

i ¼ 1HðyiyiÞþ

C2

Nu∑Nu

j ¼ 1Hð yj Þ;

�� ð30Þ

C1 and C2 are constants, yj is the NN output for the unlabeled dataxj, and Nu is the cardinality of the unlabeled dataset. Notice thatthe operator j � j makes this additional term independent of theclass assigned by the NN for the unlabeled example, i.e. indepen-dent from the signal of yj, since we are interested only in thedistance from the unlabeled data to the decision boundary.

In order to illustrate the effect of the last term of (30), weintroduce two toy examples which enable a comparative study onthe decision boundaries generated by SVNN and TNN, as can beseen in Figs. 4 and 5, where circles represent training data andpoints represent testing (unlabeled) data.

Note that both toy examples are in accordance with the clusterassumption, i.e. there are low-density regions surrounding dataclusters whose elements belong to the same class. TNN places theseparating-surface along such low-density regions (see Fig. 4(b)),in order to increase the absolute value of the margin of theunlabeled data, in such a way as to decrease the last term of (30).

Empirically, it is sometimes observed that the solution to (30) isunbalanced, since it is possible to decrease the last term of (30) by

Fig. 4. Separating surfaces generated by two NNs with five hidden neurons. Circlesrepresent training data and points represent testing (unlabeled) data: (a) NNtrained by SVNN, (b) NN trained by TNN.


placing the separating-surface away from all the testing instances,as can be seen in Fig. 6. In this case, all the testing instances arepredicted in only one of the classes. Such problem can also be

observed in case of TSVM, for which a heuristic solution is toconstrain the predicted class proportion on the testing data, sothat it is the same as the class proportion on the training data. Thiswork adopts a similar solution for TNN, by including in (30) a termthat penalizes models whose predicted class proportion on thetesting data is different from the class proportion on the trainingdata. Therefore, we rewrite (30) as

Φn ¼ λminþλmaxþC1

N∑N

i ¼ 1HðyiyiÞþ

C2

Nu∑Nu

j ¼ 1Hð yj Þ

��

þC31N

∑N

i ¼ 1yi�

1Nu

∑Nu

j ¼ 1sgnðyjÞ

�� ð31Þ

where C3 is a penalization coefficient. Fig. 7 shows the separating-surface of TNN after the inclusion of the last term of (31).

6. Experiments

In this section our methods are evaluated by means of experi-ments in three UCI benchmark datasets4 and four datasets from[22].5 Table 1 details the applied datasets.

Fig. 5. Separating surfaces generated by two NNs with four hidden neurons. Circlesrepresent training data and points represent testing (unlabeled) data: (a) NNtrained by SVNN, (b) NN trained by TNN.

Fig. 6. Toy experiment using TNN trained without the last term of (31). Circlesrepresent training data and points represent testing (unlabeled) data.

Fig. 7. Toy experiment using TNN trained with the last term of (31). Circlesrepresent training data and points represent testing (unlabeled) data.

Table 1Datasets used in the experiments.

Dataset Attributes # data train # data test Total

Breast Cancer 9 200 77 277Haberman 3 153 153 306Hepatitis 19 77 78 155BCI 117 200 200 400Digit1 241 750 750 1500g241c 241 750 750 1500Text 11 960 750 750 1500

Table 2Number of hidden neurons of both SVNN and TNN.

Breast Cancer Haberman Hepatitis BCI Digit1 g241c Text

2 4 2 2 2 2 1

4 http://archive.ics.uci.edu/ml/datasets/5 http://olivier.chapelle.cc/ssl-book/benchmarks.html


http://archive.ics.uci.edu/ml/datasets/

http://olivier.chapelle.cc/ssl-book/benchmarks.html

The highly unbalanced datasets, Breast Cancer, Haberman, andHepatitis, were introduced in our experimental analysis in order toverify the behavior of the optimization algorithms of transductivemethods when working under the constraint on the predictedclass proportion, i.e. the last term of (31). The other datasets areusually applied to evaluate semi-supervised learning algorithms.Each dataset was randomly divided into 10-folds, thus, all theexperimental results were averaged over 10 runs.

The datasets were randomly divided into 10-folds, in order toaverage the results over 10 runs. Each fold contains all the datadivided into two subsets: half for training and half for testing. Foreach training dataset, i.e. half of the data, it was performed 10-folds cross-validation to set the classifier parameters. Therefore,the parameter setting does not take into account information fromthe testing dataset. In the case of SVM, the soft-margin parameterC and the RBF parameter γ (in the case of RBF kernel) were chosenin the set f10�3;10�2;…;103g. In case of SVNN and TNN, theparameters C1, C2, and C3 were chosen in the set f102;103;104;105g.Table 2 gives the averaged number of hidden neurons adopted afterthe cross-validation.

In our experiments SVNN and TNN were compared with NNstrained by the usual Levenberg–Maquardt (LM), as well as, SVMand TSVM.6 Since SVNN and TNN can perform nonlinear classifica-tion, we also evaluated the performance of SVM and TSVM withRBF kernel. In order to verify the capability of transductivealgorithms in learning from few labeled data, all the algorithmswere trained by using only 10 labeled training points, as well as, allthe training data. In both cases, the transductive algorithms madeuse of all the testing (unlabeled) data. Regarding the GA, theselective pressure was set to a¼6 and the population Npop ¼ 3000.

Table 3Accuracy (Acc), balanced error rate (BER), and positive predictive value (PPV) ofinductive methods with 10 labeled training points.

Index SVNN NN-LM SVM-l SVM-rbf

Breast Cancer datasetAcc 62.8679.16 60.2679.42 61.6977.83 62.7375.25BER 38.9276.65 42.2478.11 55.0473.81 42.1574.55PPV 36.37710.51 33.2479.82 14.9678.94 34.3475.71

Haberman datasetAcc 26.6070.00 26.6070.00 26.6070.00 26.6070.00BER 50.0070.00 50.0070.00 50.0070.00 50.0070.00PPV 26.4770.00 26.4770.00 26.4770.00 26.4770.00

Hepatitis datasetAcc 61.0373.44 58.3373.82 61.7973.97 53.5974.50BER 42.9972.87 44.2172.84 47.2572.92 54.9474.04PPV 26.3672.29 24.9772.42 23.2273.16 16.3273.36

BCI datasetAcc 51.7074.18 51.0075.26 50.1575.11 50.9075.67BER 48.3074.18 49.0075.26 49.8575.11 49.1075.67PPV 51.2073.49 50.5074.35 49.6575.08 50.4074.26

Digit1 datasetAcc 74.9774.63 73.7675.70 69.4774.22 73.8174.53BER 25.5474.05 26.4174.98 30.9374.06 26.3274.45PPV 96.7373.68 77.3274.11 79.8373.52 76.2073.65

g241c datasetAcc 61.6873.74 55.8473.92 57.7073.17 55.7673.82BER 38.3273.74 44.1673.92 42.3073.17 44.2473.82PPV 61.6172.91 55.7773.52 57.6372.52 55.6973.35

Text datasetAcc 57.6873.89 57.1574.07 55.1773.91 54.3374.02BER 42.8373.80 43.7174.01 44.2173.88 44.3073.97PPV 57.4073.53 56.8673.72 56.5273.06 55.1973.32

Table 4Accuracy (Acc), balanced error rate (BER), and positive predictive value (PPV) oftransductive methods with 10 labeled training points.

Index TNN TSVM-l TSVM-rbf

Breast Cancer datasetAcc 71.4371.83 74.0370.00 74.0370.00BER 30.7471.53 50.0070.00 50.0070.00PPV 46.4172.88 25.9770.00 25.9770.00

Haberman datasetAcc 75.9572.49 73.4070.00 73.4070.00BER 33.4772.31 50.0070.00 50.0070.00PPV 55.4573.27 26.4770.00 26.4770.00

Hepatitis datasetAcc 78.7272.21 78.0870.00 78.0870.00BER 34.0372.10 50.0070.00 50.0070.00PPV 47.9773.38 20.5170.00 20.5170.00

BCI datasetAcc 52.2073.42 50.9073.55 51.0074.85BER 47.8073.42 49.1073.55 49.0074.85PPV 51.9571.58 50.6571.59 50.7572.23

Digit1 datasetAcc 82.4374.02 80.1874.26 82.2073.97BER 17.8473.94 21.5874.12 18.3374.06PPV 92.7776.29 90.1275.01 91.8674.34

g241c datasetAcc 77.7973.18 76.1173.29 75.2973.02BER 22.2173.18 23.8973.29 24.7173.02PPV 77.7472.58 76.0672.55 75.2472.28

Text datasetAcc 68.8273.67 69.0373.22 65.4474.01BER 32.1073.21 31.7273.38 35.0673.43PPV 68.5373.21 68.9573.05 64.6374.08

Table 5Accuracy (Acc), balanced error rate (BER), and positive predictive value (PPV) ofinductive methods with all the labeled training points.

Index SVNN NN-LM SVM-l SVM-rbf

Breast Cancer datasetAcc 73.7774.11 65.5874.61 72.7273.83 73.6474.61BER 28.2673.52 42.2274.27 37.0473.71 37.2275.46PPV 49.6375.23 35.9374.88 47.2173.79 49.0975.97

Haberman datasetAcc 75.2972.88 74.1273.19 72.0372.76 74.5173.16BER 32.0572.67 33.5173.05 35.4472.32 32.8772.84PPV 53.3973.01 51.1374.22 47.2573.17 51.8673.87

Hepatitis datasetAcc 71.7972.85 68.9773.11 64.2372.34 71.1573.34BER 30.2672.81 32.9372.95 32.6772.12 30.3673.20PPV 38.9673.20 35.6774.14 33.0672.59 38.3774.08

BCI datasetAcc 80.1572.23 72.5572.83 77.1072.12 79.3072.60BER 19.8572.23 27.4572.83 22.9072.12 20.7072.60PPV 79.9971.95 72.3572.00 76.9271.69 79.1372.21

Digit1 datasetAcc 95.3372.61 93.6472.72 90.7772.45 94.9872.13BER 6.4272.32 7.7272.68 10.8272.29 5.1272.09PPV 88.2172.51 87.7572.31 75.2573.47 87.7672.51

g241c datasetAcc 79.6072.64 68.8872.83 78.7972.67 78.6672.81BER 20.4072.64 31.1272.83 21.2172.67 21.3472.81PPV 79.5572.26 68.8271.82 78.7472.23 78.3272.07

Text datasetAcc 85.5771.87 75.1372.22 86.8471.82 78.7372.12BER 15.4971.76 26.5572.13 15.3271.68 23.0571.64PPV 84.9271.48 74.1672.06 86.5771.55 77.3571.82

6 http://svmlight.joachims.org/


http://svmlight.joachims.org/

Tables 3–6 summarize the experimental results. Eqs. (32)–(34)define the three indexes adopted to assess the learning perfor-mance, i.e. accuracy (Acc), balanced error rate (BER), and a measureof precision named positive predictive value (PPV).

Acc¼ TPþTNðnpþnnÞ ð32Þ

BER¼ 12

FPnn

þFNnp

� �ð33Þ

PPV ¼ TPðTPþFPÞ ð34Þ

where TP is the number of positive examples correctly classified,TN is the number of negative examples correctly classified, FP isthe number of negative examples classified as positive, FN is thenumber of positive examples classified as negative, np is thenumber of positive examples, and nn is the number of negativeexamples.

Tables 7 and 8 report the training and testing time in seconds,averaged on all the cross-validation runs.

In order to evaluate the influence of eigenvalue decay on theperformance of SVNN, two sets of experiments were performed. Inthe first set of experiments a SVNNwas trained by minimizing (27)without the first term; therefore, this model was namedSVNN�λmin. In the second set of experiments a SVNN was rainedby minimizing (27) without the first two terms; hence, this modelwas named SVNN�λmin�λmax. Both models were evaluated on allthe datasets of Table 1. Moreover, it was investigated the influence ofthe term about class proportion on the accuracy of TNN. To do so, aTNN was trained by minimizing only the first four terms of (31);

hence, this model was named TNN-C3. The results are summarized inTable 9.

As regards the inductive training methods, SVNN had the bestperformance and in the majority of the evaluated datasets byusing only 10 labeled data. In the case of the Haberman dataset, allthe algorithms fail, predicting all the testing data in the same class(see the value of BER¼50% in Table 3). By using all the trainingdata, SVNN only was less accurate than SVM in the Text dataset(see Table 5). We believe that this fact is due to the high-dimensional feature space of Text dataset, since such fact canfavor linear classifiers, such as the linear SVM.

Table 6Accuracy (Acc), balanced error rate (BER), and positive predictive value (PPV) oftransductive methods with all the labeled training points.

Index TNN TSVM-l TSVM-rbf





Digit1 datasetAcc 95.4472.12 94.4372.04 92.3772.21BER 5.4772.25 6.0671.94 8.2472.18PPV 88.7872.54 88.1272.96 87.5572.64



Table 7Training and testing time, in seconds, of NNs with all the labeled training points.

Data type SVNN NN-LM TNN

Breast Cancer datasetTrain 24.53 12.55 74.28Test 0.01 0.01 0.01

Haberman datasetTrain 69.33 14.12 78.37Test 0.01 0.01 0.01

Hepatitis datasetTrain 29.56 16.76 65.54Test 0.01 0.01 0.01

BCI datasetTrain 185.34 32.58 214.76Test 0.01 0.01 0.01

Digit1 datasetTrain 684.95 92.67 996.47Test 0.02 0.02 0.02

g241c datasetTrain 673.65 89.73 989.75Test 0.02 0.02 0.02

Text datasetTrain 990.35 193.18 1287.83Test 0.06 0.06 0.06

Table 8Training and testing time, in seconds, of SVMs with all the labeled training points.

Data type SVM-l SVM-rbf TSVM-l TSVM-rbf

Breast Cancer datasetTrain 0.02 0.13 0.28 1.32Test 0.01 0.02 0.01 0.03

Haberman datasetTrain 37.39 52.45 368.71 87.42Test 0.01 0.05 0.01 0.05

Hepatitis datasetTrain 1.23 241.36 22.32 287.22Test 0.01 0.12 0.01 0.12

BCI datasetTrain 5.45 10.36 9.22 75.36Test 0.01 0.08 0.01 0.06

Digit1 datasetTrain 2.53 19.74 838.29 1442.94Test 0.04 2.56 0.06 4.16

g241c datasetTrain 1.56 32.04 84.19 242.90Test 0.03 1.59 0.06 3.92

Text datasetTrain 8.36 3764.61 27.98 5138.61Test 0.43 3.05 0.63 3.16


As regards the transductive training methods, TSVM and TSVM-rbf predicted all the testing data of the UCI datasets in the majorityclass when using only 10 labeled data, i.e. the constraint on thepredicted class proportion was violated (see the value of BER¼50%in the first three rows of Table 4). Therefore, TNN was the bestapproach for all the datasets, excepting the Text dataset, for whichthe linear TSVM was the best approach. By using all the trainingdata, TNN had the best values of accuracy, BER, and PPV in five ofthe seven datasets.

As regards the training and testing time, SVM was, in most of theexperiments, less expensive in training than the proposed methods;however, the testing time reveals the main advantage of SVNN andTNN, which can perform nonlinear classification a few hundred timesfaster than SVM with nonlinear kernels, as can be seen, for instance,in the fifth row of Table 7. In this case, TSVM has 231 support-vectors,while the TNN has only two hidden neurons; therefore, taking intoaccount that Digit1 dataset has 241 attributes, from the models(1) and (19) it is possible to realize that the decision function of TSVMrequires the calculation of 56 133 products, 55 672 sums, and 232nonlinear functions, while the decision function of TNN only requiresthe calculation of 484 products, 487 sums, and 2 nonlinear functions.Such fact is especially relevant in applications such as on-the-flyobject detection, in which each image frame has to be scanned by asliding window, generating several thousands of cropped images tobe classified.

By comparing Tables 9 and 5, it is possible to verify the positiveinfluence of eigenvalue decay on the performance of SVNN. Table 9also reveals the importance of the term about class proportion onthe performance of TNN. Note that TNN-C3 is unsatisfactory inclassifying the first two datasets, i.e. TNN predicted all the testingdata of datasets Breast Cancer and Haberman in the majority class(see the value of BER¼50% in the last cells of the first two rows ofTable 9).

7. Conclusion

The analysis presented in this paper indicates that by applyingeigenvalue decay it is possible to increase the classification margin,which improves the generalization capability of NNs. The intro-duction of eigenvalue decay allowed the synthesis of two novelSVM-like training methods for NNs, including a transductivealgorithm. These methods are suitable options for a faster non-linear classification, by avoiding the time expensive decision-function of nonlinear SVMs, which may hinder on-the-fly applica-tions, such as pedestrian detection (e.g. see Section 4.2 of [15]).The experiments indicate that, regarding the classificationaccuracy, SVNN and TNN are similar to nonlinear SVM andTSVM; however, regarding the testing time, the proposed methodswere significantly faster than nonlinear SVMs. The experimentsalso indicate that TNN can take advantage of unlabeled data,especially when few labeled data are available, as can be seen inTable 4.

References

[1] F. Dan Foresee, M. Hagan, Gauss–Newton approximation to Bayesian learning,in: International Conference on Neural Networks, 1997, vol. 3, IEEE, 1997,pp. 1930–1935.

[2] D. MacKay, Bayesian interpolation, Neural Computation 4 (3) (1992) 415–447.[3] N. Treadgold, T. Gedeon, Exploring constructive cascade networks, IEEE

Transactions on Neural Networks 10 (6) (1999) 1335–1350, http://dx.doi.org/10.1109/72.809079.

[4] C. Bishop, Neural Networks for Pattern Recognition, 1st ed., Oxford UniversityPress, USA, 1996.

[5] Y. Jin, Neural network regularization and ensembling using multi-objectiveevolutionary algorithms, in: Congress on Evolutionary Computation (CEC'04),IEEE, IEEE Press, 2004, pp. 1–8.

[6] O. Ludwig, Study on Non-parametric Methods for Fast Pattern Recognitionwith Emphasis on Neural Networks and Cascade Classifiers, Ph.D. Dissertation,University of Coimbra, Coimbra, Portugal, 2012.

[7] O. Ludwig, U. Nunes, Novel maximum-margin training algorithms for super-vised neural networks, IEEE Transactions on Neural Networks 21 (6) (2010)972–984.

[8] A. Deb, M. Gopal, S. Chandra, SVM-based tree-type neural networks as a criticin adaptive critic designs for control, IEEE Transactions on Neural Networks 18(4) (2007) 1016–1030, http://dx.doi.org/10.1109/TNN.2007.899255.

[9] S. Abe, Support Vector Machines for Pattern Classification (Advances in PatternRecognition), Springer-Verlag New York Inc., Secaucus, NJ, USA, 2005.

[10] K. Hornik, M. Stinchcombe, H. White, Universal approximation of an unknownmapping and its derivatives using multilayer feedforward networks, NeuralNetworks 3 (5) (1990) 551–560.

[11] H. Demuth, M. Beale, Neural Network Toolbox User's Guide, The MathWorksInc. (2013), URL: ohttp://www.mathworks.com/help/pdf_doc/nnet/nnet_ug.pdf4 .

[12] P. Bartlett, The sample complexity of pattern classification with neuralnetworks: the size of the weights is more important than the size of thenetwork, IEEE Transactions on Information Theory 44 (2) (1998) 525–536.

[13] R. Horn, C. Johnson, Matrix Analysis, Cambridge University Press, 1990.[14] V. Vapnik, Statistical Learning Theory, John Wiley, 1998.[15] M. Enzweiler, D. Gavrila, Monocular pedestrian detection: survey and experi-

ments, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (12)(2009) 2179–2195.

[16] O. Ludwig, U. Nunes, B. Ribeiro, C. Premebida, Improving the generalizationcapacity of cascade classifiers, IEEE Transactions on Cybernetics PP (99) (2013)1–12. http://dx.doi.org/10.1109/TCYB.2013.2240678.

[17] S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: primal estimated sub-gradientsolver for svm, in: Proceedings of the 24th International Conference onMachine Learning, ACM, 2007, pp. 807–814.

[18] M. Adankon, M. Cheriet, Genetic algorithm-based training for semi-supervisedsvm, Neural Computing & Applications 19 (8) (2010) 1197–1206.

[19] D. Nguyen, B. Widrow, Improving the learning speed of 2-layer neuralnetworks by choosing initial values of the adaptive weights, in: InternationalJoint Conference on Neural Networks, IJCNN'90, IEEE, 1990, pp. 21–26.

[20] O. Ludwig, P. Gonzalez, A. Lima, Optimization of ANN applied to non-linearsystem identification, in: Proceedings of the 25th IASTED InternationalConference on Modeling, Identification, and Control, ACTA Press, Anaheim,CA, USA, 2006, pp. 402–407.

[21] O. Ludwig, U. Nunes, R. Araujo, L. Schnitman, H. Lepikson, Applications ofinformation theory, genetic algorithms, and neural models to predict oil flow,Communications in Nonlinear Science and Numerical Simulation 14 (7) (2009)2870–2885.

[22] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, vol. 2, MIT PressCambridge, MA, 2006.

Table 9Performance of SVNN without eigenvalue decay and TNN without the term aboutclass proportion.

Data type SVNN �λmin SVNN-λmin�λmax TNN �C3





Digit1 datasetAcc 94.6772.71 90.9372.63 95.3372.20BER 5.8372.75 10.1672.50 5.7272.28PPV 8772.65 84.0473.06 89.2473.21




http://refhub.elsevier.com/S0925-2312(13)00833-3/othref0005



http://refhub.elsevier.com/S0925-2312(13)00833-3/sbref2

http://dx.doi.org/10.1109/72.809079

http://dx.doi.org/10.1109/72.809079

http://dx.doi.org/10.1109/72.809079

http://dx.doi.org/10.1109/72.809079












http://dx.doi.org/10.1109/TNN.2007.899255











http://www.mathworks.com/help/pdf_doc/nnet/nnet_ug.pdf

http://www.mathworks.com/help/pdf_doc/nnet/nnet_ug.pdf











dx.doi.org/10.1109/TCYB.2013.2240678




















Oswaldo Ludwig received the M.Sc. degree in electricalengineering from Federal University of Bahia, Salvador,Brazil, in 2004 and the Ph.D. degree in electricalengineering from the University of Coimbra, Coimbra,Portugal, in 2012. He is an Assistant Professor ofComputer and Electrical Engineering at the Universityof Coimbra. His current research focuses on machinelearning with application on several fields, such aspedestrian detection in the domain of intelligent vehi-cles and biomedical data mining.

Urbano Nunes received the Lic. and Ph.D. degrees inelectrical engineering from the University of Coimbra,Coimbra, Portugal, in 1983 and 1995, respectively. He isa Full Professor with the Computer and ElectricalEngineering Department of Coimbra University. He isalso the Vice Director of the Institute for Systems andRobotics where he is the Coordinator of the Automationand Mobile Robotics Group. He has been involved with/responsible for several funded projects at both nationaland international levels in the areas of mobile robotics,intelligent vehicles, and intelligent transportation sys-tems. His research interests are several areas in con-nection with intelligent vehicles and human-centered

mobile robotics with more than 120 published papers in these areas.

Rui Araujo received the M.Sc. degree in Systems andAutomation, in 1994 and the Ph.D. degree in electricalengineering from the University of Coimbra, in 2000,both from the University of Coimbra, Coimbra, Portugal.He is an Assistant Professor at the Department ofElectrical and Computer Engineering of the Universityof Coimbra. His research interests include robotics,learning systems, control systems and theory, fuzzysystems, neural networks, real-time systems, architec-tures, and systems for controlling robots.


Date post:	25-Apr-2017
Category:	Documents
Upload:	frangky200787
View:	223 times
Download:	4 times

EigenValue Decay a New Method for Neural Network Regularization

Documents