Parameter Optimization for Support Vector
Machine Based on Nested Genetic Algorithms
Pin Liao, Xin Zhang, and Kunlun Li College of Science and Technology, Nanchang University, Nanchang, China
Email: [email protected], [email protected], [email protected]
Yang Fu
Tian Ge Interactive Holdings Limited, Beijing, China
Email: [email protected]
Mingyan Wang and Sensen Wang
Information Engineering School, Nanchang University, Nanchang, China
Email: [email protected], [email protected]
Abstract—Support Vector Machine (SVM) is a popular and
landmark classification method based on the idea of
structural risk minimization, which has obtained extensive
adoption across numerous domains such as pattern
recognition, regression, ranking, etc. In order to achieve
satisfying generalization, penalty and kernel function
parameters of SVM must be carefully determined. This
paper presents an original method based on two nested real-
valued genetic algorithms (NRGA), which can optimize the
parameters of SVM efficiently and speed up the parameter
optimization by orders of magnitude compared to the
traditional methods which optimize all the parameters
simultaneously. As illustrated by the experimental results on
gender classification of facial images, the proposed
parameter optimization method, NRGA, can develop a SVM
classifier quickly with superior classification accuracy due
to its overwhelming efficiency and consequent searching
power.
Index Terms—support vector machine, parameter
optimization, genetic algorithm, nested optimization method,
gender classification of facial images
I. INTRODUCTION
Support Vector Machine (SVM) [1], [2] is a well-
known and successful two-class classification method
based on statistical learning theory. SVM is also known
as maximum margin classifier, which maximizes the
geometric margin between two classes in order to reach a
minimum generalization error. Thereby, SVM has been
successfully employed for many diverse classification
applications, mainly because of its extraordinary
generalization performance.
Properly-chosen SVM parameters are critical for
achieving highly generalization. Penalty parameter C and
kernel function parameters are the parameters of SVM to
be optimized. Penalty parameter is introduced to make a
trade-off between minimizing the empirical risk and
Manuscript received October 12, 2014; revised December 26, 2014.
minimizing the model complexity. And kernel function
parameters, such as the gamma γ for the radial basis
function (RBF) kernel, define the non-linear mapping
from the input space to some high-dimensional feature
space. Just for convenience, this work uses RBF as the
kernel function of SVM without loss of generality.
Hence this paper proposes a novel method to optimize
the parameters of SVM by nesting two real-valued
genetic algorithms (NRGA). NRGA is tremendously
efficient compared to the existing parameter optimization
techniques which simultaneously search all the
parameters. In NRGA, the inner-loop real-valued genetic
algorithm (RGA) involves the optimization of penalty
factor C with fixed kernel function parameters, and the
outer-loop RGA is in charge of optimizing kernel
function parameters. So that, in each step of the outer
loop, the kernel values remain unchanged and can be
reused for all iterations of the inner loop. Therefore,
NRGA can reduce the computing time of the SVM
parameter optimization by orders of magnitude
(dependent on the iteration number of the inner loop)
compared to the traditional optimization approaches.
The remainder of this paper is organized as follows.
Section II reviews pertinent literature on parameter
optimization for SVM. Section III gives a brief
introduction to SVM. Section VI describes the parameter
optimization method based on two nested RGA. Section
V presents the experimental results from using the
proposed method for gender classification of facial
images. Conclusions are given in Section VI.
II. RELATED WORK
The optimal parameter search on SVM plays a crucial
role in establishing an efficient SVM model with high
prediction accuracy and stability. Improper parameter
settings result in poor classification performance, while
the optimal categorization accuracy of SVM arises from
seeking optimal parameters.
507
Journal of Automation and Control Engineering Vol. 3, No. 6, December 2015
©2015 Engineering and Technology Publishingdoi: 10.12720/joace.3.6.507-511
It was suggested by Vapnik in [2] that the parameters
can be set directly using a priori knowledge of the
specific problem to be solved. However, usually the
method is not practical and effective, and is seldom
adopted by researchers. As a simple and direct algorithm to seek prompt values
for the SVM parameters [3]-[5], grid search is apt to be time consuming. And the obtained solutions significantly rely on the selected grid granularity and search space. In [3] a “grid-search” on C and γ using cross-validation was recommended, because the authors did not feel safe psychologically to use methods which avoid doing an exhaustive parameter search by approximations or heuristics and they considered that the grid-search could be easily parallelized since each (C, γ) is independent. An approach was presented in [4] for selecting SVM parameters based on ideas from design of experiments, which started with a very coarse grid covering the entire search range and iteratively refined both the grid resolution and search boundaries, maintaining the number of samples at each step almost invariable. A heuristic searching strategy was presented to determine the good combination of parameter values in order to avoid high computational cost of grid search [5].
Genetic algorithm (GA) has been widely adopted for parameter setting of SVM [6], [7], [8], for it tends to be quite good at finding generally good global solutions. In [6], a GA based method was proposed to simultaneously optimize the feature subset and the parameters for SVM. In [7], a RGA is employed to optimize the parameters of SVM for predicting bankruptcy, for that RGA is more straightforward, faster and more efficient than the binary GA. In [8], the asymptotic behaviors of SVM are fused with GA, which thereby directs the search to the straight line of optimal generalization error in the superparameter space.
Gradient descent method is also employed to determine the SVM parameters [9], [10]. In [9] the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton algorithm was used for arriving at good values of hyperparameters. A method for tuning SVM parameters used a gradient descent algorithm over the set of parameters to minimize some estimates of the generalization error [10]. As it is well known, gradient descent algorithm is sensitive to initial parameters and usually convergence to local optimum, so that it becomes impractical for non-complex optimization problems.
As a generic probabilistic metaheuristic for the global optimization problem, simulated annealing (SA) has been applied to optimize SVM parameters [11], [12], because it is excellent at avoiding local optimums and is commonly good at finding an approximate global optimum. Two classical techniques, GA and SA, was evaluated on optimizing SVM parameters in [11], and it was mentioned that the two methods both obtained similar results, while GA tended to be faster and SA needed less parameter settings. In [12] a modified SA (MSA) algorithm was developed and combined with the linear least square and gradient descent paradigms for parameter optimization in SVM.
As a new optimization method based on the biologic
immune principle of living beings, artificial immune
technique can effectively avoid the premature
convergence and guarantee the variety of solution. Thus,
some artificial immune algorithm (AIA) based
approaches were presented for SVM parameter
optimization [13]-[16]. An artificial immune based
parameter optimization of SVM was introduced in [13],
which was able to effectively solve the conflict between
local searching and globe searching in the process of
features selection and parameters selection. Artificial
immunization algorithm is used to optimize the
parameters in SVM in [14] to improve the total capability
of the SVM classifier for the fault diagnosis of turbo
pump rotor, where the concentration and variety of the
antibody is utilized to automatically select these
parameters. In [15], an immune algorithm was given to
search optimal parameter of sphere-shaped SVM for
constructing 3D model for encephalic tissues, which not
only has fast convergence speed and powerful searching
ability, but also could avoid the phenomenon of
degeneration and immaturity as well. A multi-objective
artificial immune algorithm was developed to obtain
optimal multiple kernel and penalize parameters in [16],
which not only maximizes the accuracy rate but also
minimizes the number of support vectors. Particle swarm optimization (PSO) is a method for
global numerical optimization based on swarm intelligence, which makes few or no assumptions about the optimization problem. An approach based on PSO was developed for parameter determination of SVM [17]. The developed approach not only tunes the parameter values of SVM, but also selects an optimal feature subset, maximizing the classification accuracy rate of SVM. The experimental results demonstrate that the classification accuracy rate of the developed approach surpasses those of grid search and many other approaches, and that the developed PSO + SVM approach has a similar result to GA + SVM.
III. SUPPORT VECTOR MACHINE FOR CLASSIFICATION
In this section a brief introduction to SVM for classification is given. A more comprehensive description of SVM was provided in [2]. SVM involves a supervised learning algorithm which can solve linear and nonlinear binary classification problems. The standard two-class soft-margin SVM classification problem is considered in this paper, which tries to find a maximum-margin hyperplane to separate two-class samples with low generalization error.
Given some training samples of two classes xi∈d, i =
1,…, N, with their corresponding labels yi∈ {−1, 1}, where d is the dimension of data points xi, and n is the number of training samples, training SVM requires the solution of the following (primal) quadratic programming (QP) optimization problem to attain the weight vector w and the offset b.
2
, ,1
1min
2
. (( ) ) 1 ,
0, 1,
n
iw b
i
i i i
i
w C
s t y w x b
i n
(1)
Data point xi can be mapped into a high dimensional
space by mapping function () if it is nonlinear. The
508
Journal of Automation and Control Engineering Vol. 3, No. 6, December 2015
©2015 Engineering and Technology Publishing
penalty coefficient C controls the width of margin. And
non-negative relax variables ξi measure the degree of
misclassification of the data points xi. The separating
hyper-plane’s generalization ability is determined by the
margin, calculated as 2 w . The minimization process in
formulation (1) is known as structure risk minimization,
which optimizes the trade-off between empirical error
minimization and margin maximization. Usually the
primal problem of (1) can be solved by solving its dual
problem of (2). 1
min2
. . 0
0 , 1,...,
T T
T
i
Q e
s t y
C i n
(2)
where e is the vector of all ones, αi is the Lagrange
multiplier, and Q is an n by n symmetric and positive
semidefinite matrix. The element of Q is given by
( , )ij i j i jQ y y K x x , and ( , ) ( ) ( )T
i j i jK x x x x is called
kernel function.
SVM is trained to get the value of each αi by solving
the QP problem (2) using training data points.
Subsequently, the following function is used to classify
any new data points:
1
( ) ( , )n
i i i
i
g x sign y K x x
where the bias term β is computed during the training as
well. Depending on the selected kernel, β may implicitly
be a part of the kernel function, which will be not
required if RBF kernel is chosen.
IV. PARAMETER OPTIMIZATION BASED ON NESTING
REAL-VALUED GENETIC ALGORITHMS
Linear, polynomial, sigmoid and RBF are usually
employed in SVM as kernel function. RBF is used
particularly widely, for its simplicity with only one
parameter γ, and its ability of generally leading to good
generalization. Thus, RBF is adopted in this study as the
kernel function in SVM without loss of generality, since
the parameters of other kernel functions can also be
sought with the same technique.
Training SVM involves the solution of a large and
convex QP optimization problem. With the purpose of
decreasing the optimization cost, some heuristic
algorithms are developed to decompose the large QP
problem into a series of sub-problems. Typically the
sequential minimal optimization (SMO) [18] algorithm
employs only two working samples, and optimizes the
reduced sub-problems with a simple analytic method.
And LIBSVM [19], an improved SMO algorithm, makes
use of second order information and accelerates
convergence. Therefore, LIBSVM is adopted to train
SVM in this work.
As a search heuristic based on Darwinian natural
selection and genetics in biological systems, GA has been
popularly and effectively applied to many optimization
problems. GA is well suited to solve complex non-linear
optimization problems without requiring gradient
information or a priori knowledge about model properties.
In GA, a population of candidate solutions to an
optimization problem is evolved toward better solutions
using operators inspired by natural evolution, such as
selection, crossover, and mutation. Solutions are
traditionally coded in binary as strings of 0s and 1s.
Real-valued GA (RGA) is quite different from the
traditional binary GA, which uses a real value as a
parameter in a chromosome, so that RGA needs no
coding and encoding process to compute the fitness
values of individuals. Hence, RGA is more direct, quicker
and more convenient than BGA. As a result, the two
parameters, C and γ, of SVM are simply used to form the
chromosome in this work.
The fitness value of RGA is determined by the
accuracy rate of cross-validation. And crossover,
mutation and selection operators are employed to produce
the offspring of the current population. The roulette
wheel technique is used to select which chromosomes can
be the survivors to the next generation.
The crossover operation of real-valued chromosome is
given as follows.
1 1 2 1 2min( , ) 0.5*new old old old oldG G G G G
2 1 2 1 2max( , ) 0.5*new old old old oldG G G G G
where the pair of populations before crossover operation
are denoted as G1old
and G2old
, and the pair of new
populations after crossover operation are denoted as G1new
and G2new
.
The mutation operation adopted here creates a random
real value in a given range.
The computation complexity of SVM parameter
optimization can be reduced by orders of magnitude
using NRGA compared to the traditional methods [3]-[17]
all of which optimize the penalty factor and the kernel
parameter simultaneously. In a different way, NRGA
optimizes the SVM parameters by nesting two loops of
RGA. As a result, in the traditional methods, the kernel
matrix values should be updated for each chromosome of
{C, γ} in the population, which is the most computation
consuming part of SMO. However, in NRGA, the inner
loop RGA optimizes C and the outer loop RGA optimizes
γ, so that at each outer step with one fixed γ the kernel
matrix values should be fixed and can be reused for all
the iterations in the inner loop corresponding to a lot of
different C in a population. As an remarkable result, the
computation cost can be significantly reduced by orders
of magnitude dependent on the size of C population. And
it is notable that NRGA searches the same global range as
the traditional methods do.
Fig. 1 shows the flowchart of NRGA for SVM
parameter-optimization by nesting two loops of RGA.
The algorithm of NRGA can be expressed as follows:
1. Initialize a real-valued γ population.
2. Evaluate the fitness of each γ. For each γ:
2.1 Initialize a real-valued C population.
509
Journal of Automation and Control Engineering Vol. 3, No. 6, December 2015
©2015 Engineering and Technology Publishing
2.2 Evaluate the fitness of each C with the fixed γ,
i.e.
a) Train SVM classifiers using cross
validation with each C and the fixed γ.
b) Calculate the fitness corresponding to the
accuracy rate of cross validation.
2.3 Set the best fitness of the fixed γ with each C
in C population as the fitness of the γ.
Figure 1. The flowchart of NRGA
2.4 Go to Step 3 if termination criterion is satisfied,
or go to Step 2.5 otherwise.
2.5 Evolve a new C population by crossover,
mutation and selection. Go to Step 2.2.
3. Go to Step 5 if termination criterion is satisfied, or
go to Step 4 otherwise.
4. Evolve a new γ population by crossover, mutation
and selection. Go to Step 2.
5. Select the final {C, γ} corresponding to the best
fitness (i.e. the highest accuracy of cross validation).
V. EXPERIMENTS
The used platform is a Dell x86 server with dual
2.4GHz Intel Xeon E5620 CPUs and 72GB RAM.
Window Server 2003 is the used operating system.
Microsoft Visual C++ 2010 is the employed development
environment. GAlib, a C++ Genetic Algorithm library
[20] is adopted, which was developed for easily trying
various objective functions, representations, genetic
operators, and genetic algorithms.
To assess the performance of NRGA for SVM
parameter optimization, we conduct experiments for
gender classification of facial images on a large labeled
database created by us. The large-scale database includes
over 20,000 facial images collected from the Internet.
And some synthetic facial images are derived from
original images with slight geometric transforms of
translation, scaling, rotation and mirror-reflection, so that
a training set with 80,000 samples and a cross-validation
set with 20,000 samples are constructed. In addition, the
testing set consists of 4400 original facial images, which
are irrelevant to the training samples and the cross-
validation samples. The feature with a dimension of 4880
is extracted using Gabor filters.
In the experiments, the search range of C is set as [2-5
,
215
], and the search range of γ [2-15
, 23], as recommended
by [3]. And as to each RGA in NRGA, each population
size is set as 200, crossover rate 0.9, and mutation rate
0.05.
TABLE I.
ACCURACY RATES OF DIFFERENT METHODS
Method
C
γ
Accuracy (%)
Grid Search
4
0.00225
96.9119
Traditional RGA
1.213
0.00829
97.1617
NRGA
1.095
0.00894
97.2071
510
Journal of Automation and Control Engineering Vol. 3, No. 6, December 2015
©2015 Engineering and Technology Publishing
Table I illustrates that NRGA exceeds the grid search
and the traditional RGA which seek all the parameters
contemporaneously. NRGA is far more efficient than the
traditional methods which optimize all the parameters all
together, while searching the same global range. Hence,
the computation complexity can be dramatically reduced
by orders of magnitude dependent on the size of C
population. The best performance of NRGA may result
from its dramatic efficiency, so that it is able to search
much more candidate solutions and to find a better
optimal solution than the other traditional methods.
VI. CONCLUSION
In this paper, a novel SVM parameter optimization
method, NRGA, is proposed. Based on two nested RGA,
NRGA reduces the computation cost significantly by
orders of magnitude (related to the size of C population)
compared to the traditional optimization techniques
which search all the parameters all together. The
remarkable efficiency and effectiveness of NRGA are
illustrated by the experimental results on gender
classification of facial images.
ACKNOWLEDGMENT
The authors wish to thank Professor Chih-Jen Lin and
his research team members for kindly providing the
LIBSVM tool, and to thank Dr. Matthew Wall for GAlib.
This research is partially sponsored by the Jiangxi
Province Education Department science and technology
project (GJJ13086), and Nanchang Key Laboratory of
Pattern Recognition.
REFERENCES
[1] C. Cortes and V. Vapnik., “Support-vector networks,” Machine
Learning, vol. 20, no. 3, pp. 273-297, 1995. [2] V. N. Vapnik, The Nature of Statistical Learning Theory, 2nd ed.
New York: Springer-Verlag, 1999.
[3] C. W. Hsu, C. C. Chang, and C. J. Lin. (2003). A practical guide to support vector classification. Technical Report. [Online].
Available: http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ [4] C. Staelin, “Parameter selection for support vector machines,”
Technical Report HPL-2002-354 (R.1), HP Laboratories Israel,
2003.
[5] J. Wang, X. Wu, and C. Zhang, “Support vector machines based
on k-means clustering for real-time business intelligence systems,” International Journal of Business Intelligence and Data Mining,
vol. 1, no. 1, pp. 54–64, 2005.
[6] C. L. Huang and C. J. Wang, “A GA-based feature selection and parameters optimization for support vector machine,” Expert
Systems with Applications, vol. 31, no. 2, pp. 231–240, 2006. [7] C. H. Wu, G. H. Tzeng, Y. J. Goo, and W. C. Fang, “A real-
valued genetic algorithm to optimize the parameters of support
vector machine for predicting bankruptcy,” Expert Systems with Applications, vol. 32, no. 2, pp. 397–408, 2007.
[8] M. Y. Zhao, C. Fu, L. P. Ji, K. Tang, and M. T. Zhou, “Feature selection and parameter optimization for support vector machines:
A new approach based on genetic algorithm with feature
chromosomes,” Expert Systems with Applications, vol. 38, pp. 5197–5204, 2011.
[9] S. S. Keerthi, “Efficient tuning of SVM hyper parameters using radius/margin bound and iterative algorithms,” IEEE Trans. on
Neural Networks, vol. 13, no. 5, pp. 1225–1229, 2002.
[10] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,”
Machine Learning, vol. 46, no. 1, pp. 131–159, 2002. [11] F. Imbault and K. Lebart, “A stochastic optimization approach for
parameter tuning of support vector machines,” in Proc. 17th
International Conf. Pattern Recognition, vol. 4, 2004, pp. 597–600.
[12] H. Q. Wang, D. S. Huangk, and B. Wang, “Optimisation of radial basis function classifiers using simulated annealing algorithm for
cancer classification,” Electronics Letters, vol. 41, no. 11, pp. 630-
632, 2005. [13] H. G. Zhou and C. D. Yang, “Using immune algorithm to
optimize anomaly detection based on SVM,” in Proc. IEEE International Machine Learning and Cybernetics Conference,
Dalian, China, 2006, pp. 4257–4261.
[14] S. Yuan and F. Chu, “Fault diagnosis based on support vector machine with parameter optimization by artificial immunisation
algorithm,” Mechanical Systems and Signal Processing, vol. 21, no. 3, pp. 1318–1330, 2007.
[15] L. Guo, L. Wang, Y. Wu, W. Yan, and X. Shen, “Research on 3D
modeling for head MRI image based on immune sphere-shaped support vector machine,” in Proc. IEEE Eng. Med. Biol. Soc,
Lyon, France, 2007, pp. 1082–1085. [16] I. Aydin, M. Karakose, and E. Akin, “A multi-objective artificial
immune algorithm for parameter optimization in support vector
machine,” Applied Soft Computing, vol. 11, pp. 120–129, 2011. [17] S. W. Lin, K. C. Ying, S. C. Chen, and Z. J. Lee, “Particle swarm
optimization for parameter determination and feature selection of support vector machines,” Expert Systems with Applications, vol.
35, pp. 1817–1824, 2008.
[18] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods-
Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, ed. MIT Press, 1999, pp. 185–208.
[19] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector
machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011.
[20] GAlib Web Site. [Online]. Available: http://lancet.mit.edu/ga [21] P. Liao, et al., “Nesting genetic algorithms for parameter
optimization of support vector machine,” in Proc. International
Academic Conf. Information Science and Communication Engineering, 2014.
Pin Liao was born in 1975. He received the B.S. degree in computer
science from Nanchang University, Nanchang, China in 1996, the M.S. degree in pattern recognition and intelligent system from Beijing
Institute of Technology, Beijing, China in 1999, and the Ph.D. degree in computer science from Institute of Computing Technology, Chinese
Academy of Sciences, China in 2003. He joined the faculty of College
of Science and Technology, Nanchang University, China, in 2005, where he is currently a Professor. His current research interests include
face recognition, computer vision, neural networks and machine learning.
Yang Fu received the B.S. degree in software engineering in 1996, and the M.S. degree in computer system organization in 2012, from
Nanchang University, Nanchang, China. He is now with Tian Ge Interactive Holdings Limited, Beijing, China.
Xin Zhang received the B.S. degree in computing mathematics in 1999, and the M.S. degree in computer science in 2005, from Southwest
Jiaotong University, Chengdu, China. He joined the faculty of College of Science and Technology, Nanchang University, China, in 2005,
where he is currently an Associate Professor. His current research
interests include pattern recognition, data mining and machine learning.
511
Journal of Automation and Control Engineering Vol. 3, No. 6, December 2015
©2015 Engineering and Technology Publishing