arXiv:2007.15745v1 [cs.LG] 30 Jul 2020 · 2020. 8. 3. · suitable for HPO problems, since many HPO...

On Hyperparameter Optimization of Machine Learning

Algorithms: Theory and Practice

Li Yang and Abdallah Shami

Department of Electrical and Computer Engineering, University of Western Ontario,1151 Richmond St, London, Ontario, Canada N6A 3K7

Abstract

Machine learning algorithms have been used widely in various applicationsand areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configurationfor machine learning models has a direct impact on the model’s performance.It often requires deep knowledge of machine learning algorithms and appro-priate hyper-parameter optimization techniques. Although several automaticoptimization techniques exist, they have different strengths and drawbackswhen applied to different types of problems. In this paper, optimizing thehyper-parameters of common machine learning models is studied. We in-troduce several state-of-the-art optimization techniques and discuss how toapply them to machine learning algorithms. Many available libraries andframeworks developed for hyper-parameter optimization problems are pro-vided, and some open challenges of hyper-parameter optimization researchare also discussed in this paper. Moreover, experiments are conducted onbenchmark datasets to compare the performance of different optimizationmethods and provide practical examples of hyper-parameter optimization.This survey paper will help industrial users, data analysts, and researchersto better develop machine learning models by identifying the proper hyper-parameter configurations effectively.

Keywords: Hyper-parameter optimization, machine learning, Bayesianoptimization, particle swarm optimization, genetic algorithm, grid search.

Email address: {lyang339, abdallah.shami}@uwo.ca (Li Yang and AbdallahShami)

Preprint submitted to Neurocomputing August 3, 2020

arX

iv:2

007.

1574

5v1

[cs

.LG

] 3

0 Ju

l 202

0

1. Introduction

Machine learning (ML) algorithms have been widely used in many appli-cations domains, including advertising, recommendation systems, computervision, natural language processing, and user behavior analytics [1]. Thisis because they are generic and demonstrate high performance in data ana-lytics problems. Different ML algorithms are suitable for different types ofproblems or datasets [2]. In general, building an effective machine learningmodel is a complex and time-consuming process that involves determiningthe appropriate algorithm and obtaining an optimal model architecture bytuning its hyper-parameters (HPs) [3].

Two types of parameters exist in machine learning models: one thatcan be initialized and updated through the data learning process (e.g., theweights of neurons in neural networks), named model parameters; whilethe other, named hyper-parameters, cannot be directly estimated from datalearning and must be set before training a ML model because they define thearchitecture of a ML model [4]. Hyper-parameters are the parameters thatare used to either configure a ML model (e.g., the penalty parameter C in asupport vector machine, and the learning rate to train a neural network) orto specify the algorithm used to minimize the loss function (e.g., the activa-tion function and optimizer types in a neural network, and the kernel typein a support vector machine) [5].

To build an optimal ML model, a range of possibilities must be ex-plored. The process of design the ideal model architecture with an opti-mal hyper-parameter configuration is named hyper-parameter tuning. Tun-ing hyper-parameters is considered a key component of building an effectiveML model, especially for tree-based ML models and deep neural networks,which have many hyper-parameters [6]. Hyper-parameter tuning process isdifferent among different ML algorithms due to their different types of hyper-parameters, including categorical, discrete, and continuous hyper-parameters[7]. Manual testing is a traditional way to tune hyper-parameters and is stillprevalent in graduate student research although it requires a deep under-standing of the ML algorithms and their hyper-parameter value settings [8].However, manual tuning is ineffective for many problems due to certain fac-tors, including a large number of hyper-parameters, complex models, time-consuming model evaluations, and non-linear hyper-parameter interactions.These factors have inspired increased research in techniques for automaticoptimization of hyper-parameters; so-called hyper-parameter optimization

2

(HPO) [9]. The main aim of HPO is to automate hyper-parameter tuningprocess and make it possible for users to apply machine learning models topractical problems effectively [3]. The optimal model architecture of a MLmodel is expected to be obtained after a HPO process. Some importantreasons for applying HPO techniques to ML models are as follows [6]:

1. It reduces the human effort required, since many ML developers spendconsiderable time tuning the hyper-parameters, especially for largedatasets or complex ML algorithms with a large number of hyper-parameters.

2. It improves the performance of ML models. Many ML hyper-parametershave different optimums to achieve optimal performance in differentdatasets or problems.

3. It makes the models and research more reproducible. Only when thesame level of hyper-parameter tuning is implemented can different MLalgorithms be compared fairly; hence, using a same HPO method ondifferent ML algorithms also helps to determine the most suitable MLmodel for a specific problem.

It is crucial to select an appropriate optimization technique to detectoptimal hyper-parameters. Traditional optimization techniques may not besuitable for HPO problems, since many HPO problems are not convex ordifferentiable optimization problems, and may result in a local instead of aglobal optimum [10]. Gradient descent-based methods are a common type oftraditional optimization algorithm that can be used to tune continuous hyper-parameters by calculating their gradients [11]. For example, the learning ratein a neural network can be optimized by a gradient-based method.

Compared with traditional optimization methods like gradient descent,many other optimization techniques are more suitable for HPO problems, in-cluding decision-theoretic approaches, Bayesian optimization models, multi-fidelity optimization techniques, and metaheuristics algorithms [7]. Apartfrom detecting continuous hyper-parameters, many of these algorithms alsohave the capacity to effectively identify discrete, categorical, and conditionalhyper-parameters.

Decision-theoretic methods are based on the concept of defining a hyper-parameter search space and then detecting the hyper-parameter combina-tions in the search space, ultimately selecting the best-performing hyper-parameter combination. Grid search (GS) [12] is a decision-theoretic ap-

3

proach that involves exhaustively searching for a fixed domain of hyper-parameter values. Random search (RS) [13] is another decision-theoreticmethod that randomly selects hyper-parameter combinations in the searchspace, given limited execution time and resources. In GS and RS, each hyper-parameter configuration is treated independently.

Unlike GS and RS, Bayesian optimization (BO) [14] models determine thenext hyper-parameter value based on the previous results of tested hyper-parameter values, which avoids many unnecessary evaluations; thus, BO candetect the best hyper-parameter combination within fewer iterations than GSand RS. To be applied to different problems, BO can model the distributionof the objective function using different models as the surrogate function,including Gaussian process (GP), random forest (RF), and tree-structuredParzen estimators (TPE) models [15]. BO-RF and BO-TPE can retain theconditionality of variables [15]. Thus, they can be used to optimize condi-tional hyper-parameters, like the kernel type and the penalty parameter C ina support vector machine (SVM). However, since BO models work sequen-tially to balance the exploration of unexplored areas and the exploitation ofcurrently-tested regions, it is difficult to parallelize them.

Training a ML model often takes considerable time and space. Multi-fidelity optimization algorithms are developed to tackle problems with lim-ited resources, and the most common ones being bandit-based algorithms.Hyperband [16] is a popular bandit-based optimization technique that can beconsidered an improved version of RS. It generates small versions of datasetsand allocates a same budget to each hyper-parameter combination. In eachiteration of Hyperband, poorly-performing hyper-parameter configurationsare eliminated to save time and resources.

Metaheuristic algorithms are a set of techniques used to solve complex,large search space and non-convex optimization problems to which HPOproblems belong [17]. Among all metaheuristic methods, genetic algorithm(GA) [18] and particle swarm optimization (PSO) [19] are the two mostprevalent metaheuristic algorithms used for HPO problems. Genetic algo-rithms detect well-performing hyper-parameter combinations in each gener-ation, and pass them to the next generation to identify the best-performingcombination. In PSO algorithms, each particle communicates with otherparticles to detect and update the current global optimum in each iterationuntil the final optimum is detected. Metaheuristics can efficiently explore thesearch space to detect optimal or near-optimal solutions. Hence, they are par-ticularly suitable for the HPO problems with large configuration space due

4

to their high efficiency. For instance, a deep neural network (DNN) oftenhas a large configuration space with multiple hyper-parameters, includingthe activation and optimizer types, the learning rate, drop-out rate, etc.

Although using HPO algorithms to tune the hyper-parameters of MLmodels greatly improves the model performance, certain other aspects, liketheir computational complexity, still have much room for improvement. Onthe other hand, since different HPO models have their own advantages andsuitable problems, overviewing them is necessary for proper optimizationalgorithm selection in terms of different types of ML models and problems.

This paper makes the following contributions:

1. It reviews common ML algorithms and their important hyper-parameters.2. It analyzes common HPO techniques, including their benefits and draw-

backs, to help apply them to different ML models by appropriate algo-rithm selection in practical problems.

3. It surveys common HPO libraries and frameworks for practical use.4. It discusses the open challenges and research directions of the HPO

research domain.

In this survey paper, we begin with a comprehensive introduction of thecommon optimization techniques used in ML hyper-parameter tuning prob-lems. Section 2 introduces the main concepts of mathematical optimizationand hyper-parameter optimization, as well as the general HPO process. InSection 3, we discuss the key hyper-parameters of common ML models thatneed to be tuned. Section 4 covers the various state-of-the-art optimizationapproaches that have been proposed for tackling HPO problems. In Section5, we analyze different HPO methods and discuss how they can be appliedto ML algorithms. In Section 6, we provide an introduction to various publiclibraries and frameworks that are developed to implement HPO. Section 7presents and discusses the experimental results of using HPO on benchmarkdatasets for HPO method comparison and practical use case demonstration.In Section 8, we discuss several research directions and open challenges thatshould be considered to improve current HPO models or develop new HPOapproaches. We conclude the paper in Section 9.

2. Mathematical Optimization and Hyper-parameter OptimizationProblems

The key process of machine learning is to solve optimization problems.To build a ML model, its weight parameters are initialized and optimized

5

by an optimization method until the objective function value approaches aminimum value and the accuracy rate approaches a maximum value [20].Similarly, hyper-parameter optimization methods aim to optimize the archi-tecture of a ML model by evaluating the optimal hyper-parameter configura-tions. In this section, the main concepts of mathematical optimization andhyper-parameter optimization for machine learning models are discussed.

2.1. Mathematical Optimization

Mathematical optimization is finding the best solution from a set of avail-able candidates that enables the objective function to be maximized or mini-mized [20]. Generally, optimization problems can be classified as constrainedor unconstrained optimization problems based on whether they have con-straints for the decision variables or the solution variables.

In unconstrained optimization problems, a decision variable, x, can takeany values from the one-dimensional space of all real numbers, R. An un-constrained optimization problem can be denoted by [21]:

minx∈R

f(x), (1)

where f(x) is the objective function.On the other hand, most real-life optimization problems are constrained

optimization problems. The decision variable x for constrained optimizationproblems should be subject to certain constraints which could be mathemat-ical equalities or inequalities. Therefore, constrained optimization problemsor general optimization problems can be expressed as [21]:

minxf(x)

subject to

gi(x) ≤ 0, i = 1, 2, · · · ,m,hj(x) = 0, j = 1, 2, · · · , p,x ∈ X,

(2)

where gi(x), i = 1, 2, · · · ,m, are the inequality constraint functions; hj(x), j =1, 2, · · · , p, are the equality constraint function; and X is the domain of x.

The role of constraints is to limit the possible values of the optimal so-lution to certain areas of the search space, named the feasible region [21].Thus, the feasible region D of x can be represented by:

D = {x ∈ X|gi(x) ≤ 0, hj(x) = 0}. (3)

6

To conclude, an optimization problem consists of three major compo-nents: a set of decision variables x, an objective function f(x) to be eitherminimized or maximized, and a set of constraints that allow the variables totake on values in certain ranges (if it is a constrained optimization problem).Therefore, the goal of optimization tasks is to obtain the set of variable val-ues that minimize or maximize the objective function while satisfying anyapplicable constraints.

Many HPO problems have certain constraints, like the feasible domainof the number of clusters in k-means, as well as time and space constraints.Therefore, constrained optimization techniques are widely-used in HPO prob-lems [3].

For optimization problems, in many cases, only a local instead of a globaloptimum can be obtained. For example, to obtain the minimum of a problem,assuming D is the feasible region of a decision variable x, a global minimumis the point x∗ ∈ D satisfying f(x∗) ≤ f(x) ∀x ∈ D , while a local minimumis a point x∗ ∈ D in a neighborhood N satisfying f (x∗) ≤ f(x) ∀x ∈ N ∩D[21] . Thus, the local optimum may only be an optimum in a small rangeinstead of being the optimal solution in the entire feasible region.

A local optimum is only guaranteed to be the global optimum in con-vex functions [22]. Convex functions are the functions that only have oneoptimum. Therefore, continuing to search along the direction in which theobjective function decreases can detect the global optimal value. A functionf(x) is a convex function if [22], for ∀x1, x2 ∈ X, ∀t ∈ [0, 1],

f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2) , (4)

where X is the domain of decision variables, and t is a coefficient in the rangeof [0,1].

An optimization problem is a convex optimization problem only when theobjective function f(x) is a convex function and the feasible region C is aconvex set, denoted by [22]:

minxf(x)

subject to x ∈ C.(5)

On the other hand, nonconvex functions have multiple local optimums,but only one of these optimums is the global optimum. Most ML and HPOproblems are nonconvex optimization problems. Thus, utilizing inappropri-ate optimization methods can often detect only a local instead of a globaloptimum.

7

There are many traditional methods that can be used to solve opti-mization problems, including gradient descent, Newtons method, conjugategradient, and heuristic optimization methods [20]. Gradient descent is acommonly-used optimization method that uses the negative gradient direc-tion as the search direction to move towards the optimum. However, gradientdescent cannot guarantee to detect the global optimum unless the objectivefunction is a convex function. Newtons method uses the inverse matrix of theHessian matrix to obtain the optimum. Newtons method has faster conver-gence speed than gradient descent, but often requires more time and largerspace than gradient descent to store and calculate the Hessian matrix. Con-jugate gradient searches along the conjugated directions constructed by thegradient of known data points to detect the optimum. Conjugate gradienthas faster convergence speed than gradient descent, but its calculation of con-jugate gradient is more complex. Unlike other traditional methods, heuristicmethods use empirical rules to solve the optimization problems instead offollowing systematical steps to obtain the solution. Heuristic methods canoften detect the approximate global optimum within a few iterations, butcannot guarantee to detect the global optimum [20].

2.2. Hyper-parameter Optimization Problem Statement

During the design process of ML models, effectively searching the hyper-parameters’ space using optimization techniques can identify the optimalhyper-parameters for the models. The hyper-parameter optimization processconsists of four main components: an estimator (a regressor or a classifier)with its objective function, a search space (configuration space), a searchor optimization method used to find hyper-parameter combinations, and anevaluation function to compare the performance of different hyper-parameterconfigurations.

The domain of a hyper-parameter can be continuous (e.g., learning rate),discrete (e.g., number of clusters), binary (e.g., whether to use early stoppingor not), or categorical (e.g., type of optimizer). Therefore, hyper-parametersare classified as continuous, discrete, and categorical hyper-parameters. Forcontinuous and discrete hyper-parameters, their domains are usually boundedin practical applications [12] [23]. On the other hand, the hyper-parameterconfiguration space sometimes contains conditionality. A hyper-parametermay need to be used or tuned depending on the value of another hyper-parameter, called a conditional hyper-parameter [10]. For instance, in SVM,

8

the degree of the polynomial kernel function only needs to be tuned whenthe kernel type is chosen to be polynomial.

In simple cases, all hyper-parameters can take unrestricted real values,and the feasible setX of hyper-parameters can be a real-valued n-dimensionalvector space. However, in most cases, the hyper-parameters of a ML modeloften take on values from different domains and have different constraints,so their optimization problems are often complex constrained optimizationproblems [24]. For instance, the number of considered features in a decisiontree should be in the range of 0 to the number of features, and the numberof clusters in k-means should not be larger than the size of data points.Additionally, categorical features can often only take several certain values,like the limited choices of the activation function and the optimizer of aneural network. Therefore, the feasible domain of X often has a complexstructure, which increases the problems’ complexity [24].

In general, for a hyper-parameter optimization problem, the aim is toobtain [19]:

x∗ = arg minx∈X

f(x), (6)

where f(x) is the objective function to be minimized, such as the error rate orthe root mean squared error (RMSE); x∗ is the hyper-parameter configurationthat produces the optimum value of f(x); and a hyper-parameter x can takeany value in the search space X.

The aim of HPO is to achieve optimal or near-optimal model performanceby tuning hyper-parameters within the given budgets [3]. The mathemati-cal expression of the function f varies, depending on the objective functionof the chosen ML algorithm and the performance metric function. Modelperformance can be evaluated by various metrics, like accuracy, RMSE, F1-score, and false alarm rate. On the other hand, in practice, time budgets arean essential constraint for optimizing HPO models and must be considered.It often requires a massive amount of time to optimize the objective functionof a ML model with a reasonable number of hyper-parameter configurations.Every time a hyper-parameter value is tested, the entire ML model needs tobe retrained, and the validation set needs to be processed to generate a scorethat reflects the model performance.

After selecting a ML algorithm, the main process of HPO is as follows[10]:

1. Select the objective function and the performance metrics;

9

2. Select the hyper-parameters that require tuning, summarize their types,and determine the appropriate optimization technique;

3. Train the ML model using the default hyper-parameter configurationor common values as the baseline model;

4. Start the optimization process with a large search space as the hyper-parameter feasible domain determined by manual testing and/or do-main knowledge;

5. Narrow the search space based on the regions of currently-tested well-performing hyper-parameter values, or explore new search spaces ifnecessary.

6. Return the best-performing hyper-parameter configuration as the finalsolution.

However, most traditional optimization techniques [25] are unsuitablefor HPO, since HPO problems are different from traditional optimizationproblems in the following aspects [10]:

1. The optimization target, the objective function of ML models, is usu-ally a non-convex and non-differentiable function. Therefore, manytraditional optimization methods designed to solve convex or differen-tiable optimization problems are often unsuitable for HPO problems,since these methods may return a local optimum instead of a global op-timum. Additionally, an optimization target lacking smoothness makescertain traditional derivative-free optimization models perform poorlyfor HPO problems [26].

2. The hyper-parameters of ML models include continuous, discrete, cat-egorical, and conditional hyper-parameters. Thus, many traditionalnumerical optimization methods [27] that only aim to tackle numericalor continuous variables are unsuitable for HPO problems.

3. It is often computationally expensive to train a ML model on a large-scale dataset. HPO techniques sometimes use data sampling to obtainapproximate values of the objective function. Thus, effective optimiza-tion techniques for HPO problems should be able to use these approx-imate values. However, function evaluation time is often ignored inmany black-box optimization (BBO) models, so they often require ex-act instead of approximate objective function values. Consequently,many BBO algorithms are often unsuitable for HPO problems withlimited time and resource budgets.

10

Therefore, appropriate optimization algorithms should be applied to HPOproblems to identify optimal hyper-parameter configurations for ML models.

3. Hyper-parameters in Machine Learning Models

To boost ML models by HPO, firstly, we need to find out what the keyhyper-parameters are that people need to tune to fit the ML models intospecific problems or datasets.

In general, ML models can be classified as supervised and unsupervisedlearning algorithms, based on whether they are built to model labeled orunlabeled datasets [28]. Supervised learning algorithms are a set of machinelearning algorithms that map input features to a target by training labeleddata, and mainly include linear models, k-nearest neighbors (KNN), sup-port vector machines (SVM), nave Bayes (NB), decision-tree-based models,and deep learning (DL) algorithms [29]. Unsupervised learning algorithmsare used to find patterns from unlabeled data and can be divided into clus-tering and dimensionality reduction algorithms based on their aims. Clus-tering methods mainly include k-means, density-based spatial clustering ofapplications with noise (DBSCAN), hierarchical clustering, and expectation-maximization (EM); while two common dimensionality reduction algorithmsare principal component analysis (PCA) and linear discriminant analysis(LDA) [30]. Moreover, there are several ensemble learning methods that com-bine different singular models to further improve model performance, like vot-ing, bagging, and AdaBoost. In this paper, the important hyper-parametersof common ML models are studied based on their names in Python libraries,including scikit-learn (sklearn) [31], XGBoost [32], and Keras[33].

3.1. Supervised Learning Algorithms

In supervised learning, both the input x and the output y are available,and the goal is to obtain an optimal predictive model function f ∗ to minimizethe cost function L(f(x), y) that models the error between the estimatedoutput and ground-truth labels. The predictive model function f variesbased on its model structure. With limited model structures determinedby different hyper-parameter configurations, the domain of the ML modelfunction f is restricted to a set of functions F . Thus, the optimal predictivemodel f ∗ can be obtained by [34]:

f ∗ = arg minf∈F

1

n

n∑i=1

L (f (xi) , yi) (7)

11

where N is the number of training data points, xi is the feature vector ofthe i-th instance, yi is the corresponding actual output, and L is the costfunction value of each sample.

Many different loss functions exist in supervised learning algorithms, in-cluding the square of Euclidean distance, cross-entropy, information gain, etc.[34]. On the other hand, different ML algorithms generate different predic-tive model architectures based on different hyper-parameter configurations,which will be discussed in detail in this subsection.

3.1.1. Linear Models

In general, supervised learning models can be classified as regression andclassification techniques when used to predict continuous or discrete targetvariables, respectively. Linear regression [35] is a typical regression model topredict a target y by the following equation:

y(w,x) = w0 + w1x1 + . . .+ wpxp, (8)

where the target value y is expected to be a linear combination of p inputfeatures x = (x1, · · ·xp), and y is the predicted value. The weight vectorw = (w1, · · ·wp) is designated as an attribute ’coef ’, and w0 is defined asanother attribute ’intercept ’ in the linear model of sklearn. Usually, nohyper-parameter needs to be tuned in linear regression. A linear model’sperformance mainly depends on how well the problem or data follows a lineardistribution.

To improve the original linear regression models, ridge regression wasproposed in [36]. Ridge regression imposes a penalty on the coefficients, andaims to minimize the objective function [37]:

α‖w‖22 +

p∑i=1

(yi − wi · xi)2 , (9)

where ‖w‖2 is the L2-norm of the coefficient vector, and α is the regularizationstrength. A larger value of α indicates a larger amount of shrinkage; thus,the coefficients are also more robust to collinearity.

Lasso regression [38] is another linear model used to estimate sparse co-efficients, consisting of a linear model with an L1 priori added regularizationterm. It aims to minimize the objective function [37]:

α‖w‖1 +

p∑i=1

(yi − wi · xi)2 , (10)

12

where α is a constant and ‖w‖1 is the L1-norm of the coefficient vector.Therefore, the regularization strength α is an crucial hyper-parameter ofboth ridge and lasso regression models.

Logistic regression (LR) [39] is a linear model used for classification prob-lems. In LR, its cost function may be different, depending on the regular-ization method chosen for the penalization. There are three main types ofregularization methods in LR: L1-norm, L2-norm, and elastic-net regulariza-tion [40].

Therefore, the first hyper-parameter that needs to be tuned in LR is tothe regularization method used in the penalization, ’l1’, ’l2’, ’elasticnet’ or’none’, which is called ’penalty’ in sklearn. The coefficient, ’C’, is anotheressential hyper-parameter that determines the regularization strength of themodel. In addition, the ’solver’ type, representing the optimization algorithmtype, can be set to ’newton-cg’, ’lbfgs’, ’liblinear’, ’sag’, or ’saga’ in LR. The’solver’ type has correlations with ’penalty’ and ’C’, so they are conditionalhyper-parameters.

3.1.2. KNN

K-nearest neighbor (KNN) is a simple ML algorithm that is used to clas-sify data points by calculating the distances between different data points.In KNN, the predicted class of each test sample is set to the class to whichmost of its k-nearest neighbors in the training set belong.

Assuming the training set T = {(x1, y1), (x2, y2), · · · , (xn, yn)}, xi is thefeature vector of an instance, and yi ∈ {c1, c2, · · · , cm} is the class of theinstance, i = (1, 2, · · ·n), for a test instance x, its class y can be denoted by[41]:

y = arg maxcj

∑xi∈Nk(x)

I (yi = cj) , i = 1, 2, · · · , n; j = 1, 2, · · · ,m, (11)

where I(x) is an indicator function, I = 1 when yi = cj, otherwise I = 0;Nk(x) is the field involving the k-nearest neighbors of x.

In KNN, the number of considered nearest neighbors, k, is the mostcrucial hyper-parameter [42]. If k is too small, the model will be under-fitting;if k is too large, the model will be over-fitting and require high computationaltime. In addition, the weighted function used in the prediction can also bechosen to be ’uniform’ (points are weighted equally) or ’distance’ (points areweighted by the inverse of their distance), depending on specific problems.

13

The distance metric and the power parameter of the Minkowski metric canalso be tuned as it can result in minor improvement. Lastly, the ’algorithm’used to compute the nearest neighbors can also be chosen from a ball tree, ak-dimensional (KD) tree, or a brute force search. Typically, the model candetermine the most appropriate algorithm itself by setting the ’algorithm’ to’auto’ in sklearn [31].

3.1.3. SVM

A support vector machines (SVM) [43] is a supervised learning algorithmthat can be used for both classification and regression problems. SVM algo-rithms are based on the concept of mapping data points from low-dimensionalinto high-dimensional space to make them linearly separable; a hyperplaneis then generated as the classification boundary to partition data points [44].Assuming there are n data points, the objective function of SVM is [45]:

arg minw

{1

n

n∑i=1

max {0, 1− yif (xi)}+ CwTw

}, (12)

where w is a normalization vector; C is the penalty parameter of the errorterm, which is an important hyper-parameter of all SVM models.

The kernel function f(x), which is used to measure the similarity betweentwo data points xi and xj, can be set to different types in SVM models, includ-ing several common kernel types, or even customized kernels. Therefore, thekernel type would be a vital hyper-parameter to be tuned. Common kerneltypes in SVM include linear kernels, radial basis function (RBF), polynomialkernels, and sigmoid kernels.

The different kernel functions can be denoted as follows [46]:

1. Linear kernel:f(x) = xTi xj; (13)

2. Polynomial kernel:

f(x) =(γxTi xj + r

)d; (14)

3. RBF kernel:f(x) = exp

(−γ ‖x− x′‖2

); (15)

4. Sigmoid kernel:f(x) =

(tanh

(γxTi xj + r

)); (16)

14

As shown in the kernel function equations, a few other different hyper-parameters need to be tuned after a kernel type is chosen. The hyper-parameter γ, denoted by ’gamma’ in sklearn, is the conditional hyper-parameterof the ’kernel type’ hyper-parameter when it is set to polynomial, RBF, orsigmoid; r, specified by ’coef0’ in sklearn, is the conditional hyper-parameterof polynomial and sigmoid kernels. Moreover, the polynomial kernel hasan additional conditional hyper-parameter d representing the ’degree’ of thepolynomial kernel function. In support vector regression (SVR) models, thereis another hyper-parameter, ’epsilon’, indicating the distance error to of itsloss function [31].

3.1.4. Nave Bayes

Nave Bayes (NB) [47] algorithms are supervised learning algorithms basedon Bayes’ theorem. Assuming there are n dependent features x1, · · ·xn anda target variable y, the objective function of nave Bayes can be denoted by:

y = arg maxyP (y)

n∏i=1

P (xi|y) , (17)

where P (y) is the probability of a value y, and P (xi|y) is the posterior prob-abilities of xi given the values of y. Regarding the different assumptions ofthe distribution of P (xi|y), there are different types of nave Bayes classi-fiers. The four main types of NB models are: Bernoulli NB, Gaussian NB,multinomial NB, and complement NB [48].

For Gaussian NB [49], the likelihood of features is assumed to follow aGaussian distribution:

P (xi|y) =1√

2πσ2y

exp

(−(xi − µy)

2

2σ2y

). (18)

The maximum likelihood method is used to calculate the mean value, µy,and the variance, σ2

y. Normally, there is not any hyper-parameter that needsto be tuned for Gaussian NB. The performance of a Gaussian NB modelmainly depends on how well the dataset follows Gaussian distributions.

Multinomial NB [50] is designed for multinomially-distributed data basedon the nave Bayes algorithm. Assuming there are n features, and θyi is thedistribution of each value of the target variable y, which equals the condi-tional probability P (xi|y) when a feature value i is involved in a data point

15

belonging to the class y. Based on the concept of relative frequency counting,θy can be estimated by a smoothed version of θyi [31]:

θyi =Nyi + α

Ny + αn, (19)

where Nyi is the number of times when feature i is in a data point belongingto class y, and Ny is the sum of all Nyi (i = 0, 1, 2, · · · , n). The smoothingpriors α ≥ 0 are used for features that are not in the learning samples. Whenα = 1, it is called Laplace smoothing; when α < 1, it is called Lidstonesmoothing.

Complement NB [51] is an improved version of the standard multino-mial NB algorithm and is suitable for processing imbalanced data, whileBernoulli NB [52] requires samples to have binary-valued feature vectors sothat the data can follow multivariate Bernoulli distributions. They bothhave the additive (Laplace/Lidstone) smoothing parameter, α, as the mainhyper-parameter that needs tuning. To conclude, for nave Bayes algorithms,users often do not need to tune hyper-parameters or only need to tune thesmoothing parameter α, which is a continuous hyper-parameter.

3.1.5. Tree-based Models

Decision tree (DT) [53] is a common classification method that uses atree-structure to model decisions and possible consequences by summarizinga set of classification rules from the data. A DT has three main components:a root node representing the whole data; multiple decision nodes indicatingdecision tests and sub-node splits over each feature; and several leaf nodesrepresenting the result classes [54]. DT algorithms recursively split the train-ing set with better feature values to achieve good decisions on each subset.Pruning, which means removing some of the sub-nodes of decision nodes, isused in DT to avoid over-fitting. Since a deeper tree has more sub-trees tomake more accurate decisions, the maximum tree depth, ’max depth’, is anessential hyper-parameter of DT algorithms [55].

There are many other important HPs to be tuned to build effective DTmodels [56]. Firstly, the quality of splits can be measured by setting a mea-suring function, denoted by ’criterion’ in sklearn. Gini impurity or informa-tion gain are the two main types of measuring functions. The split selectionmethod, ’splitter’, can also be set to ’best’ to choose the best split, or ’ran-dom’ to select a random split. The number of considered features to generate

16

the best split, ’max features’, can also be tuned as a feature selection pro-cess. Moreover, there are several discrete hyper-parameters related to thesplitting process, which need to be tuned to achieve better performance: theminimum number of data points to split a decision node or to obtain a leafnode, denoted by ’min samples split’ and ’min samples leaf’, respectively;the ’max leaf nodes’, indicating the maximum number of leaf nodes, and the’min weight fraction leaf’ that means the minimum weighted fraction of thetotal weights, can also be tuned to improve model performance [31] [56].

Based on the concept of DT models, many decision-tree-based ensemblealgorithms have been proposed to improve model performance by combin-ing multiple decision trees, including random forest (RF), extra trees (ET),and extreme gradient boosting (XGBoost) models. RF [57] is an ensemblelearning method that uses the bagging method to combine multiple decisiontrees. In RF, basic DTs are built on many randomly-generated subsets, andthe class with the majority voting will be selected to be the final classificationresult [58]. ET [59] is another tree-based ensemble learning method that issimilar to RF, but it uses all samples to build DTs and randomly selects thefeature sets. In addition, RF optimizes splits on DTs while ET randomlymakes the splits. XGBoost [32] is a popular tree-based ensemble model de-signed for speed and performance improvement, which uses the boosting andgradient descent methods to combine basic DTs. In XGBoost, the next inputsample of a new DT will be related to the results of previous DTs. XGBoostaims to minimize the following objective function [55]:

Obj = −1

2

t∑j=1

G2j

Hj + λ+ γt, (20)

where t is the number of leaves in a decision tree, G and H are the sums ofthe first and second order gradient statistics of the cost function, γ and λare the penalty coefficients.

Since tree-based ensemble models are built with decision trees as baselearners, they have the same hyper-parameters as DT models, described inthis subsection. Apart from these hyper-parameters, RF, ET, and XGBoostall have another crucial hyper-parameter to be tuned, which is the number ofdecision trees to be combined, denoted by ’n estimators’ in sklearn. XGBoosthas several additional hyper-parameters, including [60]: ’min child weight’which means the minimum sum of weights in a child node; ’subsample’ and’colsample bytree’ used to control the subsampling ratio of instances and

17

features, respectively; and four continuous hyper-parameters ’gamma’, ’al-pha’, ’lambda’, and ’learning rate’ indicating the minimum loss reductionfor a split, L1, and L2 regularization term on weights, and the learning rate,respectively.

3.1.6. Ensemble Learning Algorithms

Apart from tree-based ensemble models, there are several other generalensemble learning methods that combine multiple singular ML models toachieve better model performance than any singular algorithms alone. Thethree general ensemble learning models voting, bagging, and AdaBoost areintroduced in this subsection [61].

Voting [61] is a basic ensemble learning algorithm that uses the majorityvoting rule to combine singular estimators and generate a comprehensive es-timator with improved accuracy. In sklearn, the voting method can be setto be ’hard’ or ’soft’, indicating whether to use majority voting or averagedpredicted probabilities to determine the classification result. The list of se-lected single ML estimators and their weights can also be tuned in certaincases. For instance, a higher weight can be assigned to a better-performingsingular ML model.

Bootstrap aggregating [61], also named bagging, trains multiple base esti-mators on different randomly-extracted subsets to construct a final predictor[62]. When using bagging methods, the first consideration should be the typeand number of base estimators in the ensemble, denoted by ’base estimator’and ’n estimators’, respectively. Then, the ’max samples’ and ’max features’,indicating the sample size and feature size to generate different subsets, canalso be tuned.

AdaBoost [61], short for adaptive boosting, is an ensemble learning methodthat trains multiple base learners consecutively (weak learners), and laterlearners emphasize the mis-classified samples of previous learners; ultimately,a final strong learner is trained. During this process, incorrectly-classifiedinstances are retrained with other new instances, and their weights are ad-justed so that the subsequent classifiers focus more on difficult cases, therebygradually building a stronger classifier. In AdaBoost, the type of base esti-mator, ’base estimator’, can be set to a decision tree or other methods. Inaddition, the maximum number of estimators at which boosting is termi-nated, ’n estimators’, and the learning rate that shrinks the contribution ofeach classifier, should also be tuned to achieve a trade-off between these twohyper-parameters.

18

3.1.7. Deep Learning Models

Deep learning (DL) algorithms are widely applied to various areas likecomputer vision, natural language processing, and machine translation sincethey have had great success solving many types of problems. DL models arebased on the theory of artificial neural networks (ANNs). Common typesof DL architectures include deep neural networks (DNNs), feedforward neu-ral networks (FFNNs), deep belief networks (DBNs), convolutional neuralnetworks (CNNs), recurrent neural networks (RNNs) and many more [63].All these DL models have similar hyper-parameters since they have similarunderlying neural network models. Compared with other ML models, DLmodels benefit more from HPO since they often have many hyper-parametersthat require tuning.

The first set of hyper-parameters is related to the construction of thestructure of a DL model; hence, named model design hyper-parameters.Since all neural network models have an input layer and an output layer,the complexity of a deep learning model mainly depends on the number ofhidden layers and the number of neurons of each layer, which are two mainhyper-parameters to build DL models [64]. These two hyper-parametersare set and tuned according to the complexity of the datasets or the prob-lems. DL models need to have enough capacity to model objective func-tions (or prediction tasks) while avoiding over-fitting. At the next stage,certain function types need to be set or tuned. The first function type toconfigure is the loss function type, which is chosen mainly based on theproblem type (e.g., binary cross-entropy for binary classification problems,multi-class cross-entropy for multi-classification problems, and RMSE for re-gression problems). Another important hyper-parameter is the activationfunction type used to model non-linear functions, which be set to ’softmax’,’rectified linear unit (ReLU)’, ’sigmoid’, ’tanh’, or ’softsign’. Lastly, the opti-mizer type can be set to stochastic gradient descent (SGD), adaptive momentestimation (Adam), root mean square propagation (RMSprop), etc. [65].

On the other hand, some other hyper-parameters are related to the opti-mization and training process of DL models; hence, categorized as optimizerhyper-parameters. The learning rate is one of the most important hyper-parameters in DL models [66]. It determines the step size at each iteration,which enables the objective function to converge. A large learning rate speedsup the learning process, but the gradient may oscillate around the local mini-mum value or even cannot converge. On the other hand, a small learning rate

19

converges smoothly, but will largely increase model training time by requir-ing more training epochs. An appropriate learning rate enables the objectivefunction to converge to a global minimum in a reasonable amount of time.Another common hyper-parameter is the drop-out rate. Drop-out is a stan-dard regularization method for DL models proposed to reduce over-fitting.In drop-out, a proportion of neurons are randomly selected and removed, andthe percentage of neurons to be removed should be tuned.

Mini-batch size and the number of epochs are the other two DL hyper-parameters that represent the number of processed samples before updatingthe model, and the number of complete passes through the entire trainingset, respectively [67]. Mini-batch size is affected by the resource requirementsof the training process, speed, and the number of iterations. The numberof epochs depends on the size of the training set and should be tuned byslowly increasing its value until validation accuracy starts to decrease, whichindicates over-fitting. On the other hand, DL models often converge withina few epochs, and the following epochs may lead to unnecessary additionalexecution time and over-fitting, which can be avoided by the early stoppingmethod. Early stopping is a form of regularization whereby model trainingstops in advance when validation accuracy does not increase after a certainnumber of consecutive epochs. The number of waiting epochs, called earlystop patience, can also be tuned to reduce model training time.

Apart from traditional DL models, transfer learning (TL) is a technologythat obtains a pre-trained model on the data in a related domain and transfersit to other target tasks [68]. To transfer a DL model from one problem toanother problem, a certain number of top layers are frozen, and only theremaining layers are retrained to fit the new problem. Therefore, the numberof frozen layers is a vital hyper-parameter to tune if TL is used.

3.2. Unsupervised Learning Algorithms

Unsupervised learning algorithms are a set of ML algorithms used to iden-tify unknown patterns in unlabeled datasets. Clustering and dimensionality-reduction algorithms are the two main types of unsupervised learning meth-ods. Clustering methods include k-means, DBSCAN, EM, hierarchical clus-tering, etc.; while PCA and LDA are two commonly-used dimensionalityreduction algorithms [30].

20

3.2.1. Clustering Algorithms

In most clustering algorithms including k-means, EM, and hierarchicalclustering the number of clusters is the most important hyper-parameter totune [69].

The k-means algorithm [70] uses k prototypes, indicating the centroidsof clusters, to cluster data. In k-means algorithms, the number of clusters,’n clusters’, must be specified, and is determined by minimizing the sum ofsquared errors [70]:

nk∑i=0

minuj∈Ck

(xi − uj)2 , (21)

where (x1, · · · ,xn) is the data matrix; uj, also called the centroid of thecluster Ck, is the mean of the samples in the cluster; and nk is the numberof sample points in the cluster Ck.

To tune k-means, ’n clusters’ is the most crucial hyper-parameter. Be-sides this, the method for centroid initialization, ’init’, could be set to ’k-means++’, ’random’ or a human-defined array, which slightly affects modelperformance. In addition, ’n init’, denoting the number of times that thek-means algorithm will be executed with different centroid seeds, and the’max iter’, the maximum number of iterations in a single execution of k-means, also have slight impacts on model performance [31].

The expectation-maximization (EM) algorithm [71] is an iterative al-gorithm used to detect the maximum likelihood estimation of parameters.Gaussian Mixture model is a clustering method that uses a mixture of Gaus-sian distributions to model data by implementing the EM method. Simi-lar to k-means, its major hyper-parameter to be tuned is ’n components’,indicating the number of clusters or Gaussian distributions. Additionally,different methods can be chosen to constrain the covariance of the estimatedclasses in Gaussian mixture models, including ’full covariance’, ’tied’, ’diago-nal’ or ’spherical’ [72]. Other hyper-parameters could also be tuned, including’max iter’ and ’tol’, representing the number of EM iterations to perform andthe convergence threshold, respectively [31].

Hierarchical clustering [73] methods build clusters by continuously merg-ing or splitting the built-in clusters. The hierarchy of clusters is representedby a tree-structure; its root indicates the unique cluster gathering all samples,and its leaves represent the clusters with only one sample [73]. In sklearn, thefunction ’AgglomerativeClustering’ is a common type of hierarchical cluster-ing. In agglomerative clustering, the linkage criteria, ’linkage’, determines

21

the distance between sets of observations and can be set to ’ward’, ’com-plete’, ’average’, or ’single’, indicating whether to minimize the variance ofthe all clusters, or use the maximum, average, or minimum distance betweenevery two clusters, respectively. Like other clustering methods, its mainhyper-parameter is the number of clusters, ’n clusters’. However, ’n clusters’cannot be set if we choose to set the ’distance threshold’, the linkage distancethreshold for merging clusters, since if so, ’n clusters’ will be determined au-tomatically.

DBSCAN [74] is a density-based clustering method that determines theclusters by dividing data into clusters with sufficiently high density. Unlikeother clustering models, the number of clusters does not need to be configuredbefore training. Instead, DBSCAN has two significant conditional hyper-parameters the scan radius represented by ’eps’, and the minimum numberof considered neighbor points represented by ’min samples’ which define thecluster density together [75]. DBSCAN works by starting with an unvisitedpoint and detecting all its neighbor points within a pre-defined distance ’eps’.If the number of neighbor points reaches the value of ’min samples’, this un-visited point and all its neighbors are defined as a cluster. The procedures areexecuted recursively until all data points are visited. A higher ’min samples’or a lower ’eps’ indicates a higher density to form a cluster.

3.2.2. Dimensionality Reduction Algorithms

The increasing amount of collected data provides ample information,while increasing problem complexity. In reality, many features are irrel-evant or redundant to predict target variables. Dimensionality reductionalgorithms often serve as feature engineering methods to extract importantfeatures and eliminate insignificant or redundant features. Two commondimensionality-reduction algorithms are principal component analysis (PCA)and linear discriminant analysis (LDA). In PCA and LDA, the number offeatures to be extracted, represented by ’n components’ in sklearn, is themain hyper-parameter to be tuned.

Principal component analysis (PCA) [76] is a widely used linear dimen-sionality reduction method. PCA is based on the concept of mapping theoriginal n-dimensional features into k-dimension features as the new orthog-onal features, also called the principal components. PCA works by calculat-ing the covariance matrix of the data matrix to obtain the eigenvectors ofthe covariance matrix. The matrix comprises the eigenvectors of k featureswith the largest eigenvalues (i.e., the largest variance). Consequently, the

22

data matrix can be transformed into a new space with reduced dimension-ality. Singular value decomposition (SVD) [77] is a popular method usedto obtain the eigenvalues and eigenvectors of the covariance matrix of PCA.Therefore, in addition to ’n components’, the SVD solver type is anotherhyper-parameter of PCA to be tuned, which can be assigned to ’auto’, ’full’,’arpack’ or ’randomized’ [31].

Linear discriminant analysis (LDA) [78] is another common dimensional-ity reduction method that projects the features onto the most discriminativedirections. Unlike PCA, which obtains the direction with the largest varianceas the principal component, LDA optimizes the feature subspace of classifica-tion. The objective of LDA is to minimize the variance inside each class andmaximize the variance between different classes after projection. Thus, theprojection points in each class should be as close as possible, and the distancebetween the center points of different classes should be as large as possible.Similar to PCA, the number of features to be extracted, ’n components’,should be tuned in LDA models. Additionally, the solver type of LDA canalso be set to ’svd’ for SVD, ’lsqr’ for least-squares solution, or ’eigen’ foreigenvalue decomposition [79]. LDA also has a conditional hyper-parameter,the shrinkage parameter, ’shrinkage’, which can be set to a float value alongwith ’lsqr’ and ’eigen’ solvers.

4. Hyper-parameter Optimization Techniques

4.1. Model-free Algorithms

4.1.1. Babysitting

Babysitting, also called ’Trial and Error’ or grad student descent (GSD),is a basic hyper-parameter tuning method [8]. This method is 100% manualtuning and widely used by students and researchers. The workflow is simple:after building a ML model, the student tests many possible hyper-parametervalues based on experience, guessing, or the analysis of previously-evaluatedresults; the process is repeated until this student runs out of time (oftenreaching a deadline) or is satisfied with the results. As such, this approachrequires a sufficient amount of prior knowledge and experience to identifyoptimal hyper-parameter values with limited time.

Manual tuning is infeasible for many problems due to several factors, likea large number of hyper-parameters, complex models, time-consuming modelevaluations, and non-linear hyper-parameter interactions [9]. These factors

23

inspired increased research into techniques for the automatic optimization ofhyper-parameters [80].

4.1.2. Grid Search

Grid search (GS) is one of the most commonly-used methods to explorehyper-parameter configuration space [81]. GS can be considered an exhaus-tive search or a brute-force method that evaluates all the hyper-parametercombinations given to the grid of configurations [82]. GS works by evaluatingthe Cartesian product of a user-specified finite set of values [6].

GS cannot exploit the well-performing regions further by itself. There-fore, to identify the global optimums, the following procedure needs to beperformed manually [2]:

1. Start with a large search space and step size.

2. Narrow the search space and step size based on the previous results ofwell-performing hyper-parameter configurations.

3. Repeat step 2 multiple times until an optimum is reached.

GS can be easily implemented and parallelized. However, the main draw-back of GS is its inefficiency for high-dimensionality hyper-parameter con-figuration space, since the number of evaluations increases exponentially asthe number of hyper-parameters grows. This exponential growth is referredto as the curse of dimensionality [83]. For GS, assuming that there are kparameters, and each of them has n distinct values, its computational com-plexity increases exponentially at a rate of O(nk) [19]. Thus, only when thehyper-parameter configuration space is small can GS be an effective HPOmethod.

4.1.3. Random Search

To overcome certain limitations of GS, random search (RS) was proposedin [13]. RS is similar to GS; but, instead of testing all values in the searchspace, RS randomly selects a pre-defined number of samples between theupper and lower bounds as candidate hyper-parameter values, and then trainsthese candidates until the defined budget is exhausted. The theoretical basisof RS is that if the configuration space is large enough, then the globaloptimums, or at least their approximations, can be detected. With a limitedbudget, RS is able to explore a larger search space than GS [13].

The main advantage of RS is that it is easily parallelized and resource-allocated since each evaluation is independent. Unlike GS, RS samples a fixed

24

number of parameter combinations from the specified distribution, whichimproves system efficiency by reducing the probability of wasting much timeon an unimportant small search space. Since the number of total evaluationsin RS is set to a certain value n before the optimization process starts, thecomputational complexity of RS is O(n) [84]. In addition, RS can detect theglobal optimum or the near-global optimum when given enough budgets [6].

Although RS is more efficient than GS for large search spaces, there arestill a large number of unnecessary function evaluations since it does notexploit the previously well-performing regions [2].

To conclude, the main limitation of both RS and GS is that every eval-uation in their iterations is independent of previous evaluations; thus, theywaste massive time evaluating poorly-performing areas of the search space.This issue can be solved by other optimization methods, like Bayesian opti-mization that uses previous evaluation records to determine the next evalu-ation [14].

4.2. Gradient-based Optimization

Gradient descent [85] is a traditional optimization technique that cal-culates the gradient of variables to identify the optimal direction and movestowards the optimum. After randomly selecting a point, the technique movestowards the opposite direction of the largest gradient to locate the next point.Therefore, a local optimum can be reached after convergence. The local op-timum is also the global optimum for convex functions. Gradient-based al-gorithms have a time complexity of O(nk) for optimizing k hyper-parameters[86].

For specific machine learning algorithms, the gradient of certain hyper-parameters can be calculated, and then gradient descent can be used tooptimize these hyper-parameters. Although gradient-based algorithms havea faster convergence speed to reach local optimum than the previously-presented methods in Section 4.1, they have several limitations. Firstly, theycan only be used to optimize continuous hyper-parameters because othertypes of hyper-parameters, like categorical hyper-parameters, do not havegradient directions. Secondly, they are only efficient for convex functions be-cause the local instead of a global optimum may be reached for non-convexfunctions [2]. Therefore, the gradient-based algorithms can only be used insome cases where it is possible to obtain the gradient of hyper-parameters;e.g., the learning rate in neural networks (NN) [11]. Still, it is not guaranteed

25

for these ML algorithms to identify global optimums using gradient-basedoptimization techniques.

4.3. Bayesian Optimization

Bayesian optimization (BO) [87] is an iterative algorithm that is popularlyused for HPO problems. Unlike GS and RS, BO determines the future evalu-ation points based on the previously-obtained results. To determine the nexthyper-parameter configuration, BO uses two key components: a surrogatemodel and an acquisition function [57]. The surrogate model aims to fit allthe currently-observed points into the objective function. After obtaining thepredictive distribution of the probabilistic surrogate model, the acquisitionfunction determines the usage of different points by balancing the trade-offbetween exploration and exploitation. Exploration is to sample the instancesin the areas that have not been sampled, while exploitation is to sample inthe current regions where the global optimum is most likely to occur, basedon the posterior distribution. BO models balance the exploration and theexploitation processes to detect the current most likely optimal regions andavoid missing better configurations in the unexplored areas [88].

The basic procedures of BO are as follows [87]:

1. Build a probabilistic surrogate model of the objective function.

2. Detect the optimal hyper-parameter values on the surrogate model.

3. Apply these hyper-parameter values to the real objective function toevaluate them.

4. Update the surrogate model with new results.

5. Repeat steps 2 - 4 until the maximum number of iterations is reached.

Thus, BO works by updating the surrogate model after each evaluation onthe objective function. BO is more efficient than GS and RS since it can de-tect the optimal hyper-parameter combinations by analyzing the previously-tested values, and running a surrogate model is often much cheaper thanrunning a real objective function.

However, since Bayesian optimization models are executed based on thepreviously-tested values, they belong to sequential methods that are diffi-cult to parallelize; but they can usually detect near-optimal hyper-parametercombinations within a few iterations [7].

Common surrogate models for BO include Gaussian process (GP) [89],random forest (RF) [90], and the tree Parzen estimator (TPE) [12]. There-fore, there are three main types of BO algorithms based on their surrogate

26

models: BO-GP, BO-RF, BO-TPE. An alternative name for BO-RF is se-quential model-based algorithm configuration (SMAC) [90].

4.3.1. BO-GP

Gaussian process (GP) is a standard surrogate model for objective func-tion modeling in BO [87]. Assuming that the function f with a mean µand a covariance σ2 is a realization of a GP, the predictions follow a normaldistribution [91]:

p(y|x,D) = N(y|µ, σ2

), (22)

where D is the configuration space of hyper-parameters, and y = f(x) isthe evaluation result of each hyper-parameter value x. After obtaining aset of predictions, the points to be evaluated next are then selected from theconfidence intervals generated by the BO-GP model. Each newly-tested datapoint is added to the sample records, and the BO-GP model is re-built withthe new information. This procedure is repeated until termination.

Applying a BO-GP to a size n dataset has a time complexity of O(n3)and space complexity of O(n2) [92]. One main limitation of BO-GP is thatthe cubic complexity to the number of instances limits the capacity for paral-lelization [3]. Additionally, it is mainly used to optimize continuous variables.

4.3.2. SMAC

Random forest (RF) is another popular surrogate function for BO tomodel the objective function using an ensemble of regression trees. BO usingRF as the surrogate model is also called SMAC [90].

Assuming that there is a Gaussian model N (y|µ, σ2), and µ and σ2 arethe mean and variance of the regression function r(x), respectively, then [90]:

µ =1

|B|∑r∈B

r(x), (23)

σ2 =1

|B| − 1

∑r∈B

(r(x)− µ)2, (24)

where B is a set of regression trees in the forest. The major procedures ofSMAC are as follows [3]:

1. RF starts with building B regression trees, each constructed by sam-pling n instances from the training set with replacement.

2. A split node is selected from d hyper-parameters for each tree.

27

3. To maintain a low computational cost, both the minimum number ofinstances considered for further split and the number of trees to groware set to a certain value.

4. Finally, the mean and variance for each new configuration are estimatedby RF.

Compared with BO-GP, the main advantage of SMAC is its support for alltypes of variables, including continuous, discrete, categorical, and conditionalhyper-parameters [91]. The time complexities of using SMAC to fit andpredict variances are O(nlogn) and O(logn), respectively, which are muchlower than the complexities of BO-GP [3].

4.3.3. BO-TPE

Tree-structured Parzen estimator (TPE) [12] is another common surro-gate model for BO. Instead of defining a predictive distribution used in BO-GP, BO-TPE creates two density functions, l(x) and g(x), to act as thegenerative models for all domain variables [3]. To apply TPE, the observa-tion results are divided into good results and poor results by a pre-definedpercentile y∗, and the two sets of results are modeled by simple Parzen win-dows [12]:

p(x|y,D) =

{l(x), if y < y∗

g(x), if y > y∗. (25)

After that, the expected improvement in the acquisition function is re-flected by the ratio between the two density functions, which is used todetermine the new configurations for evaluation. The Parzen estimators areorganized in a tree structure, so the specified conditional dependencies areretained. Therefore, TPE naturally supports specified conditional hyper-parameters [91]. The time complexity of BO-TPE is O(nlogn), which islower than the complexity of BO-GP [3].

BO methods are effective for many HPO problems, even if the objectivefunction f is stochastic, non-convex, or non-continuous. However, the maindrawback of BO models is that, if they fail to achieve the balance betweenexploration and exploitation, they might only reach the local instead of aglobal optimum. RS does not have this limitation since it does not focus onany specific area. Additionally, it is difficult to parallelize BO models sincetheir intermediate results are dependent on each other [7].

28

4.4. Multi-fidelity Optimization Algorithms

One major issue with HPO is the long execution time, which increaseswith a larger number of hyper-parameter values and larger datasets. Theexecution time may be several hours, several days, or even more [93]. Multi-fidelity optimization techniques are common approaches to solve the con-straint of limited time and resources. To save time, people can use a subsetof the original dataset or a subset of the features [94]. Multi-fidelity involveslow-fidelity and high-fidelity evaluations and combines them for practicalapplications [95]. In low-fidelity evaluations, a relatively small subset is eval-uated at a low cost but with poor generalization performance. In high-fidelityevaluations, a relatively large subset is evaluated with better generalizationperformance but at a higher cost than low-fidelity evaluations. In multi-fidelity optimization algorithms, poorly-performing configurations are dis-carded after each round of hyper-parameter evaluation on generated subsets,and only well-performing hyper-parameter configurations will be evaluatedon the entire training set.

Bandit-based algorithms categorized to multi-fidelity optimization algo-rithms have shown success in dealing with deep learning optimization prob-lems [3]. Two common bandit-based techniques are successive halving [96]and Hyperband [16].

4.4.1. Successive Halving

Theoretically speaking, exhaustive methods are able to identify the besthyper-parameter combination by evaluating all the given combinations. How-ever, many factors, including limited time and resources, should be consid-ered in practical applications. These factors are called budgets (B). Toovercome the limitations of GS and RS and to improve efficiency, successivehalving algorithms were proposed in [96].

The main process of using successive halving algorithms for HPO is asfollows. Firstly, it is presumed that there are n sets of hyper-parametercombinations, and that they are evaluated with uniformly-allocated budgets(b = B/n). Then, according to the evaluation results for each iteration, halfof the poorly-performing hyper-parameter configurations are eliminated, andthe better-performing half is passed to the next iteration with double budgets(bi+1 = 2 ∗ bi). The above process is repeated until the final optimal hyper-parameter combination is detected.

Successive halving is more efficient than RS, but is affected by the trade-off between the number of hyper-parameter configurations and the budgets

29

allocated to each configuration [6]. Thus, the main concern of successivehalving is how to allocate the budget and how to determine whether to testfewer configurations with a higher budget for each or to test more configura-tions with a lower budget for each [2].

4.4.2. Hyperband

Hyperband [16] is then proposed to solve the dilemma of successive halv-ing algorithms by dynamically choosing a reasonable number of configura-tions. It aims to achieve a trade-off between the number of hyper-parameterconfigurations (n) and their allocated budgets by dividing the total budgets(B) into n pieces and allocating these pieces to each configuration (b = B/n).Successive halving serves as a subroutine on each set of random configura-tions to eliminate the poorly-performing hyper-parameter configurations andimprove efficiency. The main steps of Hyperband algorithms are shown inAlgorithm 1 [2].

Algorithm 1 Hyperband

Input: bmax, bmin

1: smax = log(

bmax

bmin

)2: for s ∈ {bmax, bmin − 1, . . . , 0} do3: n = DetermineBudget(s)4: γ = SampleConfigurations(n)5: SuccessiveHalving(γ)6: end for7: return The best configuration so far.

Firstly, the budget constraints bmin and bmax are determined by the totalnumber of data points, the minimum number of instances required to train asensible model, and the available budgets. After that, the number of config-urations n and the budget size allocated to each configuration are calculatedbased on bmin and bmax in steps 2-3 of Algorithm 1. The configurations aresampled based on n and b, and then passed to the successive halving modeldemonstrated in steps 4-5. The successive halving algorithm discards theidentified poorly-performing configurations and passes the well-performingconfigurations on to the next iteration. This process is repeated until thefinal optimal hyper-parameter configuration is identified. By involving thesuccessive halving searching method, Hyperband has a computational com-plexity of O(nlogn) [16].

30

4.4.3. BOHB

Bayesian Optimization HyperBand (BOHB) [97] is a state-of-the-art HPOtechnique that combines Bayesian optimization and Hyperband to incorpo-rate the advantages of both while avoiding their drawbacks. The originalHyperband uses a random search to search the hyper-parameter configura-tion space, which has a low efficiency. BOHB replaces the RS method byBO to achieve both high performance as well as low execution time by ef-fectively using parallel resources to optimize all types of hyper-parameters.In BOHB, TPE is the standard surrogate model for BO, but it uses multidi-mensional kernel density estimators. Therefore, the complexity of BOHB isalso O(nlogn) [97].

It has been shown that BOHB outperforms many other optimization tech-niques when tuning SVM and DL models [97]. The only limitation of BOHBis that it requires the evaluations on subsets with small budgets to be rep-resentative of evaluations on the entire training set; otherwise, BOHB mayhave a slower convergence speed than standard BO models.

4.5. Metaheuristic Algorithms

Metaheuristic algorithms [98] are a set of algorithms mainly inspired bybiological theories and widely used for optimization problems. Unlike manytraditional optimization methods, metaheuristics have the capacity to solvenon-convex, non-continuous, or non-smooth optimization problems.

Population-based optimization algorithms (POAs) are a major type ofmetaheuristic algorithm, including genetic algorithms (GAs), evolutionaryalgorithms, evolutionary strategies, and particle swarm optimization (PSO).POAs start by creating and updating a population as each generation; eachindividual in every generation is then evaluated until the global optimum isidentified [14]. The main differences between different POAs are the methodsused to generate and select populations [17]. POAs can be easily parallelizedsince a population of N individuals can be evaluated on at most N threads ormachines in parallel [6]. Genetic algorithms and particle swarm optimizationare the two main POAs that are popularly-used for HPO problems.

4.5.1. Genetic Algorithm

Genetic algorithm (GA) [18] is one of the common metaheuristic algo-rithms based on the evolutionary theory that individuals with the best sur-vival capability and adaptability to the environment are more likely to surviveand pass on their capabilities to future generations. The next generation will

31

also inherit their parents’ characteristics and may involve better and worseindividuals. Better individuals will be more likely to survive and have morecapable offspring, while the worse individuals will gradually disappear. Afterseveral generations, the individual with the best adaptability will be identi-fied as the global optimum [99].

To apply GA to HPO problems, each chromosome or individual representsa hyper-parameter, and its decimal value is the actual input value of thehyper-parameter in each evaluation. Every chromosome has several genes,which are binary digits; and then crossover and mutation operations areperformed on the genes of this chromosome. The population involves allpossible values within the initialized chromosome/parameter ranges, whilethe fitness function characterizes the evaluation metrics of the parameters[99].

Since the randomly-initialized parameter values often do not include theoptimal parameter values, several operations on the well-performing chromo-somes, including selection, crossover, and mutation operations, must be per-formed to identify the optimums [18]. Chromosome selection is implementedby selecting those chromosomes with good fitness function values. To keepthe population size unchanged, the chromosomes with good fitness functionvalues are passed to the next generation with higher probability, where theygenerate new chromosomes with the parents’ best characteristics. Chromo-some selection ensures that good characteristics of each generation can bepassed to later generations. Crossover is used to generate new chromosomesby exchanging a proportion of genes in different chromosomes. Mutationoperations are also used to generate new chromosomes by randomly alteringone or more genes of a chromosome. Crossover and mutation operations en-able later generations to have different characteristics and reduce the chanceof missing some good characteristics [3].

The main procedures of GA are as follows [98]:

1. Randomly initialize the population, chromosomes, and genes represent-ing the entire search space, hyper-parameters, and hyper-parametervalues, respectively.

2. Evaluate the performance of each individual in the current generationby calculating the fitness function, which indicates the objective func-tion of a ML model.

3. Perform selection, crossover, and mutation operations on the chromo-somes to produce a new generation involving the next hyper-parameter

32

configurations to be evaluated.

4. Repeat steps 2 & 3 until the termination condition is met.

5. Terminate and output the optimal hyper-parameter configuration.

Among the above steps, the population initialization step is an impor-tant step of GA and PSO since it provides an initial guess of the optimalvalues. Although the initialized values will be iteratively improved in theoptimization process, a suitable population initialization method can signif-icantly improve the convergence speed and performance of POAs. A goodinitial population of hyper-parameters should involve individuals that areclose to global optimums by covering the promising regions and should notbe localized to an unpromising region of the search space [100].

To generate hyper-parameter configuration candidates for the initial pop-ulation, random initialization that simply creates the initial population withrandom values in the given search space is often used in GA [101]. Thus,GA is easily implemented and does not necessitate good initializations, be-cause its selection, crossover, and mutation operations lower the possibilityof missing the global optimum.

Hence, it is useful when the data analyst does not have much experi-ence determining a potential appropriate initial search space for the hyper-parameters. The main limitation of GA is that the algorithm itself introducesadditional hyper-parameters to be configured, including the fitness functiontype, population size, crossover rate, and mutation rate. Moreover, GA is asequential execution algorithm, making it difficult to parallelize. The timecomplexity of GA is O(n2) [102]. As a result, sometimes, GA may be ineffi-cient due to low convergence speed.

4.5.2. Particle Swarm Optimization

Particle swarm optimization (PSO) [103] is another set of evolutionaryalgorithms that are commonly used for optimization problems. PSO algo-rithms are inspired by biological populations that exhibit both individual andsocial behaviors [17]. PSO works by enabling a group of particles (swarm)to traverse the search space in a semi-random manner [9]. PSO algorithmsidentify the optimal solution through cooperation and information sharingamong individual particles in a group.

In PSO, there are a group of n particles in a swarm S [2]:

S = (S1, S2, · · · , Sn) , (26)

33

and each particle Si is represented by a vector:

Si =< −→xi ,−→vi ,−→pi >, (27)

where −→xi is the current position, −→vi is the current velocity, and −→pi is theknown best position of the particle so far.

PSO initially generates each particle with a random position and a ran-dom velocity. Every particle evaluates the current position and records theposition with its performance score. In the next iteration, the velocity −→vi ofeach particle is changed based on the previous position −→pi and the currentglobal optimal position −→p :

−→vi := −→vi + U (0, ϕ1) (−→pi −−→xi ) + U (0, ϕ2) (−→p −−→xi ), (28)

where U(0, ϕ) is the continuous uniform distributions based on the accelera-tion constants ϕ1 and ϕ2.

After that, the particles move based on their new velocity vectors:

−→xi := −→xi +−→vi . (29)

The above procedures are repeated until convergence or termination con-straints are reached.

Compared with GA, it is easier to implement PSO, since PSO does nothave certain additional operations like crossover and mutation. In GA, allchromosomes share information with each other, so the entire populationmoves uniformly toward the optimal region; while in PSO, only informationon the individual best particle and the global best particle is transmittedto others, which is a one-way flow of information sharing, and the entiresearch process follows the direction of the current optimal solution [2]. Thecomputational complexity of PSO algorithm is O(nlogn) [104]. In most cases,the convergence speed of PSO is faster than of GA. In addition, particles inPSO operate independently and only need to share information with eachother after each iteration, so this process is easily parallelized to improvemodel efficiency [9].

The main limitation of PSO is that it requires proper population ini-tialization; otherwise, it might only reach a local instead of a global op-timum, especially for discrete hyper-parameters [105]. Proper populationinitialization requires developers prior experience or can be obtained by pop-ulation initialization techniques. Many population initialization techniques

34

have been proposed to improve the performance of evolutionary algorithms,like the opposition-based optimization algorithm [101] and the space trans-formation search method [106]. Involving additional population initializationtechniques will require more execution time and resources.

5. Applying Optimization Techniques to Machine Learning Algo-rithms

5.1. Optimization Techniques Analysis

Grid search (GS) is a simple method, its major limitation being that it istime-consuming and impacted by the curse of dimensionality [83]. Thus, it isunsuitable for a large number of hyper-parameters. Moreover, GS is often notable to detect the global optimum of continuous parameters, since it requiresa pre-defined, finite set of hyper-parameter values. It is also not realistic forGS to be used to identify integer and continuous hyper-parameter optimumswith limited time and resources. Therefore, compared with other techniques,GS is only efficient for a small number of categorical hyper-parameters.

Random search is more efficient than GS and supports all types of hyper-parameters. In practical applications, using RS to evaluate the randomly-selected hyper-parameter values helps analysts to explore a large searchspace. However, since RS does not consider previously-tested results, it mayinvolve many unnecessary evaluations, which decrease its efficiency [13].

Hyperband can be considered an improved version of RS, and they bothsupport parallel executions. Hyperband balances model performance andresource usage, so it is more efficient than RS, especially with limited timeand resources [15]. However, GS, RS, and Hyperband all have a major con-straint in that they treat each hyper-parameter independently and do notconsider hyper-parameter correlations [107]. Thus, they will be inefficientfor ML algorithms with conditional hyper-parameters, like SVM, DBSCAN,and logistic regression.

Gradient-based algorithms are not a prevalent choice for hyper-parameteroptimization, since they only support continuous hyper-parameters and canonly detect local instead of a global optimum for non-convex HPO problems[2]. Therefore, gradient-based algorithms can only be used to optimize certainhyper-parameters, like the learning rate in DL models.

Bayesian optimization models are divided into three different models BO-GP, SMAC, and BO-TPE based on their surrogate models. BO algorithmsdetermine the next hyper-parameter value based on the previously-evaluated

35

results to reduce unnecessary evaluations and improve efficiency. BO-GPmainly supports continuous and discrete hyper-parameters (by rounding them),but does not support conditional hyper-parameters [14]; while SMAC andBO-TPE are both able to handle categorical, discrete, continuous, and con-ditional hyper-parameters. SMAC performs better when there are many cate-gorical and conditional parameters, or cross-validation is used, while BO-GPperforms better for only a few continuous parameters [15]. BO-TPE pre-serves the specified conditional relationships, so one advantage of BO-TPEover BO-GP is its innate support for specified conditional hyper-parameters[14].

Metaheuristic algorithms, including GA and PSO, are more complicatedthan many other HPO algorithms, but often perform well for complex op-timization problems. They support all types of hyper-parameters and areparticularly efficient for large configuration spaces, since they can obtain thenear-optimal solutions even within very few iterations. However, GA andPSO have their own advantages and disadvantages in practical use. PSOis able to support large-scale parallelization, and is particularly suitable forcontinuous and conditional HPO problems [19]; on the other hand, GA isexecuted sequentially, making it difficult to be parallelized. Therefore, PSOoften executes faster than GA, especially for large configuration spaces andlarge datasets. However, an appropriate population initialization is crucialfor PSO; otherwise, it may converge slowly or only identify a local instead ofa global optimum. Yet, the impact of proper population initialization is notas significant for GA as for PSO [108]. Another limitation of GA is that itintroduces additional hyper-parameters, like its crossover and mutation rates[18].

The strengths and limitations of the hyper-parameter optimization algo-rithms involved in this paper are summarized in Table 1.

5.2. Apply HPO Algorithms to ML Models

Since there are many different HPO methods for different use cases, itis crucial to select the appropriate optimization techniques for different MLmodels.

Firstly, if we have access to multiple fidelities, which means that it is ableto define meaningful budgets: the performance rankings of hyper-parameterconfigurations evaluated on small budgets should be the same as or simi-lar to the configuration rankings on the full budget (the original dataset);

36

Table 1: The comparison of common HPO algorithms (n is the number of hyper-parametervalues and k is the the number of hyper-parameters)

HPOMethod

Strengths Limitations TimeCom-plexity

GS Simple.Time-consuming,

Only efficient with categoricalHPs.

O(nk)

RSMore efficient than GS.

Enable parallelization.

Not consider previous results.

Not efficient with conditionalHPs.

O(n)

Gradient-basedmodels

Fast convergence speed for con-tinuous HPs.

Only support continuous HPs.

May only detect local optimums.O(nk)

BO-GPFast convergence speed for con-

tinuous HPs.

Poor capacity for parallelization.


O(n3)

SMAC Efficient with all types of HPs. Poor capacity for parallelization. O(nlogn)

BO-TPEEfficient with all types of HPs.

Keep conditional dependencies.Poor capacity for parallelization. O(nlogn)

Hyperband Enable parallelization.


Require subsets with small bud-gets to be representative.

O(nlogn)

BOHBEfficient with all types of HPs.

Enable parallelization.

Require subsets with small bud-gets to be representative.

O(nlogn)

GAEfficient with all types of HPs.

Not require good initialization.Poor capacity for parallelization. O(n2)

PSOEfficient with all types of HPs.

Enable parallelization.Require proper initialization. O(nlogn)

BOHB would be the best choice, since it has the advantages of both BO andHyperband [6] [97].

On the other hand, if multiple fidelities are not applicable, which meansthat using the subsets of the original dataset or the subsets of original featuresis misleading or too noisy to reflect the performance of the entire dataset,BOHB may perform poorly with higher time complexity than standard BOmodels, then choosing other HPO algorithms would be more efficient [97].

ML algorithms can be classified by the characteristics of their hyper-parameter configurations. Appropriate optimization algorithms can be cho-sen to optimize the hyper-parameters based on these characteristics.

5.2.1. One Discrete Hyper-parameter

Commonly for some ML algorithms, like certain neighbor-based, clus-tering, and dimensionality reduction algorithms, only one discrete hyper-

37

parameter needs to be tuned. For KNN, the major hyper-parameter is k,the number of considered neighbors. The most essential hyper-parameter ofk-means, hierarchical clustering, and EM is the number of clusters. Similarly,for dimensionality reduction algorithms, including PCA and LDA, their basichyper-parameter is ’n components’, the number of features to be extracted.

In these situations, Bayesian optimization is the best choice, and thethree surrogates could be tested to find the best one. Hyperband is anothergood choice, which may have a fast execution speed due to its capacity forparallelization. In some cases, people may want to fine-tune the ML model byconsidering other less important hyper-parameters, like the distance metricof KNN and the SVD solver type of PCA; so BO-TPE, GA, or PSO couldbe chosen for these situations.

5.2.2. One Continuous Hyper-parameter

Some linear models, including ridge and lasso algorithms, and some naveBayes algorithms, involving multinomial NB, Bernoulli NB, and complementNB, generally only have one vital continuous hyper-parameter to be tuned.In ridge and lasso algorithms, the continuous hyper-parameter is ’alpha’, theregularization strength. In the three NB algorithms mentioned above, thecritical hyper-parameter is also named ’alpha’, but it represents the additive(Laplace/Lidstone) smoothing parameter. In terms of these ML algorithms,BO-GP is the best choice, since it is good at optimizing a small number ofcontinuous hyper-parameters. Gradient-based algorithms can also be used,but might only detect local optimums, so they are less effective than BO-GP.

5.2.3. A Few Conditional Hyper-parameters

It is noticeable that many ML algorithms have conditional hyper-parameters,like SVM, LR, and DBSCAN. LR has three correlated hyper-parameters,’penalty’, ’C’, and the solver type. Similarly, DBSCAN has ’eps’ and ’min samples’that must be tuned in conjunction. SVM is more complex, since after settinga different kernel type, there is a separate set of conditional hyper-parametersthat need to be tuned next, as described in Section 3.1.3. Hence, some HPOmethods that cannot effectively optimize conditional hyper-parameters, in-cluding GS, RS, BO-GP, and Hyperband, are not suitable for ML modelswith conditional hyper-parameters. For these ML methods, BO-TPE is thebest choice if we have pre-defined relationships among the hyper-parameters.SMAC is also a good choice, since it also performs well for tuning conditionalhyper-parameters. GA and PSO can be used, as well.

38

5.2.4. A Large Hyper-parameter Configuration Space with Multiple Types ofHyper-parameters

In ML, tree-based algorithms, including DT, RF, ET, and XGBoost, aswell as DL algorithms, like DNN, CNN, RNN, are the most complex types ofML algorithms to bed tuned, since they have many hyper-parameters withvarious, different types. For these ML models, PSO is the best choice since itenables parallel executions to improve efficiency, particularly for DL modelsthat often require massive training time. Some other techniques, like GA,BO-TPE, and SMAC can also be used, but they may cost more time thanPSO, since it is difficult to parallelize these techniques.

5.2.5. Categorical Hyper-parameters

This category of hyper-parameters is mainly for ensemble learning algo-rithms, since their major hyper-parameter is a categorical hyper-parameter.For bagging and AdaBoost, the categorical hyper-parameter is ’base estimator’,which is set to be a singular ML model. For voting, it is ’estimators’, in-dicating a list of ML singular models to be combined. The voting methodhas another categorical hyper-parameter, ’voting’, which is used to choosewhether to use a hard or soft voting method. If we only consider these cat-egorical hyper-parameters, GS would be sufficient to test their suitable basemachine learners. On the other hand, in many cases, other hyper-parametersneed to be considered, like ’n estimators’, ’max samples’, and ’max features’in bagging, as well as ’n estimators’ and ’learning rate’ in AdaBoost; conse-quently, BO algorithms would be a better choice to optimize these continuousor discrete hyper-parameters.

In conclusion, when tuning a ML model to achieve high model perfor-mance and low computational costs, the most suitable HPO algorithm shouldbe selected based on the properties of its hyper-parameters.

6. Existing HPO Frameworks

To tackle HPO problems, many open-source libraries exist to apply theoryinto practice and lower the threshold for ML developers. In this section,we provide a brief introduction to some popular open-source HPO librariesor frameworks mainly for Python programming. The principles behind theinvolved optimization algorithms are provided in Section 4.

39

6.1. Sklearn

In sklearn [31], ’GridSearchCV’ can be implemented to detect the optimalhyper-parameters using the GS algorithm. Each hyper-parameter value inthe human-defined configuration space is evaluated by the program, withits performance evaluated using cross-validation. When all the instances inthe configuration space have been evaluated, the optimal hyper-parametercombination in the defined search space with its performance score will bereturned.

’RandomizedSearchCV’ is also provided in sklearn to implement a RSmethod. It evaluates a pre-defined number of randomly-selected hyper-parameter values in parallel. Cross-validation is conducted to effectivelyevaluate the performance of each configuration.

6.2. Spearmint

Spearmint [87] is a library using Bayesian optimization with the Gaussianprocess as the surrogate model. Spearmint’s primary deficiency is that it isnot very efficient for categorical and conditional hyper-parameters.

6.3. BayesOpt

Bayesian Optimization (BayesOpt) [109] is a Python library employedto solve HPO problems using BO. BayesOpt uses a Gaussian process as itssurrogate model to calculate the objective function based on past evaluationsand utilizes an acquisition function to determine the next values.

6.4. Hyperopt

Hyperopt [110] is a HPO framework that involves RS and BO-TPE asthe optimization algorithms. Unlike some of the other libraries that onlysupport a single model, Hyperopt is able to use multiple instances to modelhierarchical hyper-parameters. In addition, Hyperopt is parallelizable sinceit uses MongoDb as the central database to store the hyper-parameter com-binations. hyperopt-sklearn [111] and hyperas [112] are the two libraries thatcan apply Hyperopt to scikit-learn and Keras libraries.

6.5. SMAC

SMAC [90][113] is another library that uses BO with random forest as thesurrogate model. It supports categorical, continuous, and discrete variables.

40

6.6. BOHB

BOHB framework [97] is a combination of Bayesian optimization and Hy-perband [15]. It overcomes one limitation of Hyperband, in that it randomlygenerates the test configurations, by replacing this procedure by BO. TPEis used as the surrogate model to store and model function evaluations. Us-ing BOHB to evaluate the instance can achieve a trade-off between modelperformance and the current budget.

6.7. Optunity

Optunity [83] is a popular HPO framework that provides several optimiza-tion techniques, including GS, RS, PSO, and BO-TPE. In Optunity, categori-cal hyper-parameters are converted to discrete hyper-parameters by indexing,and discrete hyper-parameters are processed as continuous hyper-parametersby rounding them; as such, it supports all kinds of hyper-parameter.

6.8. Skopt

Skopt (scikit-optimize) [114] is a HPO library that is built on top ofthe scikit-learn [31] library. It implements several sequential model-basedoptimization models, including RS and BO-GP. The methods exhibit goodperformance with small search space and proper initialization.

6.9. GpFlowOpt

GpFlowOpt [115] is a Python library for BO using GP as the surrogatemodel. It supports running BO-GP on GPU using the Tensorflow library.Therefore, GpFlowOpt is a good choice if BO is used in deep learning models,and GPU resources are available.

6.10. Talos

Talos [116] is a Python package designed for hyper-parameter optimiza-tion with Keras models. Talos can be fully deployed into any Keras modelsand implemented easily without learning any new syntax. Several optimiza-tion techniques, including GS, RS, and probabilistic reduction, can be im-plemented using Talos.

41

6.11. Sherpa

Sherpa [117] is a Python package used for HPO problems. It can be usedwith other ML libraries, including sklearn [31], Tensorflow[118], and Keras[33]. It supports parallel computations and has several optimization methods,including GS, RS, BO-GP (via GPyOpt), Hyperband, and population-basedtraining (PBT).

6.12. Osprey

Osprey [119] is a Python library designed to optimize hyper-parameters.Several HPO strategies are available in Osprey, including GS, RS, BO-TPE(via Hyperopt), and BO-GP (via GPyOpt).

6.13. FAR-HO

FAR-HO [120] is a hyper-parameter optimization package that employsgradient-based algorithms with TensorFlow. FAR-HO contains a few gradient-based optimizers, like reverse hyper-gradient and forward hyper-gradientmethods. This library is designed to build access to the gradient-based hyper-parameter optimizers in TensorFlow, allowing deep learning model trainingand hyper-parameter optimization in GPU or other tensor-optimized com-puting environments.

6.14. Hyperband

Hyperband [16] is a Python package for tuning hyper-parameters by Hy-perband, a bandit-based approach. Similar to ’GridSearchCV’ and ’Random-izedSearchCV’ in scikit-learn, there is a class named ’HyperbandSearchCV’in Hyperband that can be combined with sklearn and used for HPO problems.In ’HyperbandSearchCV’ method, cross-validation is used for evaluation.

6.15. DEAP

DEAP [121] is a novel evolutionary computation package for Python thatcontains several evolutionary algorithms like GA and PSO. It integrates withparallelization mechanisms like multiprocessing, and machine learning pack-ages like sklearn.

42

6.16. TPOT

TPOT [122] is a Python tool for auto-ML that uses genetic programmingto optimize ML pipelines. TPOT is built on top of sklearn, so it is easy toimplement TPOT on ML models. ’TPOTClassifier’ is its principal function,and several additional hyper-parameters of GA must be set to fit specificproblems.

6.17. Nevergrad

Nevergrad [123] is an open-source Python library that includes a widerange of optimizers, like fast-GA and PSO. In ML, Nevergrad can be usedto tune all types of hyper-parameters, including discrete, continuous, andcategorical hyper-parameters, by choosing different optimizers.

7. Experiments

To summarize the content of Sections 3 to 6, a comprehensive overviewof applying hyper-parameter optimization techniques to ML models is shownin Table 2. It provides a summary of common ML algorithms, their hyper-parameters, suitable optimization methods, and available Python libraries;thus, data analysts and researchers can look up this table and select suitableoptimization algorithms as well as libraries for practical use.

To put theory into practice, several experiments have been conductedbased on Table 2. This section provides the experiments of applying eightdifferent HPO techniques to three common and representative ML algorithmson two benchmark datasets. In the first part of this section, the experimentalsetup and the main process of HPO are discussed. In the second part, theresults of utilizing different HPO methods are compared and analyzed. Thesample code of the experiments has been published in [124] to illustrate theprocess of applying hyper-parameter optimization to ML models.

7.1. Experimental Setup

Based on the steps to optimize hyper-parameters discussed in Section 2.2,several steps were completed before the actual optimization experiments.

Firstly, two standard benchmarking datasets provided by the sklearn li-brary [31], namely, the Modified National Institute of Standards and Tech-nology dataset (MNIST) and the Boston housing dataset, are selected as

43

Table 2: A comprehensive overview of common ML models, their hyper-parameters, suit-able optimization techniques, and available Python libraries

ML Algorithm Main HPs Optional HPs HPO methods Libraries

Linear regression - - - -Ridge & lasso alpha - BO-GP Skpot

Logistic regressionpenalty,

c,solver

-BO-TPE,SMAC

Hyperopt,SMAC

KNN n neighborsweights,

p,algorithm

BOs,Hyperband

Skpot,Hyperopt,SMAC,

Hyperband

SVMC,

kernel,epsilon (for SVR)

gamma,coef0,degree

BO-TPE,SMAC,BOHB

Hyperopt,SMAC,BOHB

NB alpha - BO-GP Skpot

DT

criterion,max depth,

min samples split,min samples leaf,

max features

splitter,min weight fraction leaf,

max leaf nodes

GA,PSO,

BO-TPE,SMAC,BOHB

TPOT,Optunity,SMAC,BOHB

RF & ET

n estimatorsmax depth,criterion,

min samples split,min samples leaf,

max features

splitter,min weight fraction leaf,

max leaf nodes

GA,PSO,

BO-TPE,SMAC,BOHB


XGBoost

n estimators,max depth,

learning rate,subsample,

colsample bytree,

min child weight,gamma,alpha,lambda

GA,PSO,

BO-TPE,SMAC,BOHB


Votingestimators,

votingweights GS sklearn

Baggingbase estimator,n estimators

max samples,max features

GS,BOs

sklearn,Skpot,

Hyperopt,SMAC

AdaBoostbase estimator,n estimators,learning rate

-BO-TPE,SMAC

Hyperopt,SMAC

Deep learning

number of hidden layers,units per layer,

loss,optimizer,Activation,

learning rate,dropout rate,

epochs,batch size,

early stop patience

number of frozen layers(if transfer learning

is used)

PSO,BOHB

Optunity,BOHB

K-means n clustersinit,

n init,max iter

BOs,Hyperband


Hyperband

Hierarchical clusteringn clusters,

distance thresholdlinkage

BOs,Hyperband


Hyperband

DBSCANeps,

min samples-

BO-TPE,SMAC,BOHB

Hyperopt,SMAC,BOHB

Gaussian mixture n componentscovariance type,

max iter,tol

BO-GP Skpot

PCA n components svd solverBOs,

Hyperband


Hyperband

LDA n componentssolver,

shrinkageBOs,

Hyperband


Hyperband

44

the benchmark datasets for HPO method evaluation on data analytics prob-lems. MNIST is a hand-written digit recognition dataset used as a multi-classification problem, while the Boston housing dataset contains informationabout the price of houses in various places in the city of Boston and can beused as a regression dataset to predict the housing prices.

At the next stage, the ML models with their objective function need tobe configured. In Section 5, all common ML models are divided into five cat-egories based on their hyper-parameter types. Among those ML categories,”one discrete hyper-parameter”, ”a few conditional hyper-parameters”, and”a large hyper-parameter configuration space with multiple types of hyper-parameters” are the three most common cases. Thus, three ML algorithms,KNN, SVM, and RF, are selected as the target models to be optimized, sincetheir hyper-parameter types represent the three most common HPO cases:KNN has one important hyper-parameter, the number of considered near-est neighbors for each sample; SVM has a few conditional hyper-parameters,like the kernel type and the penalty parameter C; RF has multiple hyper-parameters of different types, as discussed in Section 3. Moreover, KNN,SVM, and RF can all be applied to solve both classification and regressionproblems.

In the next step, the performance metric and evaluation method are con-figured. For each experiment on the selected two datasets, 3-fold cross vali-dation is implemented to evaluate the involved HPO methods. The two mostcommonly-used performance metrics are used in our experiments. For classi-fication models, accuracy is used as the classifier performance metric, whichis the proportion of correctly classified data; while for regression models,the mean squared error (MSE) is used as the regressor performance metric,which measures the average squared difference between the predicted valuesand the actual values. Additionally, the computational time (CT) , the totaltime needed to complete a HPO process with 3-fold cross-validation, is alsoused as the model efficiency metric [55]. In each experiment, the optimal MLmodel architecture that has the highest accuracy or the lowest MSE will bereturned with the optimal hyper-parameter configuration.

After that, to fairly compare different optimization algorithms and frame-works, certain constraints should be satisfied. Firstly, we compare differentHPO methods using the same hyper-parameter configuration space. ForKNN, the only hyper-parameter to be optimized, ’n neighbors’, is set to bein the same range of 1 to 20 for each optimization method evaluation. Thehyper-parameters of SVM and RF models for classification and regression

45

Table 3: Configuration space for the hyper-parameters of tested ML modelsML Model Hyper-parameter Type Search Space

RF Classifier

n estimators Discrete [10,100]

max depth Discrete [5,50]

min samples split Discrete [2,11]

min samples leaf Discrete [1,11]

criterion Categorical [’gini’, ’entropy’]

max features Discrete [1,64]

SVM ClassifierC Continuous [0.1,50]

kernel Categorical [’linear’, ’poly’, ’rbf’, ’sigmoid’]

KNN Classifier n neighbors Discrete [1,20]

RF Regressor

n estimators Discrete [10,100]

max depth Discrete [5,50]

min samples split Discrete [2,11]

min samples leaf Discrete [1,11]

criterion Categorical [’mse’, ’mae’]

max features Discrete [1,13]

SVM RegressorC Continuous [0.1,50]

kernel Categorical [’linear’, ’poly’, ’rbf’, ’sigmoid’]

epsilon Continuous [0.001,1]

KNN Regressor n neighbors Discrete [1,20]

problems are also set to be in the same configuration space for each type ofproblem. The specifics of the configuration space for ML models are shownin Table 3. The selected hyper-parameters and their search space are deter-mined based on the concepts in Section 3, domain knowledge, and manualtestings [81]. The hyper-parameter types of each ML algorithm are alsosummarized in Table 3.

On the other hand, to fairly compare the performance metrics of opti-mization techniques, the maximum number of iterations for all HPO methodsis set to 50 for RF and SVM model optimizations, and 10 for KNN modeloptimization based on manual testings and domain knowledge. Moreover,to avoid the impacts of randomness, all experiments are repeated ten timeswith different random seeds, and results are averaged for regression problemsor given the majority vote for classification problems.

In Section 4, more than ten HPO methods are introduced. In our exper-iments, eight representative HPO approaches are selected for performancecomparison, including GS, RS, BO-GP, BO-TPE, Hyperband, BOHB, GA,and PSO. After setting up the fair experimental environments for each HPOmethod, the HPO experiments are implemented based on the steps discussedin Section 2.2.

All experiments were conducted using Python 3.5 on a machine with

46

Table 4: Performance evaluation of applying HPO methods to the RF classifier on theMNIST dataset

OptimizationAlgorithm

Accuracy(%)

CT (s)

Default HPs 90.65 0.09

GS 93.32 48.62

RS 93.38 16.73

BO-GP 93.38 20.60

BO-TPE 93.88 12.58

Hyperband 93.38 8.89

BOHB 93.38 9.45

GA 93.83 19.19

PSO 93.73 12.43

6 Core i7-8700 processor and 16 gigabytes (GB) of memory. The involvedML and HPO algorithms are evaluated using multiple open-source Python li-braries and frameworks introduced in Section 6, including sklearn [31], Skopt[114], Hyperopt [110], Optunity [83], Hyperband [16], BOHB [97], and TPOT[122].

7.2. Performance Comparison

The experiments of applying eight different HPO methods to ML modelsare summarized in Tables 4 to 9. Tables 4 to 6 provide the performanceof each optimization algorithm when applied to RF, SVM, and KNN classi-fiers evaluated on the MNIST dataset after a complete optimization process;while Tables 7 to 9 demonstrate the performance of each HPO method whenapplied to RF, SVM, and KNN regressors evaluated on the Boston-housingdataset. In the first step, each ML model with their default hyper-parameterconfigurations is trained and evaluated as baseline models. After that, eachHPO algorithm is implemented on the ML models to evaluate and comparetheir accuracies for classification problems, or MSEs for regression problems,and their computational time (CT).

From Tables 4 to 9, we can see that using the default HP configurationsdo not yield the best model performance in our experiments, which empha-sizes the importance of utilizing HPO methods. GS and RS can be seen asbaseline models for HPO problems. From the results in Tables 4 to 9, it isshown that the computational time of GS is often much higher than otheroptimization methods. With the same search space size, RS is faster thanGS, but both of them cannot guarantee to detect the near-optimal hyper-parameter configurations of ML models, especially for RF and SVM models

47

Table 5: Performance evaluation of applying HPO methods to the SVM classifier on theMNIST dataset


Accuracy(%)

CT (s)


GS 97.44 32.90

RS 97.35 12.48

BO-GP 97.50 17.56

BO-TPE 97.44 3.02


BOHB 97.44 8.18

GA 97.44 16.89

PSO 97.44 8.33

Table 6: Performance evaluation of applying HPO methods to the KNN classifier on theMNIST dataset


Accuracy(%)

CT (s)


GS 96.22 7.86

RS 96.33 6.44

BO-GP 96.83 1.12

BO-TPE 96.83 2.33


BOHB 97.44 3.84

GA 96.83 2.34

PSO 96.83 1.73

Table 7: Performance evaluation of applying HPO methods to the RF regressor on theBoston-housing dataset


MSE CT (s)


GS 29.02 4.64

RS 27.92 3.42

BO-GP 26.79 17.94

BO-TPE 25.42 1.53


BOHB 25.56 1.88

GA 26.95 4.73

PSO 25.69 3.20

48

Table 8: Performance evaluation of applying HPO methods to the SVM regressor on theBoston-housing dataset


MSE CT (s)


GS 67.07 1.33

RS 61.40 0.48

BO-GP 61.27 5.87

BO-TPE 59.40 0.33


BOHB 59.67 0.31

GA 60.17 1.12

PSO 58.72 0.53

Table 9: Performance evaluation of applying HPO methods to the KNN regressor on theBoston-housing dataset


MSE CT (s)


GS 81.53 0.12

RS 80.77 0.11

BO-GP 80.77 0.49

BO-TPE 80.83 0.08


BOHB 80.77 0.09

GA 80.77 0.33

PSO 80.74 0.19

49

which have a larger search space than KNN.The performance of BO and multi-fidelity models is much better than GS

and RS. The computation time of BO-GP is often higher than other HPOmethods due to its cubic time complexity, but it can obtain better perfor-mance metrics for ML models with small-size continuous hyper-parameterspace, like KNN. Conversely, hyperband is often not able to obtain the high-est accuracy or the lowest MSE, but their computational time is low becauseit works on the small-sized subsets. The performance of BO-TPE and BOHBare often better than others, since they can detect the optimal or near-optimalhyper-parameter configurations within a short computational time.

For metaheuristics methods, GA and PSO, their accuracies are oftenhigher than other HPO methods for classification problems, and their MSEsare often lower than other optimization techniques. However, their computa-tional time is often higher than BO-TPE and multi-fidelity models, especiallyfor GA, which does not support parallel executions.

To summarize, GS and RS are simple to be implemented, but they oftencannot detect the optimal hyper-parameter configurations or cost much com-putational time. BO-GP and GA also cost more computational time thanother HPO methods, but BO-GP works well on small configuration space,while GA is effective for large configuration space. Hyperband’s computa-tional time is low, but it cannot guarantee to detect the global optimums.For ML models with large configuration space, BO-TPE, BOHB, and PSOoften work well.

8. Open Issues, Challenges, and Future Research Directions

Although there have been many existing HPO algorithms and practicalframeworks, some issues still need to be addressed, and several aspects in thisdomain could be improved. In this section, we discuss the open challenges,current research questions, and potential research directions in the future.They can be classified as model complexity challenges and model performancechallenges, which are summarized in Table 10.

8.1. Model Complexity

8.1.1. Costly Objective Function Evaluations

To evaluate the performance of a ML model with different hyper-parameterconfigurations, its objective function must be minimized in each evaluation.

50

Table 10: The open challenges and future directions of HPO researchCategory Challenges & Future Re-

quirementsBrief Description

Model complexityCostly objective function eval-uations

HPO methods should reduce evaluation timeon large datasets.

Complex search space HPO methods should reduce execution time onhigh dimensionalities (large hyper-parametersearch space).

Model performance

Strong anytime performance HPO methods should be able to detect the op-timal or near-optimal HPs even with a verylimited budget.

Strong final performance HPO methods should be able to detect theglobal optimum when given a sufficient bud-get.

Comparability There should exist a standard set of bench-marks to fairly evaluate and compare differentoptimization algorithms.

Over-fitting and generalization The optimal HPs detected by HPO methodsshould have generalizability to build efficientmodels on unseen data.

Randomness HPO methods should reduce randomness onthe obtained results.

Scalability HPO methods should be scalable to multi-ple libraries or platforms (e.g., distributed MLplatforms).

Continuous updating capabil-ity

HPO methods should consider their capacityto detect and update optimal HP combinationson continuously-updated data.

Depending on the scale of data, the model complexity, and available compu-tational resources, the evaluation of each hyper-parameter configuration maycost several minutes, hours, days, or even more [93]. Additionally, the valuesof certain hyper-parameters have a direct impact on the execution time, likethe number of considered neighbors in KNN, the number of basic decisiontrees in RF, and the number of hidden layers in deep neural networks [125].

To solve this problem by HPO algorithms, BO models reduce the to-tal number of evaluations by spending time choosing the next evaluatingpoint instead of simply evaluating all possible hyper-parameter configura-tions; however, they still require much execution time due to their poorcapacity for parallelization. On the other hand, although multi-fidelity opti-mization methods, like Hyperband, have had some success dealing with HPOproblems with limited budgets, there are still some problems that cannot beeffectively solved by HPO due to the complexity of models or the scale ofdatasets [6]. For example, the ImageNet [126] challenge is a very popularproblem in the image processing domain, but there has not been any re-

51

search or work on efficiently optimizing hyper-parameters for the ImageNetchallenge yet, due to its huge scale and the complexity of CNN models usedon ImageNet.

8.1.2. Complex Search Space

In many problems to which ML algorithms are applied, only a few hyper-parameters have significant effects on model performance, and they are themain hyper-parameters that require tuning. However, certain other unimpor-tant hyper-parameters may still affect the performance slightly and may beconsidered to optimize the ML model further, which increases the dimension-ality of hyper-parameter search space. As the number of hyper-parametersand their values increase, they exponentially increase the dimensionality ofthe search space and the complexity of the problems, and the total objectivefunction evaluation time will also increase exponentially [7]. Therefore, it isnecessary to reduce the influence of large search spaces on execution time byimproving existing HPO methods.

8.2. Model Performance

8.2.1. Strong Anytime Performance and Final Performance

HPO techniques are often expensive and sometimes require extreme re-sources, especially for massive datasets or complex ML models. One exam-ple of a resource-intensive model is deep learning models, since they viewobjective function evaluations as black-box functions and do not considertheir complexity. However, the overall budget is often very limited for mostpractical situations, so practical HPO algorithms should be able to priori-tize objective function evaluations and have a strong anytime performance,which indicates the capacity to detect optimal or near-optimal configurationseven with a very limited budget [97]. For instance, an efficient HPO methodshould have a high convergence speed so that there would not be a huge dif-ference between the results before and after model convergence, and shouldavoid random results even if time and resources are limited, like RS methodscannot.

On the other hand, if conditions permit and an adequate budget is given,HPO approaches should be able to identify the global optimal hyper-parameterconfiguration, named a strong final performance [97].

52

8.2.2. Comparability of HPO Methods

To optimize the hyper-parameters of ML models, different optimizationalgorithms can be applied to each ML framework. Different optimizationtechniques have their own strengths and drawbacks in different cases, andcurrently, there is no single optimization approach that outperforms all otherapproaches when processing different datasets with various metrics and hyper-parameter types [3]. In this paper, we have analyzed the strengths and weak-nesses of common hyper-parameter optimization techniques based on theirprinciples and their performance in practical applications; but this topiccould be extended more comprehensively.

To solve this problem, a standard set of benchmarks could be designedand agreed on by the community for a better comparison of different HPOalgorithms. For example, there is a platform called COCO (Comparing Con-tinuous Optimizers) [127] that provides benchmarks and analyzes commoncontinuous optimizers. However, there is, to date, not any reliable platformthat provides benchmarks and analysis of all common hyper-parameter opti-mization approaches. It would be easier for people to choose HPO algorithmsin practical applications if a platform like COCO exists. In addition, a uni-fied metric can also improve the comparability of different HPO algorithms,since different metrics are currently used in different practical problems [6].

On the other hand, based on the comparison of different HPO algorithms,a way to further improve HPO is to combine existing models or propose newmodels that contain as many benefits as possible and are more suitable forpractical problems than existing singular models. For example, the BOHBmethod [97] has had some success by combining Bayesian optimization andHyperband. In addition, future research should consider both model per-formance and time budgets to develop HPO algorithms that suit real-lifeapplications.

8.2.3. Over-fitting and Generalization

Generalization is another issue with HPO models. Since hyper-parameterevaluations are done with a finite number of evaluations in datasets, theoptimal hyper-parameter values detected by HPO approaches might not bethe same optimums on previously-unseen data. This is similar to over-fittingissues with ML models that occur when a model is closely fit to a finitenumber of known data points but is unfit to unseen data [128]. Generalizationis also a common concern for multi-fidelity algorithms, like Hyperband andBOHB, since they need to extract subsets to represent the entire dataset.

53

One solution to reduce or avoid over-fitting is to use cross-validation toidentify a stable optimum that performs best in all or most of the subsetsinstead of a sharp optimum that only performs well in a singular validationset [6]. However, cross-validation increases the execution time several-fold. Itwould be beneficial if methods can better deal with overfitting and improvegeneralization in future research.

8.2.4. Randomness

There are stochastic components in the objective function of ML algo-rithms; thus, in some cases, the optimal hyper-parameter configuration mightbe different after each run. This randomness could be due to various pro-cedures of certain ML models, like neural network initialization, or differentsampled subsets of a bagging model [93]; or due to certain procedures ofHPO algorithms, like crossover and mutation operations in GA. In addition,it is often difficult for HPO methods to identify the global optimums, due tothe fact that HPO problems are mainly NP-hard problems. Many existingHPO algorithms can only collect several different near-optimal values, whichis caused by randomness. Thus, the existing HPO models can be furtherimproved to reduce the impact of randomness. One possible solution is torun a HPO method multiple times and select the hyper-parameter value thatoccurs most as the final optimum.

8.2.5. Scalability

In practice, one main limitation of many existing HPO frameworks is thatthey are tightly integrated with one or a couple of machine learning libraries,like sklearn and Keras, which restricts them to only work with a single nodeinstead of large data volumes [3]. To tackle large datasets, some distributedmachine learning platforms, like Apache SystemML [129] and Spark MLib[130], have been developed; however, only very few HPO frameworks existthat support distributed ML. Therefore, more research efforts and scalableHPO frameworks, like the ones supporting distributed ML platforms, shouldbe developed to support more libraries.

On the other hand, future practical HPO algorithms should have thescalability to efficiently optimize hyper-parameters from a small size to alarge size, irrespective of whether they are continuous, discrete, categorical,or conditional hyper-parameters.

54

8.2.6. Continuous Updating Capability

In practice, many datasets are not stationary and are constantly updatedby adding new data and deleting old data. Correspondingly, the optimalhyper-parameter values or combinations may also change with data changes.Currently, developing HPO methods with the capacity to continuously tunehyper-parameter values as the data changes has not drawn much attention,since researchers and data analysts often do not alter the ML model afterachieving a currently optimal performance [3]. However, since their optimalhyper-parameter values would change with data changes, proper approachesshould be proposed to achieve continuous updating capability.

9. Conclusion

Machine learning has become the primary strategy for tackling data-related problems and has been widely used in various applications. To applyML models to practical problems, their hyper-parameters need to be tuned tofit specific datasets. However, since the scale of produced data is greatly in-creased in real-life, and manually tuning hyper-parameters is extremely com-putationally expensive, it has become crucial to optimize hyper-parametersby an automatic process. In this survey paper, we have comprehensively dis-cussed the state-of-the-art research into the domain of hyper-parameter opti-mization as well as how to apply them to different ML models by theory andpractical experiments. To apply optimization methods to ML models, thehyper-parameter types in a ML model is the main concern for HPO methodselection. To summarize, BOHB is the recommended choice for optimizing aML model, if randomly selected subsets are highly-representative of the givendataset, since it can efficiently optimize all types of hyper-parameters; other-wise, BO models are recommended for small hyper-parameter configurationspace, while PSO is usually the best choice for large configuration space.Moreover, some existing useful HPO tools and frameworks, open challenges,and potential research directions are also provided and highlighted for prac-tical use and future research purposes. We hope that our survey paper servesas a useful resource for ML users, developers, data analysts, and researchersto use and tune ML models utilizing proper HPO techniques and frameworks.We also hope that it helps to enhance understanding of the challenges thatstill exist within the HPO domain, and thereby further advance HPO andML applications in future research.

55

References

[1] M.I. Jordan, T.M. Mitchell, Machine learning: Trends,perspectives, and prospects, Science 349 (2015) 255260.https://doi.org/10.1126/science.aaa8415.

[2] M.-A. Zller and M. F. Huber, Benchmark and Survey of Automated Ma-chine Learning Frameworks, arXiv preprint arXiv:1904.12054, (2019).https://arxiv.org/abs/1904.12054.

[3] R. E. Shawi, M. Maher, S. Sakr, Automated machine learning: State-of-the-art and open challenges, arXiv preprint arXiv:1906.02287, (2019).http://arxiv.org/abs/1906.02287.

[4] M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer (2013)ISBN: 9781461468493.

[5] G.I. Diaz, A. Fokoue-Nkoutche, G. Nannicini, H. Samulowitz, An effec-tive algorithm for hyperparameter optimization of neural networks, IBMJ. Res. Dev. 61 (2017) 120. https://doi.org/10.1147/JRD.2017.2709578.

[6] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic Ma-chine Learning: Methods, Systems, Challenges, Springer (2019) ISBN:9783030053185.

[7] N. Decastro-Garca, . L. Muoz Castaeda, D. Escudero Garca, and M. V.Carriegos, Effect of the Sampling of a Dataset in the HyperparameterOptimization Phase over the Efficiency of a Machine Learning Algo-rithm, Complexity 2019 (2019). https://doi.org/10.1155/2019/6278908.

[8] S. Abreu, Automated Architecture Design for Deep Neu-ral Networks, arXiv preprint arXiv:1908.10714, (2019).http://arxiv.org/abs/1908.10714.

[9] O. S. Steinholtz, A Comparative Study of Black-box Optimization Algo-rithms for Tuning of Hyper-parameters in Deep Neural Networks, M.S.thesis, Dept. Elect. Eng., Lule Univ. Technol., (2018).

[10] G. Luo, A review of automatic selection methods for machine learn-ing algorithms and hyper-parameter values, Netw. Model. Anal. Heal.Informatics Bioinforma. 5 (2016) 116. https://doi.org/10.1007/s13721-016-0125-6.

56

http://arxiv.org/abs/1904.12054





[11] D. Maclaurin, D. Duvenaud, R.P. Adams, Gradient-based Hyper-parameter Optimization through Reversible Learning, arXiv preprintarXiv:1502.03492, (2015). http://arxiv.org/abs/1502.03492.

[12] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, Algorithms for hyper-parameter optimization, Proc. Adv. Neural Inf. Process. Syst., (2011)25462554.

[13] B. James and B. Yoshua, Random Search for Hyper-Parameter Opti-mization, J. Mach. Learn. Res. 13 (1) (2012) 281305.

[14] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H.Hoos, K. Leyton-Brown, Towards an Empirical Foundation for Assess-ing Bayesian Optimization of Hyperparameters, BayesOpt Work. (2013)15.

[15] K. Eggensperger, F. Hutter, H.H. Hoos, K. Leyton-Brown, Efficientbenchmarking of hyperparameter optimizers via surrogates, Proc. Natl.Conf. Artif. Intell. 2 (2015) 11141120.

[16] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar,Hyperband: A novel bandit-based approach to hyperparameter opti-mization, J. Mach. Learn. Res. 18 (2012) 152.

[17] Q. Yao et al., Taking Human out of Learning Applications: A Survey onAutomated Machine Learning, arXiv preprint arXiv:1810.13306, (2018).http://arxiv.org/abs/1810.13306.

[18] S. Lessmann, R. Stahlbock, S.F. Crone, Optimizing hyperparameters ofsupport vector machines by genetic algorithms, Proc. 2005 Int. Conf.Artif. Intell. ICAI05. 1 (2005) 7480.

[19] P. R. Lorenzo, J. Nalepa, M. Kawulok, L. S. Ramos, and J. R. Paster,Particle swarm optimization for hyper-parameter selection in deep neu-ral networks, Proc. ACM Int. Conf. Genet. Evol. Comput., (2017)481488.

[20] S. Sun, Z. Cao, H. Zhu, J. Zhao, A Survey of Optimization Methodsfrom a Machine Learning Perspective, arXiv preprint arXiv:1906.06821,(2019). https://arxiv.org/abs/1906.06821.

57






[21] T.M. S. Bradley, A. Hax, Applied Mathematical Programming,Addison-Wesley, Reading, Massachusetts. (1977).

[22] S. Bubeck, Convex optimization: Algorithms and com-plexity, Found. Trends Mach. Learn. 8 (2015) 231357.https://doi.org/10.1561/2200000050.

[23] B. Shahriari, A. Bouchard-Ct, and N. de Freitas, Unbounded Bayesianoptimization via regularization, Proc. Artif. Intell. Statist., (2016)11681176.

[24] G.I. Diaz, A. Fokoue-Nkoutche, G. Nannicini, H. Samulowitz, An effec-tive algorithm for hyperparameter optimization of neural networks, IBMJ. Res. Dev. 61 (2017) 120. https://doi.org/10.1147/JRD.2017.2709578.

[25] C. Gambella, B. Ghaddar, and J. Naoum-Sawaya, Optimization Mod-els for Machine Learning: A Survey, arXiv preprint arXiv:1901.05331,(2019). http://arxiv.org/abs/1901.05331.

[26] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, andT. Kraska, Automating model search for large scale machine learning,Proc. 6th ACM Symp. Cloud Comput., (2015) 368380.

[27] J. Nocedal and S. Wright, Numerical Optimization, (2006) Springer-Verlag, ISBN: 978-0-387-40065-5.

[28] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, DNS Typo-SquattingDomain Detection: A Data Analytics & Machine Learning Based Ap-proach, 2018 IEEE Glob. Commun. Conf. GLOBECOM 2018 - Proc.(2018). https://doi.org/10.1109/GLOCOM.2018.8647679.

[29] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervisedlearning algorithms, ACM Int. Conf. Proceeding Ser. 148 (2006) 161168.https://doi.org/10.1145/1143844.1143865.

[30] O. Kramer, Scikit-Learn, in Machine Learning for Evolution Strategies.Cham, Switzerland: Springer International Publishing, (2016) 4553.

[31] F. Pedregosa et al., Scikit-learn: Machine learning in Python, J. Mach.Learn. Res., 12 (2011) 28252830.

58



[32] T.Chen, C.Guestrin, XGBoost: a scalable tree boosting system, arXivpreprint arXiv:1603.02754, (2016). http://arxiv.org/abs/1603.02754.

[33] F. Chollet, Keras, 2015. https://github.com/fchollet/keras.

[34] C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimiza-tion Models for Machine Learning: A Survey, (2019) 140.http://arxiv.org/abs/1901.05331

[35] C.M. Bishop, Pattern Recognition and Machine Learning. (2006)Springer, ISBN: 978-0-387-31073-2.

[36] A.E. Hoerl, R.W. Kennard, Ridge Regression: Applicationsto Nonorthogonal Problems, Technometrics. 12 (1970) 6982.https://doi.org/10.1080/00401706.1970.10488635.

[37] L.E. Melkumova, S.Y. Shatskikh, Comparing Ridge and LASSOestimators for data analysis, Procedia Eng. 201 (2017) 746755.https://doi.org/10.1016/j.proeng.2017.09.615.

[38] R. Tibshirani, Regression Shrinkage and Selection Via the Lasso, J.R. Stat. Soc. Ser. B. 58 (1996) 267288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

[39] D.W. Hosmer Jr, S. Lemeshow, Applied logistic regression, Technomet-rics, 34 (1) (2013), 358-359.

[40] J.O. Ogutu, T. Schulz-Streeck, H.P. Piepho, Genomic selection usingregularized linear regression models: ridge regression, lasso, elastic netand their extensions, BMC Proceedings. BioMed Cent. 6 (2012).

[41] J.M. Keller, M.R. Gray, A Fuzzy K-Nearest Neighbor Algo-rithm, IEEE Trans. Syst. Man Cybern. SMC-15 (1985) 580585.https://doi.org/10.1109/TSMC.1985.6313426.

[42] W. Zuo, D. Zhang, K. Wang, On kernel difference-weighted k-nearest neighbor classification, Pattern Anal. Appl. 11 (2008) 247257.https://doi.org/10.1007/s10044-007-0100-z.

[43] A. Smola, V. Vapnik, Support vector regression machines, Adv. NeuralInf. Process. Syst. 9 (1997) 155-161.

59




[44] L. Yang, R. Muresan, A. Al-Dweik, L.J. Hadjileontiadis,Image-Based Visibility Estimation Algorithm for IntelligentTransportation Systems, IEEE Access. 6 (2018) 7672876740.https://doi.org/10.1109/ACCESS.2018.2884225.

[45] L. Yang, Comprehensive Visibility Indicator Algorithm for AdaptableSpeed Limit Control in Intelligent Transportation Systems, M.A.Sc. the-sis, University of Guelph, 2018.

[46] O.S. Soliman, A.S. Mahmoud, A classification system for remote sensingsatellite images using support vector machine with non-linear kernelfunctions, 2012 8th Int. Conf. Informatics Syst. INFOS 2012. (2012)BIO-181-BIO-187.

[47] I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001Work. Empir. methods Artif. Intell., (2001), 41-46.

[48] J.N. Sulzmann, J. Frnkranz, E. Hllermeier, On pairwise naive bayesclassifiers, Lect. Notes Comput. Sci. (Including Subser. Lect. NotesArtif. Intell. Lect. Notes Bioinformatics). 4701 LNAI (2007) 371381.https://doi.org/10.1007/978-3-540-74958-5 35.

[49] C. Bustamante, L. Garrido, R. Soto, Comparing fuzzy NaiveBayes and Gaussian Naive Bayes for decision making in RoboCup3D, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Ar-tif. Intell. Lect. Notes Bioinformatics). 4293 LNAI (2006) 237247.https://doi.org/10.1007/11925231 23.

[50] A.M. Kibriya, E. Frank, B. Pfahringer, G. Holmes, Multinomial naivebayes for text categorization revisited, Lect. Notes Artif. Intell. (Sub-series Lect. Notes Comput. Sci. 3339 (2004) 488499.

[51] J.D.M. Rennie, L. Shih, J. Teevan, D.R. Karger Tackling the poor as-sumptions of Naive Bayes text classifiers, Proc. Twent. Int. Conf. Mach.Learn. ICML (2003), 616-623.

[52] V. Narayanan, I. Arora, and A. Bhatia, Fast and accurate sentimentclassification using an enhanced nave Bayes model, arXiv preprintarXiv:1305.6143, (2013). https://arxiv.org/abs/1305.6143.

60


[53] S. Rasoul, L. David, A Survey of Decision Tree Classifier Methodology,IEEE Trans. Syst. Man. Cybern. 21 (1991) 660674.

[54] D.M. Manias, M. Jammal, H. Hawilo, A. Shami, P. Heidari,A. Larabi, R. Brunner, Machine Learning for Performance-aware Virtual Network Function Placement, 2019 IEEE Glob.Commun. Conf. GLOBECOM 2019 - Proc. (2019) 1217.https://doi.org/10.1109/GLOBECOM38437.2019.9013246.

[55] L. Yang, A. Moubayed, I. Hamieh, A. Shami, Tree-based in-telligent intrusion detection system in internet of vehicles, 2019IEEE Glob. Commun. Conf. GLOBECOM 2019 - Proc. (2019).https://doi.org/10.1109/GLOBECOM38437.2019.9013892.

[56] S. Sanders, C. Giraud-Carrier, Informing the use of hyper-parameter optimization through metalearning, Proc. - IEEEInt. Conf. Data Mining, ICDM. 2017-Novem (2017) 10511056.https://doi.org/10.1109/ICDM.2017.137.

[57] M. Injadat, F. Salo, A.B. Nassif, A. Essex, A. Shami, BayesianOptimization with Machine Learning Algorithms TowardsAnomaly Detection, 2018 IEEE Glob. Commun. Conf. (2018) 16.https://doi.org/10.1109/glocom.2018.8647714.

[58] F. Salo, M.N. Injadat, A. Moubayed, A.B. Nassif, A. Essex, ClusteringEnabled Classification using Ensemble Feature Selection for IntrusionDetection, 2019 Int. Conf. Comput. Netw. Commun. ICNC 2019. (2019)276281. https://doi.org/10.1109/ICCNC.2019.8685636.

[59] K. Arjunan, C.N. Modi, An enhanced intrusion detectionframework for securing network layer of cloud computing,ISEA Asia Secur. Priv. Conf. 2017, ISEASP 2017. (2017) 110.https://doi.org/10.1109/ISEASP.2017.7976988.

[60] Y. Xia, C. Liu, Y.Y. Li, N. Liu, A boosted decision tree approach usingBayesian hyper-parameter optimization for credit scoring, Expert Syst.Appl. 78 (2017) 225241. https://doi.org/10.1016/j.eswa.2017.02.017.

[61] T. G. Dietterich, Ensemble methods in machine learning, Mult. Classif.Syst., 1857 (2000), 1-15.

61

[62] A. Moubayed, E. Aqeeli, A. Shami, Ensemble-based Feature Selectionand Classification Model for DNS Typo-squatting Detection, in: 2020IEEE Can. Conf. Electr. Comput. Eng., 2020.

[63] W. Yin, K. Kann, M. Yu, and H. Schtze, Comparative Study ofCNN and RNN for Natural Language Processing, arXiv preprintarXiv:1702.01923, (2017). https://arxiv.org/abs1702.01923

[64] A. Koutsoukas, K.J. Monaghan, X. Li, J. Huan, Deep-learning: In-vestigating deep neural networks hyper-parameters and comparison ofperformance to shallow methods for modeling bioactivity data, J. Chem-inform. 9 (2017) 113. https://doi.org/10.1186/s13321-017-0226-y.

[65] T. Domhan, J.T. Springenberg, F. Hutter, Speeding up automatic hy-perparameter optimization of deep neural networks by extrapolation oflearning curves, IJCAI Int. Jt. Conf. Artif. Intell. 2015-January (2015)34603468.

[66] Y. Ozaki, M. Yano, M. Onishi, Effective hyperparameter optimizationusing Nelder-Mead method in deep learning, IPSJ Trans. Comput. Vis.Appl. 9 (2017). https://doi.org/10.1186/s41074-017-0030-7.

[67] F.C. Soon, H.Y. Khaw, J.H. Chuah, J. Kanesan, Hyper-parametersoptimisation of deep CNN architecture for vehicle logo recognition,IET Intell. Transp. Syst. 12 (2018) 939946. https://doi.org/10.1049/iet-its.2018.5127.

[68] D. Han, Q. Liu, W. Fan, A new image classification method using CNNtransfer learning and web data augmentation, Expert Syst. Appl. 95(2018) 4356. https://doi.org/10.1016/j.eswa.2017.11.028.

[69] C. Di Francescomarino, M. Dumas, M. Federici, C. Ghidini, F.M. Maggi,W. Rizzi, L. Simonetto, Genetic algorithms for hyperparameter opti-mization in predictive business process monitoring, Inf. Syst. 74 (2018)6783. https://doi.org/10.1016/j.is.2018.01.003.

[70] A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, Stu-dent Engagement Level in e-Learning Environment: Cluster-ing Using K-means, Am. J. Distance Educ. 34 (2020) 120.https://doi.org/10.1080/08923647.2020.1696140.

62


[71] T. K. Moon, The expectation-maximization algorithm, IEEE SignalProcess. Mag. 13 (6) (1996) 4760.

[72] S. Brahim-Belhouari, A. Bermak, M. Shi, P.C.H. Chan, Fast and Ro-bust gas identification system using an integrated gas sensor technol-ogy and Gaussian mixture models, IEEE Sens. J. 5 (2005) 14331444.https://doi.org/10.1109/JSEN.2005.858926.

[73] Z. Y., K. G., Hierarchical Clustering Algorithms for Document Dataset,Data Min. Knowl. Discov. 10 (2005) 141168.

[74] K. Khan, S.U. Rehman, K. Aziz, S. Fong, S. Sarasvady, A.Vishwa, DBSCAN: Past, present and future, 5th Int. Conf.Appl. Digit. Inf. Web Technol. ICADIWT 2014. (2014) 232238.https://doi.org/10.1109/ICADIWT.2014.6814687.

[75] H. Zhou, P. Wang, H. Li, Research on adaptive parameters determina-tion in DBSCAN algorithm, J. Inf. Comput. Sci. 9 (2012) 19671973.

[76] J. Shlens, A Tutorial on Principal Component Analysis, arXiv preprintarXiv:1404.1100, (2014). https://arxiv.org/abs1404.1100

[77] N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness:probabilistic algorithms for constructing approximate matrix decompo-sitions, SIAM Rev. 53 (2) (2011), pp. 217-288

[78] M. Loog, Conditional linear discriminant analysis,Proc. - Int. Conf. Pattern Recognit. 2 (2006) 387390.https://doi.org/10.1109/ICPR.2006.402.

[79] P. Howland, J. Wang, H. Park, Solving the small sample size problem inface recognition using generalized discriminant analysis, Pattern Recog-nit. 39 (2006) 277287. https://doi.org/10.1016/j.patcog.2005.06.013.

[80] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparam-eter optimization of deep learning algorithms using deterministic RBFsurrogates, 31st AAAI Conf. Artif. Intell. AAAI 2017. (2017) 822829.

[81] M.N. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Sys-tematic Ensemble Model Selection Approach for EducationalData Mining, Knowledge-Based Syst. 200 (2020) 105992.https://doi.org/10.1016/j.knosys.2020.105992.

63


[82] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Multi-split OptimizedBagging Ensemble Model Selection for Multi-class Educational DataMining, Springers Appl. Intell. (2020).

[83] M. Claesen, J. Simm, D. Popovic, Y. Moreau, and B. De Moor, Easy Hy-perparameter Search Using Optunity, arXiv preprint arXiv:1412.1114,(2014). https://arxiv.org/abs1412.1114.

[84] C. Witt, Worst-case and average-case approximations by simple ran-domized search heuristics, in: Proceedings of the 22nd Annual Sympo-sium on Theoretical Aspects of Computer Science, STACS05, Stuttgart,Germany, 2005, pp. 4456.

[85] Y. Bengio, Gradient-based optimization of hyperparameters, NeuralComput. 12 (8) (2000) 1889-1900.

[86] H. H. Yang and S. I. Amari, Complexity Issues in Natural GradientDescent Method for Training Multilayer Perceptrons, Neural Comput.10 (8) (1998) 21372157.

[87] J. Snoek, H. Larochelle, R. Adams Practical Bayesian optimization ofmachine learning algorithms Adv. Neural Inf. Process. Syst. 4 (2012),2951-2959.

[88] E. Hazan, A. Klivans, and Y. Yuan, Hyperparameter optimiza-tion: a spectral approach, arXiv preprint arXiv:1706.00764, (2017).https://arxiv.org/abs1706.00764.

[89] M. Seeger, Gaussian processes for machine learning, Int. J. Neural Syst.,14 (2004), 69-106.

[90] F. Hutter, H. H. Hoos, and K. Leyton-Brown, Sequential model-basedoptimization for general algorithm configuration, Proc. LION 5, (2011)507-523.

[91] I. Dewancker, M. McCourt, S. Clark, Bayesian Optimization Primer,(2015) URL: https://sigopt.com/static/pdf/SigOpt Bayesian Optimiza-tion Primer.pdf

[92] J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian pro-cesses for big data, arXiv preprint arXiv:1309.6835, (2013).https://arxiv.org/abs/1309.6835.

64




[93] M. Claesen and B. De Moor, Hyperparameter Search inMachine Learning, arXiv preprint arXiv:1502.02127, (2015).https://arxiv.org/abs1502.02127.

[94] L. Bottou, Large-scale machine learning with stochastic gradient de-scent, Proceedings of the COMPSTAT, Springer (2010) 177-186.

[95] S. Zhang, J. Xu, E. Huang, C.H. Chen, A new optimal sam-pling rule for multi-fidelity optimization via ordinal transforma-tion, IEEE Int. Conf. Autom. Sci. Eng. 2016-Novem (2016) 670674.https://doi.org/10.1109/COASE.2016.7743467.

[96] Z. Karnin, T. Koren, O. Somekh, Almost optimal exploration in multi-armed bandits, 30th Int. Conf. Mach. Learn. ICML 2013. 28 (2013)22752283.

[97] S. Falkner, A. Klein, F. Hutter, BOHB: Robust and Efficient Hyper-parameter Optimization at Scale, 35th Int. Conf. Mach. Learn. ICML2018. 4 (2018) 23232341.

[98] A. Gogna, A. Tayal, Metaheuristics: Review and appli-cation, J. Exp. Theor. Artif. Intell. 25 (2013) 503526.https://doi.org/10.1080/0952813X.2013.782347.

[99] F. Itano, M.A. De Abreu De Sousa, E. Del-Moral-Hernandez, Extend-ing MLP ANN hyper-parameters Optimization by using Genetic Al-gorithm, Proc. Int. Jt. Conf. Neural Networks. 2018-July (2018) 18.https://doi.org/10.1109/IJCNN.2018.8489520.

[100] B. Kazimipour, X. Li, A.K. Qin, A Review of Population Initializa-tion Techniques for Evolutionary Algorithms, 2014 IEEE Congr. Evol.Comput. (2014) 25852592. https://doi.org/10.1109/CEC.2014.6900618.

[101] S. Rahnamayan, H.R. Tizhoosh, M.M.A. Salama, A novelpopulation initialization method for accelerating evolutionaryalgorithms, Comput. Math. with Appl. 53 (2007) 16051614.https://doi.org/10.1016/j.camwa.2006.07.013.

[102] F. G. Lobo, D. E. Goldberg, and M. Pelikan, Time complexity of ge-netic algorithms on exponentially scaled problems, Proc. Genet. Evol.Comput. Conf., (2000) 151-158.

65


[103] Y. Shi, R.C. Eberhart, Parameter Selection in Particle Swarm Opti-mization, Evolutionary Programming VII, Springer (1998) 591-600.

[104] X. Yan, F. He, Y. Chen, A Novel Hardware / Software Parti-tioning Method Based on Position Disturbed Particle Swarm Op-timization with Invasive Weed Optimization, 32 (2017) 340355.https://doi.org/10.1007/s11390-017-1714-2.

[105] M.Y. Cheng, K.Y. Huang, M. Hutomo, Multiobjective Dynamic-Guiding PSO for Optimizing Work Shift Schedules, J. Constr.Eng. Manag. 144 (2018) 17. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001548.

[106] H. Wang, Z. Wu, J. Wang, X. Dong, S. Yu, G. Chen, Anew population initialization method based on space transformationsearch, 5th Int. Conf. Nat. Comput. ICNC 2009. 5 (2009) 332336.https://doi.org/10.1109/ICNC.2009.371.

[107] J. Wang, J. Xu, and X. Wang, Combination of Hyper-band and Bayesian Optimization for Hyperparameter Optimiza-tion in Deep Learning, arXiv preprint arXiv:1801.01596, (2018).https://arxiv.org/abs1801.01596.

[108] P. Cazzaniga, M.S. Nobile, D. Besozzi, The impact of particles initial-ization in PSO: Parameter estimation as a case in point, 2015 IEEEConf. Comput. Intell. Bioinforma. Comput. Biol. CIBCB 2015. (2015)18. https://doi.org/10.1109/CIBCB.2015.7300288.

[109] R. Martinez-Cantin, BayesOpt: A Bayesian optimization library fornonlinear optimization, experimental design and bandits, J. Mach.Learn. Res. 15 (2015) 37353739.

[110] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D.D. Cox, Hyper-opt: A Python library for model selection and hyperparameter opti-mization, Comput. Sci. Discov. 8 (2015). https://doi.org/10.1088/1749-4699/8/1/014008.

[111] B. Komer, J. Bergstra, and C. Eliasmith, Hyperopt-sklearn: Automatichyperparameter configuration for scikit-learn, Proc. ICML WorkshopAutoML, (2014) 3440.

66


[112] M. Pumperla, Hyperas, 2019. http://maxpumperla.com/hyperas/.

[113] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp,and F. Hutter, Smac v3: Algorithm configuration in python, 2017.https://github.com/automl/SMAC3.

[114] Tim Head, MechCoder, Gilles Louppe, et al., scikitoptimize/scikit-optimize: v0.5.2, 2018. https://doi.org/10.5281/zenodo.1207017.

[115] N. Knudde, J. van der Herten, T. Dhaene, and I. Couckuyt,GPflowOpt: A Bayesian Optimization Library using TensorFlow, arXivpreprint arXiv:1711.03845, (2017). https://arxiv.org/abs1711.03845.

[116] Autonomio Talos [Computer software], 2019.http://github.com/autonomio/talos.

[117] L. Hertel, P. Sadowski, J. Collado, P. Baldi, Sherpa: HyperparameterOptimization for Machine Learning Models, Conf. Neural Inf. Process.Syst. (2018).

[118] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,et al., TensorFlow: Large-Scale Machine Learning on Heteroge-neous Distributed Systems, arXiv preprint arXiv:1603.04467, (2016).https://arxiv.org/abs1603.04467.

[119] J. Grandgirard, D. Poinsot, L. Krespi, J.P. Nnon, A.M. Cortesero, Os-prey: Hyperparameter Optimization for Machine Learning, 103 (2002)239248. https://doi.org/10.21105/joss.00034.

[120] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, Forward andreverse gradient-based hyperparameter optimization, 34th Int. Conf.Mach. Learn. ICML 2017, 70 (2017) 1165-1173.

[121] F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gage,DEAP: Evolutionary algorithms made easy, J. Mach. Learn. Res. 13(2012) 21712175.

[122] R. S. Olson and J. H. Moore, TPOT: A tree-based pipeline optimizationtool for automating machine learning, Auto Mach. Learn. (2019) 151-160. https://doi.org/10.1007/978-3-030-05318-5 8

67

http://maxpumperla.com/hyperas/


http://github.com/autonomio/talos


[123] J. Rapin and O. Teytaud, Nevergrad - A gradient-free optimizationplatform, 2018. https://GitHub.com/FacebookResearch/Nevergrad.

[124] L. Yang and A. Shami, Hyperparameter Opti-mization of Machine Learning Algorithms, 2020.https://github.com/LiYangHart/Hyperparameter-Optimization-of-Machine-Learning-Algorithms.

[125] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Uni-versity Press (1995).

[126] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification withdeep convolutional neural networks, Adv. Neural Inf. Process. Syst. 25(2012) 1097-1105

[127] N. Hansen, A. Auger, O. Mersmann, T. Tusar, and D. Brock-hoff, COCO: A Platform for Comparing Continuous Optimizersin a Black-Box Setting, arXiv preprint arXiv:1603.08785, (2016).https://arxiv.org/abs1603.08785.

[128] G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection andsubsequent selection bias in performance evaluation, J. Mach. Learn.Res. 11 (2010) 20792107.

[129] M. Boehm, A. Surve, S. Tatikonda, et al., SystemML: declarativemachine learning on spark, Proc. VLDB Endow. 9 (2016) 14251436.https://doi.org/10.14778/3007263.3007279.

[130] X. Meng, J. Bradley, B. Yavuz, et al., Mllib: machine learning in apachespark, J. Mach. Learn. Res. 17 (1) (2016) 1235-1241.

68


Li Yang received the B.E. degree in computer sci-ence from Wuhan University of Science and Technology,Wuhan, China in 2016 and the MASc degree in Engineer-ing from University of Guelph, Guelph, Canada, 2018.Since 2018 he has been working toward the Ph.D. degreein the Department of Electrical and Computer Engineer-ing, Western University, London, Canada. His researchinterests include cybersecurity, machine learning, networkdata analytics, and time series data analytics.

Abdallah Shami is a professor with the ECE De-partment at Western University, Ontario, Canada. Heis the Director of the Optimized Computing and Com-munications Laboratory at Western University (https:

//www.eng.uwo.ca/oc2/). He is currently an associateeditor for IEEE Transactions on Mobile Computing, IEEENetwork, and IEEE Communications Surveys and Tuto-rials. He has chaired key symposia for IEEE GLOBE-COM, IEEE ICC, IEEE ICNC, and ICCIT. He was theelected Chair of the IEEE Communications Society Tech-

nical Committee on Communications Software (2016-2017) and the IEEELondon Ontario Section Chair (2016-2018).

69

(https://www.eng.uwo.ca/oc2/)

(https://www.eng.uwo.ca/oc2/)

Date post:	16-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

arXiv:2007.15745v1 [cs.LG] 30 Jul 2020 · 2020. 8. 3. · suitable for HPO problems, since many HPO...

Documents