Application of
machine learning
in manufacturing industry
MSC Degree Thesis
Written by:
Hsinyi Lin
Master of Science in Mathematics
Supervisor:
Lukacs AndrasInstitute of Mathematics
Eotvos Lorand University
Faculty of Science
Budapest
2017
ABSTRACT
In high-tech manufacturing industry, much data are fetched each day from pro-
duction line. These data could be measurement data, categorical data, or time-
related data. Before the product being delivered to costumer, it has to pass quality
examination. The task in this thesis is to use a given training set and combine with
different boosting algorithms which already exist in scikit-learn package in python
to build a model which can predict quality test result. To build such model can
be regarded as solving highly imbalanced binary classification problem, since the
number of ”Fail” case compared to the number of ”Pass” case are very scarce for
mature product.
Boosting is very common used for solving this type of problem in Kaggle commu-
nity and its general idea is to combine many weak learners(base classifiers) sequen-
tially to get a strong learner. In programming, the model is trained under certain
condition, which means some parameters are fixed, therefore, the understanding of
each parameter is needed for finding optimal parameters. To optimize the model,
we minimize certain objective function which depends on its correspondent algo-
rithm. In adaboost, the upper bound of empirical error is minimized. In logitboost,
it minimizes the least square regression of residual which is updated by Newton’s
method. In gradientboost, it minimizes user-defined differential convex loss function
which represents the dissimilarity between real class and prediction result. Except
loss function, XGB takes regularization term into account and using second order
Taylor expansion to approximate the loss function for preventing over-fitting. Solv-
ing the problem with given training set includes two step: (1)Use line-search to find
optimal parameters. (2)Find the optimal threshold to decide the result of classifi-
cation. The purpose of the thesis is to demonstrate the procedure of solving the
problem and the computational ability of laptop is also limited, therefore, the very
small part of original data are taken as training set. Thus, the model trained by
given training set with less information is not as competitive as those model with
higher rank in the competition. While doing line search to optimize the value for
a specific parameter, it may not easy to find a proper range or value. Thus, it’s
quite important to reference past empirical experience. In this thesis, some model
with different objective function and different training set are built, but there is no
specific method which is always the best. In Kaggle competition, most participants
won the competition by combining many methods rather than using single method.
I
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Adaptive boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Derivation of each step . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Logit boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Derivation of each step . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 XGB-Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . 16
2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Measurement and Experiment . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Measurement of goodness of the classifier . . . . . . . . . . . . . . . . 23
3.1.1 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Optimization of parameters . . . . . . . . . . . . . . . . . . . 27
3.3.2 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . 32
II
3.3.3 Regularization and over-fitting . . . . . . . . . . . . . . . . . . 33
3.3.4 Model with different training set and algorithm . . . . . . . . 35
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
III
List of Figures
1 Receiver operating characteristics curve . . . . . . . . . . . . . . . . . 24
2 Sigmoid function maps the score to [0,1] . . . . . . . . . . . . . . . . 25
3 Partial process flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1 . . 28
5 max depth = 2, and default value λ = 1, γ = 0, objective = logistic,
ν = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic . . . . . 29
7 M = 103, max depth = 2, λ = 1 , objective = logistic . . . . . . . . . 30
8 M = 103, max depth = 2, λ = 1 , objective = logistic . . . . . . . . . 30
9 M = 103, max depth = 2, λ = 1 , objective = logistic . . . . . . . . . 31
10 M = 103, max depth = 2, γ = 0 , objective = logistic . . . . . . . . . 31
11 ROC curve under optimal condition . . . . . . . . . . . . . . . . . . . 32
12 Confusion matrix while threshold is equal to 0.0644 . . . . . . . . . . 33
13 XGB with regularization λ = 1, γ = 0.1 . . . . . . . . . . . . . . . . . 34
14 XGB without regularization λ = 0, γ = 0 . . . . . . . . . . . . . . . . 34
15 Overfitting: XGB with regularization improved overfitting problem. . 35
IV
List of Tables
1 Summary of different loss function for gradient boosting . . . . . . . 16
2 Summary of different loss function for XGB . . . . . . . . . . . . . . 22
3 Confusion matrix which is gotten under certain threshold. . . . . . . 24
4 Size of training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Optimal condiftion of given different training set by using adaboost
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Optimal condiftion of given different training set while using gradient
boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Optimal condiftion of different given training set while using XGB . . 36
V
1 Introduction
1.1 Motivation
The few years working experience in high-tech manufacturing industry made me
realize how fast the speed of data being generated in production line. The limitation
of capacity of repository urges old data be purged while new data being generated.
It’s a tough task to analyse all the fetched data and extract useful information from
such a big data set by tradition statistical analytical methods before data being
purged.
In order to effectively use all the fetched data, more and more companies started
to introduce the concept of machining learning to develop their analytical tools.
As we can see, recently, many companies started to provide their data and held
competitions in Kaggle, the largest communicative platform for people studying in
relevant fields of machine learning [1]. By holding the competitions, they can get
better ideas and concepts among those innovative and efficient methods that were
proposed by participants from all over the world.
In manufacturing industry, one of the typical questions is how to predict the
quality test results based on past data. The quality test result decides whether the
company can gain benefit from the product or not, because every product needs to
pass all the quality tests before being delivered to customers. If the result can be
predicted precisely by built model and certain given features, that would be helpful
for company to save cost. Thinking about the situation, if the predicted result is
”fail”, then we can simply scrap the product and stop the remaining process it should
be done, because of the high risk of being scrap after finishing all the procedure. We
also can know which features are important to the prediction via such model, so we
can focus on the yield improvement of certain process.
To build the model of solving binary classification problems, we will introduce
boosting methods which are the most commonly used in Kaggle platform by the
community. Most participants won the competitions by using the idea of boosting
methods to construct their model. Until now, there have been many kinds of boosting
methods have been proposed such as adaboost, logitboost, gradient boosting and
extreme gradient boosting. Some algorithms have been developed and existed in
python packages.
1
1.2 Outline
This thesis is organized in the following way. In Chapter 2, the general idea of
boosting methods and four different boosting methods which are adaboost, logit-
boost, gradient boost, extreme gradient boost are introduced. Each method includes
the detail of algorithm, the concept and derivation. The idea of loss function and the
regularization which appears in the extreme gradient boost are also be shown. In
Chapter 3, firstly, ROC curve and sigmoid function are introduced for evaluating the
goodness of the model. Data preparation and the implementation environment are
also presented before programming. The demonstration of the procedure of solving
the highly imbalanced classification problems by extreme gradient boosting which
existed in scikit-learn in python packages includes two steps: (1)Use line search
strategy to find optimal condition.(2)Find a best threshold which maximizes the
Matthew correlation coefficient. After demonstrating the procedure, I show the im-
provement of overfitting when taking the regularization into account. In the end of
this chapter, I demonstrate the result of those models which are trained by differ-
ent boosting algorithms with given different training set and give a conclusion for
chapter 3.
2
2 Boosting Methods
2.1 Binary classification
Classification is one of the supervised learning problem in machine learning. The
task is to find a model F to fit given labeled training set xi, yi ∀i = 1, 2, · · · , N , and
use it to predict a set with unknown class label. The labeled training set include
features xi and class label yi. The problem of predicting quality test results can be
regarded as a highly imbalanced binary classification problem because the results
of prediction have only two possibilities, – ”Pass” and ”Fail”, and non-uniform
distribution for the big disparity between two classes. In practical application, the
labels of two classes are usually encoded to -1/1 or 0/1 depend on the method being
used. In manufacturing industry, features can be any measured data during process
or time values. To formulize the problem, we denote the model F : x → F (x)
where F (x) is the prediction with given features x. The goal of finding optimal F
is equivalent to find the model which minimizes the dissimilarity between yi and
F (xi).
2.2 General idea
The general idea of boosting methods [2] [3] [4] is to sequentially ensemble many
weak learners(base classifiers) fm which can predict slightly better than flipping
uniform coin(random guessing) to a strong classifier F . In most boosting methods
use CART(classification and regression trees) as based classifiers. In general, to
find a optimal classifier in mth iteration is equivalent to find a fm which minimize
certain given objective function. In adaboost, the goal is to minimize the upper
bound of the empirical error of the ensemble classifier [5]. Since its empirical error
is bounded by exponential loss, it’s same as taking exponential loss as objective
function. In logitboost, it minimizes the least square of residual which estimated by
Newton’s update. The goal of gradient boosting is to minimize loss function which
can be interpreted as the dissimilarity between real class label y and predicted value
F (x). The loss function can be any differentiable convex function. Unlike gradient
boosting, the objective function of extreme gradient boosting has one more term,
that is regularization, which can prevent overfitting effectively.
2.3 Adaptive boosting
Adaboost was once the most popular boosting method until extreme gradient
boosting appeared. It combines many weak classifiers which are trained by weighted
3
training set sequentially to get a strong classifier. Initially, every sample in training
set is given an equal weight. In each stage, we increase the weight to misclassified
samples of predicting by ensemble classifiers and get a new weighted training set to
train next weak learner. It emphasizes the importance of those misclassified instances
so that next weak classifier is able to focus on fitting those misclassified samples.
Because the summation of the weight of each sample is equal to one, the weight of
those samples which are classified correctly will also be decreased. In general, using
stumps which are the simplest decision tree only with one node and two leaves as a
weak classifier already get quite good accuracy and efficiency. The commonly used
version of adaboost is discrete adaboost which was first time introduced in [Freund
and Schapire (1996b)]. It was developed to solve binary classification problem with
class label yi ∈ −1, 1, ∀i. The real adaboost came up two years later.
2.3.1 Algorithm
The following two algorithms are the variant of adaboost. Algorithm 1 is the
discrete version which was presented in 1996 by Freund and Schapire and Algorithm
2 is real adaboost.
It has a slightly difference while calculating the value which will be added at m
iteration. In discrete adaboost, the value is estimated by given weighted training
error predicted by weak classifier fm ∈ −1, 1 when adaboost uses weak classifier
fm which return a real number and estimate the value by given class probability. The
weak classifier fm Both algorithms increase the weight of those misclassified samples
by multiplying exponential function. The detail explanation will be described in the
derivation part.
Algorithm 1 Discrete adaboost algorithm
1: Initialize wi ←1
N, i = 1, 2 . . . , N . Initialize equal weight of each sample
2: while m ≤M do
3: Fit a classifier fm(x) ∈ −1, 1 . Fit a classifier fm ∈ −1, 1 with minimum weighted training error
4: Compute εm = Ew[1(y 6=fm(x))
]. Calculate weighted training error of fm
5: cm = log(1− εm)
(εm). Update the coefficient of the weak learner of current iteration
6: wi ← wiecmI[ 6=fm(x)]] i = 1, 2 · · · , N . Increase the weight of those misclassified samples
7: re-normalize such that
N∑i=1
wi = 1 . re-normalize all the weights
8: end while
9: return sign [F (x)]. F (x) =
M∑m=1
cmfm(x) . output the sign of ensemble result
wi: weight of sample i
fm: optimal classifier of mth iteration
4
εm : weighted training error of fm
cm: the coefficient of classifier fm while ensemble to a strong classifier
F : final ensemble classifier
Algorithm 2 Real adaboost algorithm
1: Initialize wi ←1
N, i = 1, 2 . . . , N . Initialize equal weight to each sample
2: while m ≤M do
3: Find optimal partition and get class probability pm(y = 1|x)
. Train a classifier with minimum weighted training error and get its class probability
4: Compute fm(x)← 1
2ln
pm(y = 1|x)
1− pm(y = 1|x). Update the score of each leaf
5: wi ←− wie−yifm(x) i = 1, 2 · · · , N . Update weight of each sample
6: re-normalize such that
N∑i=1
wi = 1 . re-normalize all the weights
7: end while
8: return sign [F (x)]. F (x) =
M∑m=1
fm(x) . output the sign of ensemble result
wi: weight of sample i
pm(x|y = 1) : class conditional probability of each leaf at mth iteration
fm: classifier with optimal score at mth iteration
F :final ensemble classifier
2.3.2 Derivation of each step
Firstly, it sets equal weight to each sample w0(i) = 1/N , where N is the number of all
samples.
Secondly, it comes the iteration step m. There are three steps at mth stage in discrete
adaboost. It includes (1) Find a optimal classifier fm with minimal weighted training
error. (2) Find the optimal coefficient cm of weak classifier fm. (3) Update weight for
next iteration. In practical case, we usually find a classifier fm(x) with minimum weighted
training error under certain given condition which can also be optimized by doing line-
search. For example, the most popular condition is to set the base classifier as stump,
which is the simplest decision tree for its efficiency and accuracy of final classifier.
fm(x) = arg minfm
Ew[I(y 6=fm(x))
]= arg min
fm
N∑i=1
wm(i)[I(y 6=fm(x))
]The following we’ll have to find optimal cm which can minimize the empirical training
error1
N
N∑i=1
I(yiF (xi)<0) while fm is known. To minimize the empirical training error, we
consider to minimize its upper bound. Since
I(yi 6=F (xi)) ≤ e−kyiF (xi) ∀k > 0,
5
the objective function becomes1
n
N∑i=1
e−kyiF (xi). According to the definition in algorithm,
the weights of those misclassified samples will be increased. In fact, the weight of those
samples which are classified correctly are also reduced because of the re-normalization.
The updated weight for next iteration can be expressed as
w(m+1)(i) =wm(i)ecmI[ 6=fm(x)]
N∑i=1
ecmI[yi 6=fm(x)]
(1)
The denominatorN∑i=1
wm(i)ecmI[ 6=fm(x)] is the normalization term and it can be simplified
to (1− εm) + εmecm by the following derivation.
N∑i=1
wm(i)ecmI[yi 6=fm(x)]
=∑
yi=fm(xi)
wm(i)e0 +∑
yi 6=fm(xi)
wm(i)ecm
=∑
yi=fm(xi)
wm(i) + ecm∑
yi 6=fm(xi)
wm(i)
= (1− εm) + εmecm
Where the weighted training error is εm =∑
yi 6=fm(xi)
wi.
When yi = fm(xi),which is same as yifm(xi) = 1
w(m+1)(i) =wm(i)
(1− εm) + εmecm=
wm(i)e−cm/2
(1− εm)e−cm/2 + εmecm/2. (2)
Otherwise; yi 6= fm(xi)(or yifm(xi) = −1),
w(m+1)(i) =wm(i)ecm
(1− εm) + εmecm=
wm(i)ecm/2
(1− εm)e−cm/2 + εmecm/2. (3)
We can simplify equation (2) and (3) to single equation.
w(m+1)(i) =wm(i)e−cmyifm(xi)/2
Zm(4)
Where Zm = (1− εm)e−cm/2 + εmecm/2.
By equation (4), we can get
e−cmyifm(xi)/2 =w(m+1)(i)
w(m)(i)Zm (5)
6
By (5), upper bound of empirical error can be written as
1
N
N∑i=1
e−yiF (xi)/2 =M∏m=1
Zm (6)
while choosing k =1
2.The derivation is as following.
1
N
N∑i=1
e−yiF (xi)/2
=1
N
N∑i=1
e
−yi
M∑m=1
cmfm(xi)/2
=1
N
N∑i=1
M∏m=1
e−yicmfm(xi)/2
=1
N
N∑i=1
M∏m=1
w(m+1)(i)
w(m)(i)Zm
=M∏m=1
Zm
with givenN∑i=1
w(m+1(i) = 1 and w1(xi) =1
N.
Thus, by equation (6), to minimize upper bound of empirical error is equivalent to mini-
mize Zm at each iteration. The minimum value of Zm occurs when∂Zm∂cm
= 0.
∂Zm∂cm
= 0
−1
2(1− εm)e−cm/2 +
1
2εme
cm/2 = 0
ecm =1− εmεm
⇒ cm = log1− εmεm
(7)
And
Zm = (1− εm)e−
1
2log
1− εmεm + εme
1
2log
1− εmεm
7
= (1− εm)
√εm
1− εm+ εm
√1− εmεm
= 2√εm(1− εm)
while cm = log1− εmεm
.
Thus, empirical error is bounded.
1
N
N∑i=1
I [(yiF (xi) < 0] <M∏m=1
2√εm(1− εm) (8)
This result can be interpreted. When cm = 0, it means the classifier fm is same as random
guessing, because the weighted training error εm is 0.5, so it’s useless for final prediction.
cm > 0 means fm has more chance to predict correctly(εm < 0.5), so we can multiply
a positive coefficient(cm > 0) to emphasize its positive influence of ensemble classifier. If
cm < 0, not surprisingly, fm has more possibility to have a wrong prediction(εm > 0.5),
we reverse the prediction, then it still helps for ensemble prediction.
In real adaboost, fm is a real number rather than in −1, 1. To find optimal value of
fm(x). Because the weighted training error
Ew[1(yfm(x)<0] ≤ Ew[e−yfm(x)],
to minimize weighted training error is equivalent to minimize its upper bound Ew[e−yfm(x)].
Ew
[e−yfm(x)
]= Pw(y = 1|x)e−fm(x) + Pw(y = −1|x)efm(x)
∂Ew[e−yfm(x)
]∂fm(x)
= −Pw(y = 1|x)e−fm(x) + Pw(y = −1|x)efm(x)
∂Ew[e−yfm(x)
]∂fm(x)
= 0⇐⇒ fm(x) =1
2log
Pw(y = 1|x)
Pw(y = −1|x)
Hence, the optimal base classifier fm is half of the value of taking logarithm of the ratio
of two classes’ probabilities.
Lastly, both algorithms output the sign of ensemble value.
2.4 Logit boosting
Logitboost is a method only being used to solve binary classification problem with
class label y∗ = 0, 1, while other methods using class label y = −1, 1. It fit an
additive logistic symmetric likelihood function by using adaptive newton approach. At
every iteration step, a working response zi ∀i = 1, 2, · · · , N of each sample is calculated
by its yi and estimated class probability. Working response can be also regarded as an
8
approximate residual between real class label and ensemble value. The formula of working
response is derived by using Newton’s method. The optimal regression tree fm is got by
minimizing the weighted least-square regression of zi. In the last step, it will output the
sign of the ensemble value.
2.4.1 Algorithm
Algorithm 3 Logit boosting algorithm
1: Initialize wi ←1
N, p0(xi) =
1
2i = 1, 2 . . . , N and F0(xi) = 0
. Initialize equal weight to each sample and initial score of single node tree
2: while m ≤M do
3: zm(i) =y∗i − p(m−1)(xi)
p(m−1)(xi)(1− p(m−1)(xi)). Compute working response Z
4: wm(i) = p(m−1)(xi)(1− p(m−1)(xi)) . Update weight of each instance
5: Fit the function fm(x), Ew[(fm(x)− z)2
]. by a weighted least-squares regression of zi to Fm(xi) using weight wm(i)
6: Fm(x) = F(m−1)(x) +1
2fm(x), and pm(x) =
eFm(x)
eFm(x) + e−Fm(x)
. estimate class probability for computing next working response
7: end while
8: return sign[F (x)] . output the sign of ensemble result
wi:weight of sample i
wm(i): weight of sample i which updated by using previous class probability
zm(i): the newton’s update or estimated residual
fm: optimal classifier which minimize the least square regression of zi
F : final ensemble classifier
2.4.2 Derivation of each step
Firstly, give to equal weight1
Nto every training instance, initialize F (x) = 0 and
p(xi) =1
2, which means the the probability of y = 1 of every instance is same as random
guessing.
In the mth iteration, the goal is to fit fm by minimizing the weighted least square
regression of zi. F (x) is gradually improved by adding f(x) after each iteration. Here
zm(i) is regarded as a residual(Newton’s update) that compared to its previous estimated
probability of occurring y = 1 under x = xi. Next topic we concerned most is that why
zm(i) =y∗ − p(m−1)(xi)
(1− p(m−1)is(xi))p(m−1)(xi))and how to find an optimal f(x) to maximize the
next logistic likelihood function?
Logistic likelihood function is denoted as
l(p(x)) = y∗ log p(x) + (1− y∗) log(1− p(x))
9
where
p(x) =eF (x)
eF (x) + e−F (x)=
1
e−2F (x) + 1(9)
=⇒ l(p(x)) = l(F (x))
= log
[eF (x)
eF (x + e−F (x)
]y∗ [e−F (x)
eF (x + e−F (x
]1−y∗
= 2y∗F (x)− log(1 + e2F (x))
To find an optimal f(x) for current iteration, we need to maximize the expected logistic
likelihood function.
f(x) = arg maxf
E [(F (x) + f(x))]
It occurs when∂E [(F (x) + f(x))]]
∂f(x)= 0
Here, we denote
g(F (x) + f(x)) =∂E [l(F (x) + f(x))]
∂f(x)= E
[2y∗ − 2
1 + e−2(F (x)+f(x))
](10)
,and
h(F (x) + f(x)) =∂g(F (x) + f(x))
∂f(x)= E
[− 4e−2(F (x)+f(x))
(1 + e−2(F (x)+f(x)))2
](11)
Thus, it can be simplified to solve g(f) = 0. In logitboost methods,it uses Newton’s method
to find approximate solution.
Newton’s method is a method to find approximate roots r of function q(r) such that
q(r) ' 0. According to Taylor expansion,
q(r) = q(r0) + q′(r0)(r − r0) +O((r − r0)2.
It may not easy to find the root of q(r) = 0, but we can find a r such that 0 < |q(r)| <|q(r0)|, which means r is closer to root than r0, and using this approach to find an ap-
proximate solution until q(r) ' 0.
If r0 is already a point which is near the root, and
q(r) ' q(r0) + q′(r0)(r − r0) ' 0,
⇒ r ' r0 −q(r0)
q′(r0), (12)
then r will be a better approach than r0. To solve g(F +f) = 0, we can consider F (x) = r0
and r = F (x) + f(x).
By (10), (11) and (12),
F (x) + f(x) ' F (x)− g(F (x) + f(x))
h(F (x) + f(x))|f(x)=0
10
F (x) + f(x) ' F (x)− g(F (x))
h(F (x))(13)
Sinceg(F (x))
h(F (x)),
=
E
[2y∗ − 2
1 + e−2F (x)|x]
E
[− 4e−2F (x)
(1 + e−2F (x))2|x
]
= −E
[y∗ − 1
1 + e−2F (x)|x]
2E
[e−2F (x)
(1 + e−2F (x))
1
(1 + e−2F (x))
] ,and by (9), we get
g(F (x))
h(F (x))= −1
2E
[y∗ − p(x)
(1− p(x))p(x)|x]. (14)
And (13) becomes
F (x) + f(x) ' F (x) +1
2E
[y∗ − p(x)
(1− p(x))p(x)|x]. (15)
f(x) = arg minf
Ew
[f(x)− 1
2
y∗ − p(x)
(1− p(x))p(x)
]2(16)
Thus, we denote zm(i) =y∗ − p(m−1)(xi)
(1− p(m−1)(xi))p(m−1)(xi)), and minimize weighted least square
error to z to find optimal fm(x).
2.5 Gradient boosting
Gradient boosting [6] [8] is a methods of using the idea of function estimation. The
goal is to find a additive function which is fitting for training data. In gradient boosting,he
objective function is loss function which can interpreted as the dissimilarity between real
class label and the predicted value of the additive function. In every iteration, it uses neg-
ative gradient of loss function to approximate the residual of previous iteration and the
goal is to find a regression or classifier f and its score(function value) of each leaf which
can minimize the summation of all the instances’ loss function L(y, F(m−1)(x) + fm(x)).
Because of the needed of computing gradient, the loss function is required to be differential
convex function. In binary classification problem, it output the sign of final ensemble score
as its prediction if the class label y ∈ −1, 1.
11
2.5.1 Algorithm
Algorithm 4 Gradient boosting algorithm
1: F0(x) = arg minγ
N∑i=1
ψ(yi, γ) . Give an initial value to F
2: while m ≤M do
3: yim = −[∂ψ(yi, F (xi))
∂F (xi)
]|F (xi)=F(m−1)(xi) . Calculate negative gradient of loss function
4: RlmL1 = L− disjoint regions trained by (yim, xiN1 )
. use the gradient of loss function we just calculated as new training label to train a new classifier
5: γlm = arg minγ∑
xi∈Rlm
ψ(yi, Fm−1(xi) + γ) . Calculate the optimal score of each leaf
6: Fm(x) = F(m−1)(x) + ν · γlm1(x ∈ Rlm) . add the base classifier to the additive model(ensemble classifier)
7: end while
F0 : initial value of ensemble classifier
ψ : loss function
yim: estimated residual of sample i at mth iteration
Rlm: all the samples which are classified to leaf with index l at mth
γlm: optimal score of leaf l under the L-disjoint partition
ν : learning rate
Fm: the ensemble score of mth iteration
2.5.2 Derivation
In gradient boosting algorithm, user can give arbitrary convex differential function as a
loss function. Firstly, it gives an initial value to ensemble classifier which will be calculated
by the loss function being used. The initial classifier can be regards as a tree with only
single node which means all the samples are classified to the only single node and the
initial value F0 is the optimal score of the single node. Denote the score of single node tree
is γ.
F0(x) = arg minγ
N∑i=1
ψ(yi, γ) (17)
The minimum value of
N∑i=1
ψ(yi, γ) occurs at
∂
N∑i=1
ψ(yi, γ)
∂γ= 0 (18)
By equation (17) (18), F0 of taking different loss function can be derived as following.
Least square loss:1
2(y − F (x))2
12
∂
N∑i=1
1
2(yi − γ)2
∂γ= N
γ −N∑i=1
yi
N
= 0
F0(x) =
N∑i=1
yi
N
F0 is the average of all yi in the training set according to the following derivation.
Exponential loss: e−yF (x)
∂N∑i=1
e−yiγ
∂γ= 0
N∑i=1
−yie−yiγ = 0
= −e−γN∑i=1
1(yi = 1) + eγN∑i=1
1(yi = −1) = 0
−e−γNP (y = 1) + eγNP (y = −1) = 0
F0(x) = γ0 =1
2log
P (y = 1)
P (y = −1)
Logistic loss: − log(e−2yF (x) + 1)
∂N∑i=1
− log(1 + e−2yiγ)
∂γ= 0
N∑i=1
2yi1 + e2yiγ
= 0
2N
[P (y = 1)
1 + e2γ− P (y = −1)
1 + e−2γ
]= 0
F0(x) =1
2log
P (y = 1)
P (y = −1)
13
For exponential loss and logistic loss, F0 is taking the logarithm of the ratio of the prob-
abilities of two classes.
Secondly, it include four steps at mth iteration. (1) Calculate negative gradient of loss
function yim to approximate residual of each sample. (2) Train a classifier fm which par-
tition the features xi|i = 1, 2, · · · , N into L-disjoint region Rlm ∀l = 1, 2, · · · , N by
given training set xi, yim (3) Find the optimal score γlm of leaf l which minimizes the
loss function of the leaf. (4) Update the ensemble score.
(1) Calculate yim of different loss function.
yim = −[∂ψ(yi, F (xi))
∂F (xi)
]|F (xi)=F(m−1)(xi) (19)
Least square loss:
yim = −∂
1
2(yi − F (xi))
2
∂F (xi)|F (xi)=F(m−1)(xi)
= yi − F(m−1)(xi) (20)
Exponential loss:
yim = −
[∂e−yiF (xi)
∂F (xi)
]|F (xi)=F(m−1)(xi)
= yie−yiF(m−1)(xi) (21)
Logistic loss:
yim = −
[∂ − log(1 + e−2yiF (xi))
∂F (xi)
]|F (xi)=F(m−1)(xi)
=−2yi
1 + e2yiF(m−1)(xi)(22)
(2)Train a classifier(tree) fm has L leaves by using training set xi, yim. L can be variant
in different iteration.
(3)Find optimal score fm(xi) = γlm ∀xi ∈ Rlm of leaf l, which minimize the ensemble
loss function∑
xi∈Rlm
ψ(yi, Fm(xi)).
∑xi∈Rlm
ψ(yi, Fm(xi)) =∑
xi∈Rlm
ψ(yi, F(m−1)(xi) + γlm) (23)
14
γlm = arg minγ
∑xi∈Rlm
ψ(yi, Fm−1(xi) + γ) (24)
The minimum value occurs while∑xi∈Rlm
ψ(yi, Fm−1(xi) + γ)
∂γ= 0 (25)
Least square loss:
By equation (20),
(yi − F(m−1)(xi)− γ)2
can be re-written as1
2
(γ2 − 2yimγ + y2im
).
By equation (24) (25), we get
∂∑
xi∈Rlm
1
2
(γ2 − 2yimγ + y2im
)∂γ
= 0
⇒ Nl
γ −∑
xi∈Rlm
yim
Nl
= 0
⇒ γlm =
∑xi∈Rlm
yim
Nl,
where Nl is the number of samples which are classified to leaf l, Nl = N(Rlm).
Exponential loss:
By equation (21) (24) and (25),
∂∑
xi∈Rlm
e−yi(F(m−1)(xi)+γ)
∂γ= 0,
⇒∑
xi∈Rlm
−yie−yi(F(m−1)(xi)e−yiγ = 0,
⇒∑
xi∈Rlm
−yime−yiγ = 0,
⇒ −e−γ∑
xi∈Rlm
yim1(yi = 1)− eγ∑
xi∈Rlm
yim1(yi = −1) = 0,
we get
γlm =1
2log
−∑
xi∈Rlm
yim1(yi = 1)∑xi∈Rlm
yim1(yi = −1).
15
Logistic loss:
By equation(24), we need to solve
∂∑
xi∈Rlm
2yi
1 + e−2(F(m−1)(xi)+γ)
∂γ= 0,
but there is no closed form of the solution yet. In practical case, use numerical method to
solve it.
(4) Add the score to the ensemble classifier depends on which leaf x belongs to.
Fm(x) = F(m−1)(x) + ν · γlm1(x ∈ Rlm)
Here, ν is learning rate which is a coefficient of γlm while adding to ensemble classifier. In
practical application, it’s a parameter need to be fine tuned for building model conserva-
tively.
Lastly, output the prediction value which depends on loss function. For least square,
output the sign of ensemble result. For logistic and exponential loss, use sigmoid function
which will be introduced next chapter maps the ensemble value to [0, 1]. If the result is
larger than 0.5, the result of prediction will be 1. We summarize the result as Table 1
based on different loss function.
Table 1: Summary of different loss function for gradient boosting
Least square exponential loss logistic loss
loss function1
2(y − F (x))2 e−yF (x) − log(e−2yF (x) + 1)
F0(x)
∑Ni=1 yiN
1
2log
P (y = 1)
P (y = −1)
1
2log
P (y = 1)
P (y = −1)
yim yi − F(m−1)(xi) yie−yiF(m−1)(xi)
−2yi1 + e2yiF(m−1)
γlm
∑xi∈Rlm yim
Nl
1
2log−∑
xi∈Rlm yim1(yi = 1)∑xi∈Rlm yim1(yi = −1)
no closed form
2.6 XGB-Extreme Gradient Boosting
XGB which is also called ”extreme gradient boosting” and has been developed since
2014. It appeared first time in the competition hold by Kaggle and was proposed by Tainqi
Chen [7] in University of Washington. The idea of XGB is originated from gradient boost-
ing. The biggest difference between gradient boosting and XGB is the objective function.
16
In addition to training loss, objective function of XGB has a regularization term which
dose not exist in traditional gradient boosting. It can prevent overfitting. Like other boost-
ing method, the final classifier is the result to ensemble many based classifier up. And the
ensemble classifier can always have a better prediction than previous iteration. XGB uses
second order Taylor expansion to approximate objective function which approached by
previous term. We discuss binary classification problem with label y ∈ 1,−1, and the
training set (xi, yi)|i = 1, 2 · · · , N, where xi is a vector with d−dimensions(d features)
2.6.1 Algorithm
Algorithm 5 Extreme boosting algorithm1: Initial F0(xi) = γ0 ∀i = 1, 2, · · · , N . Give an initial value to F
2: while m ≤M do
3: gim =∂ψ(yi, F (xi))
∂F (xi)|F (xi)=F(m−1)(xi) , him =
∂l2(yi, F (xi))
∂F (xi)2|F (xi)=F(m−1)(xi)
. Calculate gim and him for 2nd order Taylor expansion approximation
4: Train a new classifier which partitions xiNxiinto L− disjoint regions RlmLl=1
. Train a classifier with training set xi, gim, him
5: wlm = −
∑xi∈Rlm
gim∑xi∈Rlm
him + λ. calculate the optimal score of each region
6: Fm(x) = F(m−1)(x) + ν · wlm1(x ∈ Rlm) . add the base classifier to the additive model(ensemble classifier)
7: end while
γ0: initial score of each sample
gim: value of first derivation of loss function of sample i at mth iteration
him: value of first derivation of loss function of sample i at mth iteration
Rlm: ∀ xi which are classified to region l at mth iteration
wlm: optimal score of leaf l at mth iteration
2.6.2 Derivation
The objective function in XGB includes two terms, which are loss function and regu-
larization term. Denote the Objective function Lm at mth iteration.
L(m) =N∑i=1
ψ(yi, F(m−1)(xi) + fm(xi)) + Ω(fm) (26)
Ω(fm) = γL+1
2λ ‖ w ‖2 (27)
are loss function and regularization term respectively. fm is a regression tree trained
by xi, gim, him. It partitions all samples xi|i = 1, 2 · · · , N into L-disjoint Regions
17
RlmNi=1. The goal is to minimize the objective function L(m) with given under fixed
partition RlmNl=1 and find optimal weight wlm of region l such that fm(xi ∈ Rlm) = wlm
The regions can be regarded as leaves.
Different from traditional gradient boosting, XGB uses 2nd order Taylor expansion to
approximate loss function , thus, ψ is required to be a second differentiable function.
ψ(yi, F(m−1)(xi) + fm(xi)) ∼= ψ(F (m−1)(xi), yi) + gimfm(xi) +1
2himf
2m(xi) (28)
where
gim = ∂Fψ(yi, F ) |F=F (m−1)(xi)(29)
him = ∂2Fψ(yi, F ) |F=F (m−1)(xi)(30)
Eq (26) becomes
L(m) ∼=N∑i=1
ψ(F (m−1)(xi), yi) + gimfm(xi) +1
2himf
2m(xi) + γL+
1
2λ ‖ w ‖2 . (31)
Since ψ(F (m−1)(xi), yi) is known, it’s equivalent to optimize
L(m) ∼=N∑i=1
gimfm(xi) +1
2himf
2m(xi) + γL+
1
2λ ‖ w ‖2 (32)
Let fm(xi ∈ Rlm) = wl, equation (32) can be rewritten as
L(m) ∼=L∑l=1
∑xi∈Rlm
gimwl +1
2himw
2l + γ+
L∑l=1
1
2λw2
l
=L∑l=1
wl
∑xi∈Rlm
gim
+1
2w2l
∑xi∈Rlm
him + λ
+ γ
Let
Ll(m)
= wl(∑
xi∈Rlm
gim) +1
2w2l (∑
xi∈Rlm
him + λ) + γ (33)
Finding optimal value of regression tree fm is equivalent to find optimal wl which minimize
Ll(m)
of leaf l, respectively.
wlm = arg minwl
Ll(m)
(34)
Its minimal value occurs when∂Ll
(m)
∂wl= 0
Denote alm =1
2(∑
xi∈Rlm
him + λ) and blm = (∑
xi∈Rlm
gim).
Equation (33) becomes
18
Ll(m)
= almw2l + blmwl + γ
and
∂Ll(m)
∂wl= 2almwl + blm = 0,
we get
wlm = − blm2alm
= −
(∑
xi∈Rlm
gim)
(∑
xi∈Rlm
him + λ)(35)
Since
Optimal(Ll
(m))
= γ −b2lm
4alm= γ −
∑xi∈Rlm
gim
2
2
∑xi∈Rlm
him + λ
,
it turns out that
Optimal(L(m)
)∼=
L∑l=1
γ −∑
xi∈Rlm
gim)2
2(∑
xi∈Rlm him + λ)
= γL−L∑l=1
∑xi∈Rlm
gim
2
2
∑xi∈Rlm
him + λ
, (36)
which is important for the latter derivation of the splitting criteria for training new
classifier of each iteration.
In the algorithm, it gives a initial value to ensemble classifier which can be defined by
user in practical application. It also can simply use the same value as gradient boosting.
In the mth iteration, it includes four steps:
(1) Compute the estimated first derivation gim and second derivation him by using yi and
the ensemble value F(m−1)(xi) of previous iteration for all the samples. (2) Fit a classifier
with given training set xi, gim, him. (3) Compute optimal score wlm.
(4) Add the score to ensemble classifier.
(1) Compute gim and him of different loss function by equation (29) (30)
Least square loss:(y − F (x))2
gim = 2F(m−1)(xi)− 2yi
19
him = 2
Exponential loss function: e−yF (x)
gim = −yie−yiF(m−1)(xi)
him = y2i eyiF(m−1)(xi) = eyiF(m−1)(xi),
because of y2i = 1 for yi ∈ −1, 1.Logistic loss function: − log(1 + e−2yF (x))
gim =2yie
−2yiF(m−1)(xi)
1 + e−2yiF(m−1)(xi)
=2yi
1 + e2yiF(m−1)(xi)
him =4y2i e
2yiF(m−1)(xi)
1 + e2yiF(m−1)(xi)
=4
1 + e−2yiF(m−1)(xi)
(2) and (3) Train a classifier to partition xi|i = 1, 2, ·, N into RlmLl=1 and find the
optimal score wlm which minimize the objective function.
In step (2), to use gim and him as a stopping/splitting criteria to train a new classifier
with L leaves. It means the classifier stop splitting after evaluating the further splitting of
each leaf. To measure the goodness of the classifier, we can use the similar idea as decision
tree. Grow a tree from a single node and stop while impurity(entropy) is increasing after
splitting and the optimal classifier will be the one before splitting. In XGB, the objective
function L(m) can be an index to represent goodness of a classifier. If objective function
value is increasing after splitting leaf l, then not to split l, otherwise continue to split until
every leaf can’t be split further. To formulate the stopping criteria, firstly, we partitions the
samples in Rlm into RL and RR with leaf index lL, lR. Every possible partition(splitting)
of Rlm won’t increase objective function value, if l is a leaf of optimal classifier. Since
other leaves objective function value won’t change after splitting Rlm, the difference of
objectives is equal to the difference between Optimal(Ll(m)
) and Optimal(L(m)lL∪lR). Thus,
we can formulate the stopping the criteria as following
δL(m) = Optimal(Ll
(m))−Optimal
(L(m)lL∪lR
)< 0 (37)
20
where
Optimal(Ll
(m))
= γ −
(∑xi∈Rlm gim
)22(∑
xi∈Rlm him + λ) ,
and
Optimal(L(m)lL∪lR
)= 2γ −
∑xi∈RR
gim
2
2
∑xi∈RR
him + λ
+
∑xi∈RL
gim
2
2
∑xi∈RL
him + λ
.
Inequality (37) becomes
∑xi∈RL
gim
2
2
∑xi∈RL
him + λ
+
∑xi∈RR
gim
2
2
∑xi∈RR
him + λ
− ∑xi∈Rlm
gim
2
2
∑xi∈Rlm
him + λ
− γ < 0 (38)
If l is a leaf of the optimal classifier, all the possible partitions satisfies inequality (38).
If δL(m) > 0 which means l can be split further, then we have to find the best splitting
for leaf l. The best splitting in decision tree is to choose the partition which can reduce
most score of impurity(entropy). Exact greedy algorithm for splitting finding in XGB is
also with the similar idea, so it choose the splitting which maximizes δL(m) (decrease the
objective function value most) as optimal spitting. In exact greedy algorithm for splitting
finding, it sorts all the samples in Rlm by xik which is the value of kth feature such that
RL = xs|∀xsk ≤ xik and RR = xj |∀xjk ≥ xik. Always Take the partition with larger
δL(m). Repeat it for every feature and finally we can find the best splitting for leaf l. After
getting L and RlmNl=1, calculate optimal weight wlm of each leaf l by equation (35).
In last step, output the ensemble classifier. The prediction output the sign of the
ensemble value while using least square loss for binary classification problem with yi ∈−1, 1. For exponential and logistic loss, sigmoid function can be used to map the value
to [0, 1] and return 1 if the value is larger than 0.5. Table 2 summarize the results of
different objective functions.
21
Lea
stsq
uar
eE
xp
onen
tial
loss
Log
isti
clo
ss
loss
funct
ion
(y−F
(x))
2e−
yF(x
)−
log(e−2yF(x
)+
1)
wlm
−
∑xi∈Rlm
2(y i−F(m−1)(xi)
)
(2Nl+λ
)−
∑xi∈Rlm
−y ie−
F(m−
1)(xi)yi
(∑
xi∈Rlm
e−F
(m−
1)(xi)yi+λ
)−
∑xi∈Rlm
2yi
1+e2yiF
(m−
1)(xi)
(∑
xi∈Rlm
−4e−2yiF
(m−
1)
(1+e−
2yiF
(m−
1)(xi) )
2+λ
)
g im
2(y i−F(m−1)(xi)
)−y ie−
F(m−
1)(xi)yi
2yi
1+e2yiF
(m−
1)(xi)
him
2e−
F(m−
1)(xi)yi
−4e−2yiF
(m−
1)
(1+e−
2yiF
(m−
1)(xi) )
2
Tab
le2:
Sum
mar
yof
diff
eren
tlo
ssfu
nct
ion
for
XG
B
22
3 Measurement and Experiment
3.1 Measurement of goodness of the classifier
In general, the goodness of a model can be measured by accuracy, which is the ratio be-
tween correct predicted counts and total counts. But in some specific case, it may happen
that it can not truly represent the model’s goodness, such as the quality test prediction of
high yield product, which always includes rare ”fail” samples compared to ”Pass” samples.
In the dataset will be used for demonstration in this thesis, the fail samples of quality test
is less than 1% of all products, which means two classes have big disparity. The model
trained by such imbalanced dataset will tend to predict the result as the major class, but
it’s still possible to get high accuracy, which is variant by the given test set of two-class
ratio. For example, if the ratio between two classes(Fail/Pass) is 0.01(1:100), the model
will be prone to predict the result as ”Pass”, and the accuracy is higher than 99% while
the class distribution of given test set is same as training set. But when we change an test
set with more failure cases, this model can not predict precisely. To effectively measure the
goodness of the model learned from imbalance two-class data set, the accuracy of ROC
curve will be used to measure the model’s goodness in whole paper.
3.1.1 ROC curve
ROC(receiver operating characteristics) curve is a curve using for measuring the good-
ness of binary classifier. Its value of x − axis and y − axis represent false positive rate
and true positive rate respectively. The area under the curve can be an index of good-
ness of the model, which is called ”AUC”. In binary classification, we define two classes
being ”+” and ”-”, and the combination of predicted result and true class label have 4
different cases, which are ”TP”(True positive), ”FP”(False positive), ”TN”(True negative)
and ”FN”(False negative) and it can be represented by confusion matrix as Table 1. True
positive means condition positive case is predicted as positive, and false positive means
negative case is predicted as positive case. To draw ROC curve, we adjust the threshold
and get different pair of false positive rate and true positive rate (FPR, TPR) which is
defined by equation (39) and (40). For example, If the instance’s probability of being ”+”
is p and p > threshold, then classify it to positive class. Thus, the higher the threshold is,
the smaller the false positive rate is.
23
Figure 1: Receiver operating characteristics curve
Confusion matrix Predicted ”+” Predicted ”-”
Condition ”+” True Positive(TP) False negative(FN)
Condition ”-” false Positive(FP) True negative(TN)
Table 3: Confusion matrix which is gotten under certain threshold.
TPR =TP
TP + FN(39)
FPR =FP
FP + TN(40)
3.1.2 sigmoid function
The curve of sigmoid function which is an antisymmetric function looks like ”S” as
Fig 2, The special case of sigmoid function is logistic function, which maps a real number
to [0,1]. Denote
S(x) =1
1 + e−x. (41)
Some of boosting methods returns the sign of ensemble classifier F (x) as the prediction
result while F (x) is an additive logistic function. But in this kind of sense, it means
it already choose the threshold as 0.5. As we can see in Fig 2, S(x) > 0.5 when x is
24
positive. For cutting different threshold to draw ROC curve, in this paper, Sigmoid function
will be used to map F to [0,1]. Dscrete adaboost is not suitable for using this case,
because the return value will be likely accuracy(1-error) instead of class probability after
mapping by sigmoid function. Basically, this will be used only when the objective function
is ”exponential” or ”logistic” such as real adaboost, gradient boosting and extreme gradient
boosting.
Figure 2: Sigmoid function maps the score to [0,1]
3.2 Data preparation
The original dataset includes three parts, which are categorical, numerical and times-
tamps data. Three datasets are totally with 4267 columns × 1183748 rows. Since the
limitation of computational ability of laptop, only numerical and time-related data from
partial process flow will be used for the later demonstration. Before starting to train the
model, I construct three the training sets as Table 5. First training set is constructed
by cutting partial consecutive process flow and leaving those rows with missing values
out. The dataset didn’t reveal the real physical meaning of each feature in the numerical
dataset, but there are still some useful information, such as the process sequence of each
product in a machine. There are some processes which can be done not in only single ma-
chine. For example, in Fig 3, B1 and B2 are equivalent machines, in this station, product
can be processed either in B1 or in B2, except some special case needed rework. In first
training set, some data with same feature data from equivalent machine are putting into
25
different columns, so I merged them into same column. Second training set is constructed
by adding some extra features to first training set. I add machine number and machine
idling time before the product staring process as extra features. If the product pass B1,
then the added feature is B1. It’s possible that machine has a long queueing time before
product comes. The machine’s idling time is the difference between two consecutive prod-
ucts which are processed in this machine. Since there are many missing values in each
columns, which means the sampling rate of measurement is not 100%. In third training
set, two consecutive partial process flow are cut, but later I use XGB to demonstrate. In
all the packages, only XGB support to handle the cell with missing value.
Training set # of rows # of columns Note
1 80000 42 Partial process flow without missing data
2 80000 56 Partial process flow and extra added features
3 22000 267 Partial process flow with missing data
Table 4: Size of training set
Figure 3: Partial process flow
3.3 implementation
The demonstration will be implement using ASUS laptop under the environment with
window 10, 64-bit system, 4-GB ram, X64 processor. I encode in python language and the
programming part will be executed and complied in jupyter notebook, which is a on-line
26
interactive platform support multi-programming languages. The code can be executed in
a single cell independently. The version of Python is 3.3 and jupyter notebook is 4.1.
Compared to original data set, the training set used for demonstration is very small,
so the result of the prediction is not as good as the those results of higher rank in the
competition. Since the purpose of this thesis is not to require the high accuracy, it’s
the demonstration of the procedure. Programming will be executed by using scikit-learn
package which exists in python. Most methods can be found in this package, such as
adaboost, gradient boost, and extreme gradient boosting.
To build the training model, the procedure includes two steps. (1) Find the optimal
parameters to train the model. (2) Find an optimal threshold so that we can output the
result of prediction.
3.3.1 Optimization of parameters
In programming, the model is optimized under given condition with fixed parameters.
To find these optimal parameters before training the model, I took line-search strategy
which is a methods looking for local extreme value by adjusting one specific parameter
while other parameters are fixed. I will demonstrate the procedure by using Extreme
gradient boosting with third training set in Table 4, because XGB is the only methods
able to handle training set with data missing among four methods. The line search strategy
can be shown from Fig 4 to 8. The sequence of picking parameter to maximize AUC does
not matter.
Firstly, we fix other parameters except maximal depth of base classifier and find local
extreme by adjusting maximal depth of base classifier. Fig 4 shows that the local maxi-
mum Roc accuracy occurs when max depth = 2.
27
Figure 4: Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1
Secondly, fix max depth = 2 and pick the number of base classifiers(n estimators) for
doing line search to find the optimal value which maximize the ROC accuracy while other
parameters fixed. In Fig 5, the result show the maximum of AUC occurs when it’s 103.
Figure 5: max depth = 2, and default value λ = 1, γ = 0, objective = logistic,
ν = 0.1
28
Figure 6: M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic
Fig 6 shows the learning rate ν with maximum AUC occurs while it’s 0.1 and it
keep trending down after 0.1. In empirical experience, it’s is between 0.01 and 0.2. γ is a
parameter which only appears in XGB. Since AUC doesn’t change between 0.01 and 0.2
while programming. Thus, I took 0.2 as increment of each iteration. The maximum value
happens while γ = 0 0.2. In practice, it could happen that γ is either very close to zero
or very far from zero, but the result in Fig 7 shows it’s almost same as random guessing
while γ is very large.
29
Figure 7: M = 103, max depth = 2, λ = 1 , objective = logistic
Figure 8: M = 103, max depth = 2, λ = 1 , objective = logistic
30
Figure 9: M = 103, max depth = 2, λ = 1 , objective = logistic
λ is also a parameter only appears in XGB, it represents the coefficient of L2-regularization.
Figure 10: M = 103, max depth = 2, γ = 0 , objective = logistic
Finally, we get the optimal condition with auc 0.68. The optimal condition is as Table
9.
31
3.3.2 Optimal threshold
The ROC curve under optimal condition is shown as Fig 10. To find a optimal thresh-
old, we have to maximize MCC(Matthew’s Correlation Coefficient) which can strongly
represent the correlation between imbalanced two classes and the value is always between
-1 1, it can be computed by equation (43) with the confusion matrix which is got by
adjusting threshold.
MCC =TP × TN + FP × FN√
(TP + FP )(TP + FN)(TN + FP )(TN + FN)(42)
Figure 11: ROC curve under optimal condition
32
Figure 12: Confusion matrix while threshold is equal to 0.0644
3.3.3 Regularization and over-fitting
In XGB algorithm, the most significant difference from other boosting methods is to
consider the regularization term which can prevent over-fitting problem. Fig 12 and 13
show the model become slightly stable while the number of estimators is larger than 100,
and training error continue to decay in the meantime. Fig 14 shows not only the over-fitting
problem also the AUC got improvement while taking the regularization into account.
33
Figure 13: XGB with regularization λ = 1, γ = 0.1
Figure 14: XGB without regularization λ = 0, γ = 0
34
Figure 15: Overfitting: XGB with regularization improved overfitting problem.
3.3.4 Model with different training set and algorithm
Table 5, 6 and 7 show the results of optimal conditions and its AUC while the model
trained by fixed algorithm with given different training sets. The best model is not sur-
prisingly the training set by XGB with third training set, because of it’s with more in-
formations(features) compared to another two training set. The model trained by second
training set has no improvement compared to those trained by first set.
Training set max depth ν max features M AUC
1 3 0.06 30 16 0.612
2 3 0.06 46 16 0.628
3 — — — — —
Table 5: Optimal condiftion of given different training set by using adaboost algo-
rithms
35
Training set loss function max depth ν max features M AUC
1 exponential 3 0.06 31 28 0.62
1 logistic 3 0.11 44 27 0.632
2 exponential 3 0.09 46 26 0.624
2 logistic 3 0.13 50 69 0.629
3 — — — — — —
Table 6: Optimal condiftion of given different training set while using gradient boost-
ing
Training set Objective cols∗ byt∗ max d∗ ν γ λ M AUC
1 logistic 1 3 0.12 7.4 0.99 49 0.617
2 logistic 1 4 0.1 2.25 1 75 0.608
3 logistic 1 2 0.1 0.1 0.06 103 0.676
1 exponential 1 6 0.1 0 3.96 37 0.613
2 exponential 1 6 0.1 0 1 88 0.61
3 exponential 0.5 6 0.03 0 1 29 0.658
Table 7: Optimal condiftion of different given training set while using XGB
3.4 Summary
Since the limited computational ability of hardware, the original data set couldn’t
be fully used. Third training set is only with 260 features out of 4127 features from
original data, the model trained by such small dataset compared to original dataset already
achieved AUC 0.68. In such high imbalanced classification problem, the consideration of
choosing optimal threshold is needed; otherwise, all the data are tended to be classified
to major class if we simply take the sign of output score of the model as the prediction.
XGB which existed in scikit-learn package in python is quite suitable for handling large
dataset with a lot of missing value. In the dataset from Bosch company, if we only handle
those rows with complete information, then many rows will be deleted. For productivity
concerned, measurement sampling rate for each process are usually not 100%. Thus, if
we delete those rows with incomplete information, then not many rows will be preserved
when the measurement sampling rate is very low. When doing line search, it may not
easy to find an proper range of each parameter. Take γ as example, the value of AUC
dosen’t have obvious vibration when γ = 0 and γ = 4. Based on empirical experience, it’s
possible that the optimal value occurs either very close or far from zero. After optimizing
the parameters, the optimal MCC we got is 0.15 which is only slightly better than random
guessing when the training set gave less information. In the competitions, optimal MCC
36
of the first rank in the competition is 0.52 which is also not vary high correlated. We also
observed that overfitting and AUC are improved while considering the regularization term,
which means λ 6= 0 and γ 6= 0. This result matches to the theoretical part. Those models
trained by different methods with training set 2 are not better than those trained by
training set 1. Both training set are with less information. But if we can know more about
mechanism or real physical meaning, it’s quite helpful for data preparation. Table 5 6 and
7 show the result of those models trained by different algorithms and training set, but
not surprisingly the results are not so competitive because of less information(features).
In gradient boosting and extreme gradient boosting, only ”exponential” and ”logistic” are
selected as objective function because it make more sense to use sigmoid function map it to
[0, 1]. There is no specific method which is the always the best, and in Kaggle competitions,
many participants won the competition by combining many different methods rather than
with using single algorithm.
37
References
[1] Mohri, Mehryar and Rostamizadeh, Afshin and Talwalkar, Ameet, Foundations of
Machine Learning, 2012
[2] Yoav Freund and Robert E. Schapire, A Short Introduction to Boosting, In Proceedings
of the Sixteenth International Joint Conference on Artificial Intelligence, 1999, 6,
1401-1406
[3] Robert E. Schapir, Theoretical Views of Boosting and Applications, 1999
[4] Friedman, J.; Hastie, T. & Tibshirani, R., Additive Logistic Regression: a Statistical
View of Boosting, Annals of Statistics 1998.
[5] Freund, Yoav and Schapire, Robert E, A Decision-theoretic Generalization of On-line
Learning and an Application to Boosting, J. Comput. Syst. Sci., 1997, 21, 119-139
[6] Jerome H. Friedman, Stochastic Gradient Boosting Computational Statistics and Data
Analysis, 1999, 38, 367-378
[7] Tianqi Chen, XGBoost: A Scalable Tree Boosting System KDD ’16: Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining 2016
[8] Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine,
Annals of Statistics, 2000, 29, 1189-1232
VI