Application of machine learning in manufacturing...

Application of

machine learning

in manufacturing industry

MSC Degree Thesis

Written by:

Hsinyi Lin

Master of Science in Mathematics

Supervisor:

Lukacs AndrasInstitute of Mathematics

Eotvos Lorand University

Faculty of Science

Budapest

2017

ABSTRACT

In high-tech manufacturing industry, much data are fetched each day from pro-

duction line. These data could be measurement data, categorical data, or time-

related data. Before the product being delivered to costumer, it has to pass quality

examination. The task in this thesis is to use a given training set and combine with

different boosting algorithms which already exist in scikit-learn package in python

to build a model which can predict quality test result. To build such model can

be regarded as solving highly imbalanced binary classification problem, since the

number of ”Fail” case compared to the number of ”Pass” case are very scarce for

mature product.

Boosting is very common used for solving this type of problem in Kaggle commu-

nity and its general idea is to combine many weak learners(base classifiers) sequen-

tially to get a strong learner. In programming, the model is trained under certain

condition, which means some parameters are fixed, therefore, the understanding of

each parameter is needed for finding optimal parameters. To optimize the model,

we minimize certain objective function which depends on its correspondent algo-

rithm. In adaboost, the upper bound of empirical error is minimized. In logitboost,

it minimizes the least square regression of residual which is updated by Newton’s

method. In gradientboost, it minimizes user-defined differential convex loss function

which represents the dissimilarity between real class and prediction result. Except

loss function, XGB takes regularization term into account and using second order

Taylor expansion to approximate the loss function for preventing over-fitting. Solv-

ing the problem with given training set includes two step: (1)Use line-search to find

optimal parameters. (2)Find the optimal threshold to decide the result of classifi-

cation. The purpose of the thesis is to demonstrate the procedure of solving the

problem and the computational ability of laptop is also limited, therefore, the very

small part of original data are taken as training set. Thus, the model trained by

given training set with less information is not as competitive as those model with

higher rank in the competition. While doing line search to optimize the value for

a specific parameter, it may not easy to find a proper range or value. Thus, it’s

quite important to reference past empirical experience. In this thesis, some model

with different objective function and different training set are built, but there is no

specific method which is always the best. In Kaggle competition, most participants

won the competition by combining many methods rather than using single method.

I

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Adaptive boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.2 Derivation of each step . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Logit boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Derivation of each step . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 XGB-Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . 16

2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Measurement and Experiment . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Measurement of goodness of the classifier . . . . . . . . . . . . . . . . 23

3.1.1 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Optimization of parameters . . . . . . . . . . . . . . . . . . . 27

3.3.2 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . 32

II

3.3.3 Regularization and over-fitting . . . . . . . . . . . . . . . . . . 33

3.3.4 Model with different training set and algorithm . . . . . . . . 35

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

III

List of Figures

1 Receiver operating characteristics curve . . . . . . . . . . . . . . . . . 24

2 Sigmoid function maps the score to [0,1] . . . . . . . . . . . . . . . . 25

3 Partial process flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1 . . 28

5 max depth = 2, and default value λ = 1, γ = 0, objective = logistic,

ν = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic . . . . . 29

7 M = 103, max depth = 2, λ = 1 , objective = logistic . . . . . . . . . 30



10 M = 103, max depth = 2, γ = 0 , objective = logistic . . . . . . . . . 31

11 ROC curve under optimal condition . . . . . . . . . . . . . . . . . . . 32

12 Confusion matrix while threshold is equal to 0.0644 . . . . . . . . . . 33

13 XGB with regularization λ = 1, γ = 0.1 . . . . . . . . . . . . . . . . . 34

14 XGB without regularization λ = 0, γ = 0 . . . . . . . . . . . . . . . . 34

15 Overfitting: XGB with regularization improved overfitting problem. . 35

IV

List of Tables

1 Summary of different loss function for gradient boosting . . . . . . . 16

2 Summary of different loss function for XGB . . . . . . . . . . . . . . 22

3 Confusion matrix which is gotten under certain threshold. . . . . . . 24

4 Size of training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Optimal condiftion of given different training set by using adaboost

algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Optimal condiftion of given different training set while using gradient

boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Optimal condiftion of different given training set while using XGB . . 36

V

1 Introduction

1.1 Motivation

The few years working experience in high-tech manufacturing industry made me

realize how fast the speed of data being generated in production line. The limitation

of capacity of repository urges old data be purged while new data being generated.

It’s a tough task to analyse all the fetched data and extract useful information from

such a big data set by tradition statistical analytical methods before data being

purged.

In order to effectively use all the fetched data, more and more companies started

to introduce the concept of machining learning to develop their analytical tools.

As we can see, recently, many companies started to provide their data and held

competitions in Kaggle, the largest communicative platform for people studying in

relevant fields of machine learning [1]. By holding the competitions, they can get

better ideas and concepts among those innovative and efficient methods that were

proposed by participants from all over the world.

In manufacturing industry, one of the typical questions is how to predict the

quality test results based on past data. The quality test result decides whether the

company can gain benefit from the product or not, because every product needs to

pass all the quality tests before being delivered to customers. If the result can be

predicted precisely by built model and certain given features, that would be helpful

for company to save cost. Thinking about the situation, if the predicted result is

”fail”, then we can simply scrap the product and stop the remaining process it should

be done, because of the high risk of being scrap after finishing all the procedure. We

also can know which features are important to the prediction via such model, so we

can focus on the yield improvement of certain process.

To build the model of solving binary classification problems, we will introduce

boosting methods which are the most commonly used in Kaggle platform by the

community. Most participants won the competitions by using the idea of boosting

methods to construct their model. Until now, there have been many kinds of boosting

methods have been proposed such as adaboost, logitboost, gradient boosting and

extreme gradient boosting. Some algorithms have been developed and existed in

python packages.

1

1.2 Outline

This thesis is organized in the following way. In Chapter 2, the general idea of

boosting methods and four different boosting methods which are adaboost, logit-

boost, gradient boost, extreme gradient boost are introduced. Each method includes

the detail of algorithm, the concept and derivation. The idea of loss function and the

regularization which appears in the extreme gradient boost are also be shown. In

Chapter 3, firstly, ROC curve and sigmoid function are introduced for evaluating the

goodness of the model. Data preparation and the implementation environment are

also presented before programming. The demonstration of the procedure of solving

the highly imbalanced classification problems by extreme gradient boosting which

existed in scikit-learn in python packages includes two steps: (1)Use line search

strategy to find optimal condition.(2)Find a best threshold which maximizes the

Matthew correlation coefficient. After demonstrating the procedure, I show the im-

provement of overfitting when taking the regularization into account. In the end of

this chapter, I demonstrate the result of those models which are trained by differ-

ent boosting algorithms with given different training set and give a conclusion for

chapter 3.

2

2 Boosting Methods

2.1 Binary classification

Classification is one of the supervised learning problem in machine learning. The

task is to find a model F to fit given labeled training set xi, yi ∀i = 1, 2, · · · , N , and

use it to predict a set with unknown class label. The labeled training set include

features xi and class label yi. The problem of predicting quality test results can be

regarded as a highly imbalanced binary classification problem because the results

of prediction have only two possibilities, – ”Pass” and ”Fail”, and non-uniform

distribution for the big disparity between two classes. In practical application, the

labels of two classes are usually encoded to -1/1 or 0/1 depend on the method being

used. In manufacturing industry, features can be any measured data during process

or time values. To formulize the problem, we denote the model F : x → F (x)

where F (x) is the prediction with given features x. The goal of finding optimal F

is equivalent to find the model which minimizes the dissimilarity between yi and

F (xi).

2.2 General idea

The general idea of boosting methods [2] [3] [4] is to sequentially ensemble many

weak learners(base classifiers) fm which can predict slightly better than flipping

uniform coin(random guessing) to a strong classifier F . In most boosting methods

use CART(classification and regression trees) as based classifiers. In general, to

find a optimal classifier in mth iteration is equivalent to find a fm which minimize

certain given objective function. In adaboost, the goal is to minimize the upper

bound of the empirical error of the ensemble classifier [5]. Since its empirical error

is bounded by exponential loss, it’s same as taking exponential loss as objective

function. In logitboost, it minimizes the least square of residual which estimated by

Newton’s update. The goal of gradient boosting is to minimize loss function which

can be interpreted as the dissimilarity between real class label y and predicted value

F (x). The loss function can be any differentiable convex function. Unlike gradient

boosting, the objective function of extreme gradient boosting has one more term,

that is regularization, which can prevent overfitting effectively.

2.3 Adaptive boosting

Adaboost was once the most popular boosting method until extreme gradient

boosting appeared. It combines many weak classifiers which are trained by weighted

3

training set sequentially to get a strong classifier. Initially, every sample in training

set is given an equal weight. In each stage, we increase the weight to misclassified

samples of predicting by ensemble classifiers and get a new weighted training set to

train next weak learner. It emphasizes the importance of those misclassified instances

so that next weak classifier is able to focus on fitting those misclassified samples.

Because the summation of the weight of each sample is equal to one, the weight of

those samples which are classified correctly will also be decreased. In general, using

stumps which are the simplest decision tree only with one node and two leaves as a

weak classifier already get quite good accuracy and efficiency. The commonly used

version of adaboost is discrete adaboost which was first time introduced in [Freund

and Schapire (1996b)]. It was developed to solve binary classification problem with

class label yi ∈ −1, 1, ∀i. The real adaboost came up two years later.

2.3.1 Algorithm

The following two algorithms are the variant of adaboost. Algorithm 1 is the

discrete version which was presented in 1996 by Freund and Schapire and Algorithm

2 is real adaboost.

It has a slightly difference while calculating the value which will be added at m

iteration. In discrete adaboost, the value is estimated by given weighted training

error predicted by weak classifier fm ∈ −1, 1 when adaboost uses weak classifier

fm which return a real number and estimate the value by given class probability. The

weak classifier fm Both algorithms increase the weight of those misclassified samples

by multiplying exponential function. The detail explanation will be described in the

derivation part.

Algorithm 1 Discrete adaboost algorithm

1: Initialize wi ←1

N, i = 1, 2 . . . , N . Initialize equal weight of each sample

2: while m ≤M do

3: Fit a classifier fm(x) ∈ −1, 1 . Fit a classifier fm ∈ −1, 1 with minimum weighted training error

4: Compute εm = Ew[1(y 6=fm(x))

]. Calculate weighted training error of fm

5: cm = log(1− εm)

(εm). Update the coefficient of the weak learner of current iteration

6: wi ← wiecmI[ 6=fm(x)]] i = 1, 2 · · · , N . Increase the weight of those misclassified samples

7: re-normalize such that

N∑i=1

wi = 1 . re-normalize all the weights

8: end while

9: return sign [F (x)]. F (x) =

M∑m=1

cmfm(x) . output the sign of ensemble result

wi: weight of sample i

fm: optimal classifier of mth iteration

4

εm : weighted training error of fm

cm: the coefficient of classifier fm while ensemble to a strong classifier

F : final ensemble classifier

Algorithm 2 Real adaboost algorithm


N, i = 1, 2 . . . , N . Initialize equal weight to each sample

2: while m ≤M do

3: Find optimal partition and get class probability pm(y = 1|x)

. Train a classifier with minimum weighted training error and get its class probability

4: Compute fm(x)← 1

2ln

pm(y = 1|x)

1− pm(y = 1|x). Update the score of each leaf

5: wi ←− wie−yifm(x) i = 1, 2 · · · , N . Update weight of each sample

6: re-normalize such that

N∑i=1

wi = 1 . re-normalize all the weights

7: end while

8: return sign [F (x)]. F (x) =

M∑m=1

fm(x) . output the sign of ensemble result

wi: weight of sample i

pm(x|y = 1) : class conditional probability of each leaf at mth iteration

fm: classifier with optimal score at mth iteration

F :final ensemble classifier

2.3.2 Derivation of each step

Firstly, it sets equal weight to each sample w0(i) = 1/N , where N is the number of all

samples.

Secondly, it comes the iteration step m. There are three steps at mth stage in discrete

adaboost. It includes (1) Find a optimal classifier fm with minimal weighted training

error. (2) Find the optimal coefficient cm of weak classifier fm. (3) Update weight for

next iteration. In practical case, we usually find a classifier fm(x) with minimum weighted

training error under certain given condition which can also be optimized by doing line-

search. For example, the most popular condition is to set the base classifier as stump,

which is the simplest decision tree for its efficiency and accuracy of final classifier.

fm(x) = arg minfm

Ew[I(y 6=fm(x))

]= arg min

fm

N∑i=1

wm(i)[I(y 6=fm(x))

]The following we’ll have to find optimal cm which can minimize the empirical training

error1

N

N∑i=1

I(yiF (xi)<0) while fm is known. To minimize the empirical training error, we

consider to minimize its upper bound. Since

I(yi 6=F (xi)) ≤ e−kyiF (xi) ∀k > 0,

5

the objective function becomes1

n

N∑i=1

e−kyiF (xi). According to the definition in algorithm,

the weights of those misclassified samples will be increased. In fact, the weight of those

samples which are classified correctly are also reduced because of the re-normalization.

The updated weight for next iteration can be expressed as

w(m+1)(i) =wm(i)ecmI[ 6=fm(x)]

N∑i=1

ecmI[yi 6=fm(x)]

(1)

The denominatorN∑i=1

wm(i)ecmI[ 6=fm(x)] is the normalization term and it can be simplified

to (1− εm) + εmecm by the following derivation.

N∑i=1

wm(i)ecmI[yi 6=fm(x)]

=∑

yi=fm(xi)

wm(i)e0 +∑

yi 6=fm(xi)

wm(i)ecm

=∑

yi=fm(xi)

wm(i) + ecm∑

yi 6=fm(xi)

wm(i)

= (1− εm) + εmecm

Where the weighted training error is εm =∑

yi 6=fm(xi)

wi.

When yi = fm(xi),which is same as yifm(xi) = 1

w(m+1)(i) =wm(i)

(1− εm) + εmecm=

wm(i)e−cm/2

(1− εm)e−cm/2 + εmecm/2. (2)

Otherwise; yi 6= fm(xi)(or yifm(xi) = −1),

w(m+1)(i) =wm(i)ecm

(1− εm) + εmecm=

wm(i)ecm/2

(1− εm)e−cm/2 + εmecm/2. (3)

We can simplify equation (2) and (3) to single equation.

w(m+1)(i) =wm(i)e−cmyifm(xi)/2

Zm(4)

Where Zm = (1− εm)e−cm/2 + εmecm/2.

By equation (4), we can get

e−cmyifm(xi)/2 =w(m+1)(i)

w(m)(i)Zm (5)

6

By (5), upper bound of empirical error can be written as

1

N

N∑i=1

e−yiF (xi)/2 =M∏m=1

Zm (6)

while choosing k =1

2.The derivation is as following.

1

N

N∑i=1

e−yiF (xi)/2

=1

N

N∑i=1

e

−yi

M∑m=1

cmfm(xi)/2

=1

N

N∑i=1

M∏m=1

e−yicmfm(xi)/2

=1

N

N∑i=1

M∏m=1

w(m+1)(i)

w(m)(i)Zm

=M∏m=1

Zm

with givenN∑i=1

w(m+1(i) = 1 and w1(xi) =1

N.

Thus, by equation (6), to minimize upper bound of empirical error is equivalent to mini-

mize Zm at each iteration. The minimum value of Zm occurs when∂Zm∂cm

= 0.

∂Zm∂cm

= 0

−1

2(1− εm)e−cm/2 +

1

2εme

cm/2 = 0

ecm =1− εmεm

⇒ cm = log1− εmεm

(7)

And

Zm = (1− εm)e−

1

2log

1− εmεm + εme

1

2log

1− εmεm

7

= (1− εm)

√εm

1− εm+ εm

√1− εmεm

= 2√εm(1− εm)

while cm = log1− εmεm

.

Thus, empirical error is bounded.

1

N

N∑i=1

I [(yiF (xi) < 0] <M∏m=1

2√εm(1− εm) (8)

This result can be interpreted. When cm = 0, it means the classifier fm is same as random

guessing, because the weighted training error εm is 0.5, so it’s useless for final prediction.

cm > 0 means fm has more chance to predict correctly(εm < 0.5), so we can multiply

a positive coefficient(cm > 0) to emphasize its positive influence of ensemble classifier. If

cm < 0, not surprisingly, fm has more possibility to have a wrong prediction(εm > 0.5),

we reverse the prediction, then it still helps for ensemble prediction.

In real adaboost, fm is a real number rather than in −1, 1. To find optimal value of

fm(x). Because the weighted training error

Ew[1(yfm(x)<0] ≤ Ew[e−yfm(x)],

to minimize weighted training error is equivalent to minimize its upper bound Ew[e−yfm(x)].

Ew

[e−yfm(x)

]= Pw(y = 1|x)e−fm(x) + Pw(y = −1|x)efm(x)

∂Ew[e−yfm(x)

]∂fm(x)

= −Pw(y = 1|x)e−fm(x) + Pw(y = −1|x)efm(x)

∂Ew[e−yfm(x)

]∂fm(x)

= 0⇐⇒ fm(x) =1

2log

Pw(y = 1|x)

Pw(y = −1|x)

Hence, the optimal base classifier fm is half of the value of taking logarithm of the ratio

of two classes’ probabilities.

Lastly, both algorithms output the sign of ensemble value.

2.4 Logit boosting

Logitboost is a method only being used to solve binary classification problem with

class label y∗ = 0, 1, while other methods using class label y = −1, 1. It fit an

additive logistic symmetric likelihood function by using adaptive newton approach. At

every iteration step, a working response zi ∀i = 1, 2, · · · , N of each sample is calculated

by its yi and estimated class probability. Working response can be also regarded as an

8

approximate residual between real class label and ensemble value. The formula of working

response is derived by using Newton’s method. The optimal regression tree fm is got by

minimizing the weighted least-square regression of zi. In the last step, it will output the

sign of the ensemble value.

2.4.1 Algorithm

Algorithm 3 Logit boosting algorithm


N, p0(xi) =

1

2i = 1, 2 . . . , N and F0(xi) = 0

. Initialize equal weight to each sample and initial score of single node tree

2: while m ≤M do

3: zm(i) =y∗i − p(m−1)(xi)

p(m−1)(xi)(1− p(m−1)(xi)). Compute working response Z

4: wm(i) = p(m−1)(xi)(1− p(m−1)(xi)) . Update weight of each instance

5: Fit the function fm(x), Ew[(fm(x)− z)2

]. by a weighted least-squares regression of zi to Fm(xi) using weight wm(i)

6: Fm(x) = F(m−1)(x) +1

2fm(x), and pm(x) =

eFm(x)

eFm(x) + e−Fm(x)

. estimate class probability for computing next working response

7: end while

8: return sign[F (x)] . output the sign of ensemble result

wi:weight of sample i

wm(i): weight of sample i which updated by using previous class probability

zm(i): the newton’s update or estimated residual

fm: optimal classifier which minimize the least square regression of zi

F : final ensemble classifier

2.4.2 Derivation of each step

Firstly, give to equal weight1

Nto every training instance, initialize F (x) = 0 and

p(xi) =1

2, which means the the probability of y = 1 of every instance is same as random

guessing.

In the mth iteration, the goal is to fit fm by minimizing the weighted least square

regression of zi. F (x) is gradually improved by adding f(x) after each iteration. Here

zm(i) is regarded as a residual(Newton’s update) that compared to its previous estimated

probability of occurring y = 1 under x = xi. Next topic we concerned most is that why

zm(i) =y∗ − p(m−1)(xi)

(1− p(m−1)is(xi))p(m−1)(xi))and how to find an optimal f(x) to maximize the

next logistic likelihood function?

Logistic likelihood function is denoted as

l(p(x)) = y∗ log p(x) + (1− y∗) log(1− p(x))

9

where

p(x) =eF (x)

eF (x) + e−F (x)=

1

e−2F (x) + 1(9)

=⇒ l(p(x)) = l(F (x))

= log

[eF (x)

eF (x + e−F (x)

]y∗ [e−F (x)

eF (x + e−F (x

]1−y∗

= 2y∗F (x)− log(1 + e2F (x))

To find an optimal f(x) for current iteration, we need to maximize the expected logistic

likelihood function.

f(x) = arg maxf

E [(F (x) + f(x))]

It occurs when∂E [(F (x) + f(x))]]

∂f(x)= 0

Here, we denote

g(F (x) + f(x)) =∂E [l(F (x) + f(x))]

∂f(x)= E

[2y∗ − 2

1 + e−2(F (x)+f(x))

](10)

,and

h(F (x) + f(x)) =∂g(F (x) + f(x))

∂f(x)= E

[− 4e−2(F (x)+f(x))

(1 + e−2(F (x)+f(x)))2

](11)

Thus, it can be simplified to solve g(f) = 0. In logitboost methods,it uses Newton’s method

to find approximate solution.

Newton’s method is a method to find approximate roots r of function q(r) such that

q(r) ' 0. According to Taylor expansion,

q(r) = q(r0) + q′(r0)(r − r0) +O((r − r0)2.

It may not easy to find the root of q(r) = 0, but we can find a r such that 0 < |q(r)| <|q(r0)|, which means r is closer to root than r0, and using this approach to find an ap-

proximate solution until q(r) ' 0.

If r0 is already a point which is near the root, and

q(r) ' q(r0) + q′(r0)(r − r0) ' 0,

⇒ r ' r0 −q(r0)

q′(r0), (12)

then r will be a better approach than r0. To solve g(F +f) = 0, we can consider F (x) = r0

and r = F (x) + f(x).

By (10), (11) and (12),

F (x) + f(x) ' F (x)− g(F (x) + f(x))

h(F (x) + f(x))|f(x)=0

10

F (x) + f(x) ' F (x)− g(F (x))

h(F (x))(13)

Sinceg(F (x))

h(F (x)),

=

E

[2y∗ − 2

1 + e−2F (x)|x]

E

[− 4e−2F (x)

(1 + e−2F (x))2|x

]

= −E

[y∗ − 1

1 + e−2F (x)|x]

2E

[e−2F (x)

(1 + e−2F (x))

1

(1 + e−2F (x))

] ,and by (9), we get

g(F (x))

h(F (x))= −1

2E

[y∗ − p(x)

(1− p(x))p(x)|x]. (14)

And (13) becomes

F (x) + f(x) ' F (x) +1

2E

[y∗ − p(x)

(1− p(x))p(x)|x]. (15)

f(x) = arg minf

Ew

[f(x)− 1

2

y∗ − p(x)

(1− p(x))p(x)

]2(16)

Thus, we denote zm(i) =y∗ − p(m−1)(xi)

(1− p(m−1)(xi))p(m−1)(xi)), and minimize weighted least square

error to z to find optimal fm(x).

2.5 Gradient boosting

Gradient boosting [6] [8] is a methods of using the idea of function estimation. The

goal is to find a additive function which is fitting for training data. In gradient boosting,he

objective function is loss function which can interpreted as the dissimilarity between real

class label and the predicted value of the additive function. In every iteration, it uses neg-

ative gradient of loss function to approximate the residual of previous iteration and the

goal is to find a regression or classifier f and its score(function value) of each leaf which

can minimize the summation of all the instances’ loss function L(y, F(m−1)(x) + fm(x)).

Because of the needed of computing gradient, the loss function is required to be differential

convex function. In binary classification problem, it output the sign of final ensemble score

as its prediction if the class label y ∈ −1, 1.

11

2.5.1 Algorithm

Algorithm 4 Gradient boosting algorithm

1: F0(x) = arg minγ

N∑i=1

ψ(yi, γ) . Give an initial value to F

2: while m ≤M do

3: yim = −[∂ψ(yi, F (xi))

∂F (xi)

]|F (xi)=F(m−1)(xi) . Calculate negative gradient of loss function

4: RlmL1 = L− disjoint regions trained by (yim, xiN1 )

. use the gradient of loss function we just calculated as new training label to train a new classifier

5: γlm = arg minγ∑

xi∈Rlm

ψ(yi, Fm−1(xi) + γ) . Calculate the optimal score of each leaf

6: Fm(x) = F(m−1)(x) + ν · γlm1(x ∈ Rlm) . add the base classifier to the additive model(ensemble classifier)

7: end while

F0 : initial value of ensemble classifier

ψ : loss function

yim: estimated residual of sample i at mth iteration

Rlm: all the samples which are classified to leaf with index l at mth

γlm: optimal score of leaf l under the L-disjoint partition

ν : learning rate

Fm: the ensemble score of mth iteration

2.5.2 Derivation

In gradient boosting algorithm, user can give arbitrary convex differential function as a

loss function. Firstly, it gives an initial value to ensemble classifier which will be calculated

by the loss function being used. The initial classifier can be regards as a tree with only

single node which means all the samples are classified to the only single node and the

initial value F0 is the optimal score of the single node. Denote the score of single node tree

is γ.

F0(x) = arg minγ

N∑i=1

ψ(yi, γ) (17)

The minimum value of

N∑i=1

ψ(yi, γ) occurs at

∂

N∑i=1

ψ(yi, γ)

∂γ= 0 (18)

By equation (17) (18), F0 of taking different loss function can be derived as following.

Least square loss:1

2(y − F (x))2

12

∂

N∑i=1

1

2(yi − γ)2

∂γ= N

γ −N∑i=1

yi

N

= 0

F0(x) =

N∑i=1

yi

N

F0 is the average of all yi in the training set according to the following derivation.

Exponential loss: e−yF (x)

∂N∑i=1

e−yiγ

∂γ= 0

N∑i=1

−yie−yiγ = 0

= −e−γN∑i=1

1(yi = 1) + eγN∑i=1

1(yi = −1) = 0

−e−γNP (y = 1) + eγNP (y = −1) = 0

F0(x) = γ0 =1

2log

P (y = 1)

P (y = −1)

Logistic loss: − log(e−2yF (x) + 1)

∂N∑i=1

− log(1 + e−2yiγ)

∂γ= 0

N∑i=1

2yi1 + e2yiγ

= 0

2N

[P (y = 1)

1 + e2γ− P (y = −1)

1 + e−2γ

]= 0

F0(x) =1

2log

P (y = 1)

P (y = −1)

13

For exponential loss and logistic loss, F0 is taking the logarithm of the ratio of the prob-

abilities of two classes.

Secondly, it include four steps at mth iteration. (1) Calculate negative gradient of loss

function yim to approximate residual of each sample. (2) Train a classifier fm which par-

tition the features xi|i = 1, 2, · · · , N into L-disjoint region Rlm ∀l = 1, 2, · · · , N by

given training set xi, yim (3) Find the optimal score γlm of leaf l which minimizes the

loss function of the leaf. (4) Update the ensemble score.

(1) Calculate yim of different loss function.

yim = −[∂ψ(yi, F (xi))

∂F (xi)

]|F (xi)=F(m−1)(xi) (19)

Least square loss:

yim = −∂

1

2(yi − F (xi))

2

∂F (xi)|F (xi)=F(m−1)(xi)

= yi − F(m−1)(xi) (20)

Exponential loss:

yim = −

[∂e−yiF (xi)

∂F (xi)

]|F (xi)=F(m−1)(xi)

= yie−yiF(m−1)(xi) (21)

Logistic loss:

yim = −

[∂ − log(1 + e−2yiF (xi))

∂F (xi)

]|F (xi)=F(m−1)(xi)

=−2yi

1 + e2yiF(m−1)(xi)(22)

(2)Train a classifier(tree) fm has L leaves by using training set xi, yim. L can be variant

in different iteration.

(3)Find optimal score fm(xi) = γlm ∀xi ∈ Rlm of leaf l, which minimize the ensemble

loss function∑

xi∈Rlm

ψ(yi, Fm(xi)).

∑xi∈Rlm

ψ(yi, Fm(xi)) =∑

xi∈Rlm

ψ(yi, F(m−1)(xi) + γlm) (23)

14

γlm = arg minγ

∑xi∈Rlm

ψ(yi, Fm−1(xi) + γ) (24)

The minimum value occurs while∑xi∈Rlm

ψ(yi, Fm−1(xi) + γ)

∂γ= 0 (25)

Least square loss:

By equation (20),

(yi − F(m−1)(xi)− γ)2

can be re-written as1

2

(γ2 − 2yimγ + y2im

).

By equation (24) (25), we get

∂∑

xi∈Rlm

1

2

(γ2 − 2yimγ + y2im

)∂γ

= 0

⇒ Nl

γ −∑

xi∈Rlm

yim

Nl

= 0

⇒ γlm =

∑xi∈Rlm

yim

Nl,

where Nl is the number of samples which are classified to leaf l, Nl = N(Rlm).

Exponential loss:

By equation (21) (24) and (25),

∂∑

xi∈Rlm

e−yi(F(m−1)(xi)+γ)

∂γ= 0,

⇒∑

xi∈Rlm

−yie−yi(F(m−1)(xi)e−yiγ = 0,

⇒∑

xi∈Rlm

−yime−yiγ = 0,

⇒ −e−γ∑

xi∈Rlm

yim1(yi = 1)− eγ∑

xi∈Rlm

yim1(yi = −1) = 0,

we get

γlm =1

2log

−∑

xi∈Rlm

yim1(yi = 1)∑xi∈Rlm

yim1(yi = −1).

15

Logistic loss:

By equation(24), we need to solve

∂∑

xi∈Rlm

2yi

1 + e−2(F(m−1)(xi)+γ)

∂γ= 0,

but there is no closed form of the solution yet. In practical case, use numerical method to

solve it.

(4) Add the score to the ensemble classifier depends on which leaf x belongs to.

Fm(x) = F(m−1)(x) + ν · γlm1(x ∈ Rlm)

Here, ν is learning rate which is a coefficient of γlm while adding to ensemble classifier. In

practical application, it’s a parameter need to be fine tuned for building model conserva-

tively.

Lastly, output the prediction value which depends on loss function. For least square,

output the sign of ensemble result. For logistic and exponential loss, use sigmoid function

which will be introduced next chapter maps the ensemble value to [0, 1]. If the result is

larger than 0.5, the result of prediction will be 1. We summarize the result as Table 1

based on different loss function.

Table 1: Summary of different loss function for gradient boosting

Least square exponential loss logistic loss

loss function1

2(y − F (x))2 e−yF (x) − log(e−2yF (x) + 1)

F0(x)

∑Ni=1 yiN

1

2log

P (y = 1)

P (y = −1)

1

2log

P (y = 1)

P (y = −1)

yim yi − F(m−1)(xi) yie−yiF(m−1)(xi)

−2yi1 + e2yiF(m−1)

γlm

∑xi∈Rlm yim

Nl

1

2log−∑

xi∈Rlm yim1(yi = 1)∑xi∈Rlm yim1(yi = −1)

no closed form

2.6 XGB-Extreme Gradient Boosting

XGB which is also called ”extreme gradient boosting” and has been developed since

2014. It appeared first time in the competition hold by Kaggle and was proposed by Tainqi

Chen [7] in University of Washington. The idea of XGB is originated from gradient boost-

ing. The biggest difference between gradient boosting and XGB is the objective function.

16

In addition to training loss, objective function of XGB has a regularization term which

dose not exist in traditional gradient boosting. It can prevent overfitting. Like other boost-

ing method, the final classifier is the result to ensemble many based classifier up. And the

ensemble classifier can always have a better prediction than previous iteration. XGB uses

second order Taylor expansion to approximate objective function which approached by

previous term. We discuss binary classification problem with label y ∈ 1,−1, and the

training set (xi, yi)|i = 1, 2 · · · , N, where xi is a vector with d−dimensions(d features)

2.6.1 Algorithm

Algorithm 5 Extreme boosting algorithm1: Initial F0(xi) = γ0 ∀i = 1, 2, · · · , N . Give an initial value to F

2: while m ≤M do

3: gim =∂ψ(yi, F (xi))

∂F (xi)|F (xi)=F(m−1)(xi) , him =

∂l2(yi, F (xi))

∂F (xi)2|F (xi)=F(m−1)(xi)

. Calculate gim and him for 2nd order Taylor expansion approximation

4: Train a new classifier which partitions xiNxiinto L− disjoint regions RlmLl=1

. Train a classifier with training set xi, gim, him

5: wlm = −

∑xi∈Rlm

gim∑xi∈Rlm

him + λ. calculate the optimal score of each region

6: Fm(x) = F(m−1)(x) + ν · wlm1(x ∈ Rlm) . add the base classifier to the additive model(ensemble classifier)

7: end while

γ0: initial score of each sample

gim: value of first derivation of loss function of sample i at mth iteration

him: value of first derivation of loss function of sample i at mth iteration

Rlm: ∀ xi which are classified to region l at mth iteration

wlm: optimal score of leaf l at mth iteration

2.6.2 Derivation

The objective function in XGB includes two terms, which are loss function and regu-

larization term. Denote the Objective function Lm at mth iteration.

L(m) =N∑i=1

ψ(yi, F(m−1)(xi) + fm(xi)) + Ω(fm) (26)

Ω(fm) = γL+1

2λ ‖ w ‖2 (27)

are loss function and regularization term respectively. fm is a regression tree trained

by xi, gim, him. It partitions all samples xi|i = 1, 2 · · · , N into L-disjoint Regions

17

RlmNi=1. The goal is to minimize the objective function L(m) with given under fixed

partition RlmNl=1 and find optimal weight wlm of region l such that fm(xi ∈ Rlm) = wlm

The regions can be regarded as leaves.

Different from traditional gradient boosting, XGB uses 2nd order Taylor expansion to

approximate loss function , thus, ψ is required to be a second differentiable function.

ψ(yi, F(m−1)(xi) + fm(xi)) ∼= ψ(F (m−1)(xi), yi) + gimfm(xi) +1

2himf

2m(xi) (28)

where

gim = ∂Fψ(yi, F ) |F=F (m−1)(xi)(29)

him = ∂2Fψ(yi, F ) |F=F (m−1)(xi)(30)

Eq (26) becomes

L(m) ∼=N∑i=1

ψ(F (m−1)(xi), yi) + gimfm(xi) +1

2himf

2m(xi) + γL+

1

2λ ‖ w ‖2 . (31)

Since ψ(F (m−1)(xi), yi) is known, it’s equivalent to optimize

L(m) ∼=N∑i=1

gimfm(xi) +1

2himf

2m(xi) + γL+

1

2λ ‖ w ‖2 (32)

Let fm(xi ∈ Rlm) = wl, equation (32) can be rewritten as

L(m) ∼=L∑l=1

∑xi∈Rlm

gimwl +1

2himw

2l + γ+

L∑l=1

1

2λw2

l

=L∑l=1

wl

∑xi∈Rlm

gim

+1

2w2l

∑xi∈Rlm

him + λ

+ γ

Let

Ll(m)

= wl(∑

xi∈Rlm

gim) +1

2w2l (∑

xi∈Rlm

him + λ) + γ (33)

Finding optimal value of regression tree fm is equivalent to find optimal wl which minimize

Ll(m)

of leaf l, respectively.

wlm = arg minwl

Ll(m)

(34)

Its minimal value occurs when∂Ll

(m)

∂wl= 0

Denote alm =1

2(∑

xi∈Rlm

him + λ) and blm = (∑

xi∈Rlm

gim).

Equation (33) becomes

18

Ll(m)

= almw2l + blmwl + γ

and

∂Ll(m)

∂wl= 2almwl + blm = 0,

we get

wlm = − blm2alm

= −

(∑

xi∈Rlm

gim)

(∑

xi∈Rlm

him + λ)(35)

Since

Optimal(Ll

(m))

= γ −b2lm

4alm= γ −

∑xi∈Rlm

gim

2

2

∑xi∈Rlm

him + λ

,

it turns out that

Optimal(L(m)

)∼=

L∑l=1

γ −∑

xi∈Rlm

gim)2

2(∑

xi∈Rlm him + λ)

= γL−L∑l=1

∑xi∈Rlm

gim

2

2

∑xi∈Rlm

him + λ

, (36)

which is important for the latter derivation of the splitting criteria for training new

classifier of each iteration.

In the algorithm, it gives a initial value to ensemble classifier which can be defined by

user in practical application. It also can simply use the same value as gradient boosting.

In the mth iteration, it includes four steps:

(1) Compute the estimated first derivation gim and second derivation him by using yi and

the ensemble value F(m−1)(xi) of previous iteration for all the samples. (2) Fit a classifier

with given training set xi, gim, him. (3) Compute optimal score wlm.

(4) Add the score to ensemble classifier.

(1) Compute gim and him of different loss function by equation (29) (30)

Least square loss:(y − F (x))2

gim = 2F(m−1)(xi)− 2yi

19

him = 2

Exponential loss function: e−yF (x)

gim = −yie−yiF(m−1)(xi)

him = y2i eyiF(m−1)(xi) = eyiF(m−1)(xi),

because of y2i = 1 for yi ∈ −1, 1.Logistic loss function: − log(1 + e−2yF (x))

gim =2yie

−2yiF(m−1)(xi)

1 + e−2yiF(m−1)(xi)

=2yi

1 + e2yiF(m−1)(xi)

him =4y2i e

2yiF(m−1)(xi)

1 + e2yiF(m−1)(xi)

=4

1 + e−2yiF(m−1)(xi)

(2) and (3) Train a classifier to partition xi|i = 1, 2, ·, N into RlmLl=1 and find the

optimal score wlm which minimize the objective function.

In step (2), to use gim and him as a stopping/splitting criteria to train a new classifier

with L leaves. It means the classifier stop splitting after evaluating the further splitting of

each leaf. To measure the goodness of the classifier, we can use the similar idea as decision

tree. Grow a tree from a single node and stop while impurity(entropy) is increasing after

splitting and the optimal classifier will be the one before splitting. In XGB, the objective

function L(m) can be an index to represent goodness of a classifier. If objective function

value is increasing after splitting leaf l, then not to split l, otherwise continue to split until

every leaf can’t be split further. To formulate the stopping criteria, firstly, we partitions the

samples in Rlm into RL and RR with leaf index lL, lR. Every possible partition(splitting)

of Rlm won’t increase objective function value, if l is a leaf of optimal classifier. Since

other leaves objective function value won’t change after splitting Rlm, the difference of

objectives is equal to the difference between Optimal(Ll(m)

) and Optimal(L(m)lL∪lR). Thus,

we can formulate the stopping the criteria as following

δL(m) = Optimal(Ll

(m))−Optimal

(L(m)lL∪lR

)< 0 (37)

20

where

Optimal(Ll

(m))

= γ −

(∑xi∈Rlm gim

)22(∑

xi∈Rlm him + λ) ,

and

Optimal(L(m)lL∪lR

)= 2γ −

∑xi∈RR

gim

2

2

∑xi∈RR

him + λ

+

∑xi∈RL

gim

2

2

∑xi∈RL

him + λ

.

Inequality (37) becomes

∑xi∈RL

gim

2

2

∑xi∈RL

him + λ

+

∑xi∈RR

gim

2

2

∑xi∈RR

him + λ

− ∑xi∈Rlm

gim

2

2

∑xi∈Rlm

him + λ

− γ < 0 (38)

If l is a leaf of the optimal classifier, all the possible partitions satisfies inequality (38).

If δL(m) > 0 which means l can be split further, then we have to find the best splitting

for leaf l. The best splitting in decision tree is to choose the partition which can reduce

most score of impurity(entropy). Exact greedy algorithm for splitting finding in XGB is

also with the similar idea, so it choose the splitting which maximizes δL(m) (decrease the

objective function value most) as optimal spitting. In exact greedy algorithm for splitting

finding, it sorts all the samples in Rlm by xik which is the value of kth feature such that

RL = xs|∀xsk ≤ xik and RR = xj |∀xjk ≥ xik. Always Take the partition with larger

δL(m). Repeat it for every feature and finally we can find the best splitting for leaf l. After

getting L and RlmNl=1, calculate optimal weight wlm of each leaf l by equation (35).

In last step, output the ensemble classifier. The prediction output the sign of the

ensemble value while using least square loss for binary classification problem with yi ∈−1, 1. For exponential and logistic loss, sigmoid function can be used to map the value

to [0, 1] and return 1 if the value is larger than 0.5. Table 2 summarize the results of

different objective functions.

21

Lea

stsq

uar

eE

xp

onen

tial

loss

Log

isti

clo

ss

loss

funct

ion

(y−F

(x))

2e−

yF(x

)−

log(e−2yF(x

)+

1)

wlm

−

∑xi∈Rlm

2(y i−F(m−1)(xi)

)

(2Nl+λ

)−

∑xi∈Rlm

−y ie−

F(m−

1)(xi)yi

(∑

xi∈Rlm

e−F

(m−

1)(xi)yi+λ

)−

∑xi∈Rlm

2yi

1+e2yiF

(m−

1)(xi)

(∑

xi∈Rlm

−4e−2yiF

(m−

1)

(1+e−

2yiF

(m−

1)(xi) )

2+λ

)

g im

2(y i−F(m−1)(xi)

)−y ie−

F(m−

1)(xi)yi

2yi

1+e2yiF

(m−

1)(xi)

him

2e−

F(m−

1)(xi)yi

−4e−2yiF

(m−

1)

(1+e−

2yiF

(m−

1)(xi) )

2

Tab

le2:

Sum

mar

yof

diff

eren

tlo

ssfu

nct

ion

for

XG

B

22

3 Measurement and Experiment

3.1 Measurement of goodness of the classifier

In general, the goodness of a model can be measured by accuracy, which is the ratio be-

tween correct predicted counts and total counts. But in some specific case, it may happen

that it can not truly represent the model’s goodness, such as the quality test prediction of

high yield product, which always includes rare ”fail” samples compared to ”Pass” samples.

In the dataset will be used for demonstration in this thesis, the fail samples of quality test

is less than 1% of all products, which means two classes have big disparity. The model

trained by such imbalanced dataset will tend to predict the result as the major class, but

it’s still possible to get high accuracy, which is variant by the given test set of two-class

ratio. For example, if the ratio between two classes(Fail/Pass) is 0.01(1:100), the model

will be prone to predict the result as ”Pass”, and the accuracy is higher than 99% while

the class distribution of given test set is same as training set. But when we change an test

set with more failure cases, this model can not predict precisely. To effectively measure the

goodness of the model learned from imbalance two-class data set, the accuracy of ROC

curve will be used to measure the model’s goodness in whole paper.

3.1.1 ROC curve

ROC(receiver operating characteristics) curve is a curve using for measuring the good-

ness of binary classifier. Its value of x − axis and y − axis represent false positive rate

and true positive rate respectively. The area under the curve can be an index of good-

ness of the model, which is called ”AUC”. In binary classification, we define two classes

being ”+” and ”-”, and the combination of predicted result and true class label have 4

different cases, which are ”TP”(True positive), ”FP”(False positive), ”TN”(True negative)

and ”FN”(False negative) and it can be represented by confusion matrix as Table 1. True

positive means condition positive case is predicted as positive, and false positive means

negative case is predicted as positive case. To draw ROC curve, we adjust the threshold

and get different pair of false positive rate and true positive rate (FPR, TPR) which is

defined by equation (39) and (40). For example, If the instance’s probability of being ”+”

is p and p > threshold, then classify it to positive class. Thus, the higher the threshold is,

the smaller the false positive rate is.

23

Figure 1: Receiver operating characteristics curve

Confusion matrix Predicted ”+” Predicted ”-”

Condition ”+” True Positive(TP) False negative(FN)

Condition ”-” false Positive(FP) True negative(TN)

Table 3: Confusion matrix which is gotten under certain threshold.

TPR =TP

TP + FN(39)

FPR =FP

FP + TN(40)

3.1.2 sigmoid function

The curve of sigmoid function which is an antisymmetric function looks like ”S” as

Fig 2, The special case of sigmoid function is logistic function, which maps a real number

to [0,1]. Denote

S(x) =1

1 + e−x. (41)

Some of boosting methods returns the sign of ensemble classifier F (x) as the prediction

result while F (x) is an additive logistic function. But in this kind of sense, it means

it already choose the threshold as 0.5. As we can see in Fig 2, S(x) > 0.5 when x is

24

positive. For cutting different threshold to draw ROC curve, in this paper, Sigmoid function

will be used to map F to [0,1]. Dscrete adaboost is not suitable for using this case,

because the return value will be likely accuracy(1-error) instead of class probability after

mapping by sigmoid function. Basically, this will be used only when the objective function

is ”exponential” or ”logistic” such as real adaboost, gradient boosting and extreme gradient

boosting.

Figure 2: Sigmoid function maps the score to [0,1]

3.2 Data preparation

The original dataset includes three parts, which are categorical, numerical and times-

tamps data. Three datasets are totally with 4267 columns × 1183748 rows. Since the

limitation of computational ability of laptop, only numerical and time-related data from

partial process flow will be used for the later demonstration. Before starting to train the

model, I construct three the training sets as Table 5. First training set is constructed

by cutting partial consecutive process flow and leaving those rows with missing values

out. The dataset didn’t reveal the real physical meaning of each feature in the numerical

dataset, but there are still some useful information, such as the process sequence of each

product in a machine. There are some processes which can be done not in only single ma-

chine. For example, in Fig 3, B1 and B2 are equivalent machines, in this station, product

can be processed either in B1 or in B2, except some special case needed rework. In first

training set, some data with same feature data from equivalent machine are putting into

25

different columns, so I merged them into same column. Second training set is constructed

by adding some extra features to first training set. I add machine number and machine

idling time before the product staring process as extra features. If the product pass B1,

then the added feature is B1. It’s possible that machine has a long queueing time before

product comes. The machine’s idling time is the difference between two consecutive prod-

ucts which are processed in this machine. Since there are many missing values in each

columns, which means the sampling rate of measurement is not 100%. In third training

set, two consecutive partial process flow are cut, but later I use XGB to demonstrate. In

all the packages, only XGB support to handle the cell with missing value.

Training set # of rows # of columns Note

1 80000 42 Partial process flow without missing data

2 80000 56 Partial process flow and extra added features

3 22000 267 Partial process flow with missing data

Table 4: Size of training set

Figure 3: Partial process flow

3.3 implementation

The demonstration will be implement using ASUS laptop under the environment with

window 10, 64-bit system, 4-GB ram, X64 processor. I encode in python language and the

programming part will be executed and complied in jupyter notebook, which is a on-line

26

interactive platform support multi-programming languages. The code can be executed in

a single cell independently. The version of Python is 3.3 and jupyter notebook is 4.1.

Compared to original data set, the training set used for demonstration is very small,

so the result of the prediction is not as good as the those results of higher rank in the

competition. Since the purpose of this thesis is not to require the high accuracy, it’s

the demonstration of the procedure. Programming will be executed by using scikit-learn

package which exists in python. Most methods can be found in this package, such as

adaboost, gradient boost, and extreme gradient boosting.

To build the training model, the procedure includes two steps. (1) Find the optimal

parameters to train the model. (2) Find an optimal threshold so that we can output the

result of prediction.

3.3.1 Optimization of parameters

In programming, the model is optimized under given condition with fixed parameters.

To find these optimal parameters before training the model, I took line-search strategy

which is a methods looking for local extreme value by adjusting one specific parameter

while other parameters are fixed. I will demonstrate the procedure by using Extreme

gradient boosting with third training set in Table 4, because XGB is the only methods

able to handle training set with data missing among four methods. The line search strategy

can be shown from Fig 4 to 8. The sequence of picking parameter to maximize AUC does

not matter.

Firstly, we fix other parameters except maximal depth of base classifier and find local

extreme by adjusting maximal depth of base classifier. Fig 4 shows that the local maxi-

mum Roc accuracy occurs when max depth = 2.

27

Figure 4: Default value: M = 100, λ = 1, γ = 0, objective=logistic, ν = 0.1

Secondly, fix max depth = 2 and pick the number of base classifiers(n estimators) for

doing line search to find the optimal value which maximize the ROC accuracy while other

parameters fixed. In Fig 5, the result show the maximum of AUC occurs when it’s 103.

Figure 5: max depth = 2, and default value λ = 1, γ = 0, objective = logistic,

ν = 0.1

28

Figure 6: M = 103, max depth = 2, λ = 1, γ = 0, objective = logistic

Fig 6 shows the learning rate ν with maximum AUC occurs while it’s 0.1 and it

keep trending down after 0.1. In empirical experience, it’s is between 0.01 and 0.2. γ is a

parameter which only appears in XGB. Since AUC doesn’t change between 0.01 and 0.2

while programming. Thus, I took 0.2 as increment of each iteration. The maximum value

happens while γ = 0 0.2. In practice, it could happen that γ is either very close to zero

or very far from zero, but the result in Fig 7 shows it’s almost same as random guessing

while γ is very large.

29

Figure 7: M = 103, max depth = 2, λ = 1 , objective = logistic


30


λ is also a parameter only appears in XGB, it represents the coefficient of L2-regularization.

Figure 10: M = 103, max depth = 2, γ = 0 , objective = logistic

Finally, we get the optimal condition with auc 0.68. The optimal condition is as Table

9.

31

3.3.2 Optimal threshold

The ROC curve under optimal condition is shown as Fig 10. To find a optimal thresh-

old, we have to maximize MCC(Matthew’s Correlation Coefficient) which can strongly

represent the correlation between imbalanced two classes and the value is always between

-1 1, it can be computed by equation (43) with the confusion matrix which is got by

adjusting threshold.

MCC =TP × TN + FP × FN√

(TP + FP )(TP + FN)(TN + FP )(TN + FN)(42)

Figure 11: ROC curve under optimal condition

32

Figure 12: Confusion matrix while threshold is equal to 0.0644

3.3.3 Regularization and over-fitting

In XGB algorithm, the most significant difference from other boosting methods is to

consider the regularization term which can prevent over-fitting problem. Fig 12 and 13

show the model become slightly stable while the number of estimators is larger than 100,

and training error continue to decay in the meantime. Fig 14 shows not only the over-fitting

problem also the AUC got improvement while taking the regularization into account.

33

Figure 13: XGB with regularization λ = 1, γ = 0.1

Figure 14: XGB without regularization λ = 0, γ = 0

34

Figure 15: Overfitting: XGB with regularization improved overfitting problem.

3.3.4 Model with different training set and algorithm

Table 5, 6 and 7 show the results of optimal conditions and its AUC while the model

trained by fixed algorithm with given different training sets. The best model is not sur-

prisingly the training set by XGB with third training set, because of it’s with more in-

formations(features) compared to another two training set. The model trained by second

training set has no improvement compared to those trained by first set.

Training set max depth ν max features M AUC

1 3 0.06 30 16 0.612

2 3 0.06 46 16 0.628

3 — — — — —

Table 5: Optimal condiftion of given different training set by using adaboost algo-

rithms

35

Training set loss function max depth ν max features M AUC

1 exponential 3 0.06 31 28 0.62

1 logistic 3 0.11 44 27 0.632

2 exponential 3 0.09 46 26 0.624

2 logistic 3 0.13 50 69 0.629

3 — — — — — —

Table 6: Optimal condiftion of given different training set while using gradient boost-

ing

Training set Objective cols∗ byt∗ max d∗ ν γ λ M AUC

1 logistic 1 3 0.12 7.4 0.99 49 0.617

2 logistic 1 4 0.1 2.25 1 75 0.608

3 logistic 1 2 0.1 0.1 0.06 103 0.676

1 exponential 1 6 0.1 0 3.96 37 0.613

2 exponential 1 6 0.1 0 1 88 0.61

3 exponential 0.5 6 0.03 0 1 29 0.658

Table 7: Optimal condiftion of different given training set while using XGB

3.4 Summary

Since the limited computational ability of hardware, the original data set couldn’t

be fully used. Third training set is only with 260 features out of 4127 features from

original data, the model trained by such small dataset compared to original dataset already

achieved AUC 0.68. In such high imbalanced classification problem, the consideration of

choosing optimal threshold is needed; otherwise, all the data are tended to be classified

to major class if we simply take the sign of output score of the model as the prediction.

XGB which existed in scikit-learn package in python is quite suitable for handling large

dataset with a lot of missing value. In the dataset from Bosch company, if we only handle

those rows with complete information, then many rows will be deleted. For productivity

concerned, measurement sampling rate for each process are usually not 100%. Thus, if

we delete those rows with incomplete information, then not many rows will be preserved

when the measurement sampling rate is very low. When doing line search, it may not

easy to find an proper range of each parameter. Take γ as example, the value of AUC

dosen’t have obvious vibration when γ = 0 and γ = 4. Based on empirical experience, it’s

possible that the optimal value occurs either very close or far from zero. After optimizing

the parameters, the optimal MCC we got is 0.15 which is only slightly better than random

guessing when the training set gave less information. In the competitions, optimal MCC

36

of the first rank in the competition is 0.52 which is also not vary high correlated. We also

observed that overfitting and AUC are improved while considering the regularization term,

which means λ 6= 0 and γ 6= 0. This result matches to the theoretical part. Those models

trained by different methods with training set 2 are not better than those trained by

training set 1. Both training set are with less information. But if we can know more about

mechanism or real physical meaning, it’s quite helpful for data preparation. Table 5 6 and

7 show the result of those models trained by different algorithms and training set, but

not surprisingly the results are not so competitive because of less information(features).

In gradient boosting and extreme gradient boosting, only ”exponential” and ”logistic” are

selected as objective function because it make more sense to use sigmoid function map it to

[0, 1]. There is no specific method which is the always the best, and in Kaggle competitions,

many participants won the competition by combining many different methods rather than

with using single algorithm.

37

References

[1] Mohri, Mehryar and Rostamizadeh, Afshin and Talwalkar, Ameet, Foundations of

Machine Learning, 2012

[2] Yoav Freund and Robert E. Schapire, A Short Introduction to Boosting, In Proceedings

of the Sixteenth International Joint Conference on Artificial Intelligence, 1999, 6,

1401-1406

[3] Robert E. Schapir, Theoretical Views of Boosting and Applications, 1999

[4] Friedman, J.; Hastie, T. & Tibshirani, R., Additive Logistic Regression: a Statistical

View of Boosting, Annals of Statistics 1998.

[5] Freund, Yoav and Schapire, Robert E, A Decision-theoretic Generalization of On-line

Learning and an Application to Boosting, J. Comput. Syst. Sci., 1997, 21, 119-139

[6] Jerome H. Friedman, Stochastic Gradient Boosting Computational Statistics and Data

Analysis, 1999, 38, 367-378

[7] Tianqi Chen, XGBoost: A Scalable Tree Boosting System KDD ’16: Proceedings of

the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining 2016

[8] Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine,

Annals of Statistics, 2000, 29, 1189-1232

VI

Date post:	07-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Application of machine learning in manufacturing...

Documents