Customer Relationship Management
using Ensemble Methods
by
Yue Cui
A thesis submitted in conformity with the requirements
for the degree of Master of Applied Science
Department of Chemical Engineering & Applied Chemistry
University of Toronto
© Copyright by Yue Cui 2018
- ii -
Customer Relationship Management using Ensemble Methods
Yue Cui
Master of Applied Science
Department of Chemical Engineering & Applied Chemistry
University of Toronto
2018
Abstract
This thesis aims to provide a method for customer relationship management prediction to beat
the in-house classification method implemented by the data provider company. By reviewing
the common machine learning algorithms, we recommend ensemble methods to predict the
targets. Three different ensemble methods are implemented in the thesis: random forest, gradient
boosting decision trees and ensemble selection. With the results, we conclude that all ensemble
methods outperform the benchmark, reduce the positive predictions and increase the true
positive rates. Ensemble selection performs the best, followed by the gradient boosting decision
trees. In addition, the results indicate that the ensemble methods' running time significantly
increase when compared to the benchmark. The results also indicate that careful feature
selection can significantly simplify the training and prediction process. We further discuss the
potential applications in marketing using the prediction results and the trade-off between
accuracy and computational complexity when applying ensemble methods.
- iii -
Acknowledgements
Immeasurable appreciation and deepest gratitude for the academic, educational and human
support and belief in me from the following persons who have contributed in making this thesis
possible:
Professor Joseph C. Paradi, my respected thesis supervisor, for his sincere support, guidance,
valuable suggestions and comments that benefited me in the completion of this work.
My parents, whose love and support made it possible for me to complete this project.
My friends and my fellow CMTE candidates, for their companionship and encouragement.
Lastly, I humbly extend my thanks to all concerned persons who helped and encouraged me
through my graduate studies.
- iv -
Table of Contents
Acknowledgements....................................................................................................................... iii
List of Tables ............................................................................................................................... vii
List of Figures .............................................................................................................................. vii
Introduction..................................................................................................................................... 1
1.1 Motivation ................................................................................................................ 1
1.2 Objectives ................................................................................................................. 2
1.3 Scope ........................................................................................................................ 2
Literature Review ........................................................................................................................... 4
2.1 Supervised Learning ................................................................................................. 4
2.1.1 Basic Naïve Bayes Classifier .................................................................................... 4
2.1.2 Logistic Regression .................................................................................................. 5
2.1.3 Decision Tree ............................................................................................................ 6
2.1.4 Support Vector Machine ........................................................................................... 7
2.1.5 Artificial Neural Network ......................................................................................... 8
2.1.6 K-Nearest Neighbors Algorithm .............................................................................. 9
2.2 Ensemble Learning ................................................................................................. 10
2.2.1 Bagging ................................................................................................................... 11
2.2.2 Random Forest ........................................................................................................ 11
2.2.3 Boosting .................................................................................................................. 12
2.2.4 Gradient Boosting ................................................................................................... 13
2.2.5 Ensemble Selection ................................................................................................. 14
Data ............................................................................................................................................... 15
- v -
Technology ................................................................................................................................... 17
4.1 Python .................................................................................................................... 17
4.2 Libraries ................................................................................................................. 17
4.2.1 Scikit-learn .............................................................................................................. 17
4.2.2 TensorFlow ............................................................................................................. 18
4.2.3 TFlearn .................................................................................................................... 18
4.2.4 XGBoost ................................................................................................................. 18
4.3 Cloud Computing ................................................................................................... 19
Methodology ................................................................................................................................. 20
5.1 Evaluation Score: Area Under the Curve (AUC) ................................................... 20
5.2 Data Processing ...................................................................................................... 21
5.2.1 Data Preprocessing ................................................................................................. 21
5.2.2 Encoding Categorical Data ..................................................................................... 22
5.2.3 Feature Selection .................................................................................................... 24
5.3 Classification Methods ........................................................................................... 24
5.3.1 Naïve Bayes Classification ..................................................................................... 24
5.3.2 Random Forest ........................................................................................................ 25
5.3.3 Gradient Boosting Decision Tree ........................................................................... 25
5.3.4 Ensemble Selection ................................................................................................. 26
5.4 Parameter Optimization ......................................................................................... 28
5.5 Feature Importance and Model Performance ......................................................... 29
Results and Analysis ..................................................................................................................... 30
6.1 Classification Results ............................................................................................. 30
- vi -
6.1.1 Naïve Bayes Classification ..................................................................................... 30
6.1.2 Random Forest ........................................................................................................ 31
6.1.3 Gradient Boosting Decision Trees .......................................................................... 34
6.1.4 Ensemble Selection ................................................................................................. 36
6.1.5 Overall Comparison ................................................................................................ 37
6.2 Feature selection ..................................................................................................... 38
6.3 Target Audience ..................................................................................................... 41
Discussion and Conclusions ......................................................................................................... 44
Future Works ................................................................................................................................ 48
References..................................................................................................................................... 50
- vii -
List of Tables
Table 1 Frequency Distribution of Target Variables .................................................................... 16
Table 2 Confusion matrix of a binary classification ..................................................................... 20
Table 3 Confusion Matrix for Churn from Naïve Bayes Classifier ............................................. 30
Table 4 Confusion Matrix for Appetency from Naïve Bayes Classifier ...................................... 30
Table 5 Confusion Matrix for Upselling from Naïve Bayes Classifier ........................................ 31
Table 6 Confusion Matrix for Churn from Random Forest .......................................................... 31
Table 7 Confusion Matrix for Appetency from Random Forest .................................................. 32
Table 8 Confusion Matrix for Upselling from Random Forest .................................................... 32
Table 9 Optimal Hyperparameters for Targets ............................................................................. 32
Table 10 Confusion Matrix for Churn from Gradient Boosted Decision Trees ........................... 34
Table 11 Confusion Matrix for Appetency from Gradient Boosted Decision Trees .................... 34
Table 12 Confusion Matrix for Upselling from Gradient Boosted Decision Trees ..................... 34
Table 13 Optimal Hyperparameters for Targets ........................................................................... 35
Table 13 Confusion Matrix for Churn from Ensemble Selection................................................. 36
Table 14 Confusion Matrix for Appetency from Ensemble Selection ......................................... 36
Table 15 Confusion Matrix for Upselling from Ensemble Selection ........................................... 37
Table 16 Feature Importance of the Top 20 Variables for Each Target ....................................... 39
- viii -
List of Figures
Figure 1 Sigmoid Function ............................................................................................................. 5
Figure 2 A decision tree to determine the species of a citrus fruit. ................................................ 6
Figure 3 How kernel trick transforms a nonlinear classification (Jain, 2017)................................ 7
Figure 4 Maximum-margin hyperplane and margins for an SVM trained with samples from two
classes (Meyer, 2001). .................................................................................................................... 8
Figure 5 Examples of (a) a neuron, and (b) a neural network. ....................................................... 9
Figure 6 An Example of k-NN Classification .............................................................................. 10
Figure 7 The General Structure of a Random Forest (He, Chaney, Sheffield, & Schleiss, 2016)
...................................................................................................................................................... 12
Figure 8 Bagging vs Boosting (ETS Asset Management Factory, 2016) .................................... 13
Figure 9 Residual Fitting Process in Gradient Boosted Decision Trees (Grover, 2017) .............. 14
Figure 10 Frequency Distribution of Variable 2382..................................................................... 15
Figure 11 Frequency Distribution of Variable 1933..................................................................... 16
Figure 12 Area Under Curve and Balanced Accuracy for Sensitivity vs Specificity ................... 21
Figure 13 A Numeric/Ordinal Encoding Example (Laurae, 2017) .............................................. 22
Figure 14 A One Hot Encoding Example (Laurae, 2017) ............................................................ 23
Figure 15 A Binary Encoding Example (Laurae, 2017) ............................................................... 24
Figure 16 Overview of All the Classification Model ................................................................... 38
Figure 17 AUC Score for Random Forest with Different Number of Features ........................... 40
Figure 18 AUC Score for Gradient Boosting Decision Tree with Different Number of Features
...................................................................................................................................................... 40
Figure 19 Relationship between Prediction and Truth and Connection with Marketing ............. 41
Figure 20 Positive Prediction and True Positive Rate for Each Classifier ................................... 42
Figure 21 Trade-off between Accuracy and Time ........................................................................ 45
- 1 -
Chapter 1
Introduction
1.1 Motivation
In current times, our social reality is in a state of flux, developing from an industrial society via
an information society towards a knowledge-based society. Knowledge is the basis for making
decisions and taking action, and is relied on data and information (Hohenegger, Bufardi, &
Xirouchakis, 2008). If you want to make a good decision, it is necessary that the relevant
information is gathered quickly through well addressed and comprehensibly analyzed data.
Knowledge management looks into the possibility of taking an active influence on the
knowledge resources within a company (Wilde, 2011). Customers are the basis of a company’s
economic success. Therefore, it is important for us to appropriately manage customers’ data and
information. Furthermore, understanding the connections among the knowledge and actions or
strategies can lead to the improvements in customer relationships and increase the company’s
profit.
The KDD Cup 2009 offered a large marketing database (15,000 × 50,000, >1GB) provided by
the French Telecom company Orange (the database or the Orange database). The competition
required the participants to produce predictions for the propensity of customers to switch
providers (churn), buy new products or services (appetency), or buy upgrades or add-ons
proposed to them to make the sale more profitable (up-selling) (Guyon, Lemaire, Boullé, Dror,
& Vogel, 2009). The customers that are more loyal to the company, more likely to purchase new
products and upgrades can be considered as the “optimal target audiences” for marketing
campaigns.
This well cleaned and tested database provides an extraordinary opportunity to apply machine
learning algorithms on a large-scale industrial application for customer relationship
management.
- 2 -
This thesis focuses on constructing a feasible prediction implementation of the target variables,
churn, appetency and upselling, for the Orange database based on the comprehension and
evaluation of current popular machine learning methods.
1.2 Objectives
As the data source is obtained from a Knowledge Discovery and Data Mining competition,
some of the objectives follow the requirements of the challenge.
The objectives of this thesis are as follow:
➢ Assess the academic literature, industry practices and success competition candidates of
machine learning application so as to determine the fastest and the most efficient
methods that can be applied to develop predictions based on this database.
➢ Propose a reasonable methodology or framework that predicts target variables based on
large-scale inputs.
➢ Construct the algorithm that can outperform the baseline predicted using a basic Naïve
Bayes classifier, and improve the algorithm to surpass the results from the in-house
system developed by the Orange Labs.
➢ Further analyze the results generated from the above application to identify the optimal
target audience for future marketing campaigns.
1.3 Scope
The goal of the project is to compare some of the common ensemble methods, and provide
suggestions based on the results.
The thesis focuses exclusively on the structure of the database provided by Orange, which is a
flat file database.
This thesis provides predictions about target variables, churn, appetency and up-selling, and will
not predict other information related with other fields within the database.
- 3 -
This work is constructed to be worked solely on the Orange database but can be modified to fit
most classification tasks.
- 4 -
Chapter 2
Literature Review
For machine learning tasks, the training data consist of a set of examples. In supervised learning,
each example is a pair consisting of an input vector and a desired output vector, also called the
supervisory signal. A supervised learning algorithm analyzes the training data and produces an
inferred function, which can be used for mapping new examples (Russel & Norvig, 2010). This
thesis project falls into the supervised learning category. The chapter first reviews the widely
used supervised learning algorithms and states the strengths and weaknesses for each one. In
addition, this chapter studies ensemble learning, a class of algorithms transforming a simple
mediocre algorithm into a super classifier without requiring additional fancy new algorithms and
provides background and foundation for Chapter 5.
2.1 Supervised Learning
In supervised learning, a label from the output vector is the explanation of its respective input
example from the input vector. The output vector consists of labels for each training example
present in the training data. These labels for the output vector are provided by the supervisor,
and that’s why this type of learning is called supervised learning (Mohammed, Khan, & Bashier,
2016). Two general groups of algorithms fall under the umbrella of supervised learning:
classification and regression. This project, predicting customer decisions, is a classification
problem.
2.1.1 Basic Naïve Bayes Classifier
The benchmark of the KDD Cup 2009 was provided based on the basic Naïve Bayes classifier.
Naïve Bayes classifiers are a group of simple probabilistic classifiers based on the application of
Bayes’ theorem with the assumption of strong (naïve) independence among the features (Good,
1965), and votes among features with a voting score capturing the correlation of the feature to
the target. This type of classifier was widely studied since the 1950s and became a popular
- 5 -
method for classification in the early 1960s (Russel & Norvig, 2010). The Naïve Bayes model is
a good representative of generative models, as it generates distribution for both variables and
target.
Compared to other classification algorithms, Naïve Bayes is easy to implement, and only
requires a small training data to estimate the parameters necessary for classification. However,
the assumption is that class conditional independence causes the loss of accuracy. Practically,
dependencies exist among variables, and these cannot be modelled by Naïve Bayes classifier
(Rish, 2001).
2.1.2 Logistic Regression
Logistic regression was developed by statistician David Cox in 1958 (COX, 1958). Logistic
regression generalizes a linear model by replacing the multiplication in linear regression with
the sigmoid or logistic function to obtain a binary outcome. It follows a general form of
𝑦(𝐱) = 𝜎(𝐰𝑇𝐱 + 𝑤0)
where x is the input vector or the independent variables, y is the output vector or the dependent
variables, w is a parameter vector, and w0 is a constant offset term, the sigmoid is defined as
𝜎(𝑧) =1
1+𝑒−𝑧 and is plotted in Figure 1.
Figure 1 Sigmoid Function
The binary logistic model is used to estimate the probability of a binary response based on one
or more predictor variables. This model is a good example of discriminative models, which
generate the dependence of target variables on observed variables.
- 6 -
Logistic regression is incredibly easy to apply and very efficient to train. In addition, the
conditional probabilities are determined through the training process, which can be very
valuable in certain applications. Since it is a generalized linear model, it cannot solve non-linear
problems with its linear decision surface.
2.1.3 Decision Tree
The first decision tree algorithm, Automatic Interaction Detection (AID), was published in 1963
by Morgan and Sonquist to produce piecewise constant prediction of a regression function. In
1972, Messenger and Mandell introduced THeta Automatic Interaction Detection (THAID),
which extended the application to classification. Despite the novelty, AID and THAID did not
draw much interest within the statistics community. Later in 1984, Breiman et al. proposed the
Classification And Regression Trees (CART), which regenerated interests in this subject (Loh,
2014).
Decision tree (DT) classifies data in a dataset by flowing through a query flowchart-like tree
structure from the root through internal nodes (non-leaf nodes) until it reaches a leaf node
(terminal node), where each internal node denotes a test on an attribute, each branch represents
an outcome of the test, and each leaf holds a class label (Han, Kamber, & Pei, 2012). A typical
decision tree is shown in Figure 2. It predicts whether a citrus fruit is a lemon or an orange.
Figure 2 A decision tree to determine the species of a citrus fruit.
In general, decision trees are easy to fit, easy to use, and easy to interpret as a fixed sequence of
simple tests. They are non-linear, so they work much better than linear models for highly non-
linear functions. On the other hand, as the decision tree classifies by rectangular partitioning, it
does not handle nonnumeric data, and when dealing with data set with a large number of
features the size of the tree can be rather large and may suffer from over-fitting.
- 7 -
2.1.4 Support Vector Machine
The fundamental algorithm of support vector machine was initially presented by Boser, Guyon
and Vapnik as a training algorithm for optimal margin classifiers in 1992 (Boser, Guyon, &
Vapnik, 1992), and was later published as Support Vector Networks by Cortes and Vapnik in
1995 for binary classification (Cortes & Vapnik, 1995).
In a nutshell, a support vector machine (SVM) first uses the kernel trick, essentially a mapping
function, to transform the original training data (input space) into a high-dimensional feature
space as shown in Figure 3. Within this feature space, it searches for the linear optimal
separating hyperplane, a “decision boundary” separating the tuples of one class from another.
As in Figure 4, the SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors).
Figure 3 How kernel trick transforms a nonlinear classification (Jain, 2017).
- 8 -
Figure 4 Maximum-margin hyperplane and margins for an SVM trained with samples from two
classes (Meyer, 2001).
SVMs provide a good out-of-sample generalization, if the hyperparameters are appropriately
chosen. By introducing the kernel, SVMs gain the flexibility of including expert knowledge via
engineering the kernel. An SVM is defined by a convex optimality problem which has efficient
solution methods. But, as SVM is a non-parametric technique, one major disadvantage is the
lack of transparency of results (Auria & Moro, 2008). Also, the choice of the kernel and the
determination of the hyperparameter are relatively important to avoid over-fitting.
2.1.5 Artificial Neural Network
The computational model for neural network was created by McCulloch and Pitts in 1943
(McCulloch & Pitts, 1943). Through the past 70 years, many algorithms, techniques and
hardware were designed to improve the simulation and accelerate the training of neural
networks, such as back propagation (Werbos, 1974), max-pooling (Weng, Ahuja, & Huang,
1992) and GPU implementations (Steinkraus, Simard, & Buck, 2005).
- 9 -
Figure 5 shows examples of (a) a neuron, and (b) a neural network.
An artificial neural network (ANN) is a system based on the operational paradigm of biological
neural networks. Generally, an artificial neural network is a system of “neurons”, each of which
represents a transfer function, and the commonly used transfer functions are the sigmoid and
logistic functions. This system has a structure that receives an input, processes the data, and
provides an output. Commonly, the input consists of a data array which can consist of any kind
of data that can be represented in an array. Once an input is presented to the neural network, and
a corresponding desired or target response is set at the output, an error is composed from the
difference between the desired response and the real system output. The error information is fed
back to the system, which makes adjustments to all of its parameters in a systematic fashion
(commonly known as the learning rule). This process is repeated until the desired output is
acceptable (Priddy & Keller, 2005). This model is widely adopted to estimate patterns from a
large set of inputs with a large portion of unknowns, such as face identification, object
recognition, etc.
2.1.6 K-Nearest Neighbors Algorithm
K-nearest neighbors (k-NN) algorithm is a nonparametric method used for classification and
regression. This algorithm is created primarily based on the nearest neighbor decision rule,
which was first formally introduced by such name by Cover and Hart in 1967 (Cover & Hart,
1967). Before that similar rules were mentioned by Nilsson (1965) as “minimum distance
classifier” (Nilsson, 1965) and by Sebestyen (1962) as “proximity algorithm” (Sebestyen, 1962).
The first formulation of nearest neighbor type and analysis of its properties appears to be made
- 10 -
by Fix and Hodges in their very early discussion on non-parametric discrimination in 1951 (Fix
& Hodges, 1951).
Figure 6 An Example of k-NN Classification
In k-NN classification, the output is a class membership. An object is classified by a majority
vote of its neighbors, with the object being assigned to the class most common among its k
nearest neighbors. Take the green square in Figure 6 as a example, when k = 5, it would be
classified as a blue triangle, however when k = 10, it would be classified as a red star.
The k-NN algorithm the most basic of all Instance-Based Learning (IBL) methods, where the
function is only approximated locally and all computation is deferred until classification.
Though the k-NN algorithm is among the simplest of all machine learning algorithms, its
computational complexity makes it relatively expensive (in terms of both memory and time) to
work on large dataset (Mohri, Rostamizadeh, & Talwalkar, 2012).
2.2 Ensemble Learning
Ensemble methods are learning algorithms that construct a set of classifiers and then classify
new data points by taking a weighted vote of their predictions. The aim is to improve the
predictive performance of a given statistical learning or model fitting technique. The research of
ensembles was initiated at the end of 1980s (Zhou, Ensemble Learning, 2009).
Statistically, by constructing an ensemble out of accurate classifiers, the algorithm can
“average” their votes and reduce the risk of choosing the wrong classifier. Computationally, an
ensemble constructed by running from different starting points provides a better approximation
- 11 -
to the true unknown function than any of the individual classifiers. Representationally, as the
true function for the variables may not be represented by any of the classifiers, by performing
the ensemble method, it is possible to expand the space of representable function (Dietterich,
2000).
Thus, the ensemble method is considered as the suitable technique to obtain a better predictive
performance.
The variance-bias decomposition is an important general tool for analyzing the performance of
learning algorithms, where the bias measures the error from erroneous assumptions in the
learning algorithm, and the variance measures the error from sensitivity to small variations in
the training set. Ensemble methods aim to reduce the generalization error in learning algorithms
focusing on these two aspects.
2.2.1 Bagging
Bagging (or Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the
classification by combining classifications of randomly generated training sets.
Given a set, D, of d tuples, bagging works as follows: repeatedly draw n samples with
replacement from D; for each set of samples, estimate a statistic; the bootstrap estimate is the
mean (or the majority vote) of the individual estimates.
The bagged classifier often has significantly greater accuracy than a single classifier derived
from D. The increased accuracy occurs because the composite model reduces the variance of the
individual classifiers without affecting bias, which means it reduces the sensitivity to individual
data points. For prediction, it was theoretically proven that a bagged predictor will always have
improved accuracy over a single predictor derived from D (Breiman, 1996).
2.2.2 Random Forest
Random forests (RF) are the ensemble of decision trees and the general method was first
proposed by Ho in 1995 (Ho, 1995). The algorithm follows similar steps as bagging: divide
training examples into multiple training sets (bagging), then train a decision tree on each set
- 12 -
(can randomly select subset of variables to consider), finally aggregate the predictions of each
tree to make classification decision.
Figure 7 The General Structure of a Random Forest (He, Chaney, Sheffield, & Schleiss, 2016)
As each decision tree can be trained in parallel, random forests are fairly efficient on large data
sets, and they can handle a large number of variables without variable deletion. In addition, they
give estimation of which variables are more important in classification. For categorical variables
with different number of levels, random forests are biased in favour of those attributes with
more levels. Therefore, the variable importance scores from random forests are not reliable for
this type of data.
2.2.3 Boosting
Boosting is a type of algorithm that converts the weaker learners to stronger ones. In 1989,
Kearns and Valiant raised an open question: what’s the relationship between weakly learnable
and strongly learnable problem (Kearns & Valiant, 1989). In 1990, Schapire proved that strong
and weak learnability are equivalent notions (Schapire, 1990). In other words, a weak learning
algorithm that works just slightly better than random guess can be “boosted” into an arbitrarily
accurate strong learning algorithm.
In boosting, weak classifiers are trained sequentially (Zhou, Ensemble Methods: Foundations
and Algorithms, 2012). Every time a weak classifier is trained, it is given knowledge of the
- 13 -
performance of previously trained classifiers: misclassified examples gain weight and correctly
classified examples loss weight. Therefore, future classifiers will focus more on the mistakes
from the previous learners. And the final classifier is a weighted sum of component weak
classifiers. Figure 8 shows the comparison between a single learner, bagging and boosting.
Figure 8 Bagging vs Boosting (ETS Asset Management Factory, 2016)
For simple models, an average of models has much greater capacity than a single model.
Boosting can reduce bias substantially by increasing capacity, and control variance by fitting
one component at a time.
There are many different approaches for boosting, some widely known ones are: AdaBoost,
LogitBoost and Gradient Boosting.
2.2.4 Gradient Boosting
Gradient boosting was generalized from adaptive boosting (AdaBoost) by Friedman in 1999
(Friedman, 2001). Gradient boosting transforms a set of weak learners to a strong learner with
the help of gradient descent optimization. At each stage a weak learner is fitted to the remaining
errors (also known as pseudo-residuals) of current strong learner. Figure 9 shows the changes in
the residuals through a training process. At the first several iterations, the residuals are relatively
large and vary significantly between data groups. As the iteration increases to 18, the residuals
decrease to around zero and are around the same size. The residuals keep decreasing and
become stable around zero as the iteration comes to 50. Then, the contribution of the weak
- 14 -
learner to the strong one is computed using gradient descent to minimize the overall error of the
strong learner. The well known AdaBoost is a special case of gradient boosting where the
sample distribution is modified to emphasize the hard cases and the contribution of the weak
learners is determined by their performance. One of the most used models is the gradient
boosting decision trees (GBDT), which can handle mixture of features and does not require
feature scaling.
Figure 9 Residual Fitting Process in Gradient Boosted Decision Trees (Grover, 2017)
2.2.5 Ensemble Selection
Ensemble selection is a method to construct ensembles from a library of various models
proposed by Caruana, Niculescu, Crew and Ksikes in 2004. It selects models added to the
ensemble in a forward stepwise manner to maximize the performance.
Selection starts with a library of desirable models and an empty ensemble. For each iteration, a
model in the library that maximizes the ensemble’s performance to the error metric on the
validation set is added to the ensemble. The selection is repeated until it reaches a fixed number
of iterations or all models have been used (Caruana, Niculescu, Crew, & Ksikes, 2004).
Compared to bagging and boosting, where the weight of each model needs to be determined
manually, ensemble selection automatically weights the selected models based on their ability to
improve the performance of the ensemble determined by the error matric.
- 15 -
Chapter 3
Data
The Orange dataset was provided by a French multinational telecommunication corporation
Orange S.A. for the 2009 Knowledge Discovery in Data Competition (KDD Cup 2009). To
protect the privacy of the customers whose records were used, the data were anonymized by
replacing actual text or labels by meaningless codes and not revealing the meaning of the
variables.
The dataset used throughout the thesis is the training set of the competition, which contains
50,000 instances including 15,000 variables (a 50,000×15,000 matrix), the first 14,740 of which
are numerical and the last 260 are categorical, and three binary target variables corresponding to
churn, appetency and up-selling.
Figure 10 Frequency Distribution of Variable 2382
All 15,000 variables of the dataset are analyzed through frequency distributions to determine the
characteristics among the variables. Most numerical variables have skewed distributions as
shown in Figure 10 and 11. In addition, many numerical variables have a common factor, which
is an indication that these variables are artificially encoded. Take variable 2382 as an example,
0%
5%
10%
15%
20%
25%
30%
35%
Mis
sin
g 6
15
24
33
42
51
60
69
78
93
10
2
12
9
14
1
17
4
20
4
24
3
27
0
28
5
32
1
35
4
40
2
56
1
72
6
89
4
Freq
uen
cy D
istr
ibu
tio
n
- 16 -
all the values are multiples of three, as shown in Figure 10. Similarly, the values for variable
1933 increment by two.
Figure 11 Frequency Distribution of Variable 1933
Another observation is that many variables only have a few discrete values. For example, about
50% of all numerical variables have less than 3 discrete values, approximately 80% of all
categorical variables have fewer than 10 categories and 12% of numerical variables and 28% of
categorical variables are constant. Furthermore, numerical values are heavily populated by 0s. It
was discovered that 80% of the numerical variables have more than 98% 0s. These results
suggest that a large number of variables can be removed since they are constant or close to
constant.
Table 1 Frequency Distribution of Target Variables
Churn Appetency Up-selling Frequency Percentage
-1 -1 -1 41756 83.51%
-1 -1 1 3682 7.36%
-1 1 -1 890 1.78%
1 -1 -1 3672 7.34%
The frequency distributions of the three target variables are shown in Table 1. It is obvious all of
the three targets are all highly imbalanced with only 1-7% positive cases. And there is no
overlap between any pair of labels, all the instances within the dataset have at most one positive
target variable.
0%
5%
10%
15%
20%
25%
30%
35%
0 2 4 6 8 10 12 14 16 18 20
Freq
uen
cy D
istr
ibu
tio
n
- 17 -
Chapter 4
Technology
This chapter discusses the programming language and platform used in the thesis and justifies
the advantages of choosing them. This chapter also includes the packages, libraries and APIs
that are used in the thesis.
4.1 Python
Python is a very popular open source programming language, created by Guido van Rossum and
first released in 1991. While many languages are used in data science, for instance, C++, Java,
R, and MATLAB, Python is dominant; codeeval.com rated Python “the most popular language”
for the fifth year in a row (CodeEval, 2016). Python, an interpreted language, has a design
philosophy which emphasizes code readability, and a syntax which allows programmers to
express concepts in fewer lines of code than possible in languages such as C++ or Java
(Summerfield, 2007).
Furthermore, Python has packages for almost any conceivable math function for machine
learning. And most of the popular machine learning libraries are either written in Python (scikit-
learn, TensorFlow) or have Python bindings (Caffe, OpenCV). Python 2.7 is used for all
programming in this thesis.
4.2 Libraries
4.2.1 Scikit-learn
Scikit-learn is a free machine learning library for the Python programming language initiated by
David Cournapeau in 2007. It features various classifications, regressions and clustering
algorithms including Naïve Bayes, support vector machines, decision trees, ensemble methods,
clustering etc. It is designed to interoperate with the Python numerical and scientific libraries
- 18 -
NumPy and SciPy. This package focuses on bringing machine learning to non-specialists using
a general-purpose high-level language (Pedregosa, et al., 2011).
4.2.2 TensorFlow
In November 2015, Google released TensorFlow, an open source deep learning software library
for defining, training and deploying machine learning models. It provides support for both the
research and the engineering sides in Google, as it can advance the state of the art on existing
problems and bring understanding to new problems as well as take the insight from the research
community to enable innovative products and product features (Google, 2015). Aside from
supporting the internal product in Google, TensorFlow provides a platform for collaboration and
communication among researchers. There are numerous libraries and source projects built on
top of TensorFlow and allow clearer understanding and more accessible applications of deep
learning.
4.2.3 TFlearn
TFlearn is a modular and transparent deep learning library built on top of TensorFlow. It was
designed to provide a higher-level API to TensorFlow in order to facilitate and speed-up
experimentations, while remaining fully transparent and compatible with it.
Compared to TensorFlow, TFlearn allows fast prototyping through highly modular built-in
neural network layers, regularizers, optimizers, and metrics. The high-level API currently
supports most of recent deep learning models, such as convolutional neural network (CNN),
bidirectional recurrent neural networks (BRNN), batch normalization, PReLU, residual
networks, generative adversarial networks (GANs), and etc.
4.2.4 XGBoost
XGBoost is an open-source software library which provides the gradient boosting framework
for C++, Java, Python, R, and Julia. XGBoost initially started as a research project by Tianqi
Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group and was
first released on March 27, 2014. XGBoost is designed to provide an efficient, flexible and
portable gradient boosting library that works on major distributed environment (Hadoop, SGE,
- 19 -
MPI) and solves data science problems in a fast and accurate manner (Distributed (Deep)
Machine Learning Community, 2015).
4.3 Cloud Computing
Cloud computing was defined by National Institute of Standards and Technology in 2011 as “a
model for enabling convenient, on-demand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage, applications, and services) that can be
rapidly provisioned and released with minimal management effort or service provider
interaction” (Mell & Grance, 2011).
Cloud computing in general provides a more cost-efficient system for data centralization with
better software integration and various access options. The main reason the thesis is run on
cloud is that the models can run on the cloud without the interferences from other tasks on the
computer, which would most likely guarantee memory allocation and increase the computation
speed.
The cloud used in the thesis is the IBM Data Scientist Workbench, a virtual lab environment for
people to practice data science and cognitive computing. The free access was provided through
the registration of Cognitive Class (https://cognitiveclass.ai), formerly known as the Big Data
University initiated by IBM in 2010. The workbench provides elastic compute environments
with the best possible capacity of 16 vCPU and 64 GB RAM.
- 20 -
Chapter 5
Methodology
This chapter explains the steps employed to achieve the objectives of the thesis. It is organized
as follows: section 5.1 introduces how the models are evaluated, section 5.2 explains the models
used in the thesis and how they are constructed, section 5.3 further explains how optional target
audiences are selected.
5.1 Evaluation Score: Area Under the Curve (AUC)
One of the main objectives of the thesis is to make good predictions of the target variables. The
prediction of each target variable is thought of as a separate classification problem. The results
of classification, obtained by thresholding the prediction score, may be represented in a
confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false
positive) represent the number of examples falling into each possible outcome, as shown in
Table 2.
Table 2 Confusion matrix of a binary classification
Predictions
Class +1 Class –1
Truth Class +1 tp fn
Class –1 fp tn
The results will be evaluated with the Area Under Curve, which corresponds to the area under
the curve obtained by plotting sensitivity against specificity by varying a threshold on the
prediction values to determine the classification result. Another common curve used in machine
learning to determine the diagnostic ability of a binary classifier system is the receiver operating
characteristic (ROC) curve, created by plotting the true positive rate against the false positive
rate. The ROC curve evaluates a classifier based on its performance only on the predictions of
the true target, but this method is problematic especially when the data is highly skewed
(Swamidass, Azencott, Daily, & Baldi, 2012). The AUC used in the thesis represents the trade-
- 21 -
off between sensitivity and specificity, therefore more suitable for the model evaluation using
the Orange dataset.
The AUC is calculated using the trapezoid method. In the case when binary scores are supplied
for the classification instead of discriminant values, the curve is given by {(0,1), (tn/(tn+fp),
tp/(tp+fn)), (1,0)} and the AUC is just the Balanced ACcuracy BAC. The average accuracy
widely used may give misleading idea about generalization performance when a classifier is
tested on an imbalanced dataset. And such shortcoming can be overcome by replacing the
average accuracy by the balanced accuracy (Brodersen, Ong, Stephany, & Buhmann, 2010).
Figure 12 Area Under Curve and Balanced Accuracy for Sensitivity vs Specificity
5.2 Data Processing
5.2.1 Data Preprocessing
A significant portion of the variables are highly skewed, in another word, some variables
contain considerable amount of missing values. Many techniques are available to address the
missing value problem, such as deletion, mean/mode substitution, maximum likelihood
estimation, Bayesian estimation, multiple imputation, etc. (Enders, 2010). In this thesis, the
missing values are either substituted by mean or mode, depending on the type of variables, or
considered as a standalone entry ‘missing’.
- 22 -
After handling the missing values, the dataset was cleaned by removing the 1531 constant
variables and 5874 quasi-constant variables (where a single value occupies more than 99.98%
population).
For categorical data, although most of the features have less than 10 categories, there are about
5% of them having more than 100 categories, in such case, the categorical variables with more
than 100 distinct values were grouped into 20 categories.
5.2.2 Encoding Categorical Data
Some models in section 5.3 do not handle categorical data, such as Naïve Bayes Classifier and k
Nearest Neighbors. Thus, in addition to the binning process above, an encoding process of
converting categorical data into numerical data was performed for each of the categorical
variables. The encoding methods used are: ordinal/numeric encoding, one hot encoding and
binary encoding, which are visualized in Figures 13, 14 and 15 respectively.
Ordinary encoding is simply converting each value in the column to a unique number. This is
the simplest encoding method. However, in some categorical data, it is not necessary that the
category with larger number is more “important” or “heavier-weighted” than the category with a
smaller number, and this would most likely lead to misinterpretation by some of the algorithms.
This method was used when there are only a small number of categories for the advantage of
easy implementation.
Figure 13 A Numeric/Ordinal Encoding Example (Laurae, 2017)
- 23 -
A common alterative approach is one hot encoding, which converts each category value into a
new binary column by assigning 1 or 0 to the corresponding column. This avoids improper
weighting of categories but has the downside of including more columns in the dataset, thus
increasing the potential computational complexity for future analysis. This method is used when
the number of categories is relatively small.
Figure 14 A One Hot Encoding Example (Laurae, 2017)
The binary encoding first assigns a unique value to each category, then converts the value into
binary value. By using the power law of binary encoding, an N cardinalities (categories) feature
can be stored using ceil(log(N+1)/log(2)) features. This will significantly reduce the memory
increase from one hot encoding while ensuring the representation of each category. However,
the encoding process is the most complicated among the three methods mentioned and takes the
longest to encode. This method is used for data with large number of categories.
- 24 -
Figure 15 A Binary Encoding Example (Laurae, 2017)
5.2.3 Feature Selection
There was no feature selection before getting the results from the classification models. As some
of the models used later on are based on the construction of decision trees, where the trees are
split based on the significances of attributes determined by information gain or relative entropy,
the significances of the attributes were used later on as a form of feature selection to determine
the performance of different models when the number of variables changes.
5.3 Classification Methods
5.3.1 Naïve Bayes Classification
Naive Bayes Classifier is a classification technique based on Bayes’ Theorem with an
assumption of strong (naive) independence between the features. In simple terms, a Naïve Bayes
classifier assumes that the presence of a particular feature in a class is unrelated to the presence
of any other feature. Gaussian Naive Bays, basically assumes that features follow a normal
distribution.
The advantage is that Naive Bayes Classifier is easy to implement and fast to predict. It also
performs well in multiclass classification. One major limitation of Naive Bayes is the
assumption of independent features. In real life, it is almost impossible to get a dataset whose
fields are completely independent. Thus, Gaussian Naive Bayes classifier is chosen as the
benchmark for the later model.
- 25 -
The Naïve Bayes classification is constructed using function GaussianNB under the naive_bayes
class in scikit-learn library.
5.3.2 Random Forest
Random forest is an ensemble learning method, that operates by constructing a multitude of
decision trees at training time and outputting the class that is the mode of the classes of the
individual trees. Each tree is trained independently, using a random sample of the data. This
randomness helps to make the model more robust than a single decision tree, and less likely to
overfit on the training data.
Random decision forests correct for decision trees' habit of overfitting to their training set, as
well as decreasing test error by lowering prediction variance. It also naturally handles missing
and categorical values. Similar to most ensemble methods, random forest is a black box
algorithm, so it is not easy to visually interpret. Another main limitation is that a large number
of trees may make the algorithm slow for real-time prediction. Random forest is chosen as it is
an ensemble method with a reasonable good accuracy.
Random forest classifier is constructed using the function RandomForestClassifier under the
Ensemble Methods in scikit-learn (Scikit-learn, 2017). The model is tuned using grid search to
determine the optimal number of trees in the forest, the number of features to consider when
looking for the best split, the maximum depth of the trees, etc.
5.3.3 Gradient Boosting Decision Tree
Gradient boosting is a machine learning technique, which produces a prediction model in the
form of an ensemble of weak prediction models, typically decision trees. Tianqi Chen, the
creator of XGBoost, explained the detailed mathematical formulations and algorithm
construction in his 2016 KDD Cup presentation (Chen & Guestrin, 2016). In general, gradient
boosting decision tree builds the model in a stage-wise fashion like other boosting methods do,
and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree (GBDT) builds trees one at a time, where each new tree helps to
correct errors made by previously trained tree. With each tree added, the model becomes even
more expressive.
- 26 -
Gradient boosting decision trees has similar advantages and disadvantages with random forest,
as they are both ensemble methods based on decision trees. Gradient boosting decision trees
usually have higher accuracy, do not require feature scaling and can handle a mixture of
features. However, the training times are usually longer as they require significant computation,
the final model can be hard to understand, and it is prone to overfitting and requires careful
tuning. Gradient Boosting Decision Tree is selected as generally it is a better learner than
Random Forest. Gradient boosting decision tree classifier is trained using the XGBoost library.
5.3.4 Ensemble Selection
Ensemble selection is essentially a bagging process with the selection of models only improving
the results, other than a simple average of outcomes from a set of predetermined model types
(Caruana, Niculescu, Crew, & Ksikes, 2004). Ensemble selection can include other ensemble
methods as one of the base models in its library and further improve the result through the
selection process. On the other hand, this means the training of this model will take considerably
more time than all the base models as each model included in the model library is trained
separately and each would require specific tuning to achieve the best result.
5.3.4.1 Pseudocode
The pseudocode is created based on the understanding of Caruana et al.’s papers in the
construction and optimizing ensemble selection (Caruana, Niculescu, Crew, & Ksikes, 2004)
(Caruana, Munson, & Niculescu-Mizil, Getting the Most Out of Ensemble Selection, 2006).
Input: A library of classification models Ω, a training dataset T, a validation dataset V,
maximum iteration number n, predetermined performance matrices
Output: An ensemble E of models from the library Ω
Procedure:
1. Initiate with an empty ensemble E
2. Train all the models in Ω based on the training set T
3. for i ← 1 to a predefined iteration number n do
4. Perform predictions on validation set V with all the models
- 27 -
5. Evaluate the performance of adding each of the model (Ωi) into the ensemble on the
performance matrices
6. Add the model that maximizes the performance to the ensemble
7. end
8. output updated ensemble E
5.3.4.2 Model Library
Naïve Bayes: The same Naïve Bayes classifier used as benchmark was included in the library.
Logistic regression is included as a simple machine learning model under the discriminative
class. Ng and Jordan compared logistic regressions and Naïve Bayes classifiers and concluded
that when the training size reaches infinity the discriminative model performs better than the
generative model. On the other hand, the generative model reaches its asymptote faster than the
discriminative model (Ng & Jordan, 2001). Two types (L1 and L2) of regularized logistic
regression were included using the function LogisticRegression under the Generalized Linear
Models in sklearn.
Support vector machine and L2-regularized logistic regression are closely related. It is possible
to derive SVM by asking a logistic regression to make the right decisions. They differ in that
logistic regression seeks to maximize the likelihood while support vector machines seek to
minimize the constraint violations (Schmidt, 2009). When using the kernel trick, SVMs has
better scalability and can produce sparse solutions. Functions LinearSVC, NuSVC and SVC
under the Support Vector Machines in sklearn were used to trained different types of SVMs:
linear SVMs and kernel SVMs with different kernel types.
K Nearest Neighbors algorithm is an instance-based learner and does not produce a model. It is
generally the easiest to understand but is computationally intensive to classify large data sets.
Function KNeighborsClassifier under the Nearest Neighbors in sklearn was used to implement
several different types of kNN models, including different weight functions (uniform and
distance) and different algorithms to compute the nearest neighbor.
Traditionally, clusterings are regarded as unsupervised learning opposed to the supervised
learning. The goal of clustering is to determine the internal grouping in a set of data. It is still
- 28 -
doable to solve supervised classified problem using clustering (Eick, Zeidat, & Zhao, 2004).
Several clustering functions under the Clustering models in sklearn were used, including k-
means clustering, co-clustering, bi-clustering and etc.
Random Forest Classifier: The same Random Forest Classifier in section 5.3.2 was included in
the library.
Adaboost is a special case under the gradient boosting models, where the boosting is performed
through the weight change of problematic cases. The model was programmed using function
AdaBoostClassifier under the Ensemble Methods in sklearn.
Gradient Boosting Decision Tree: The same Gradient Boosting Decision Tree in section 5.3.2
was included in the library.
Deep Neural Network is really popular in the last several years, especially for its outstanding
power in computer vision and natural language processing. Even so, Dreiseitl and Ohno-
Machado’s review showed that for general data classification scenario NN and logistic
regression perform on about the same level more often than not, with the more flexible neural
networks generally outperforming logistic regression in the remaining cases (Dreiseitl & Ohno-
Machado, 2002). Raschka also suggested the strengths of neural networks are in image
classification, natural language processing, and speech recognition (Raschka, 2016). Neural
network models were built using the TFLearn libraries, models with different structures are
included in the libraries.
5.4 Parameter Optimization
The same kind of machine learning model can require different constraints, weights or learning
rates to be applied on different dataset. These measures are called hyperparameters and have to
be tuned so that the model can optimally solve the machine learning problem. The idea of
parameter optimization or tuning aims to find a tuple of hyperparameters that optimizes the
performance of the model, which is usually evaluated through loss functions. Hyperparameters
can significantly impact performance and tuning the parameters for the models is as important
as choosing the right models.
- 29 -
In the thesis, all models mentioned above in 5.3 are carefully tuned through the method of grid
search, where the optimal hyperparameters are determined through an exhaustive search and
evaluated with the help of cross-validation.
5.5 Feature Importance and Model Performance
After the training of all models and obtaining the result for the first round, further analysis of
feature selection based on feature importance determined the GBDT was applied to the GBDT
and random forest models for the three target variables. This analysis aims to determine how the
prediction changes as the number of features increases and if the prediction accuracy
yields/converges when the number reaches certain range.
- 30 -
Chapter 6
Results and Analysis
This chapter is organized according to the models introduced in section 4.2, provides
discussions regarding the optimal target audience based on the results from all models, and
further discover how feature selection (feature importance) can help with the prediction process.
6.1 Classification Results
The classification results are determined from the average of a 10-fold cross-validation. Each
target's variables are trained separately in this section. And the following results were achieved
using models discussed in section 5.3.
6.1.1 Naïve Bayes Classification
For the Naïve Bayes Classifier, the following confusion matrixes shown in Table 3 to 5 were
obtained for each of the target variables.
Table 3 Confusion Matrix for Churn from Naïve Bayes Classifier
Naïve Bayes Churn Predictions
Class +1 Class –1
Truth Class +1 157.4 209.8
Class –1 304.8 4328
For churn the true positive rate is 34.05%, true negative rate is 95.38%, and the corresponding
AUC is 0.6472.
Table 4 Confusion Matrix for Appetency from Naïve Bayes Classifier
Naïve Bayes Appetency Predictions
Class +1 Class –1
Truth Class +1 40.2 48.8
Class –1 92.6 4818.4
For appetency the true positive rate is 30.27%, true negative rate is 98.99%, and the
corresponding AUC is 0.6463.
- 31 -
Table 5 Confusion Matrix for Upselling from Naïve Bayes Classifier
Naïve Bayes Upselling Predictions
Class +1 Class –1
Truth Class +1 117.7 121.2
Class –1 250.5 4510.6
For upselling the true positive rate is 49.27%, true negative rate is 94.74%, and the
corresponding AUC is 0.7200.
As mentioned in section 5.3.1, the Naïve Bayes Classifier was considered as the benchmark for
all the following ensemble methods, and the results achieved by applying the Naïve Bayes
method matches the result provided by the Orange company, thus the data preprocess is valid.
In addition, as the data is highly skewed, all of the target variables have much higher true
negative rates comparing to true positive rates. In the worst-case scenario for data as imbalanced
as provided, the guess would be to predict all cases as negative, which still gives us a really
good average accuracy (>90%) but completely neglect the purpose of the prediction. Similarly,
if the Naïve Bayes classifier is one of the simpler classifiers, it is more accurate in predicting the
category with a larger number of training data.
Furthermore, as the AUC is calculated using the balanced accuracy, the scores for all the models
will show equal influence from both true positive true negative cases, which are more
meaningful for finding the target customers the thesis is focusing on.
The running time for Naïve Bayes Classifier was about 20 min per fold.
6.1.2 Random Forest
For the Random Forest Classifier, the following confusion matrices shown in Table 6 to 8 were
obtained for each of the target variables.
Table 6 Confusion Matrix for Churn from Random Forest
Random Forest Churn Predictions
Class +1 Class –1
Truth Class +1 151.9 174.5
Class –1 215.3 4458.3
- 32 -
For churn the true positive rate is 46.54%, true negative rate is 95.39%, and the corresponding
AUC is 0.7097.
Table 7 Confusion Matrix for Appetency from Random Forest
Random Forest Appetency Predictions
Class +1 Class –1
Truth Class +1 56.9 26.6
Class –1 32.1 4884.4
For appetency the true positive rate is 68.14%, true negative rate is 99.35%, and the
corresponding AUC is 0.8374.
Table 8 Confusion Matrix for Upselling from Random Forest
Random Forest Upselling Predictions
Class +1 Class –1
Truth Class +1 130.8 47.3
Class –1 237.4 4584.5
For upselling the true positive rate is 73.44%, true negative rate is 95.08%, and the
corresponding AUC is 0.8426.
The optimal hyperparameters for each of the targets are shown in Table 9.
Table 9 Optimal Hyperparameters for Targets
Hyperparameters Churn Appetency Upselling
bootstrap True True True
class_weight balanced balanced balanced
criterion gini gini gini
max_depth 50 75 60
max_features sqrt sqrt sqrt
min_samples_leaf 3 1 3
min_samples_split 10 5 10
n_estimators 500 500 600
n_jobs -1 -1 -1
Bootstrap=True means bootstrap samples are used when building trees.
criterion='gini' is one of the function to measure the quality of a split, another common function
is ‘entropy’.
- 33 -
n_job represent the number of jobs to run in parallel for both fit and predict, and when it is equal
to -1 the number of jobs is set to the number of cores.
n_estimators represents the number of trees in the forest, when the number is relatively small the
increase in number of trees will improve the performance gradually, but eventually, the result
will converge to an optimal value where the increase in number of trees in the forest does not
affect the result significantly.
max_depth represents the maximum depth of the tree. The deeper/bigger the tree is the more
features it considers when constructing the true. So generally, it is often more desirable to grow
large individual trees for high dimensional problem. However, growing large trees does not
always give the best performance, especially when dealing with noisy data (Lin & Jeon, 2006).
Hence the models all included upper bound for the tree depth.
min_samples_leaf represents the minimum number of samples required to be at a leaf node,
min_samples_split represents the minimum number of samples required to split an internal
node. These two values are highly related as they together determine the general structure of a
leaf. As analyzed in Chapter 3, appetency is relatively more imbalanced than churn and
upselling with about 70% less positive cases, and that makes it reasonable for appetency to have
a lower requirement to split an node or create a leaf.
max_features represents the number of features to consider when looking for the best split, and
the best result was √#𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. This will significantly reduce the time needed to train the model
especially when the size of the data set and the number of trees in the forest is large.
When class_weight is set to ‘balanced’, the model uses the values of target to automatically
adjust weights inversely proportional to class frequencies in the input data. As the dataset is
highly skewed, it would be wise to ‘balance’ the weight through this process.
Comparing to Naïve Bayes classifiers, there were obvious improvements in true positive rate in
all of the three targets, with the most significant raise in appetency. The increase in sensitivity
led to the increase in AUC score, as the increase in specificity is relatively small.
- 34 -
For both appetency and upselling the number of true positive cases increases, whereas the true
positive cases in churn decreases slightly. This is due to the fact that Naïve Bayes classifier has
the lowest true positive rate for churn, an excessive number of the samples was classified as +1
when they are actually -1. The decrease in positive predictions in churn indicates that the tree
structure is comparably better in classifying such cases.
The running time for Random Forest was approximately 4 hours per fold.
6.1.3 Gradient Boosting Decision Trees
For the Gradient Boosting Decision Tree Classifier, the following confusion matrixes shown in
Table 10 to 12 were obtained for each of the target variables.
Table 10 Confusion Matrix for Churn from Gradient Boosted Decision Trees
GBDT Churn Predictions
Class +1 Class –1
Truth Class +1 155.1 138.3
Class –1 212.1 4494.5
For churn the true positive rate is 52.86%, true negative rate is 95.49%, and the corresponding
AUC is 0.7418.
Table 11 Confusion Matrix for Appetency from Gradient Boosted Decision Trees
GBDT Appetency Predictions
Class +1 Class –1
Truth Class +1 56.6 28.2
Class –1 32.4 4915.2
For appetency the true positive rate is 66.75%, true negative rate is 99.34%, and the
corresponding AUC is 0.8304.
Table 12 Confusion Matrix for Upselling from Gradient Boosted Decision Trees
GBDT Upselling Predictions
Class +1 Class –1
Truth Class +1 132.3 26.8
Class –1 235.9 4605
- 35 -
For upselling the true positive rate is 83.15%, true negative rate is 95.13%, and the
corresponding AUC is 0.8914.
The optimal hyperparameters for each of the target is shown in Table 13.
Table 13 Optimal Hyperparameters for Targets
Hyperparameters Churn Appetency Upselling
eta 0.1 0.05 0.1
gamma 2 1 2
max_depth 2 3 2
min_child_weight 5 3 5
num_round 100 150 120
scale_pos_weight 12 20 12
subsample 0.7 0.6 0.7
In XGBoost, the hyperparameter eta represents the learning rate of the boosting.
A node is split only when the resulting split gives a positive reduction in the loss function.
Gamma specifies the minimum loss reduction required to make a split, the larger the value is the
more conservative the algorithm will be.
max_depth means the same as in random forest. In gradient boosting decision tree, it is not
essential to have really deep tree, as the later trees will improve based on the result from
previous trees, and that’s why the tree depth is much smaller in gradient boosting decision tree
than that in random forest.
min_child_weight is basically min_samples_leaf in random forest. Appetency still has the
smallest value among the targets.
num_round stands for the number of iteration for boosting. As gradient boosted decision trees
are prone to overfit, the results are not guaranteed to be better as the number of rounds
increases.
scale_pos_weight controls the balance of positive and negative weights, which is useful for
unbalanced classes. XGBoost suggests to use sum (negative cases) / sum (positive cases) as the
- 36 -
weight, and the optimal weights for churn and upselling are approximate that value (~12.5), but
for appetency the optimal weight is much smaller (~55).
Subsample is the ratio of the training instance XGBoost randomly collected from the training
dataset to grow trees, and this value will help to prevent overfitting.
Comparing to random forest, there were improvements in churn and upselling and slightly
decrease in appetency. The number of positive predictions continue to shrink for both churn and
upselling with the number of true positive cases increasing, which match the rising trend in true
positive rates. For the appetency both positive predictions and true positive cases are relative
close to the results from random forest, which indicates the good prediction ability from both
ensemble methods.
The running time for Gradient Boosted Decision Trees was in the range of 4-5 hours per fold.
6.1.4 Ensemble Selection
For the Ensemble Selection Classifier, the following confusion matrixes shown in Table 12 to
14 were obtained for each of the target variables.
Table 14 Confusion Matrix for Churn from Ensemble Selection
Ensemble Selection Churn Predictions
Class +1 Class –1
Truth Class +1 156.9 118.5
Class –1 210.3 4514.3
For churn the true positive rate is 56.97%, true negative rate is 95.55%, and the corresponding
AUC is 0.76. The top 3 models included in the ensemble are: gradient boosting decision trees,
random forest, followed by L2-regularized logistic regression.
Table 15 Confusion Matrix for Appetency from Ensemble Selection
Ensemble Selection Appetency Predictions
Class +1 Class –1
Truth Class +1 60.3 20
Class –1 28.7 4891
- 37 -
For appetency the true positive rate is 75.09%, true negative rate is 99.42%, and the
corresponding AUC is 0.8725. The top 3 models included in the ensemble are: random forest,
gradient boosted decision trees together with logistic regression.
Table 16 Confusion Matrix for Upselling from Ensemble Selection
Ensemble Selection Upselling Predictions
Class +1 Class –1
Truth Class +1 133.4 21.5
Class –1 234.8 4610.3
For upselling the true positive rate is 86.12%, true negative rate is 95.15%, and the
corresponding AUC is 0.9064. The top three models in the ensemble are: gradient boosted
decision tree, random forest and logistic regression.
The idea of ensemble selection is to improve the performance of the ensemble through the
selection iterations. And as expected, the best results for each of the base model is also the most
heavily weighted model in the ensemble. The results for all of the three targets increases
comparing to both GBDT and random forest, with further shrinkage in positive predictions and
increase in true positive cases.
The running time for Ensemble Selection varied from 20-22 hours per fold.
6.1.5 Overall Comparison
Figure 16 shows the overall comparison of all the classification models used in the thesis. It is
evident, the ensemble methods (RF, GBDT and ensemble selection) achieved significant
improvement from the benchmark Naïve Bayes classifier. As introduced earlier, the purpose of
ensemble methods is to improve the performance of simple classifiers by decreasing the
variance (bagging) or bias (boosting). However, not only did the models increased the
predictions, it also very significantly increased the time needed to produce the prediction.
Random forest and the gradient boosting decision tree can be seen as ensembles of decision
trees, and comparing to a single classifier, the running time increased by 10 times. The ensemble
selection can be considered as an ensemble of ensemble methods as both random forest and
gradient boosting decision trees are included in the library as well as many other models, this
process further pushes the running time up by another 4 times. In spite of the fact that the
- 38 -
ensemble selection model provided us with the best result among all the models tested, the
question would be how to achieve a good balance between prediction outcomes versus time and
computational complexity. From a practical perspective, time and computational complexity
make the ensemble selection model much more expensive comparing to both random forest and
gradient boosting decision tree.
Figure 16 Overview of All the Classification Model
6.2 Feature selection
As mentioned in 5.2.2, there was no initial feature selection for all of the results showed in
section 6.1. Meanwhile, it is also apparent that the running time for ensemble methods is much
longer comparing to any of the base learner. In this section, feature selection based on feature
importance from XGBoost is implemented to discover how random forest and gradient boosting
decision tree performance as the number of features varies.
Using the final gradient boosting decision tree model in section 6.1.3, the top 20 variables based
on its feature importance for each of the target is listed in Table 16. It is apparent that these
variables are far more related with the results as compared to the ones that are not listed in the
0
5
10
15
20
25
0.60
0.70
0.80
0.90
1.00
NB RF GBDT ES
Ru
nn
ing
Tim
e(h
r)
AU
C S
core
Churn
Appetency
Upselling
Overall
RunningTime
- 39 -
table. Figure 17 and Figure 18 showed the change in the AUC score when using different
number of important features for random forest and gradient boosted decision tree respectively.
Table 17 Feature Importance of the Top 20 Variables for Each Target
Rank Churn Appetency Upselling
Variable Importance Variable Importance Variable Importance
1 Var8981 0.20131 Var9045 0.23783 Var9045 0.25517
2 Var14990 0.10253 Var8032 0.13564 Var14990 0.07866
3 Var10533 0.04649 Var14995 0.10791 Var8981 0.05324
4 Var14970 0.04602 Var14990 0.06068 Var12507 0.04962
5 Var5331 0.02358 Var5826 0.03720 Var6808 0.04653
6 Var14995 0.02190 Var8981 0.03233 Var1194 0.02581
7 Var14822 0.02111 Var10256 0.03030 Var14970 0.02157
8 Var9045 0.02004 Var12641 0.02721 Var14871 0.01331
9 Var2570 0.02001 Var14772 0.01718 Var1782 0.01149
10 Var14923 0.01883 Var14939 0.01689 Var10256 0.01052
11 Var14765 0.01194 Var14867 0.01623 Var5026 0.00963
12 Var14904 0.01138 Var14970 0.01423 Var8032 0.00913
13 Var5702 0.01133 Var11781 0.01145 Var14786 0.00807
14 Var11047 0.01121 Var14871 0.00891 Var7476 0.00622
15 Var14778 0.00972 Var14788 0.00860 Var11781 0.00591
16 Var14795 0.00903 Var13379 0.00812 Var14795 0.00574
17 Var990 0.00898 Var5216 0.00711 Var6255 0.00572
18 Var12580 0.00863 Var14795 0.00703 Var5216 0.00503
19 Var9075 0.00860 Var11315 0.00664 Var2591 0.00497
20 Var647 0.00847 Var12702 0.00627 Var12641 0.00465
For random forest, the performance improves as the number of features included increases and
eventually converges as the number reaches 300. For the gradient boosting decision tree, it
followed a similar trend and converged around 200 features. In both models, the performances
for the three targets fluctuated differently but ultimately converge was as expected. In addition,
as the number of features decreases, the running time for both of the models decreases
proportionally.
- 40 -
Figure 17 AUC Score for Random Forest with Different Number of Features
Figure 18 AUC Score for Gradient Boosting Decision Tree with Different Number of Features
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0 200 400 600 800 1000
AU
C S
core
Number of Features
Churn
Appetency
Upselling
Overall
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0 200 400 600 800 1000
AU
C S
core
Number of Features
Churn
Appetency
Upselling
Overall
- 41 -
6.3 Target Audience
The above results can potentially help with the marketing and sales department of the company
with the targeting, preparing and improvements of marketing campaigns. Figure 19 visualized
the relationship between prediction and truth. For marketing purposes, the positive predictions
can be treated as the actual target customers and the true positive rate represents the respond rate
from those customers. The objectives for the company would be to minimize the cost of the
campaign while maximize the number of response.
Figure 19 Relationship between Prediction and Truth and Connection with Marketing
As shown in Figure 20, there are obvious trends in all three targets. The ensemble methods are
able to narrow the range of positive predictions by 30%, while doubling the true positive rate.
For example, the prediction for upselling can be applied in marketing campaign to promote
upgrades and add-ons. With the help of ensemble learning methods, the target customer sizes
are between 150 and 180, comparing to over 240 when using Naïve Bayes methods and 5000
without machine learning analysis. Meanwhile, the respond rates are steady between 70% to
90%, comparing to around 50% when applying the benchmark classifier and 7.4% from the total
- 42 -
customer population. In other words, it is highly possible for ensemble methods to minimize the
campaign cost by decreasing the size of target customers while improving the feedback from
customers.
Figure 20 Positive Prediction and True Positive Rate for Each Classifier
Furthermore, with the identification of churn in existing customer, a survey can be made
towards the positive predicted individuals to understand what aspects of the service or product
might lead them to consider switch providers, and what amelioration they would like to see in
the company. At the same time, survey can also be made towards potential customers with
positive classifications to study what promotion they appreciate the most and their ranking of
companies in the same industry.
Similarly, with the prediction of appetency and upselling, studies can be made to study the
customers’ behaviours after advertisements: what level of upgrade did they choose, what kind of
new product did they purchase, what kind of promotion did the costumer respond, etc., to
support future marketing plans.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
50
100
150
200
250
300
350
400
450
500
NB RF GBDT ES
Tru
e P
osi
tive
Rat
e
Nu
mb
er o
f P
osi
tive
Pre
dic
tio
ns
[LHS]Churn [LHS]Appetency [LHS]Upselling
[RHS]TP - Churn [RHS]TP - Appetency [RHS]TP - Upselling
- 43 -
As shown in section 6.2, these targets are influenced by different features. Unfortunately, further
analysis based on a single variable would be hard to understand as all the values in the dataset
are encrypted. In practice, the company can presumably further analyze the top-ranked features
from ensemble methods, especially the demographic related ones. The understanding of these
features may create higher personalized marketing plans to improve the response from
customers and strengthen the relationships between customers and the company.
- 44 -
Chapter 7
Discussion and Conclusions
The dataset used in the thesis was obtained from the 2009 KDD Cup, during which IBM
Research was able to achieve the highest score. The winning entry from IBM Research
consisted of an ensemble selection model of various classifiers. Other winning entries all
included ensemble methods to some extent, especially random forest and boosting decision tree.
The models used in the thesis are selected based on the study and understanding of theses
ensemble methods and the results from the competition entries (Guyon, Lemaire, Boullé, Dror,
& Vogel, 2009).
Based on the results from Chapter 6, the presented ensemble models, random forests, gradient
boosting decision tree and ensemble selection, are able to outperform the benchmark obtained
from Naïve Bayes classifier, with ensemble selection algorithm producing the best outcomes out
of all of them. The overall score from ensemble selection algorithm (0.8492) also measures up
to the best score produced by IBM Research in the competition (0.8521). Minor deviations are
acceptable due to the differences in data preprocessing and library construction and the
stochasticity in ensemble methods.
For companies which plan on implementing machine learning for data analysis, ensemble
methods are most likely to guarantee better results than a single classifier.
On the whole, ensemble methods are constructed based on a weak learner to improve the
performance. A weak learner is defined to be a classifier that is only slightly correlated with the
true classification, or in simple terms, it can produce a result a little better than random
guessing. Ensemble methods, either bagging or boosting, increase the computation complexity
in both the space (memory) and time aspects comparing to a weak learner. For bagging, the
training process is relatively easier. The estimate is produced by the average or majority vote of
all the weak learners. The number of weaker learners trained is the main reason causing the
increase in complexity. For instance, it takes O(t) and m memory to train a single weak learner,
- 45 -
then it will take O(nt) to train n similar weak learners and n×m memory to store them for
training and predicting. For boosting, each weak learner is trained in sequence, and the later
learners in the ensemble are ‘boosted’ given the knowledge about the mistakes from previous
learners to minimize a predetermined loss function. The computational complexity mainly
depends on the number of boosting iterations and the complexity of boosting function (Zhou,
Ensemble Learning, 2009). From the Naïve Bayes classifier to random forest and gradient
boosting (ensemble of simple learners) to ensemble selection (ensemble on ensemble models),
the prediction result improves gradually. For models like ensemble selection, which is
essentially an ensemble of ensemble models, it is understandable that the running time grows
exponentially from the base learner.
When implementing the ensemble methods in real life, the trade-off would be between accuracy
and computational complexity. In general, as a model becomes more complex, the training
process will require a more powerful driver and extra memory for storage and computation, or it
may risk freezing the whole system and losing all the work to that time. Likewise, the more
complex the algorithm is the longer time it takes to construct, train and predict.
Figure 21 Trade-off between Accuracy and Time
Figure 21 illustrates one of the possible approaches to evaluate the ensemble methods. As all of
the models were performed under the same environment, the only trade-off is between accuracy
0.01
0.1
1
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
NB RF GBDT ES
Tim
e Ef
fici
ency
AU
C S
core
[LHS]Overall Score [RHS]Time Efficiency
- 46 -
(AUC score) and time. Time efficiency is calculated as the multiplicative inverse of running
time, so that the model with shorter running time (more time efficient) has higher value. The
optimal model is the one with balanced AUC score and time efficiency, in this case random
forest or gradient boosting decision tree. Although ensemble selection has the highest score, it is
really inefficient and may not be the best model for all problems.
For companies, the trade-off would be more relevant to the financial resources of the company,
the objectives of the analysis and the details of the actual dataset.
If the dataset is not as large, complex and noisy as the Orange dataset, the base learner would be
relatively easier to train. Under this circumstance, the number of learner needed for random
forest, the number of iteration for gradient boosting would decrease for the models to converge.
At the same time the weak learner is not likely to produce really bad results, which leaves a
small improvement space for ensemble methods. Hence, it is unlikely to have such a difference
between ensemble selection and random forest or gradient boosting decision tree in both
running time and prediction results. The most consuming aspect would be the additional
programming required for the model library in ensemble selection.
Also, if the data analysis department of the company is well financed, it is most likely to have
the computational power and human resources to produce the analysis using ensemble selection.
Similarly, if the primary purpose of the company is to predict the outcomes, that if the most
important objective is accuracy then the ensemble selection would be the best option. On the
other hand, if the company does not have strong computational power, or the analysis result is
more time sensitive, or the dataset has a relatively small number of features, general ensemble
methods like random forest and gradient boosting machines can still produce trustworthy result
with much less effort. For example, a multinational technology company like Apple Inc.
certainly has the computational strength as well as the financial support for the most
complicated models. This type of company will tend to choose the model with the highest
accuracy. On the contrary, if a company is a retail Startup, which is unlikely to have the
resources required to setup a complicated ensemble method. It is wise for these company to
choose a basic ensemble method.
- 47 -
Furthermore, the feature selection process can significantly reduce the running time, and a well
constructed selection can maintain the same level of prediction with much fewer features. The
features were selected based on the feature importance produced by the gradient boosting
decision tree. With feature selection, the models were able to produce comparable predictions
with only 1.5% – 2% features. This process also leads to an extreme decrease in running time
and memory. Feature selection is not only helpful for the fact that it simplifies the computational
process, but also for the potential to decrease future cost in information gathering and
processing, the company can ask for only 2% of the original 15,000 fields for future predictions.
Moreover, as the predictions from this dataset are most likely to be used for marketing and sales
purposes, the results from the ensemble method have the potential to adjust the commercials to a
more personalized level, reduce the marketing campaign size and cost and still obtain or
improve the number of targeting customers. At the same time, ensemble methods are also good
at identifying customers with the propensity to switch to another firm, which gives the company
the opportunity to keep customers or attract customers from other companies with targeting
advertisement or promotions and also to study the reason why they are considering changing
suppliers.
Ensemble methods are widely used in many industries. In the biomedical industry, ensemble
methods can be used for disease diagnosis and prediction, drug discovery and gene expression,
etc. For high energy physics, especially the predictions and analyses for the Large Hadron
Collider (LHC) are mostly done through machine learning.
To sum up, the ensemble methods tested in the thesis were able to produce better prediction at
the cost of additional time and computational power. The trade-off between accuracy, time and
money should be evaluated correspondingly with the details about the dataset, the wealth of the
company and the main objective of the analysis. Feature selection can typically reduce the
number of features required to make good predictions and thus reduce the computational
complexity of the ensemble models. The results from ensemble models can be applied to
marketing and planning purposes. In addition to the predictions, the models can also provide
insights about feature importance that can be used for personalized/targeted promotion and
advertising.
- 48 -
Chapter 8
Future Works
The ensemble methods were only tested on the Orange dataset. A more generalized result can be
obtained by applying these ensemble methods on different datasets, preferably datasets with
different sizes, structures and objectives. In addition to generalizing the trend discovered from
this work, the result can also be used to identify which ensemble method is most suitable for
which kind of dataset, or what specific objectives. Besides the models tested in the thesis, other
ensemble methods can also be tested under the same constraints and further discover the “best”
of the ensemble methods.
The Orange dataset used in the thesis is encrypted, so it is impossible to understand the actual
meaning of each field/feature. It would be helpful to apply the ensemble methods on non-
encoded datasets and disclose some insights through the mining process. With the help of
feature importance, it is possible to reveal what demographic information affects the prediction
most and further study how the most significant features vary from target to target, and why
they have the most influence on that target. These further studies can be measurably helpful for
management and marketing decisions, especially if the features obtained are gender, age or
demographic oriented. As an illustration, if an analysis for customers indicates that the females
in the age group of 18-24 living in the Yonge–Eglington neighbourhood are more accepting and
welcoming to new fashion items. A fast fashion retailer Y may consider to open a pop-up store
within that area with exclusive or limited-edition products. This type of research can lead to
further reduction of the target customers and result in a better designed marketing campaign.
Additionally, the current dataset does not contain chronological data. More often than not, the
realistic data will have a timestamp, and the time of the recording presumably will contain
certain information, such as seasonal trends and market directions, and this is where
reinforcement learning is beneficial. Reinforcement learning, also known as approximate
dynamic programming, is an area of machine learning that maximizes the cumulative rewards
through the actions taken under certain environments. In simple terms, reinforcement learning is
- 49 -
trained on numbers of actions with time steps and corresponding rewards and update the model
as the number increases, so the model is constantly updating as it predicting new data samples
and receive feedback. If a chronological dataset regarding customer information can be
obtained, ensemble learning and reinforcement learning can be combined to produce predictions
with trends and sequential actions that could result in the highest outcomes based on the
definition of the project. One beneficial chronological data to collect to supplement the Orange
dataset would be the response from customers together with the details about marketing plans.
Reinforcement learning can be trained using these feedbacks to predict what kind of promotion
or commercials has the most responses from different customers (highest rewards). Two
consumers both identified as likely to purchase a new product could be approached through
different media and receive different promotions based on their historical data.
- 50 -
References
Auria, L., & Moro, R. A. (2008). Support Vector Machines (SVM) as a Technique for Solvency Analysis
(DIW Berlin Discussion Paper No. 811). Berlin: DIW Berlin.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers.
Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152).
Pittsburgh: ACM.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Brodersen, K. H., Ong, C. S., Stephany, K. E., & Buhmann, J. M. (2010). The Balanced Accuracy and Its
Posterior Distribution. 2010 20th International Conference on Pattern Recognition (pp. 3121-
3124). Istanbul: Institute of Electrical and Electronics Engineers .
Caruana, R., Munson, A., & Niculescu-Mizil, A. (2006). Getting the Most Out of Ensemble Selection.
Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006) (pp. 828-
833). Hong Kong: IEEE.
Caruana, R., Niculescu, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In
Proceedings of the twenty-first international conference on Machine learning (ICML '04) (pp. 18-
26). Banff: ACM.
Chen, T., & Guestrin, C. (2016, June 3). XGBoost A Scalable Tree Boosting System. KDD '16 Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
(pp. 785-794). San Francisco.
CodeEval. (2016, Feburary 2). Most Popular Coding Languages of 2016. Retrieved from CodeEval:
http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297.
Cover, T., & Hart, P. (1967, January). Nearest Neighbor Pattern Classification. IEEE Transactions on
Information Theory, 13(1), 21-27.
COX, D. R. (1958, January). The Regression Analysis of Binary Sequences: Discussion on the Paper.
Journal of the Royal Statistical Society. Series B (Methodological), 20(2), 232-242.
Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Proceedings of the First International
Workshop on Multiple Classifier Systems (pp. 1-15). London: Springer-Verlag.
Distributed (Deep) Machine Learning Community. (2015). Introduction to Boosted Trees. Retrieved from
XGBoost Documents: http://xgboost.readthedocs.io/en/latest/get_started/
- 51 -
Dreiseitl, S., & Ohno-Machado, L. (2002, October). Logistic regression and artificial neural network
classification models: a methodology review. Journal of Biomedical Informatics, 35(5-6), 352-
359.
Eick, C. F., Zeidat, N., & Zhao, Z. (2004). Supervised clustering - algorithms and benefits. 16th IEEE
International Conference on Tools with Artificial Intelligence (pp. 774-776). Boca Raton: IEEE
Computer Society.
Enders, C. K. (2010). Applied Missing Data Analysis. New York: Guilford Press.
ETS Asset Management Factory. (2016, April 20). What is the difference between Bagging and
Boosting? Retrieved from QuantDare: https://quantdare.com/what-is-the-difference-between-
bagging-and-boosting/
Fix, E., & Hodges, J. L. (1951). discriminatory Analysis. Nonparametric Discrimination: Consistency
Properties. Randolph Field: USAF School of Aviation Medicine.
Friedman, J. H. (2001, October). Greedy Function Approximation: A Gradient Boosting Machine. The
Annals of Statistics, 29(5), 1189-1232.
Good, I. J. (1965). The estimation of probabilities: An essay on modern Bayesian methods. Cambridge:
M.I.T. Press.
Google. (2015, November 9). TensorFlow: Open source machine learning. Retrieved June 11, 2017, from
YouTube: https://www.youtube.com/watch?v=oZikw5k_2FM
Grover, P. (2017, December 9). Gradient Boosting from scratch. Retrieved from Medium:
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD Cup 2009: Fast
Scoring on a Large Orange Customer Database. Proceedings of 2009 International Conference
on KDD Cup (pp. 1-22). Paris: JMLR.org.
Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques (3rd ed.). Amsterdam:
Elsevier/Morgan Kaufmann.
He, X., Chaney, N. W., Sheffield, J., & Schleiss, M. (2016, October). Spatial Downscaling of Precipitation
Using Adaptable Random Forests. Water Resources Research, 52(10), 8217–8237.
Hebb, D. O. (1949). The Organization of Behavior. New York: Wiley & Sons.
Ho, T. K. (1995). Random Decision Forests. Proceedings of the 3rd International Conference on
Document Analysis and Recognition (pp. 278-282). Montral: IEEE Computer Society.
Hohenegger, J., Bufardi, A., & Xirouchakis, P. (2008). Compatibility Knowledge in Fuzzy Front End. In A.
Bernard, & S. Tichkiewitch, Methods and tools for effective knowledge life-cycle-management
(pp. 243-258). Berlin: Springer.
- 52 -
Jain, R. (2017, February 21). Simple Tutorial on SVM and Parameter Tuning in Python and R. Retrieved
from Hackerearth blog: http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-
python-r
Kearns, M., & Valiant, L. (1989). Cryptographic Limitations on Learning Boolean Formulae and Finite
Automata. Proceedings of the 21st ACM Symposium on the Theory of Computing (pp. 434-444).
Seattle: ACM.
Laurae. (2017, April 23). Categorical Features and Encoding in Decision Trees. Retrieved from Medium:
https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-
53400fa65931
Lin, Y., & Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101(474), 578-590.
Loh, W. Y. (2014). Fifty Years of Classification and Regression Trees. International Statistical Review,
82(3), 329-348.
McCulloch, W., & Pitts, W. (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of
Mathematical Biophysics, 5(3), 115-133.
Mell, P., & Grance, T. (2011). The NIST definition of cloud computing. Gaithersburg, MD: Computer
Security Division, Information Technology Laboratory, National Institute of Standards and
Technology. Retrieved from http://purl.fdlp.gov/GPO/gpo17628
Meyer, D. (2001). Support Vector Machines The Interface to libsvm in Package e1071. R News, 1/3, 23-
26.
Mohammed, M., Khan, M. B., & Bashier, E. B. (2016). Machine Learning: Algorithms and Applications.
Boca Raton: CRC Press.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012). Foundations of Machine Learning. Cambridge, MA:
MIT Press.
Ng, A. Y., & Jordan, M. I. (2001). On Discriminative vs. Generative Classifiers: A comparison of logistic
regression and Naive Bayes. Advances in Neural Information Processing Systems 14 (NIPS 2001)
(pp. 841-848). Vancouver: MIT Press.
Nilsson, N. J. (1965). Learning Machines : foundations of trainable pattern-classifying systems. New
York: McGraw-Hill.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Brucher, M. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
Priddy, K., & Keller, P. (2005). Artificial Neural Networks: An Introduction. Bellingham: SPIE Press.
- 53 -
Raschka, S. (2016, April). When Does Deep Learning Work Better Than SVMs or Random Forests?
Retrieved from KDnuggets: https://www.kdnuggets.com/2016/04/deep-learning-vs-svm-
random-forest.html
Rish, I. (2001). An empirical study of the naive Bayes classifier. Proceedings of IJCAI-2001 workshop on
Empirical Methods in AI (pp. 41-46). Seattle: IBM New York.
Russel, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Upper Saddle River,
NJ: Prentice Hall.
Schapire, R. E. (1990). The Strength of Weak Learnability. Machine Learning, 5(2), 197-227.
Schmidt, M. (2009, March 29). A Note on Structural Extensions of SVMs. Retrieved from
http://www.cs.ubc.ca/~schmidtm/Documents/2009_Notes_StructuredSVMs.pdf
Scikit-learn. (2017). 1.11. Ensemble methods. Retrieved from Scikit-learn: http://scikit-
learn.org/stable/modules/ensemble.html#gradient-boosting
Scikit-learn. (2017). 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier. Retrieved from Scikit-learn:
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Sebestyen, G. S. (1962). Decision Making Processes in Pattern Recognition. New York: Macmillan.
Steinkraus, D., Simard, P., & Buck, I. (2005). Using GPUs for Machine Learning Algorithms. Eighth
International Conference on Document Analysis and Recognition (ICDAR'05). 2, pp. 1115-1120.
Seoul: IEEE Computer Society.
Summerfield, M. (2007). Rapid GUI Programming with Python and Qt: The Definitive Guide to PyQt
Programming. Upper Saddle River: Prentice Hall Press.
Swamidass, S., Azencott, C., Daily, K., & Baldi, P. (2012, April). A CROC stronger than ROC: measuring,
visualizing and optimizing early retrieval. Bioinformatics, 1348-1356.
Weng, J., Ahuja, N., & Huang, T. S. (1992). Cresceptron: A Self-organizing Neural Network. IJCNN
International Joint Conference on Neural Networks. 1, pp. 576-581. Baltimore: IEEE.
Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences (Doctoral dissertation). Harvard University.
Wilde, S. (2011). Customer Knowledge Management: Improving Customer Relationship through
Knowledge Application. Berlin: Springer.
Zhou, Z.-H. (2009). Ensemble Learning. In S. Z. Li, & A. Jain (Eds.), Encyclopedia of Biometrics (pp. 270-
273). New York: Springer US.
Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Boca Raton: Chapman & Hall/CRC.