Ensemble Modeling with R
Matthew A. Lanham
Doctoral Candidate/Merchandise Data Scientist
MatthewALanham.com
Virginia Tech Department of Business Information
Technology
Advance Auto Parts, Inc.
Outline
My Background and Research
Pros and Cons of R for Data Science
Modeling Using CRISP-DM Framework
What is Ensemble Modeling?
Fitting Models
Bagging a decision tree
Optimal Decision Cut Points for Binary Classification
SEPTEMBER 15, 2014 2
Outline Ensemble Modeling with R
Matthew A. Lanham
Background
(2005) B.A. Economics/Mathematics, Indiana University-Bloomington
(2005 - 2010) Genscape, Inc., Louisville, KY – Energy transparency start-up
(2008 - 2010) M.S. Biostatistics-Decision Science, University of Louisville
(2010 - 2012) M.S. Statistics, Virginia Tech
(2011 - Current) Ph.D. Business Information Technology, Virginia Tech
(2014 - Current) Advance Auto Parts, Inc. – Fortune 500 Retailer (#402.. for now)
Research Focus
• How can we build better predictive models that are empirically sound (stats) as
input parameters to prescriptive models that are process representative
(optimization) to provide the “best” (maintainable, timely, scalable, KPI fused)
decision-support for a retailer’s assortment plan?
Why is assortment planning so important?
Why is assortment planning problem so challenging?
Where does Data Science & Big Data Analytics (BDA) come into play?
Where does Information Technology (IT) come into play?
Where does Business come into play?
INTEGRATING PREDICTIVE AND PRESCRIPTIVE ANALYTICS
SEPTEMBER 15, 2014 4
Predictive and Prescriptive Analytics Ensemble Modeling with R
Matthew A. Lanham
Demand
Model
Market
Specs Estimation
Model
Utility
Model
Preference Structure
Similarity Measures
Demand Forecast
Parameter(s)
Sales
Model
Search
Algorithm Optimality
Conditions
Decision Criteria
Max Profit
Decision Variables
1) Assortment
2) Prices
3) Promotion
4) Shelf Space*
Performance Measures
1) Obj. Function - Revenue
2) Constraints - Budget(s)
Time Summary (TS)
Ex: Sum, Avg.
Time summary performance
measures (TSPM)
Scenario Summary (SS)
Ex: Sum, Avg.
Scenario summary performance
measures (SSPM)
Determine
Optimal Solution
(loop)
Prescriptive Model
Decision Model
Predictive Model(s)
Performance measures that are functions over a time
horizon or random variables that must be summarized
over their distributions
Data
WHAT I’M WORKING WITH CURRENTLY
SEPTEMBER 17, 2014 5
Oracle + SPSS Modeler Ensemble Modeling with R
Matthew A. Lanham
My opinion:
IBM SPSS Modeler and SAS Enterprise
Miner are: 1) Great for teaching
2) Great for stand-alone data mining projects
3) Visually appealing to Management
4) Not great for real-time production analytics
5) Not great for customized solutions
6) Not designed for prescriptive analytics
An Example of an IBM SPSS
Modeler stream building
predictive models
DATA SCIENCE WITH R
R is an open-source and freely accessible software language under the GNU General Public License,
version 2 agreement for statistical and mathematical computing (Ihaka & Gentleman, 1996).
R is compatible with many operating systems such as with Windows, Macintosh, Unix, and Linux.
According to Eric Sigel, Author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie,
or Die, R is “The leading free, open-source software tool for PA (Predictive Analytics), … has a rapidly
expanding base of users as well as enthusiastic volunteer developers who add to and support its
functionalities (Siegel, 2013).”
Today there are several thousand available user-developed packages (also referred to as libraries).
Packages are collections of R functions, compiled code, and data put together in a specific format
following CRAN’s guidelines. You can search for packages by application area here (http://cran.r-
project.org/web/views/).
As of July 2014, there are 33 different application areas. In the Machine Learning application area
there are 72 different packages offering libraries that have functions to do nearly any methodology.
There are many newer techniques available here that are not available in commercial software
packages.
Cons – Memory, memory, memory!
See memory_example.R
SEPTEMBER 17, 2014 6
Data Mining, Data Science, and Predictive Modeling with R Ensemble Modeling with R
Matthew A. Lanham
WHAT IS ENSEMBLE MODELING?
Ensemble methods train multiple predictive models and then combine the
predictions to achieve a higher overall performance and stability.
Pros
Ensemble methods require little tuning
Ensemble methods operate on a variety of input types (Categorical variables,
Integers, and real numbers)
Ensemble methods can be used on variety of problems (binary and multi-class
classification and rankings, regression, etc.)
SEPTEMBER 17, 2014 7
Data Mining, Data Science, and Predictive Modeling with R Ensemble Modeling with R
Matthew A. Lanham
DATA MINING FRAMEWORK
Cross-Industry Standard for Data Mining (CRISP-DM) is a general data mining process model that can be applied to
solve any business problem.
There are other popular data mining and analytics process models, such as Sample-Explore-Modify-Model-
Assess (SEMMA), but in my opinion CRISP-DM is more structured and detailed.
CRISP-DM was created and modified over time by leading practitioners and researchers in the data mining
field and has been shown to lead to analytical results that align with business objectives.
The CRISP-DM process model and techniques primarily fall under the predictive analytics domain in business
analytics, where the objective is to help organizations predict future events and proactively act upon such
insights in a systematic fashion to drive better business outcomes (Provost & Fawcett, 2013). However, this
process could be extended to prescriptive (i.e. optimization) analytics endeavors as well.
SEPTEMBER 15, 2014 8
CRISP-DM Ensemble Modeling with R
Matthew A. Lanham
CRISP-DM DETAILED VIEW
CRISP-DM Model Phases and Tasks (Source: Modified from www.crisp-dm.org)
SEPTEMBER 15, 2014 9
CRISP-DM Ensemble Modeling with R
Matthew A. Lanham
BUSINESS UNDERSTANDING
Business Objectives
Retail “Assortment planning,” at the most basic level asks which products to
offer and how many (Mantrala et al., 2009).
Assortment planning is one of the most important decisions faced by
retailers (Sauré & Zeevi, 2013). Because of financial and physical capacity constraints, operationally a retailer does not have to
ability to stock, let alone hold in store every possible product a consumer may desire (Sauré &
Zeevi, 2013).
You must get the project sponsor to detail the business success criteria. It’s
not some predictive model accuracy statistic. Examples:
Increased Sales of X% at Stores Y and Z.
Reduced non-working inventory of W% at Stores Y and Z.
Assess Situation
May use R and any of its available packages
Deadline is September 17th at Meetup
Competition winner gets $100, losers learn something, Speaker gets feedback
Data Mining Goals
Determine best overall test accuracy on a 10% out-of-training set
Neither the sensitivity nor specificity must fall below 0.70 on the out-of-
training set to qualify.
Project Plan
Layout your expected work schedule, breaks, etc.
Will vary depending on your experience using R
SEPTEMBER 15, 2014 10
Business Understanding Ensemble Modeling with R
Matthew A. Lanham
DATA UNDERSTANDING
Collect Data
http://www.matthewalanham.com/Presentations/skus.xlsx
Describe Data
Usually you will create such a table yourself but make it more descriptive.
The data scientist will ask the domain expert(s) questions such as: What are the variables units of measure?
Where does the data come from? When is it updated?
How and why was clustering performed a particular way?
How and why was a variable adjusted?
SEPTEMBER 15, 2014 11
Data Understanding Ensemble Modeling with R
Matthew A. Lanham
?? Variable Description
store_number A unique store identifier
sku_number A unique SKU identifier
Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied.
NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods.
X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods.
X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for.
X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts.
X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution
X adjusted_offset
For each store-SKU the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc
calculation
X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods.
X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods.
X unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods.
X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.
X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.
X vio_compared_to_cluster
The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over
the past 14 to 26 periods.
X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods.
X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods.
X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods.
X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period.
X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period.
X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period.
X age Estimated median person-age where the store is located over based on the latest period.
X pct_college Estimated percentage of college-education persons where the store is located based on the latest period.
X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period.
X median_household_income Estimated median household income where the store is located based on the latest period.
X establishments
Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is
located based on the latest period.
X road_quality_index A measure of the quality of the roads in the area the store is located.
DATA UNDERSTANDING (CONT.)
Explore Data
Matt’s source code: https://github.com/malanham/Datathon.git
Find the main.R
Data Quality
DataQualityReport(skus)
DataQualityReportOverall(dataSetName=skus)
SEPTEMBER 17, 2014 12
Data Understanding Ensemble Modeling with R
Matthew A. Lanham
DATA PREPARATION
Data Description
SEPTEMBER 17, 2014 13
Data Preparation Ensemble Modeling with R
Matthew A. Lanham
?? Variable Description
store_number A unique store identifier
sku_number A unique SKU identifier
Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied.
NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods.
X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods.
X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for.
X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts.
X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution
X adjusted_offset
For each store-SKU the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc
calculation
X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods.
X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods.
unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods.
X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.
X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.
X vio_compared_to_cluster
The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over
the past 14 to 26 periods.
X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods.
X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods.
X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods.
X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period.
X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period.
X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period.
X age Estimated median person-age where the store is located over based on the latest period.
X pct_college Estimated percentage of college-education persons where the store is located based on the latest period.
X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period.
X median_household_income Estimated median household income where the store is located based on the latest period.
X establishments
Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is
located based on the latest period.
X road_quality_index A measure of the quality of the roads in the area the store is located.
skus = skus[which(complete.cases(skus)),]
DataQualityReportOverall(dataSetName=skus)
MODELING
Modeling Techniques
C5.0 Decision tree
Logistic Regression
CART Decision tree
SEPTEMBER 17, 2014 14
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
Design
When building and testing predictive models using observational data (i.e. data that is not
controlled like in laboratory experimentation), the question that must be answered is how
valid is my model in regards to what will happen next?
In a properly designed and controlled experiment, data
(samples) used in the experiment are used to make inferences
about the population.
Regardless of how large or small the sample is compared to the
true size of the population, this single randomly selected subset
of the population allows for generalizability of the remaining
subset of data not used in the study.
Cross-validation is the most practical and cost effective means of obtaining a proxy for
truth in predictive analytics. ## Randomly partition data into training and test sets
my_seed = 1234567
skus = GenerateTTV(dataSetName=skus, response='SOLD', trainPct=.90, testPct=.10, my_seed)
GeneratePartitionPcts(dataSetName=skus)
Using the training data error rate as a proxy for a model’s generalization error is not wise,
especially when the training error is low to almost perfect. Most likely the model has been
overfit (or over trained) and will not perform as well when new examples are feed through
and evaluated from a validation data set (Zhou, 2012).
SEPTEMBER 17, 2014 15
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
Design ## Percentage of Target is 1 (or Y='SOLD') in total data set
dim(skus[which(skus$SOLD==1),])[[1]] / dim(skus[which(skus$SOLD==0 | skus$SOLD==1),])[[1]]
## Percentage of Target is 1 (or Y='SOLD') in training set
dim(skus[which(skus$SOLD==1 & skus$SPSS_Partition=='Train'),])[[1]] /
dim(skus[which((skus$SOLD==1 | skus$SOLD==0) & skus$SPSS_Partition=='Train'),])[[1]]
## Percentage of Target is 1 (or Y='SOLD') in test set
dim(skus[which(skus$SOLD==1 & skus$SPSS_Partition=='Test'),])[[1]] / dim(skus[which((skus$SOLD==1
| skus$SOLD==0) & skus$SPSS_Partition=='Test'),])[[1]]
## remove independent variables that you don't want to use
names(skus)
skus2 = skus[,c(3,5:19,21:28)]
head(skus2)
skus2$SOLD = as.factor(skus2$SOLD)
SEPTEMBER 17, 2014 16
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
## set up data for algorithms
trainX = skus2[which(skus2$SPSS_Partition=='Train'),2:(length(skus2)-1)]
trainY = skus2[which(skus2$SPSS_Partition=='Train'),'SOLD']
train = cbind(trainY,trainX)
testX = skus2[which(skus2$SPSS_Partition=='Test'),2:(length(skus2)-1)]
testY = skus2[which(skus2$SPSS_Partition=='Test'),'SOLD']
test = cbind(testY,testX)
SEPTEMBER 17, 2014 17
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
Build Models
C5.0 Decision tree
require(C50) #Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm
C5Params = C5.0Control(…)
C5 = C5.0(x=trainX, y=trainY,
,control=C5Params #control parameters defined above
,trails=1 #number of boosting iterations; 1 implies a single model is used
)
summary(C5)
SEPTEMBER 17, 2014 18
Modeling Ensemble Modeling with R
Matthew A. Lanham
The confusion matrix is based on a
decision cutoff threshold of 0.50
Overall error rate
Variables that were used to
create the tree
Changing the trials from 1 to 1000
doesn’t change the result in this
case
MODELING (CONT.)
Build Models
C5.0 Decision tree
## training probabilities and predicted classes
C5trainp = predict(C5
,newdata = trainX
,trials = C5$trials["Actual"]
,type = "prob", #either "class" for the predicted class or "prob" for model confidence values.
,na.action = na.pass)[,2]
C5trainc = predict(C5
,newdata = trainX
,trials = C5$trials["Actual"]
,type = "class", #either "class" for the predicted class or "prob" for model confidence values.
,na.action = na.pass)
## testing probabilities and predicted classes
C5testp = predict(C5
,newdata = testX
,trials = C5$trials["Actual"]
,type = "prob", #either "class" for the predicted class or "prob" for model confidence values.
,na.action = na.pass)[,2]
C5testc = predict(C5
,newdata = testX
,trials = C5$trials["Actual"]
,type = "class", #either "class" for the predicted class or "prob" for model confidence values.
,na.action = na.pass)
SEPTEMBER 17, 2014 19
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
Build Models
Logistic Regression logit.fit = glm(trainY ~
NUM_SOLD_LAST+TOTAL_VIO+ADJ_TOTAL_VIO+VIO_COMPARED_TO_CLUSTER+POP_EST_CY+POP_DEN
SITY_CY+PCT_WHITE+AGE+PCT_COLLEGE+PCT_BLUE_COLLAR+MEDIAN_HOUSEHOLD_INCOME+ESTABLI
SHMENTS+ROAD_QUALITY_INDEX+APPLICATION_COUNT+PROJECTED_GROWTH_PCT+UNIT_SALES_CY
+UNIT_SALES_PY+OFFSET+ADJUSTED_OFFSET+AVG_CLUSTER_CY_UNIT_SALES+AVG_CLUSTER_CY_TOTA
L_SALES #+AVG_CLUSTER_CY_LOST_SALES
,family = binomial
,data = train)
summary(logit.fit)
SEPTEMBER 17, 2014 20
Modeling Ensemble Modeling with R
Matthew A. Lanham
MODELING (CONT.)
Build Models
CART
SEPTEMBER 17, 2014 21
Modeling Ensemble Modeling with R
Matthew A. Lanham
require(rpart)
set.seed(1234567)
tree = rpart(trainY ~.
,data = train
,method = "class"
,cp = 0
,minsplit = 4
,minbucket = 2
,parms = list(prior=c(0.5, 0.5)))
#summary(tree) <-this will take awhile
## find the best pruned tree
i.min = which.min(tree$cptable[,"xerror"])
i.se = which.min(abs(tree$cptable[,"xerror"]
- (tree$cptable[i.min,"xerror"]
+ tree$cptable[i.min,"xstd"])))
alpha.best = tree$cptable[i.se, "CP"]
tree.p = prune(tree, cp=alpha.best)
## obtain predictions
treeTrainp = predict(tree.p, train)[,2]
treeTrainc = treeTrainp
treeTrainc[which(treeTrainc>.5)] = 1
treeTrainc[which(treeTrainc<=.5)] = 0
treeTestp = predict(tree.p, test)[,2]
treeTestc = treeTestp
treeTestc[which(treeTestc>.5)] = 1
treeTestc[which(treeTestc<=.5)] = 0
ENSEMBLING VIA BAGGING
Bootstrap aggregation (Bagging)
Bagging is a simple way to increase the predictive power of a model
Pros
Useful when the predictors are more unstable, meaning that the more variation observed
Cons
Using smaller samples will yield more instability and too small yields poor models
How
Take several random samples with replacement from the training data set
Use each sample to construct a separate predictive model with predictions as the testing data set.
Average the predictions to come up with one final predicted value
my_list = getBS_samples(seed=my_seed, Ntrees=100, SampleSize=1000)
bagged_tree = BS_Trees(response=train[,1], datasetName=train, sampleList=my_list, Ntrees=100)
bagged_probs = getFinalPredictions(bagged_tree, dataSetName=train, Ntrees=100)
SEPTEMBER 17, 2014 22
Modeling Ensemble Modeling with R
Matthew A. Lanham
What R packages are available for bagging?
MODEL ASSESSMENT - TRAINING
C5.0
Logit
CART
SEPTEMBER 17, 2014 23
Modeling Ensemble Modeling with R
Matthew A. Lanham
Bagged CART
Why does the bagged tree perform worse?
MODEL ASSESSMENT - TESTING
C5.0
Logit
CART
SEPTEMBER 17, 2014 24
Modeling Ensemble Modeling with R
Matthew A. Lanham
Bagged CART
How do we store a Bagged Model in R?
OPTIMAL DECISION CUTPOINTS
Why use a decision cutoff threshold of 0.50?
Example of an ROC curve
SEPTEMBER 17, 2014 25
Modeling Ensemble Modeling with R
Matthew A. Lanham
OPTIMAL DECISION CUTPOINTS
See require(OptimalCutpoints)
## Define my methods list
methodList =
list(
"Youden" #1 (Youden Index);
,"ROC01" #2 (minimizes distance between ROC plot and point (0,1));
,"PROC01" #3 (minimizes distance between PROC plot and point (0,1));
#,"MaxAccuracyArea" #4 (maximizes Accuracy Area);
#,"AUC" #5 (maximizes concordance which is a function of AUC);
,"MaxEfficiency" #6 (maximizes Efficiency or Accuracy);
,"MaxKappa" #7 (maximizes Kappa Index);
#,"MinErrorRate" #8 (minimizes Error Rate);
C5.0 Training using the Youden cutoff method
Using other cutoff methods…..
SEPTEMBER 17, 2014 26
Modeling Ensemble Modeling with R
Matthew A. Lanham