Matthew A. Lanham Doctoral …matthewalanham.com/Presentations/EnsembleModelingwithR.pdfAccording to...

Ensemble Modeling with R

Matthew A. Lanham

Doctoral Candidate/Merchandise Data Scientist

MatthewALanham.com

Virginia Tech Department of Business Information

Technology

Advance Auto Parts, Inc.

Outline

My Background and Research

Pros and Cons of R for Data Science

Modeling Using CRISP-DM Framework

What is Ensemble Modeling?

Fitting Models

Bagging a decision tree

Optimal Decision Cut Points for Binary Classification

SEPTEMBER 15, 2014 2

Outline Ensemble Modeling with R

Matthew A. Lanham

Background

(2005) B.A. Economics/Mathematics, Indiana University-Bloomington

(2005 - 2010) Genscape, Inc., Louisville, KY – Energy transparency start-up

(2008 - 2010) M.S. Biostatistics-Decision Science, University of Louisville

(2010 - 2012) M.S. Statistics, Virginia Tech

(2011 - Current) Ph.D. Business Information Technology, Virginia Tech

(2014 - Current) Advance Auto Parts, Inc. – Fortune 500 Retailer (#402.. for now)

Research Focus

• How can we build better predictive models that are empirically sound (stats) as

input parameters to prescriptive models that are process representative

(optimization) to provide the “best” (maintainable, timely, scalable, KPI fused)

decision-support for a retailer’s assortment plan?

Why is assortment planning so important?

Why is assortment planning problem so challenging?

Where does Data Science & Big Data Analytics (BDA) come into play?

Where does Information Technology (IT) come into play?

Where does Business come into play?

INTEGRATING PREDICTIVE AND PRESCRIPTIVE ANALYTICS


Predictive and Prescriptive Analytics Ensemble Modeling with R

Matthew A. Lanham

Demand

Model

Market

Specs Estimation

Model

Utility

Model

Preference Structure

Similarity Measures

Demand Forecast

Parameter(s)

Sales

Model

Search

Algorithm Optimality

Conditions

Decision Criteria

Max Profit

Decision Variables

1) Assortment

2) Prices

3) Promotion

4) Shelf Space*

Performance Measures

1) Obj. Function - Revenue

2) Constraints - Budget(s)

Time Summary (TS)

Ex: Sum, Avg.

Time summary performance

measures (TSPM)

Scenario Summary (SS)

Ex: Sum, Avg.

Scenario summary performance

measures (SSPM)

Determine

Optimal Solution

(loop)

Prescriptive Model

Decision Model

Predictive Model(s)

Performance measures that are functions over a time

horizon or random variables that must be summarized

over their distributions

Data

WHAT I’M WORKING WITH CURRENTLY


Oracle + SPSS Modeler Ensemble Modeling with R

Matthew A. Lanham

My opinion:

IBM SPSS Modeler and SAS Enterprise

Miner are: 1) Great for teaching

2) Great for stand-alone data mining projects

3) Visually appealing to Management

4) Not great for real-time production analytics

5) Not great for customized solutions

6) Not designed for prescriptive analytics

An Example of an IBM SPSS

Modeler stream building

predictive models

DATA SCIENCE WITH R

R is an open-source and freely accessible software language under the GNU General Public License,

version 2 agreement for statistical and mathematical computing (Ihaka & Gentleman, 1996).

R is compatible with many operating systems such as with Windows, Macintosh, Unix, and Linux.

According to Eric Sigel, Author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie,

or Die, R is “The leading free, open-source software tool for PA (Predictive Analytics), … has a rapidly

expanding base of users as well as enthusiastic volunteer developers who add to and support its

functionalities (Siegel, 2013).”

Today there are several thousand available user-developed packages (also referred to as libraries).

Packages are collections of R functions, compiled code, and data put together in a specific format

following CRAN’s guidelines. You can search for packages by application area here (http://cran.r-

project.org/web/views/).

As of July 2014, there are 33 different application areas. In the Machine Learning application area

there are 72 different packages offering libraries that have functions to do nearly any methodology.

There are many newer techniques available here that are not available in commercial software

packages.

Cons – Memory, memory, memory!

See memory_example.R


Data Mining, Data Science, and Predictive Modeling with R Ensemble Modeling with R

Matthew A. Lanham

http://cran.r-project.org/web/views/



WHAT IS ENSEMBLE MODELING?

Ensemble methods train multiple predictive models and then combine the

predictions to achieve a higher overall performance and stability.

Pros

Ensemble methods require little tuning

Ensemble methods operate on a variety of input types (Categorical variables,

Integers, and real numbers)

Ensemble methods can be used on variety of problems (binary and multi-class

classification and rankings, regression, etc.)


Data Mining, Data Science, and Predictive Modeling with R Ensemble Modeling with R

Matthew A. Lanham

DATA MINING FRAMEWORK

Cross-Industry Standard for Data Mining (CRISP-DM) is a general data mining process model that can be applied to

solve any business problem.

There are other popular data mining and analytics process models, such as Sample-Explore-Modify-Model-

Assess (SEMMA), but in my opinion CRISP-DM is more structured and detailed.

CRISP-DM was created and modified over time by leading practitioners and researchers in the data mining

field and has been shown to lead to analytical results that align with business objectives.

The CRISP-DM process model and techniques primarily fall under the predictive analytics domain in business

analytics, where the objective is to help organizations predict future events and proactively act upon such

insights in a systematic fashion to drive better business outcomes (Provost & Fawcett, 2013). However, this

process could be extended to prescriptive (i.e. optimization) analytics endeavors as well.


CRISP-DM Ensemble Modeling with R

Matthew A. Lanham

CRISP-DM DETAILED VIEW

CRISP-DM Model Phases and Tasks (Source: Modified from www.crisp-dm.org)


CRISP-DM Ensemble Modeling with R

Matthew A. Lanham

BUSINESS UNDERSTANDING

Business Objectives

Retail “Assortment planning,” at the most basic level asks which products to

offer and how many (Mantrala et al., 2009).

Assortment planning is one of the most important decisions faced by

retailers (Sauré & Zeevi, 2013). Because of financial and physical capacity constraints, operationally a retailer does not have to

ability to stock, let alone hold in store every possible product a consumer may desire (Sauré &

Zeevi, 2013).

You must get the project sponsor to detail the business success criteria. It’s

not some predictive model accuracy statistic. Examples:

Increased Sales of X% at Stores Y and Z.

Reduced non-working inventory of W% at Stores Y and Z.

Assess Situation

May use R and any of its available packages

Deadline is September 17th at Meetup

Competition winner gets $100, losers learn something, Speaker gets feedback

Data Mining Goals

Determine best overall test accuracy on a 10% out-of-training set

Neither the sensitivity nor specificity must fall below 0.70 on the out-of-

training set to qualify.

Project Plan

Layout your expected work schedule, breaks, etc.

Will vary depending on your experience using R


Business Understanding Ensemble Modeling with R

Matthew A. Lanham

DATA UNDERSTANDING

Collect Data

http://www.matthewalanham.com/Presentations/skus.xlsx

Describe Data

Usually you will create such a table yourself but make it more descriptive.

The data scientist will ask the domain expert(s) questions such as: What are the variables units of measure?

Where does the data come from? When is it updated?

How and why was clustering performed a particular way?

How and why was a variable adjusted?


Data Understanding Ensemble Modeling with R

Matthew A. Lanham

?? Variable Description

store_number A unique store identifier

sku_number A unique SKU identifier

Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied.

NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods.

X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods.

X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for.

X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts.

X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution

X adjusted_offset

For each store-SKU the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc

calculation

X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods.

X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods.

X unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods.

X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.

X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.

X vio_compared_to_cluster

The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over

the past 14 to 26 periods.

X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods.

X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods.

X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods.

X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period.

X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period.

X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period.

X age Estimated median person-age where the store is located over based on the latest period.

X pct_college Estimated percentage of college-education persons where the store is located based on the latest period.

X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period.

X median_household_income Estimated median household income where the store is located based on the latest period.

X establishments

Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is

located based on the latest period.

X road_quality_index A measure of the quality of the roads in the area the store is located.




DATA UNDERSTANDING (CONT.)

Explore Data

Matt’s source code: https://github.com/malanham/Datathon.git

Find the main.R

Data Quality

DataQualityReport(skus)

DataQualityReportOverall(dataSetName=skus)


Data Understanding Ensemble Modeling with R

Matthew A. Lanham

https://github.com/malanham/Datathon.git



DATA PREPARATION

Data Description


Data Preparation Ensemble Modeling with R

Matthew A. Lanham

?? Variable Description

store_number A unique store identifier

sku_number A unique SKU identifier

Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied.

NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods.

X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods.

X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for.

X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts.

X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution

X adjusted_offset

For each store-SKU the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc

calculation

X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods.

X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods.

unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods.

X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.

X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation.

X vio_compared_to_cluster

The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over

the past 14 to 26 periods.

X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods.

X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods.

X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods.

X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period.

X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period.

X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period.

X age Estimated median person-age where the store is located over based on the latest period.

X pct_college Estimated percentage of college-education persons where the store is located based on the latest period.

X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period.

X median_household_income Estimated median household income where the store is located based on the latest period.

X establishments

Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is

located based on the latest period.

X road_quality_index A measure of the quality of the roads in the area the store is located.

skus = skus[which(complete.cases(skus)),]

DataQualityReportOverall(dataSetName=skus)

MODELING

Modeling Techniques

C5.0 Decision tree

Logistic Regression

CART Decision tree


Modeling Ensemble Modeling with R

Matthew A. Lanham

MODELING (CONT.)

Design

When building and testing predictive models using observational data (i.e. data that is not

controlled like in laboratory experimentation), the question that must be answered is how

valid is my model in regards to what will happen next?

In a properly designed and controlled experiment, data

(samples) used in the experiment are used to make inferences

about the population.

Regardless of how large or small the sample is compared to the

true size of the population, this single randomly selected subset

of the population allows for generalizability of the remaining

subset of data not used in the study.

Cross-validation is the most practical and cost effective means of obtaining a proxy for

truth in predictive analytics. ## Randomly partition data into training and test sets

my_seed = 1234567

skus = GenerateTTV(dataSetName=skus, response='SOLD', trainPct=.90, testPct=.10, my_seed)

GeneratePartitionPcts(dataSetName=skus)

Using the training data error rate as a proxy for a model’s generalization error is not wise,

especially when the training error is low to almost perfect. Most likely the model has been

overfit (or over trained) and will not perform as well when new examples are feed through

and evaluated from a validation data set (Zhou, 2012).



Matthew A. Lanham

MODELING (CONT.)

Design ## Percentage of Target is 1 (or Y='SOLD') in total data set

dim(skus[which(skus$SOLD==1),])[[1]] / dim(skus[which(skus$SOLD==0 | skus$SOLD==1),])[[1]]

## Percentage of Target is 1 (or Y='SOLD') in training set

dim(skus[which(skus$SOLD==1 & skus$SPSS_Partition=='Train'),])[[1]] /

dim(skus[which((skus$SOLD==1 | skus$SOLD==0) & skus$SPSS_Partition=='Train'),])[[1]]

## Percentage of Target is 1 (or Y='SOLD') in test set

dim(skus[which(skus$SOLD==1 & skus$SPSS_Partition=='Test'),])[[1]] / dim(skus[which((skus$SOLD==1

| skus$SOLD==0) & skus$SPSS_Partition=='Test'),])[[1]]

## remove independent variables that you don't want to use

names(skus)

skus2 = skus[,c(3,5:19,21:28)]

head(skus2)

skus2$SOLD = as.factor(skus2$SOLD)



Matthew A. Lanham

MODELING (CONT.)

## set up data for algorithms

trainX = skus2[which(skus2$SPSS_Partition=='Train'),2:(length(skus2)-1)]

trainY = skus2[which(skus2$SPSS_Partition=='Train'),'SOLD']

train = cbind(trainY,trainX)

testX = skus2[which(skus2$SPSS_Partition=='Test'),2:(length(skus2)-1)]

testY = skus2[which(skus2$SPSS_Partition=='Test'),'SOLD']

test = cbind(testY,testX)



Matthew A. Lanham

MODELING (CONT.)

Build Models

C5.0 Decision tree

require(C50) #Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm

C5Params = C5.0Control(…)

C5 = C5.0(x=trainX, y=trainY,

,control=C5Params #control parameters defined above

,trails=1 #number of boosting iterations; 1 implies a single model is used

)

summary(C5)



Matthew A. Lanham

The confusion matrix is based on a

decision cutoff threshold of 0.50

Overall error rate

Variables that were used to

create the tree

Changing the trials from 1 to 1000

doesn’t change the result in this

case

MODELING (CONT.)

Build Models

C5.0 Decision tree

## training probabilities and predicted classes

C5trainp = predict(C5

,newdata = trainX

,trials = C5$trials["Actual"]

,type = "prob", #either "class" for the predicted class or "prob" for model confidence values.

,na.action = na.pass)[,2]

C5trainc = predict(C5

,newdata = trainX


,type = "class", #either "class" for the predicted class or "prob" for model confidence values.

,na.action = na.pass)

## testing probabilities and predicted classes

C5testp = predict(C5

,newdata = testX


,type = "prob", #either "class" for the predicted class or "prob" for model confidence values.

,na.action = na.pass)[,2]

C5testc = predict(C5

,newdata = testX


,type = "class", #either "class" for the predicted class or "prob" for model confidence values.

,na.action = na.pass)



Matthew A. Lanham

MODELING (CONT.)

Build Models

Logistic Regression logit.fit = glm(trainY ~

NUM_SOLD_LAST+TOTAL_VIO+ADJ_TOTAL_VIO+VIO_COMPARED_TO_CLUSTER+POP_EST_CY+POP_DEN

SITY_CY+PCT_WHITE+AGE+PCT_COLLEGE+PCT_BLUE_COLLAR+MEDIAN_HOUSEHOLD_INCOME+ESTABLI

SHMENTS+ROAD_QUALITY_INDEX+APPLICATION_COUNT+PROJECTED_GROWTH_PCT+UNIT_SALES_CY

+UNIT_SALES_PY+OFFSET+ADJUSTED_OFFSET+AVG_CLUSTER_CY_UNIT_SALES+AVG_CLUSTER_CY_TOTA

L_SALES #+AVG_CLUSTER_CY_LOST_SALES

,family = binomial

,data = train)

summary(logit.fit)



Matthew A. Lanham

MODELING (CONT.)

Build Models

CART



Matthew A. Lanham

require(rpart)

set.seed(1234567)

tree = rpart(trainY ~.

,data = train

,method = "class"

,cp = 0

,minsplit = 4

,minbucket = 2

,parms = list(prior=c(0.5, 0.5)))

#summary(tree) <-this will take awhile

## find the best pruned tree

i.min = which.min(tree$cptable[,"xerror"])

i.se = which.min(abs(tree$cptable[,"xerror"]

- (tree$cptable[i.min,"xerror"]

+ tree$cptable[i.min,"xstd"])))

alpha.best = tree$cptable[i.se, "CP"]

tree.p = prune(tree, cp=alpha.best)

## obtain predictions

treeTrainp = predict(tree.p, train)[,2]

treeTrainc = treeTrainp

treeTrainc[which(treeTrainc>.5)] = 1

treeTrainc[which(treeTrainc<=.5)] = 0

treeTestp = predict(tree.p, test)[,2]

treeTestc = treeTestp

treeTestc[which(treeTestc>.5)] = 1

treeTestc[which(treeTestc<=.5)] = 0

ENSEMBLING VIA BAGGING

Bootstrap aggregation (Bagging)

Bagging is a simple way to increase the predictive power of a model

Pros

Useful when the predictors are more unstable, meaning that the more variation observed

Cons

Using smaller samples will yield more instability and too small yields poor models

How

Take several random samples with replacement from the training data set

Use each sample to construct a separate predictive model with predictions as the testing data set.

Average the predictions to come up with one final predicted value

my_list = getBS_samples(seed=my_seed, Ntrees=100, SampleSize=1000)

bagged_tree = BS_Trees(response=train[,1], datasetName=train, sampleList=my_list, Ntrees=100)

bagged_probs = getFinalPredictions(bagged_tree, dataSetName=train, Ntrees=100)



Matthew A. Lanham

What R packages are available for bagging?

MODEL ASSESSMENT - TRAINING

C5.0

Logit

CART



Matthew A. Lanham

Bagged CART

Why does the bagged tree perform worse?

MODEL ASSESSMENT - TESTING

C5.0

Logit

CART



Matthew A. Lanham

Bagged CART

How do we store a Bagged Model in R?

OPTIMAL DECISION CUTPOINTS

Why use a decision cutoff threshold of 0.50?

Example of an ROC curve



Matthew A. Lanham

OPTIMAL DECISION CUTPOINTS

See require(OptimalCutpoints)

## Define my methods list

methodList =

list(

"Youden" #1 (Youden Index);

,"ROC01" #2 (minimizes distance between ROC plot and point (0,1));

,"PROC01" #3 (minimizes distance between PROC plot and point (0,1));

#,"MaxAccuracyArea" #4 (maximizes Accuracy Area);

#,"AUC" #5 (maximizes concordance which is a function of AUC);

,"MaxEfficiency" #6 (maximizes Efficiency or Accuracy);

,"MaxKappa" #7 (maximizes Kappa Index);

#,"MinErrorRate" #8 (minimizes Error Rate);

C5.0 Training using the Youden cutoff method

Using other cutoff methods…..



Matthew A. Lanham

Date post:	10-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Matthew A. Lanham Doctoral …matthewalanham.com/Presentations/EnsembleModelingwithR.pdfAccording to...

Documents