Benchmarking_ML_Tools

Benchmarking 20 Machine Learning Models Accuracyand Speed

Marc Borowczak, PRC Consulting LLCMarch 29, 2016

Contents

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Step 0: Selection & Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Step 1: Retrieve 1st Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Step 2. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Step 3. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Step 4. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Step 5: Retrieve 2nd Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Step 6. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Step 7. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Step 8. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Step 9. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Step 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Summary

As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientistsand analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learningmodels were benchmarked for their accuracy and speed performance on a multi-core hardware, when appliedto 2 multinomial datasets differing broadly in size and complexity. It was observed that BAG-CART, RF andBOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded95% accuracy on the small Car Evaluation dataset. On the larger and more complex Nursery dataset, weobserved BAG-CART, BOOST-C50, PART, SVM and RF exceeded 99% accuracy, while JRIP, NNET, H2O,C45, and KNN exceeded 95% accuracy. However, overwhelming dependencies on Speed (determined on anaverage of 5-runs) were observed on a multicore hardware, with only CART, MDA and GBM as contendersfor the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome was observed,with MDA, ONE-R and BOOST-C50 as fastest and overall best predictors. The implications for the DataAnalytics Leaders are to continue allocating resources to insure Machine Learning benchmarks are conductedregularly, documented and communicated thru the Analysts teams, and to insure the most efficient toolsbased on established criteria are applied on day-to-day operations. The implications of these findings fordata scientists are to retain benchmarking tasks in the front- and not on the back-burner of activities’ list,and to continue monitoring new, more efficient and distributed and/or parallellized algorithms and theireffects on various hardware platforms. Ultimately, finding the best tool depends strongly on criteria selectionand certainly on hardware platforms available. Therefore this benchmarking task may well rest on the dataanalyst leaders’ and engineers’ to-do list for the foreseeable future.

1

http://www.prcconsulting.net

Step 0: Selection & Reproducibility

As Machine Learning gains attention, more applications and models are being used, and often speed oraccuracy of the predicted models lack comparisons. In this analysis, we’ll compare the accuracy and speedof 20 Machine learning models commonly selected. We’ll exercise these models on two multinomial UCIreference datasets, differing in size, predictor levels and number of dependent variables. The first and smallestdataset is Car_Evaluation and we’ll compare results when applying on the larger and more complex Nurserydataset.

Sys.info()[1:5]

## sysname release version nodename machine## "Windows" "10 x64" "build 10586" "STALLION" "x86-64"

sessionInfo()

## R version 3.2.4 Revised (2016-03-16 r70336)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 10586)#### locale:## [1] LC_COLLATE=English_United States.1252## [2] LC_CTYPE=English_United States.1252## [3] LC_MONETARY=English_United States.1252## [4] LC_NUMERIC=C## [5] LC_TIME=English_United States.1252#### attached base packages:## [1] stats graphics grDevices utils datasets methods base#### loaded via a namespace (and not attached):## [1] magrittr_1.5 formatR_1.3 tools_3.2.4 htmltools_0.3## [5] yaml_2.1.13 stringi_1.0-1 rmarkdown_0.9.5 knitr_1.12.3## [9] stringr_1.0.0 digest_0.6.9 evaluate_0.8.3

library(stringr)library(knitr)userdir <- getwd()set.seed(123)

Step 1: Retrieve 1st Dataset

We will mirror the approach used in the formulation challenge and use first the Car Evaluation dataset hostedon UCI Machine Learning Repository. We will use R to reproducibly and quickly download the dataset, andits full description. We continue to maintain reproducibility of the analysis as a general practice. The analysistool and platform are documented, all libraries clearly listed, while data is retrieved programmatically anddate stamped from the repository.

We will display a structure of the Car_Evaluation dataset and the corresponding dictionary to translate theattribute factors.

2

http://archive.ics.uci.edu/ml/datasets.html

http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data

https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names

datadir <- "./data"if (!file.exists("data")){dir.create("data")}uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"fileUrl <- paste0(uciUrl,"car/car.data?accessType=DOWNLOAD")download.file(fileUrl, destfile="./data/cardata.csv", method = "curl")dateDownloaded <- date()car_eval <- read.csv("./data/cardata.csv",header=FALSE)fileUrl <- paste0(uciUrl,"car/car.names?accessType=DOWNLOAD")download.file(fileUrl, destfile = "./data/carnames.txt")txt <- readLines("./data/carnames.txt")lns <- data.frame(beg=which(grepl("buying v-high",txt)),end=which(grepl("med, high",txt)))# we now capture all lines of text between beg and end from txtres <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", "\n", res, fixed = TRUE)res <- gsub(" ", "", res, fixed = TRUE)res <- gsub(" ", "", res, fixed = TRUE)res <- str_c(res,"\n")writeLines(res, "./data/parsed_attr.csv")attrib <- readLines("./data/parsed_attr.csv")nv <- length(attrib) # number of attributesattrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})dictionary[[nv]][1]<-"class"colnames(car_eval)<-sapply(1:nv,function(i) {colnames(car_eval)[i]<-dictionary[[i]][1]})cm<-list()x<-car_eval[,1:(nv-1)]y<-car_eval[,nv]fmla<-paste(colnames(car_eval)[1:(nv-1)],collapse="+")fmla<-paste0(colnames(car_eval)[nv],"~",fmla)fmla<-as.formula(fmla)nlev<-nlevels(y) # number of factors describing class

Step 2. Data Exploration

head(car_eval)

## buying maint doors persons lug_boot safety class## 1 vhigh vhigh 2 2 small low unacc## 2 vhigh vhigh 2 2 small med unacc## 3 vhigh vhigh 2 2 small high unacc## 4 vhigh vhigh 2 2 med low unacc## 5 vhigh vhigh 2 2 med med unacc## 6 vhigh vhigh 2 2 med high unacc

summary(car_eval)

## buying maint doors persons lug_boot safety

3

## high :432 high :432 2 :432 2 :576 big :576 high:576## low :432 low :432 3 :432 4 :576 med :576 low :576## med :432 med :432 4 :432 more:576 small:576 med :576## vhigh:432 vhigh:432 5more:432## class## acc : 384## good : 69## unacc:1210## vgood: 65

str(car_eval)

## 'data.frame': 1728 obs. of 7 variables:## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...

cm<-list() # initialize

Exploration reveals the multinomial car experience dataset comprises 6 attributes factors we can use topredict the 4-factor car recommendation class, with no missing data.

Step 3. Classification Analysis

20 Models will be sequentially selected to represent 3 groups: A)Linear, B)Non-linear and C)Non-LinearClassification with decision trees. From the collected Confusion Matrix performance, we will build a resultsdata frame and compare the prediction accuracies.

For each analysis, we’ll follow the sampe protocol: invoque the package library, build the model, summarizeit and predict the class (dependent variable), then save the accuracy data and predicted values in a list.

3.A Linear Classification

3.A1 Multinomial

library(nnet)library(caret)model<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE)prob<-predict(model,x,type="probs")pred<-apply(prob,1,which.max)pred[which(pred=="1")]<-levels(y)[1]pred[which(pred=="2")]<-levels(y)[2]pred[which(pred=="3")]<-levels(y)[3]pred[which(pred=="4")]<-levels(y)[4]pred<-as.factor(pred)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[1]]<-c("Multinomial","MULTINOM",confusionMatrix(mtab))cm[[1]]$table

4

#### unacc acc vgood good## unacc 1166 32 0 0## acc 43 346 2 6## vgood 0 2 63 4## good 1 4 0 59

cm[[1]]$overall[1]

## Accuracy## 0.9456019

3.A2 Logistic Regression

library(VGAM)model<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100)prob<-predict(model,x,type="response")pred<-apply(prob,1,which.max)pred[which(pred=="1")]<-levels(y)[1]pred[which(pred=="2")]<-levels(y)[2]pred[which(pred=="3")]<-levels(y)[3]pred[which(pred=="4")]<-levels(y)[4]pred<-as.factor(pred)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[2]]<-c("Logistic Regression","GLM",confusionMatrix(mtab))cm[[2]]$table


cm[[2]]$overall[1]

## Accuracy## 0.9456019

3.A3 Linear Discriminant Analysis

library(MASS)model<-lda(fmla,data=car_eval)pred<-predict(model,x)$classl<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[3]]<-c("Linear Discriminant Analysis","LDA",confusionMatrix(mtab))cm[[3]]$table

5


cm[[3]]$overall[1]

## Accuracy## 0.9010417

3.B Non-Linear Classification

3.B1 Mixture Discriminant Analysis

library(mda)model<-mda(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[4]]<-c("Mixture Discriminant Analysis","MDA",confusionMatrix(mtab))cm[[4]]$table


cm[[4]]$overall[1]

## Accuracy## 0.9212963

3.B2 Regularized Discriminant Analysis

library(klaR)model<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01)pred<-predict(model,x)$classl<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[5]]<-c("Regularized Discriminant Analysis","RDA",confusionMatrix(mtab))cm[[5]]$table


6

cm[[5]]$overall[1]

## Accuracy## 0.8778935

3.B3 Neural Network

library(nnet)library(devtools)model<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700, trace = FALSE)#import the function from Github# source_url('https://gist.githubusercontent.com/Peque/41a9e20d6687f2f3108d/raw/85e14f3a292e126f1454864427e3a189c2fe33f3/nnet_plot_update.r')# plot.nnet(model, alpha.val = 0.5, cex= 0.7, circle.col = list('lightblue', 'white'), bord.col = 'black')pred<-predict(model,x,type="class")pred<-as.factor(pred)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[6]]<-c("Neural Network","NNET",confusionMatrix(mtab))cm[[6]]$table


cm[[6]]$overall[1]

## Accuracy## 0.9895833

3.B4 Flexible Discriminant Analysis

library(mda)model<-fda(fmla,data=car_eval)pred<-predict(model,x,type="class")l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[7]]<-c("Flexible Discriminant Analysis","FDA",confusionMatrix(mtab))cm[[7]]$table


7

cm[[7]]$overall[1]

## Accuracy## 0.8998843

3.B5 Support Vector Machine

library(kernlab)model<-ksvm(fmla,data=car_eval)pred<-predict(model,x,type="response")l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[8]]<-c("Support Vector Machine","SVM",confusionMatrix(mtab))cm[[8]]$table


cm[[8]]$overall[1]

## Accuracy## 0.96875

3.B6 k-Nearest Neighbors

library(caret)model<-knn3(fmla,data=car_eval,k=nlev+1)pred<-predict(model,x,type="class")l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[9]]<-c("k-Nearest Neighbors","KNN",confusionMatrix(mtab))cm[[9]]$table

#### unacc acc good vgood## unacc 1199 41 4 1## acc 11 336 45 15## good 0 6 15 4## vgood 0 1 5 45

cm[[9]]$overall[1]

## Accuracy## 0.9230324

8

3.B8 Naive Bayes

library(e1071)model<-naiveBayes(fmla,data=car_eval,k=nlev+1)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[10]]<-c("Naive Bayes","NBAYES",confusionMatrix(mtab))cm[[10]]$table


cm[[10]]$overall[1]

## Accuracy## 0.8738426

3.C Non-Linear Classification with Decision Trees

3.C1 Classification and Regression Trees(CART)

library(rpart)library(rpart.plot)model<-rpart(fmla,data=car_eval)# prp(model, faclen=3)pred<-predict(model, x ,type="class")l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[11]]<-c("classification and Regression Trees","CART",confusionMatrix(mtab))cm[[11]]$table

#### unacc acc good vgood## unacc 1161 7 0 0## acc 45 358 0 13## good 4 16 60 0## vgood 0 3 9 52

cm[[11]]$overall[1]

## Accuracy## 0.9438657

3.C2 OneR

9

library(RWeka)model<-OneR(fmla,data=car_eval)pred<-predict(model,x,type="class")l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[12]]<-c("One R","ONE-R",confusionMatrix(mtab))cm[[12]]$table


cm[[12]]$overall[1]

## Accuracy## 0.7002315

3.C3 C4.5

library(RWeka)model<-J48(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[13]]<-c("C4.5","C45",confusionMatrix(mtab))cm[[13]]$table


cm[[13]]$overall[1]

## Accuracy## 0.962963

3.C4 PART

library(RWeka)model<-PART(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[14]]<-c("PART","PART",confusionMatrix(mtab))cm[[14]]$table

10


cm[[14]]$overall[1]

## Accuracy## 0.9826389

3.C5 Bagging CART

library(ipred)model<-bagging(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[15]]<-c("Bagging CART","BAG-CART",confusionMatrix(mtab))cm[[15]]$table


cm[[15]]$overall[1]

## Accuracy## 1

3.C6 Random Forest

library(randomForest)model<-randomForest(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[16]]<-c("Random Forest","RF",confusionMatrix(mtab))cm[[16]]$table


11

cm[[16]]$overall[1]

## Accuracy## 0.9988426

3.C7 Gradient Boosted Machine

library(gbm)model<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,distribution="multinomial",verbose=FALSE,n.cores=4)prob<-predict(model,x,n.trees=5000,type="response")pred<-apply(prob,1,which.max)pred[which(pred=="1")]<-levels(y)[1]pred[which(pred=="2")]<-levels(y)[2]pred[which(pred=="3")]<-levels(y)[3]pred[which(pred=="4")]<-levels(y)[4]pred<-as.factor(pred)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[17]]<-c("Gradient Boosted Machine","GBM",confusionMatrix(mtab))cm[[17]]$table


cm[[17]]$overall[1]

## Accuracy## 0.9791667

3.C8 Boosted C5.0

library(C50)model<-C5.0(fmla,data=car_eval,trials=10)# tree <- rpart(model,data=car_eval,control=rpart.control(minsplit=20,cp=0,digits=6))# prp(tree,faclen=3)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[18]]<-c("Boosted C5.0","BOOST-C50",confusionMatrix(mtab))cm[[18]]$table


12

cm[[18]]$overall[1]

## Accuracy## 0.9965278

3.C9 JRip

library(RWeka)model<-JRip(fmla,data=car_eval)pred<-predict(model,x)l<-union(pred,y)mtab<-table(factor(pred,l),factor(y,l))cm[[19]]<-c("JRip","JRIP",confusionMatrix(mtab))cm[[19]]$table


cm[[19]]$overall[1]

## Accuracy## 0.9450231

3.C10 H2O Deep Learning

#### H2O is not running yet, starting it now...#### Note: In case of errors look at the following log files:## C:\Users\MARC_B~1\AppData\Local\Temp\Rtmpglsya5/h2o_Marc_Borowczak_started_from_r.out## C:\Users\MARC_B~1\AppData\Local\Temp\Rtmpglsya5/h2o_Marc_Borowczak_started_from_r.err###### Starting H2O JVM and connecting: ... Connection successful!#### R is connected to the H2O cluster:## H2O cluster uptime: 9 seconds 10 milliseconds## H2O cluster version: 3.8.1.3## H2O cluster name: H2O_started_from_R_Marc_Borowczak_dvn595## H2O cluster total nodes: 1## H2O cluster total memory: 3.44 GB## H2O cluster total cores: 4## H2O cluster allowed cores: 4## H2O cluster healthy: TRUE## H2O Connection ip: localhost## H2O Connection port: 54321## H2O Connection proxy: NA## R Version: R version 3.2.4 Revised (2016-03-16 r70336)

13

## acc good unacc vgood## acc 319 26 36 3## good 0 66 0 3## unacc 24 9 1177 0## vgood 0 0 0 65

## Accuracy## 0.9415509

Step 4. Performance Comparison

We now compare all analysis results gathered in the ConfusionMatrix list into a results data frame.

library(h2o)localH2O=h2o.init(nthreads=-1)

#### H2O is not running yet, starting it now...#### Note: In case of errors look at the following log files:## C:\Users\MARC_B~1\AppData\Local\Temp\RtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out## C:\Users\MARC_B~1\AppData\Local\Temp\RtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err###### Starting H2O JVM and connecting: .... Connection successful!#### R is connected to the H2O cluster:## H2O cluster uptime: 13 seconds 909 milliseconds## H2O cluster version: 3.8.1.3## H2O cluster name: H2O_started_from_R_Marc_Borowczak_abx142## H2O cluster total nodes: 1## H2O cluster total memory: 3.44 GB## H2O cluster total cores: 4## H2O cluster allowed cores: 4## H2O cluster healthy: TRUE## H2O Connection ip: localhost## H2O Connection port: 54321## H2O Connection proxy: NA## R Version: R version 3.2.4 Revised (2016-03-16 r70336)

h2o.no_progress()car_eval.hex=h2o.uploadFile(path=paste0(userdir,"/car_eval.csv"),destination_frame="car_eval.hex")library(microbenchmark)mbm<-microbenchmark(

m1<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE),m2<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100),m3<-lda(fmla,data=car_eval),m4<-mda(fmla,data=car_eval),m5<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01),m6<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700,trace=FALSE),m7<-fda(fmla,data=car_eval),m8<-ksvm(fmla,data=car_eval),

14

m9<-knn3(fmla,data=car_eval,k=nlev+1),m10<-naiveBayes(fmla,data=car_eval,k=nlev+1),m11<-rpart(fmla,data=car_eval),m12<-OneR(fmla,data=car_eval),m13<-J48(fmla,data=car_eval),m14<-PART(fmla,data=car_eval),m15<-bagging(fmla,data=car_eval),m16<-randomForest(fmla,data=car_eval),m17<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,distribution="multinomial",verbose=FALSE,n.cores=4),m18<-C5.0(fmla,data=car_eval,trials=10),m19<-JRip(fmla,data=car_eval),m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = car_eval.hex, variable_importances = TRUE),times=5)

h2o.shutdown(prompt = FALSE)

## [1] TRUE

#library(dplyr)models<-length(cm)mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5)mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean)mbm$x<-mbm$x/min(mbm$x)results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])})row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6]))results<-as.data.frame(t(results))results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])})results<-results[,-(8:9)]results<-arrange(results,desc(Accuracy))

We are now ready to chart on faceted pie charts, with decreasing value order to ease comparisons.

library(ggplot2)library(gridExtra)res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=results$Accuracy/results$Model_Time_X)library(RColorBrewer)myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3")))sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1))g<-ggplot(res,aes(x=reorder(Model,-Accuracy),y=Accuracy,fill=Model))+

geom_bar(stat="identity")+coord_polar(theta="x",direction=1)+labs(x="Machine Learning Model",y="Prediction Accuracy")+theme(legend.position="bottom",legend.box="horizontal")+ggtitle('Car Evaluation Dataset Accuracy Performance')

g

15

BAG−CARTRF

BOOST−C50

NNET

PART

GBM

SVM

C45

GLMMULTINOMJRIP

CART

H2O

KNN

MDA

LDA

FDA

RDA

NBAYESONE−R

0.00

0.25

0.50

0.75

1.00

Machine Learning Model

Pre

dict

ion

Acc

urac

y

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Car Evaluation Dataset Accuracy Performance

g<-ggplot(res,aes(x=reorder(Model,-Speed),y=Speed,fill=Model))+geom_bar(stat="identity")+coord_polar(theta="x",direction=1)+labs(x="Machine Learning Model",y="Prediction Speed")+theme(legend.position="bottom",legend.box="horizontal")+ggtitle('Car Evaluation Dataset Speed Performance')

g

16

CARTMDA

GBM

NNET

JRIP

RF

SVM

MULTINOM

BOOST−C50LDAC45

KNN

PART

NBAYES

BAG−CART

RDA

ONE−R

FDA

H2OGLM

0.00

0.25

0.50

0.75

1.00


Pre

dict

ion

Spe

ed

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Car Evaluation Dataset Speed Performance

g<-ggplot(res,aes(x=reorder(Model,-Overall),y=Overall,fill=Model))+geom_bar(stat="identity")+coord_polar(theta="x",direction=1)+labs(x="Machine Learning Model",y="Prediction Overall")+theme(legend.position="bottom",legend.box="horizontal")+ggtitle('Car Evaluation Dataset Overall Performance')

g

17

CARTMDA

GBM

NNET

JRIP

RF

SVM

MULTINOM

BOOST−C50LDAC45

KNN

PART

NBAYES

BAG−CART

RDA

ONE−R

H2O

FDAGLM

0.00

0.25

0.50

0.75


Pre

dict

ion

Ove

rall

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Car Evaluation Dataset Overall Performance

We conclude this dataset analysis by tabulating the results obtained on this dataset.

kable(res)

Model Accuracy Speed OverallBAG-CART 1.0000000 0.0093808 0.0093808RF 0.9988426 0.0966751 0.0965632BOOST-C50 0.9965278 0.0624463 0.0622294NNET 0.9895833 0.1358213 0.1344065PART 0.9826389 0.0272376 0.0267647GBM 0.9791667 0.2219264 0.2173029SVM 0.9687500 0.0893670 0.0865743C45 0.9629630 0.0434614 0.0418517MULTINOM 0.9456019 0.0685611 0.0648315GLM 0.9456019 0.0011533 0.0010906JRIP 0.9450231 0.1148750 0.1085596CART 0.9438657 1.0000000 0.9438657H2O 0.9415509 0.0013525 0.0012735KNN 0.9230324 0.0398154 0.0367509MDA 0.9212963 0.3604409 0.3320729LDA 0.9010417 0.0472147 0.0425424FDA 0.8998843 0.0013615 0.0012252RDA 0.8778935 0.0062438 0.0054814NBAYES 0.8738426 0.0263502 0.0230260ONE-R 0.7002315 0.0045960 0.0032182

18

Step 5: Retrieve 2nd Dataset

We now repeat with the nursery dataset. Again, We will use R to demonstrate quickly the approach onthis dataset, and its full description. We continue to maintain reproducibility of the analysis as a generalpractice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrievedprogrammatically and date stamped from the repository.

We will display a structure of the Nursery dataset and the corresponding dictionary to translate the propertyfactors.

datadir <- "./data"if (!file.exists("data")){dir.create("data")}uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"fileUrl <- paste0(uciUrl,"nursery/nursery.data?accessType=DOWNLOAD")download.file(fileUrl,destfile="./data/nurserydata.csv",method="curl")dateDownloaded <- date()nursery <- read.csv("./data/nurserydata.csv",header=FALSE)fileUrl <- paste0(uciUrl,"nursery/nursery.names?accessType=DOWNLOAD")download.file(fileUrl,destfile="./data/nurserynames.txt")txt <- readLines("./data/nurserynames.txt")lns <- data.frame(beg=which(grepl("parents usual",txt)),end=which(grepl("priority, not_recom",txt)))# capture text between beg and end from txtres <- lapply(seq_along(lns$beg),

function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", ":", res, fixed = TRUE)res <- gsub(" ", "\n", res, fixed = TRUE)res <- gsub(" ", "", res, fixed = TRUE)res <- gsub(" ", "", res, fixed = TRUE)res <- gsub(" ", "", res, fixed = TRUE)res <- str_c(res,"\n")writeLines(res,"./data/n_parsed_attr.csv")attrib <- readLines("./data/n_parsed_attr.csv")nv <- length(attrib) # number of attributesattrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})dictionary[[nv]][1]<-"class"colnames(nursery)<-sapply(1:nv,function(i) {colnames(nursery)[i]<-dictionary[[i]][1]})x<-nursery[,1:(nv-1)]y<-nursery[,nv]fmla<-paste(colnames(nursery)[1:(nv-1)],collapse="+")fmla<-paste0(colnames(nursery)[nv],"~",fmla)fmla<-as.formula(fmla)nlev<-nlevels(y) # number of factors describing class

Step 6. Data Exploration

head(nursery)

## parents has_nurs form children housing finance social

19

http://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data

https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.names

## 1 usual proper complete 1 convenient convenient nonprob## 2 usual proper complete 1 convenient convenient nonprob## 3 usual proper complete 1 convenient convenient nonprob## 4 usual proper complete 1 convenient convenient slightly_prob## 5 usual proper complete 1 convenient convenient slightly_prob## 6 usual proper complete 1 convenient convenient slightly_prob## health class## 1 recommended recommend## 2 priority priority## 3 not_recom not_recom## 4 recommended recommend## 5 priority priority## 6 not_recom not_recom

summary(nursery)

## parents has_nurs form children## great_pret :4320 critical :2592 complete :3240 1 :3240## pretentious:4320 improper :2592 completed :3240 2 :3240## usual :4320 less_proper:2592 foster :3240 3 :3240## proper :2592 incomplete:3240 more:3240## very_crit :2592## housing finance social## convenient:4320 convenient:6480 nonprob :4320## critical :4320 inconv :6480 problematic :4320## less_conv :4320 slightly_prob:4320###### health class## not_recom :4320 not_recom :4320## priority :4320 priority :4266## recommended:4320 recommend : 2## spec_prior:4044## very_recom: 328

str(nursery)

## 'data.frame': 12960 obs. of 9 variables:## $ parents : Factor w/ 3 levels "great_pret","pretentious",..: 3 3 3 3 3 3 3 3 3 3 ...## $ has_nurs: Factor w/ 5 levels "critical","improper",..: 4 4 4 4 4 4 4 4 4 4 ...## $ form : Factor w/ 4 levels "complete","completed",..: 1 1 1 1 1 1 1 1 1 1 ...## $ children: Factor w/ 4 levels "1","2","3","more": 1 1 1 1 1 1 1 1 1 1 ...## $ housing : Factor w/ 3 levels "convenient","critical",..: 1 1 1 1 1 1 1 1 1 1 ...## $ finance : Factor w/ 2 levels "convenient","inconv": 1 1 1 1 1 1 1 1 1 2 ...## $ social : Factor w/ 3 levels "nonprob","problematic",..: 1 1 1 3 3 3 2 2 2 1 ...## $ health : Factor w/ 3 levels "not_recom","priority",..: 3 2 1 3 2 1 3 2 1 3 ...## $ class : Factor w/ 5 levels "not_recom","priority",..: 3 2 1 3 2 1 2 2 1 5 ...

m<-cm<-list() # initialize

Exploration reveals the multinomial Nursery dataset comprises 8 attribute variables to predict the 5-factornurse recommendation class, with no missing data.

20

Step 7. Classification Analysis

We will perform a total of 20 analysis using the same 3 groups: A)Linear, B)Non-linear and C)Non-LinearClassification with decision trees, sequentially and then compare their outcome in terms of prediction accuracy,comparing predicted vs. actual dependent variable.

For each analysis, we follow the sampe protocol: invoque the package library, build the model, summarize itand predict the dependent variable, the save the accuracy data and predicted values.

#### H2O is not running yet, starting it now...#### Note: In case of errors look at the following log files:## C:\Users\MARC_B~1\AppData\Local\Temp\Rtmpglsya5/h2o_Marc_Borowczak_started_from_r.out## C:\Users\MARC_B~1\AppData\Local\Temp\Rtmpglsya5/h2o_Marc_Borowczak_started_from_r.err###### Starting H2O JVM and connecting: ... Connection successful!#### R is connected to the H2O cluster:## H2O cluster uptime: 8 seconds 961 milliseconds## H2O cluster version: 3.8.1.3## H2O cluster name: H2O_started_from_R_Marc_Borowczak_doh554## H2O cluster total nodes: 1## H2O cluster total memory: 3.44 GB## H2O cluster total cores: 4## H2O cluster allowed cores: 4## H2O cluster healthy: TRUE## H2O Connection ip: localhost## H2O Connection port: 54321## H2O Connection proxy: NA## R Version: R version 3.2.4 Revised (2016-03-16 r70336)

## not_recom priority recommend spec_prior## not_recom 3316 0 0 2## priority 0 3105 0 165## recommend 0 0 0 0## spec_prior 0 11 0 3067

## Accuracy## 0.9815849

Step 8. Performance Comparison

library(h2o)localH2O=h2o.init(nthreads=-1)

#### H2O is not running yet, starting it now...#### Note: In case of errors look at the following log files:

21

## C:\Users\MARC_B~1\AppData\Local\Temp\RtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out## C:\Users\MARC_B~1\AppData\Local\Temp\RtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err###### Starting H2O JVM and connecting: .... Connection successful!#### R is connected to the H2O cluster:## H2O cluster uptime: 9 seconds 761 milliseconds## H2O cluster version: 3.8.1.3## H2O cluster name: H2O_started_from_R_Marc_Borowczak_gcl376## H2O cluster total nodes: 1## H2O cluster total memory: 3.44 GB## H2O cluster total cores: 4## H2O cluster allowed cores: 4## H2O cluster healthy: TRUE## H2O Connection ip: localhost## H2O Connection port: 54321## H2O Connection proxy: NA## R Version: R version 3.2.4 Revised (2016-03-16 r70336)

h2o.no_progress()nursery.hex=h2o.uploadFile(path=paste0(userdir,"/nursery.csv"),destination_frame="nursery.hex")library(microbenchmark)mbm<-microbenchmark(

m1<-multinom(fmla, data = nursery, maxit = 1200, trace=FALSE),m2<-vglm(fmla, family = "multinomial", data = nursery, maxit = 100),m3<-lda(fmla,data=nursery),m4<-mda(fmla,data=nursery),m5<-rda(fmla,data=nursery,gamma = 0.05,lambda = 0.01),m6<-nnet(fmla,data=nursery,size = 4, decay = 0.0001, maxit = 1200,trace=FALSE),m7<-fda(fmla,data=nursery),m8<-ksvm(fmla,data=nursery),m9<-knn3(fmla,data=nursery,k=nlev+1),m10<-naiveBayes(fmla,data=nursery,k=nlev+1),m11<-rpart(fmla,data=nursery),m12<-OneR(fmla,data=nursery),m13<-J48(fmla,data=nursery),m14<-PART(fmla,data=nursery),m15<-bagging(fmla,data=nursery),m16<-randomForest(fmla,data=nursery),m17<-gbm(fmla,data=nursery,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,distribution="multinomial",verbose=FALSE,n.cores=4),m18<-C5.0(fmla,data=nursery,trials=10),m19<-JRip(fmla,data=nursery),m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = nursery.hex, variable_importances = TRUE),times=5)

h2o.shutdown(prompt = FALSE)

## [1] TRUE

#

22

library(dplyr)models<-length(cm)mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5)mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean)mbm$x<-mbm$x/min(mbm$x)results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])})row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6]))results<-as.data.frame(t(results))results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])})results<-results[,-(8:9)]results<-arrange(results,desc(Accuracy))## Performance Plotlibrary(ggplot2)library(gridExtra)res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=results$Accuracy/results$Model_Time_X)library(RColorBrewer)myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3")))sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1))

We are now ready to chart and will again compare on faceted pie charts.

BAG−CARTBOOST−C50

PART

SVM

RF

JRIP

NNET

H2O

C45KNNMDA

GLM

MULTINOM

FDA

NBAYES

RDA

CART

GBM

ONE−RLDA

0.00

0.25

0.50

0.75


Pre

dict

ion

Acc

urac

y

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Nursery Dataset Accuracy Performance

23

ONE−RMDA

BOOST−C50

GLM

PART

KNN

C45

GBM

SVMBAG−CARTFDA

RDA

CART

NBAYES

RF

JRIP

LDA

NNET

MULTINOMH2O

0.00

0.25

0.50

0.75

1.00


Pre

dict

ion

Spe

ed

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Nursery Dataset Speed Performance

24

MDAONE−R

BOOST−C50

GLM

PART

KNN

C45

GBM

SVMBAG−CARTFDA

RDA

CART

NBAYES

RF

JRIP

LDA

NNET

MULTINOMH2O

0.00

0.25

0.50

0.75


Pre

dict

ion

Ove

rall

Model

BAG−CART

BOOST−C50

C45

CART

FDA

GBM

GLM

H2O

JRIP

KNN

LDA

MDA

MULTINOM

NBAYES

NNET

ONE−R

PART

RDA

RF

SVM

Nursery Dataset Overall Performance

We conclude this second dataset analysis by tabulating the results obtained.

Model Accuracy Speed OverallBAG-CART 0.9999228 0.0094075 0.0094068BOOST-C50 0.9997685 0.3000380 0.2999686PART 0.9986883 0.1309383 0.1307665SVM 0.9932099 0.0125290 0.0124439RF 0.9916667 0.0045268 0.0044891JRIP 0.9821759 0.0040615 0.0039892NNET 0.9818673 0.0006675 0.0006554H2O 0.9815849 0.0005785 0.0005678C45 0.9813272 0.0491118 0.0481947KNN 0.9768519 0.0699498 0.0683306MDA 0.9331019 0.9991693 0.9323268MULTINOM 0.9262346 0.0006674 0.0006182GLM 0.9262346 0.1569507 0.1453732FDA 0.9130401 0.0077249 0.0070531NBAYES 0.9030864 0.0052343 0.0047270RDA 0.8983025 0.0076060 0.0068325CART 0.8726852 0.0055864 0.0048752GBM 0.8726852 0.0284867 0.0248599ONE-R 0.7097222 1.0000000 0.7097222LDA 0.5483796 0.0037381 0.0020499

25

Step 9. Data Analysis

Comparing these 2 datasets’ accuracy, we observe that BAG-CART, RF and BOOST-C50 top the list atmore than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the smallestdataset. On the second dataset, we observe BAG-CART, BOOST-C50, PART, SVM and RF exceed 99%accuracy, while JRIP, NNET, H2O, C45 and KNN exceed 95% accuracy.

From these observations, we should definitely include BAG-CART, BOOST-C50 and RF as prime models totackle multinomial data. However, including NNET, PART, SVM and C45 as 2nd tier seems also a goodidea. To be complete, the third group of models should include JRIP, H2O and KNN. For reference, if themain dependency is desired, ONE-R can pinpoint it with more than 70% accuracy.

Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare averagetime used by the 20 models using a 5-run average. Normalized data is obtained by dividing all times by theminimum time recorded. Transforming time into Speed involves taking the reciprocal values.

Overall ranking is obtained here by merely forming the product of Accuracy x Speed.

We observe overwhelming dependencies on Speed, with only KNN, BOOST-C50, NNET, JRIP and CARTcontenders for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome isobserved, with ONE-R, MDA and BOOST-C50 as fastest and overall best predictors.

Step 10. Conclusions

We have compared 20 Machine Learning models and benchmarked their accuracy and speed on 2 multinomialdatasets. Although ranking accuracy seemed consistent across these 2 models, and forming a 3-tier ranking,model execution speed which is often also a factor showed strong dependencies, so that combined rankingremains strongly dataset dependent.

What it means for data scientists is that benchmarking should remain in the front- and not on the back-burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/orparallellized algorithms and their effects on different hardware platforms.

We have evaluated the accuracy on 2 multinomial datasets and the analysis led to a 3-tier grouping. However,Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learningtools by applying them to broader datasets.

References

The following sources are referenced as they provided significant help and information to develop this MachineLearning analysis applied to formulations:

1. UCI Machine Learning Repository Machine Learning Repository2. Car recommendation documentation Car Evaluation Database donated by [email protected] June

19973. Nursery documentation Nursery Database donated by [email protected] June 19974. stringr Simple, Consistent Wrappers for Common String Operations5. RWeka R/Weka interface6. C50 Decision Trees and Rule-Based Models7. rpart Recursive Partitioning and Regression Trees.8. rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’9. rattle A Graphical User Interface for Data Mining using R

10. VGAM Vector Generalized Linear and Additive Models11. MASS Support Functions and Datasets for Venables and Ripley’s MASS12. mda Mixture and Flexible Discriminant Analysis

26

http://archive.ics.uci.edu/ml/datasets.html

https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names

mailto:[email protected]

https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.names

mailto:[email protected]

https://cran.r-project.org/web/packages/stringr/stringr.pdf

https://cran.r-project.org/web/packages/RWeka/RWeka.pdf

https://cran.r-project.org/web/packages/C50/C50.pdf

https://cran.r-project.org/web/packages/rpart/rpart.pdf

https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf

http://rattle.togaware.com/

https://cran.r-project.org/web/packages/VGAM/VGAM.pdf

https://cran.r-project.org/web/packages/MASS/MASS.pdf

https://cran.r-project.org/web/packages/mda/mda.pdf

13. klaR Classification and visualization14. nnet Feed-forward Neural Networks and Multinomial Log-Linear Models15. kernlab Kernel-based Machine Learning Lab16. caret Package (short for Classification And REgression Training) is a set of functions that attempt to

streamline the process for creating predictive models17. e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector

machines, shortest path computation, bagged clustering, naive Bayes classifier etc18. ipred Improved Predictors19. randomForest Breiman and Cutler’s random forests for classification and regression20. gbm Generalized Boosted Regression Models21. H2O Deep Learning R Interface for H2O22. H2O H2O.ai documentation23. gridExtra Miscellaneous Functions for Grid Graphics24. knitr A General-Purpose Package for Dynamic Report Generation in R25. RStudio Open Source and enterprise-ready professional software for R

27

https://cran.r-project.org/web/packages/klaR

https://cran.r-project.org/web/packages/nnet

https://cran.r-project.org/web/packages/kernlab

https://cran.r-project.org/web/packages/caret

https://cran.r-project.org/web/packages/e1071

https://cran.r-project.org/web/packages/ipred

https://cran.r-project.org/web/packages/randomForest

https://cran.r-project.org/web/packages/gbm/gbm.pdf

https://cran.r-project.org/web/packages/h2o/h2o.pdf

http://www.h2o.ai/docs/

https://cran.r-project.org/web/packages/gridExtra/gridExtra.pdf

https://cran.r-project.org/web/packages/knitr/knitr.pdf

https://www.rstudio.com

Date post:	26-Jan-2017
Category:	Documents
Upload:	marc-borowczak
View:	75 times
Download:	1 times

Benchmarking_ML_Tools

Documents