Pima Diabetes Analysis
Intro
Objective
• The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, basedon certain diagnostic measurements included in the dataset.
• The data is splited into a training (70%) and a testing (30%) to evaluate the accuracy of the models.
Dataset Background
• ‘Pima Indians Diabetes Database’, published by the National Institute of Diabetes in the United States,is a set of data that examines and records the physical condition of adult women of Indian descent,including pregnancy, blood pressure and insulin levels.
• Based on the eight variables, it consists of a form that can predict the occurrence of diabetes in about800 people surveyed.
Data Sources and Links
• Pima Indians Diabetes Database (Kaggle)• https://www.kaggle.com/uciml/pima-indians-diabetes-database
Variables Explaination
Variable Name ExplainationPregnancies Number of times pregnantGlucose Glucose Figure in an oral glucose tolerance testBloodPressure Diastolic blood pressure (mm Hg)SkinThickness Triceps skin fold thickness (mm)Insulin 2-Hour serum insulin (mu U/ml)BMI Body mass index (weight in kg/(height in m)ˆ2)DiabetesPedigreeFunction Diabetes pedigree functionAge Age (years)Outcome 1: Yes, 0: No
Diabetes Pedigree Function
DPF =∑
Ki(88 − ADMi) + 20∑Kj(ALCj − 14) + 50
i : The total number of relatives with diabetes j : The total number of relatives without diabetes x : Degreeof genetic match with a particular relative + 0.5 - Parent, brothers + 0.25 - Grandparents, Brothers ofParents + 0.125 - The children of brothers of parents ADM - the age at which relatives with diabetes(i)developed. ACL - the age at which a non-diabetic relative (j) tested for diabetes
1
EDA
Explore Data Analysis
Summary
Original Data Set
summary(data)
## Pregnancies Glucose BloodPressure SkinThickness## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00## Insulin BMI DiabetesPedigreeFunction Age## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00## Outcome## Min. :0.000## 1st Qu.:0.000## Median :0.000## Mean :0.349## 3rd Qu.:1.000## Max. :1.000
When I open the data and check it, a significant number of values are marked as zero. The human skinthickness can’t be zero or blood sugar or blood pressure can’t be zero, so it’s actually an unmeasured value, amissing value. In particular, there are many missing values for triceps and insulin concentrations, which arenot enough to make this learning data. There are some ways to replace the missing value such asexcludingthese 0 values from the analysis depending on types of missing values (MACR, MAR, NMAR). Howeover, inthis case, I used multiple Imputation method (MI) to deal with missing values in five columns (Isulin, BMI,BloodPressure, SkinThickness, Glucose).
Imputing Missing Values
Replace zero values for each variable with NA (Isulin, BMI, BloodPressure, SkinThickness, Glu-cose).
data <- data %>% mutate(Insulin = ifelse(data$Insulin == "0", NA, Insulin))data <- data %>% mutate(BMI= ifelse(data$BMI == "0", NA, BMI))data <- data %>% mutate(BloodPressure = ifelse(data$BloodPressure == "0", NA, BloodPressure))data <- data %>% mutate(SkinThickness = ifelse(data$SkinThickness == "0", NA, SkinThickness))data <- data %>% mutate(Glucose = ifelse(data$Glucose == "0", NA, Glucose))summary(data)
2
## Pregnancies Glucose BloodPressure SkinThickness## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00## NA's :5 NA's :35 NA's :227## Insulin BMI DiabetesPedigreeFunction Age## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00## Median :125.00 Median :32.30 Median :0.3725 Median :29.00## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00## NA's :374 NA's :11## Outcome## Min. :0.000## 1st Qu.:0.000## Median :0.000## Mean :0.349## 3rd Qu.:1.000## Max. :1.000##
sum(is.na(data));mean(is.na(data));sum(complete.cases(data))
## [1] 652
## [1] 0.0943287
## [1] 392
• Number of NA : 652• % of NAs in the total : 9.43%• rows not including NA : 392 (51%)• rows including NA : 371 (49%)
aggr <- VIM::aggr(data,numbers = TRUE,prop = c(TRUE, TRUE),sortVars = TRUE, #sort according to # of missing values.sortCombs = TRUE,only.miss = TRUE,labels = names(data),cex.axis = .7,gap = 2,col = c("navyblue", "red"),ylab = c("Histogram of Missings", "Pattern"))
3
His
togr
am o
f Mis
sing
s
0.0
0.1
0.2
0.3
0.4
Insu
lin
Ski
nThi
ckne
ss
Blo
odP
ress
ure
BM
I
Glu
cose
Pre
gnan
cies
Age
Out
com
e
Pat
tern
Insu
lin
Ski
nThi
ckne
ss
Blo
odP
ress
ure
BM
I
Glu
cose
Pre
gnan
cies
Age
Out
com
e
0.5104
0.2500
0.1823
0.0339
0.0091
0.0052
0.0026
0.0026
0.0013
0.0013
0.0013
#### Variables sorted by number of missings:## Variable Count## Insulin 0.486979167## SkinThickness 0.295572917## BloodPressure 0.045572917## BMI 0.014322917## Glucose 0.006510417## Pregnancies 0.000000000## DiabetesPedigreeFunction 0.000000000## Age 0.000000000## Outcome 0.000000000
Almost 51% of the samples are not missing any information, 25% are missing Insulin and SkinThicknessvalue, 18% are missing only Insulin value, and the remaining ones show other missing patterns.
Replace NA values by using Predictive Mean Matching Method
micedia <- mice::mice(data,seed = 1234,m = 5, # the number of imputed datsetsmaxit = 50,methd = 'pmm') # the imputation method (Predictive mean matching)
##
4
## iter imp variable## 1 1 Glucose BloodPressure SkinThickness Insulin BMI## 1 2 Glucose BloodPressure SkinThickness Insulin BMI## 1 3 Glucose BloodPressure SkinThickness Insulin BMI## 1 4 Glucose BloodPressure SkinThickness Insulin BMI## 1 5 Glucose BloodPressure SkinThickness Insulin BMI## 2 1 Glucose BloodPressure SkinThickness Insulin BMI## 2 2 Glucose BloodPressure SkinThickness Insulin BMI## 2 3 Glucose BloodPressure SkinThickness Insulin BMI## 2 4 Glucose BloodPressure SkinThickness Insulin BMI## 2 5 Glucose BloodPressure SkinThickness Insulin BMI## 3 1 Glucose BloodPressure SkinThickness Insulin BMI## 3 2 Glucose BloodPressure SkinThickness Insulin BMI## 3 3 Glucose BloodPressure SkinThickness Insulin BMI## 3 4 Glucose BloodPressure SkinThickness Insulin BMI## 3 5 Glucose BloodPressure SkinThickness Insulin BMI## 4 1 Glucose BloodPressure SkinThickness Insulin BMI## 4 2 Glucose BloodPressure SkinThickness Insulin BMI## 4 3 Glucose BloodPressure SkinThickness Insulin BMI## 4 4 Glucose BloodPressure SkinThickness Insulin BMI## 4 5 Glucose BloodPressure SkinThickness Insulin BMI## 5 1 Glucose BloodPressure SkinThickness Insulin BMI## 5 2 Glucose BloodPressure SkinThickness Insulin BMI## 5 3 Glucose BloodPressure SkinThickness Insulin BMI## 5 4 Glucose BloodPressure SkinThickness Insulin BMI## 5 5 Glucose BloodPressure SkinThickness Insulin BMI## 6 1 Glucose BloodPressure SkinThickness Insulin BMI## 6 2 Glucose BloodPressure SkinThickness Insulin BMI## 6 3 Glucose BloodPressure SkinThickness Insulin BMI## 6 4 Glucose BloodPressure SkinThickness Insulin BMI## 6 5 Glucose BloodPressure SkinThickness Insulin BMI## 7 1 Glucose BloodPressure SkinThickness Insulin BMI## 7 2 Glucose BloodPressure SkinThickness Insulin BMI## 7 3 Glucose BloodPressure SkinThickness Insulin BMI## 7 4 Glucose BloodPressure SkinThickness Insulin BMI## 7 5 Glucose BloodPressure SkinThickness Insulin BMI## 8 1 Glucose BloodPressure SkinThickness Insulin BMI## 8 2 Glucose BloodPressure SkinThickness Insulin BMI## 8 3 Glucose BloodPressure SkinThickness Insulin BMI## 8 4 Glucose BloodPressure SkinThickness Insulin BMI## 8 5 Glucose BloodPressure SkinThickness Insulin BMI## 9 1 Glucose BloodPressure SkinThickness Insulin BMI## 9 2 Glucose BloodPressure SkinThickness Insulin BMI## 9 3 Glucose BloodPressure SkinThickness Insulin BMI## 9 4 Glucose BloodPressure SkinThickness Insulin BMI## 9 5 Glucose BloodPressure SkinThickness Insulin BMI## 10 1 Glucose BloodPressure SkinThickness Insulin BMI## 10 2 Glucose BloodPressure SkinThickness Insulin BMI## 10 3 Glucose BloodPressure SkinThickness Insulin BMI## 10 4 Glucose BloodPressure SkinThickness Insulin BMI## 10 5 Glucose BloodPressure SkinThickness Insulin BMI## 11 1 Glucose BloodPressure SkinThickness Insulin BMI## 11 2 Glucose BloodPressure SkinThickness Insulin BMI## 11 3 Glucose BloodPressure SkinThickness Insulin BMI
5
## 11 4 Glucose BloodPressure SkinThickness Insulin BMI## 11 5 Glucose BloodPressure SkinThickness Insulin BMI## 12 1 Glucose BloodPressure SkinThickness Insulin BMI## 12 2 Glucose BloodPressure SkinThickness Insulin BMI## 12 3 Glucose BloodPressure SkinThickness Insulin BMI## 12 4 Glucose BloodPressure SkinThickness Insulin BMI## 12 5 Glucose BloodPressure SkinThickness Insulin BMI## 13 1 Glucose BloodPressure SkinThickness Insulin BMI## 13 2 Glucose BloodPressure SkinThickness Insulin BMI## 13 3 Glucose BloodPressure SkinThickness Insulin BMI## 13 4 Glucose BloodPressure SkinThickness Insulin BMI## 13 5 Glucose BloodPressure SkinThickness Insulin BMI## 14 1 Glucose BloodPressure SkinThickness Insulin BMI## 14 2 Glucose BloodPressure SkinThickness Insulin BMI## 14 3 Glucose BloodPressure SkinThickness Insulin BMI## 14 4 Glucose BloodPressure SkinThickness Insulin BMI## 14 5 Glucose BloodPressure SkinThickness Insulin BMI## 15 1 Glucose BloodPressure SkinThickness Insulin BMI## 15 2 Glucose BloodPressure SkinThickness Insulin BMI## 15 3 Glucose BloodPressure SkinThickness Insulin BMI## 15 4 Glucose BloodPressure SkinThickness Insulin BMI## 15 5 Glucose BloodPressure SkinThickness Insulin BMI## 16 1 Glucose BloodPressure SkinThickness Insulin BMI## 16 2 Glucose BloodPressure SkinThickness Insulin BMI## 16 3 Glucose BloodPressure SkinThickness Insulin BMI## 16 4 Glucose BloodPressure SkinThickness Insulin BMI## 16 5 Glucose BloodPressure SkinThickness Insulin BMI## 17 1 Glucose BloodPressure SkinThickness Insulin BMI## 17 2 Glucose BloodPressure SkinThickness Insulin BMI## 17 3 Glucose BloodPressure SkinThickness Insulin BMI## 17 4 Glucose BloodPressure SkinThickness Insulin BMI## 17 5 Glucose BloodPressure SkinThickness Insulin BMI## 18 1 Glucose BloodPressure SkinThickness Insulin BMI## 18 2 Glucose BloodPressure SkinThickness Insulin BMI## 18 3 Glucose BloodPressure SkinThickness Insulin BMI## 18 4 Glucose BloodPressure SkinThickness Insulin BMI## 18 5 Glucose BloodPressure SkinThickness Insulin BMI## 19 1 Glucose BloodPressure SkinThickness Insulin BMI## 19 2 Glucose BloodPressure SkinThickness Insulin BMI## 19 3 Glucose BloodPressure SkinThickness Insulin BMI## 19 4 Glucose BloodPressure SkinThickness Insulin BMI## 19 5 Glucose BloodPressure SkinThickness Insulin BMI## 20 1 Glucose BloodPressure SkinThickness Insulin BMI## 20 2 Glucose BloodPressure SkinThickness Insulin BMI## 20 3 Glucose BloodPressure SkinThickness Insulin BMI## 20 4 Glucose BloodPressure SkinThickness Insulin BMI## 20 5 Glucose BloodPressure SkinThickness Insulin BMI## 21 1 Glucose BloodPressure SkinThickness Insulin BMI## 21 2 Glucose BloodPressure SkinThickness Insulin BMI## 21 3 Glucose BloodPressure SkinThickness Insulin BMI## 21 4 Glucose BloodPressure SkinThickness Insulin BMI## 21 5 Glucose BloodPressure SkinThickness Insulin BMI## 22 1 Glucose BloodPressure SkinThickness Insulin BMI## 22 2 Glucose BloodPressure SkinThickness Insulin BMI
6
## 22 3 Glucose BloodPressure SkinThickness Insulin BMI## 22 4 Glucose BloodPressure SkinThickness Insulin BMI## 22 5 Glucose BloodPressure SkinThickness Insulin BMI## 23 1 Glucose BloodPressure SkinThickness Insulin BMI## 23 2 Glucose BloodPressure SkinThickness Insulin BMI## 23 3 Glucose BloodPressure SkinThickness Insulin BMI## 23 4 Glucose BloodPressure SkinThickness Insulin BMI## 23 5 Glucose BloodPressure SkinThickness Insulin BMI## 24 1 Glucose BloodPressure SkinThickness Insulin BMI## 24 2 Glucose BloodPressure SkinThickness Insulin BMI## 24 3 Glucose BloodPressure SkinThickness Insulin BMI## 24 4 Glucose BloodPressure SkinThickness Insulin BMI## 24 5 Glucose BloodPressure SkinThickness Insulin BMI## 25 1 Glucose BloodPressure SkinThickness Insulin BMI## 25 2 Glucose BloodPressure SkinThickness Insulin BMI## 25 3 Glucose BloodPressure SkinThickness Insulin BMI## 25 4 Glucose BloodPressure SkinThickness Insulin BMI## 25 5 Glucose BloodPressure SkinThickness Insulin BMI## 26 1 Glucose BloodPressure SkinThickness Insulin BMI## 26 2 Glucose BloodPressure SkinThickness Insulin BMI## 26 3 Glucose BloodPressure SkinThickness Insulin BMI## 26 4 Glucose BloodPressure SkinThickness Insulin BMI## 26 5 Glucose BloodPressure SkinThickness Insulin BMI## 27 1 Glucose BloodPressure SkinThickness Insulin BMI## 27 2 Glucose BloodPressure SkinThickness Insulin BMI## 27 3 Glucose BloodPressure SkinThickness Insulin BMI## 27 4 Glucose BloodPressure SkinThickness Insulin BMI## 27 5 Glucose BloodPressure SkinThickness Insulin BMI## 28 1 Glucose BloodPressure SkinThickness Insulin BMI## 28 2 Glucose BloodPressure SkinThickness Insulin BMI## 28 3 Glucose BloodPressure SkinThickness Insulin BMI## 28 4 Glucose BloodPressure SkinThickness Insulin BMI## 28 5 Glucose BloodPressure SkinThickness Insulin BMI## 29 1 Glucose BloodPressure SkinThickness Insulin BMI## 29 2 Glucose BloodPressure SkinThickness Insulin BMI## 29 3 Glucose BloodPressure SkinThickness Insulin BMI## 29 4 Glucose BloodPressure SkinThickness Insulin BMI## 29 5 Glucose BloodPressure SkinThickness Insulin BMI## 30 1 Glucose BloodPressure SkinThickness Insulin BMI## 30 2 Glucose BloodPressure SkinThickness Insulin BMI## 30 3 Glucose BloodPressure SkinThickness Insulin BMI## 30 4 Glucose BloodPressure SkinThickness Insulin BMI## 30 5 Glucose BloodPressure SkinThickness Insulin BMI## 31 1 Glucose BloodPressure SkinThickness Insulin BMI## 31 2 Glucose BloodPressure SkinThickness Insulin BMI## 31 3 Glucose BloodPressure SkinThickness Insulin BMI## 31 4 Glucose BloodPressure SkinThickness Insulin BMI## 31 5 Glucose BloodPressure SkinThickness Insulin BMI## 32 1 Glucose BloodPressure SkinThickness Insulin BMI## 32 2 Glucose BloodPressure SkinThickness Insulin BMI## 32 3 Glucose BloodPressure SkinThickness Insulin BMI## 32 4 Glucose BloodPressure SkinThickness Insulin BMI## 32 5 Glucose BloodPressure SkinThickness Insulin BMI## 33 1 Glucose BloodPressure SkinThickness Insulin BMI
7
## 33 2 Glucose BloodPressure SkinThickness Insulin BMI## 33 3 Glucose BloodPressure SkinThickness Insulin BMI## 33 4 Glucose BloodPressure SkinThickness Insulin BMI## 33 5 Glucose BloodPressure SkinThickness Insulin BMI## 34 1 Glucose BloodPressure SkinThickness Insulin BMI## 34 2 Glucose BloodPressure SkinThickness Insulin BMI## 34 3 Glucose BloodPressure SkinThickness Insulin BMI## 34 4 Glucose BloodPressure SkinThickness Insulin BMI## 34 5 Glucose BloodPressure SkinThickness Insulin BMI## 35 1 Glucose BloodPressure SkinThickness Insulin BMI## 35 2 Glucose BloodPressure SkinThickness Insulin BMI## 35 3 Glucose BloodPressure SkinThickness Insulin BMI## 35 4 Glucose BloodPressure SkinThickness Insulin BMI## 35 5 Glucose BloodPressure SkinThickness Insulin BMI## 36 1 Glucose BloodPressure SkinThickness Insulin BMI## 36 2 Glucose BloodPressure SkinThickness Insulin BMI## 36 3 Glucose BloodPressure SkinThickness Insulin BMI## 36 4 Glucose BloodPressure SkinThickness Insulin BMI## 36 5 Glucose BloodPressure SkinThickness Insulin BMI## 37 1 Glucose BloodPressure SkinThickness Insulin BMI## 37 2 Glucose BloodPressure SkinThickness Insulin BMI## 37 3 Glucose BloodPressure SkinThickness Insulin BMI## 37 4 Glucose BloodPressure SkinThickness Insulin BMI## 37 5 Glucose BloodPressure SkinThickness Insulin BMI## 38 1 Glucose BloodPressure SkinThickness Insulin BMI## 38 2 Glucose BloodPressure SkinThickness Insulin BMI## 38 3 Glucose BloodPressure SkinThickness Insulin BMI## 38 4 Glucose BloodPressure SkinThickness Insulin BMI## 38 5 Glucose BloodPressure SkinThickness Insulin BMI## 39 1 Glucose BloodPressure SkinThickness Insulin BMI## 39 2 Glucose BloodPressure SkinThickness Insulin BMI## 39 3 Glucose BloodPressure SkinThickness Insulin BMI## 39 4 Glucose BloodPressure SkinThickness Insulin BMI## 39 5 Glucose BloodPressure SkinThickness Insulin BMI## 40 1 Glucose BloodPressure SkinThickness Insulin BMI## 40 2 Glucose BloodPressure SkinThickness Insulin BMI## 40 3 Glucose BloodPressure SkinThickness Insulin BMI## 40 4 Glucose BloodPressure SkinThickness Insulin BMI## 40 5 Glucose BloodPressure SkinThickness Insulin BMI## 41 1 Glucose BloodPressure SkinThickness Insulin BMI## 41 2 Glucose BloodPressure SkinThickness Insulin BMI## 41 3 Glucose BloodPressure SkinThickness Insulin BMI## 41 4 Glucose BloodPressure SkinThickness Insulin BMI## 41 5 Glucose BloodPressure SkinThickness Insulin BMI## 42 1 Glucose BloodPressure SkinThickness Insulin BMI## 42 2 Glucose BloodPressure SkinThickness Insulin BMI## 42 3 Glucose BloodPressure SkinThickness Insulin BMI## 42 4 Glucose BloodPressure SkinThickness Insulin BMI## 42 5 Glucose BloodPressure SkinThickness Insulin BMI## 43 1 Glucose BloodPressure SkinThickness Insulin BMI## 43 2 Glucose BloodPressure SkinThickness Insulin BMI## 43 3 Glucose BloodPressure SkinThickness Insulin BMI## 43 4 Glucose BloodPressure SkinThickness Insulin BMI## 43 5 Glucose BloodPressure SkinThickness Insulin BMI
8
## 44 1 Glucose BloodPressure SkinThickness Insulin BMI## 44 2 Glucose BloodPressure SkinThickness Insulin BMI## 44 3 Glucose BloodPressure SkinThickness Insulin BMI## 44 4 Glucose BloodPressure SkinThickness Insulin BMI## 44 5 Glucose BloodPressure SkinThickness Insulin BMI## 45 1 Glucose BloodPressure SkinThickness Insulin BMI## 45 2 Glucose BloodPressure SkinThickness Insulin BMI## 45 3 Glucose BloodPressure SkinThickness Insulin BMI## 45 4 Glucose BloodPressure SkinThickness Insulin BMI## 45 5 Glucose BloodPressure SkinThickness Insulin BMI## 46 1 Glucose BloodPressure SkinThickness Insulin BMI## 46 2 Glucose BloodPressure SkinThickness Insulin BMI## 46 3 Glucose BloodPressure SkinThickness Insulin BMI## 46 4 Glucose BloodPressure SkinThickness Insulin BMI## 46 5 Glucose BloodPressure SkinThickness Insulin BMI## 47 1 Glucose BloodPressure SkinThickness Insulin BMI## 47 2 Glucose BloodPressure SkinThickness Insulin BMI## 47 3 Glucose BloodPressure SkinThickness Insulin BMI## 47 4 Glucose BloodPressure SkinThickness Insulin BMI## 47 5 Glucose BloodPressure SkinThickness Insulin BMI## 48 1 Glucose BloodPressure SkinThickness Insulin BMI## 48 2 Glucose BloodPressure SkinThickness Insulin BMI## 48 3 Glucose BloodPressure SkinThickness Insulin BMI## 48 4 Glucose BloodPressure SkinThickness Insulin BMI## 48 5 Glucose BloodPressure SkinThickness Insulin BMI## 49 1 Glucose BloodPressure SkinThickness Insulin BMI## 49 2 Glucose BloodPressure SkinThickness Insulin BMI## 49 3 Glucose BloodPressure SkinThickness Insulin BMI## 49 4 Glucose BloodPressure SkinThickness Insulin BMI## 49 5 Glucose BloodPressure SkinThickness Insulin BMI## 50 1 Glucose BloodPressure SkinThickness Insulin BMI## 50 2 Glucose BloodPressure SkinThickness Insulin BMI## 50 3 Glucose BloodPressure SkinThickness Insulin BMI## 50 4 Glucose BloodPressure SkinThickness Insulin BMI## 50 5 Glucose BloodPressure SkinThickness Insulin BMI
micedia
## Class: mids## Number of multiple imputations: 5## Imputation methods:## Pregnancies Glucose BloodPressure## "" "pmm" "pmm"## SkinThickness Insulin BMI## "pmm" "pmm" "pmm"## DiabetesPedigreeFunction Age Outcome## "" "" ""## PredictorMatrix:## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI## Pregnancies 0 1 1 1 1 1## Glucose 1 0 1 1 1 1## BloodPressure 1 1 0 1 1 1## SkinThickness 1 1 1 0 1 1## Insulin 1 1 1 1 0 1
9
## BMI 1 1 1 1 1 0## DiabetesPedigreeFunction Age Outcome## Pregnancies 1 1 1## Glucose 1 1 1## BloodPressure 1 1 1## SkinThickness 1 1 1## Insulin 1 1 1## BMI 1 1 1
Check that the imputed values are properly replaced
xyplot(micedia,Insulin ~ Glucose+BloodPressure+SkinThickness+BMI,pch = 18,cex = 1,col = c("red", "navyblue"))
Glucose + BloodPressure + SkinThickness + BMI
Insu
lin 0
200
400
600
800
Glucose
0 50 100 150 200
BloodPressure
0 50 100 150 200
SkinThickness
0
200
400
600
800
BMI
The location of the navyblue points (imputed) matches the shape of the red ones (observed). The matchingshape tells us that the imputed values can be considered as plausible values.
Check the number of zero, NA values
• q_zeros: quantity of zeros (p_zeros: in percentage)• q_na : quantity of NA (p_na: in percentage)• type : factor or numeric• unique : quantity of unique values
10
data <- complete(micedia, 1)data$Outcome <- as.factor(data$Outcome)levels(data$Outcome) <- c("No", "Yes")funModeling::df_status(data)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type## 1 Pregnancies 111 14.45 0 0 0 0 integer## 2 Glucose 0 0.00 0 0 0 0 integer## 3 BloodPressure 0 0.00 0 0 0 0 integer## 4 SkinThickness 0 0.00 0 0 0 0 integer## 5 Insulin 0 0.00 0 0 0 0 integer## 6 BMI 0 0.00 0 0 0 0 numeric## 7 DiabetesPedigreeFunction 0 0.00 0 0 0 0 numeric## 8 Age 0 0.00 0 0 0 0 integer## 9 Outcome 0 0.00 0 0 0 0 factor## unique## 1 17## 2 135## 3 46## 4 50## 5 185## 6 247## 7 517## 8 52## 9 2
DataExplorer::profile_missing(data)
## feature num_missing pct_missing## 1 Pregnancies 0 0## 2 Glucose 0 0## 3 BloodPressure 0 0## 4 SkinThickness 0 0## 5 Insulin 0 0## 6 BMI 0 0## 7 DiabetesPedigreeFunction 0 0## 8 Age 0 0## 9 Outcome 0 0
ggpairs
# library(ggplot2)# library(GGally)ggpairs(data, aes(colour=Outcome, alpha = 0.8), lower = list(combo = wrap("facethist", binwidth = 1)))
11
Cor : 0.13No: 0.0834
Yes: −0.0523
Cor : 0.226No: 0.222
Yes: 0.156Cor : 0.227
No: 0.205Yes: 0.102
Cor : 0.114No: 0.152
Yes: −0.0671Cor : 0.191
No: 0.084Yes: 0.0851
Cor : 0.236No: 0.22
Yes: 0.172
Cor : 0.0189No: 0.0304Yes: −0.12
Cor : 0.604No: 0.597
Yes: 0.538Cor : 0.0847
No: 0.109Yes: −0.0508
Cor : 0.216No: 0.192
Yes: 0.129
Cor : 0.0299No: 0.0151
Yes: −0.132Cor : 0.24
No: 0.129Yes: 0.0583
Cor : 0.291No: 0.257
Yes: 0.243Cor : 0.639
No: 0.654Yes: 0.532
Cor : 0.259No: 0.275
Yes: 0.0831
Cor : −0.0335No: −0.08
Yes: −0.0692Cor : 0.138
No: 0.0867Yes: 0.0286
Cor : 0.00174No: −0.0223
Yes: −0.0407Cor : 0.119
No: 0.0503Yes: 0.125
Cor : 0.155No: 0.189
Yes: 0.0327Cor : 0.152
No: 0.0935Yes: 0.12
Cor : 0.544No: 0.573
Yes: 0.445Cor : 0.269
No: 0.219Yes: 0.114
Cor : 0.327No: 0.305
Yes: 0.286Cor : 0.176
No: 0.186Yes: 0.00788
Cor : 0.145No: 0.0852Yes: 0.101
Cor : 0.0323No: 0.0266
Yes: −0.192Cor : 0.0336
No: 0.0417Yes: −0.0881
Pregnancies GlucoseBloodPressureSkinThickness Insulin BMIDiabetesPedigreeFunctionAge OutcomeP
regnanciesGlucoseB
loodPressureS
kinThickness
InsulinB
MI
DiabetesP
edigreeFunctionA
geO
utcome
0 5 1015 50100150200255075100125 2550751000200400600800203040506070 0 1 2 20 40 60 80 No Yes
0.000.050.100.15
50100150200
255075
100125
255075
100
0200400600800
2030405060
0.00.51.01.52.02.5
20406080
0255075100
0255075100
BoxPlot
par(mfrow = c(2, 4), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0)) # 2 rows 4 columnswith(data, {
plot(Outcome, Pregnancies, main = "Pregnancies and Outcome")plot(Outcome, Age, main = "Age and Outcome")plot(Outcome, Glucose, main = "Glucose and Outcome")plot(Outcome, BMI, main = "BMI and Outcome")plot(Outcome, SkinThickness, main = "SkinThickness and Outcome")plot(Outcome, BloodPressure, main = "BloodPressure and Outcome")plot(Outcome, DiabetesPedigreeFunction, main = "DPF and Outcome")plot(Outcome, Insulin, main = "Insulin and Outcome")mtext("Relationship of Variables and Outcomes", outer = T)
})
12
No Yes
05
1015
Pregnancies and Outcome
x
y
No Yes20
4060
80
Age and Outcome
x
y
No Yes
5010
015
020
0
Glucose and Outcome
x
y
No Yes
2030
4050
60
BMI and Outcome
x
y
No Yes
2040
6080
100
SkinThickness and Outcome
x
y
No Yes
4060
8010
0
BloodPressure and Outcome
x
y
No Yes
0.0
0.5
1.0
1.5
2.0
2.5
DPF and Outcome
x
y
No Yes
020
040
060
080
0
Insulin and Outcome
x
y
Relationship of Variables and Outcomes
Distribution
# library(ggpubr)b1 <- qplot(Age, data = data, geom = "density", col = Outcome)b2 <- qplot(BMI, data = data, geom = "density", col = Outcome)b3 <- qplot(Pregnancies, data = data, geom = "density", col = Outcome)b4 <- qplot(Glucose, data = data, geom = "density", col = Outcome)b5 <- qplot(Insulin, data = data, geom = "density", col = Outcome)b6 <- qplot(BloodPressure, data = data, geom = "density", col = Outcome)b7 <- qplot(SkinThickness, data = data, geom = "density", col = Outcome)b8 <- qplot(DiabetesPedigreeFunction, data = data, geom = "density", col = Outcome)ggpubr::ggarrange(b1, b2, b3, b4, b5, b6, b7, b8, nrow = 4, ncol = 2, heights = c(3, 3, 3))
13
0.000.020.040.06
20 40 60 80Age
NU
LLOutcome
No
Yes 0.000.020.040.06
20 30 40 50 60BMI
NU
LL
Outcome
No
Yes
0.000.050.100.15
0 5 10 15Pregnancies
NU
LL
Outcome
No
Yes 0.0000.0050.0100.015
50 100 150 200Glucose
NU
LL
Outcome
No
Yes
0.0000.0020.0040.006
0 200 400 600 800Insulin
NU
LL
Outcome
No
Yes 0.000.010.020.03
25 50 75 100 125BloodPressure
NU
LL
Outcome
No
Yes
0.000.010.020.030.04
25 50 75 100SkinThickness
NU
LL
Outcome
No
Yes 0.00.51.01.52.0
0.0 0.5 1.0 1.5 2.0 2.5DiabetesPedigreeFunction
NU
LLOutcome
No
Yes
Correlation Matrix
Correlation Matrix between variables
corr <- data[ , -9]corr <- round(cor(corr), 3)corr
## Pregnancies Glucose BloodPressure SkinThickness## Pregnancies 1.000 0.130 0.226 0.114## Glucose 0.130 1.000 0.227 0.191## BloodPressure 0.226 0.227 1.000 0.236## SkinThickness 0.114 0.191 0.236 1.000## Insulin 0.019 0.604 0.085 0.216## BMI 0.030 0.240 0.291 0.639## DiabetesPedigreeFunction -0.034 0.138 0.002 0.119## Age 0.544 0.269 0.327 0.176## Insulin BMI DiabetesPedigreeFunction Age## Pregnancies 0.019 0.030 -0.034 0.544## Glucose 0.604 0.240 0.138 0.269## BloodPressure 0.085 0.291 0.002 0.327## SkinThickness 0.216 0.639 0.119 0.176## Insulin 1.000 0.259 0.155 0.145## BMI 0.259 1.000 0.152 0.032## DiabetesPedigreeFunction 0.155 0.152 1.000 0.034
14
## Age 0.145 0.032 0.034 1.000
# library(ggcorrplot)ggcorrplot::ggcorrplot(corr, method = "square", # default
type = "lower",show.legend = TRUE, # defaultlegend.title = "Correlation",show.diag = TRUE,hc.order = TRUE,outline.color = "white",ggtheme = ggplot2::theme_minimal, # defaultcolors = c("blue", "white", "red"),lab = TRUE, # Adding Correlation Coefficientlab_size = 4,sig.level = 0.05)
1 0.14 0.16−0.030.03 0 0.12 0.15
1 0.6 0.13 0.27 0.23 0.19 0.24
1 0.02 0.14 0.08 0.22 0.26
1 0.54 0.23 0.11 0.03
1 0.33 0.18 0.03
1 0.24 0.29
1 0.64
1
DiabetesPedigreeFunction
Glucose
Insulin
Pregnancies
Age
BloodPressure
SkinThickness
BMI
Diabet
esPed
igree
Functi
on
Glucos
e
Insu
lin
Pregn
ancie
sAge
BloodP
ress
ure
SkinThic
knes
sBM
I
−1.0
−0.5
0.0
0.5
1.0Correlation
Glucose and Insulin, Pregnancies and Age, SkinThickness and BMI have relatively high values. Diabetes-PedigreeFunction, in particular, appears to have little correlation with other variables.
15
Machine Learning Modeling
Machine Learning Methods
Data Separation
Check the ratio of the target variable before spliting the dataset
data.scale <- data.frame(scale(data[, -9]))data.scale$Outcome <- data$Outcomesummary(data.scale)
## Pregnancies Glucose BloodPressure SkinThickness## Min. :-1.1411 Min. :-2.5502 Min. :-3.90316 Min. :-2.10798## 1st Qu.:-0.8443 1st Qu.:-0.7210 1st Qu.:-0.67834 1st Qu.:-0.66782## Median :-0.2508 Median :-0.1551 Median :-0.03338 Median : 0.00425## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000## 3rd Qu.: 0.6395 3rd Qu.: 0.6324 3rd Qu.: 0.61158 3rd Qu.: 0.67632## Max. : 3.9040 Max. : 2.5353 Max. : 3.99764 Max. : 6.72499## Insulin BMI DiabetesPedigreeFunction## Min. :-1.1789 Min. :-2.04948 Min. :-1.1888## 1st Qu.:-0.6751 1st Qu.:-0.72139 1st Qu.:-0.6885## Median :-0.2464 Median :-0.03569 Median :-0.2999## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000## 3rd Qu.: 0.3060 3rd Qu.: 0.60669 3rd Qu.: 0.4659## Max. : 6.1748 Max. : 5.00959 Max. : 5.8797## Age Outcome## Min. :-1.0409 No :500## 1st Qu.:-0.7858 Yes:268## Median :-0.3606## Mean : 0.0000## 3rd Qu.: 0.6598## Max. : 4.0611
table(data.scale$Outcome)
#### No Yes## 500 268
Split the scaled data set into train and test set (7:3) for KNN & check the proportions of trainand test set
set.seed(123)index <- caret::createDataPartition(y = data.scale$Outcome,
p = 0.7,list = FALSE)
train <- data.scale[index, ]test <- data.scale[-index, ]
# Checktrain$Outcome %>% table() %>% prop.table()
16
## .## No Yes## 0.6505576 0.3494424
test$Outcome %>% table() %>% prop.table()
## .## No Yes## 0.6521739 0.3478261
Split the entire data set into train and test set (7:3) for other models & check the proportions oftrain and test set
set.seed(123)index1 <- caret::createDataPartition(y = data$Outcome,
p = 0.7,list = FALSE)
train1 <- data[index, ]test1 <- data[-index, ]
# Checktrain1$Outcome %>% table() %>% prop.table()
## .## No Yes## 0.6505576 0.3494424
test1$Outcome %>% table() %>% prop.table()
## .## No Yes## 0.6521739 0.3478261
KNN
To find a best tuning parameter k for KNN, make grid range from 2 to 20 by 1.
# library(class)
grid1 <- expand.grid(.k = seq(from = 2, to = 20, by = 1))control <- caret::trainControl(method = "cv")set.seed(123)knn.train <- caret::train(Outcome ~.,
data = train,method = "knn",trControl = control,tuneGrid = grid1)
knn.train
17
## k-Nearest Neighbors#### 538 samples## 8 predictor## 2 classes: 'No', 'Yes'#### No pre-processing## Resampling: Cross-Validated (10 fold)## Summary of sample sizes: 484, 485, 484, 484, 484, 484, ...## Resampling results across tuning parameters:#### k Accuracy Kappa## 2 0.6653040 0.2522687## 3 0.7135570 0.3525008## 4 0.7044025 0.3239550## 5 0.7284416 0.3718371## 6 0.7266946 0.3691580## 7 0.7396925 0.4016165## 8 0.7358141 0.3896026## 9 0.7395877 0.3995073## 10 0.7432565 0.4118207## 11 0.7451083 0.4113598## 12 0.7619846 0.4462147## 13 0.7712089 0.4679011## 14 0.7675402 0.4588197## 15 0.7563941 0.4330841## 16 0.7674703 0.4589493## 17 0.7600280 0.4380374## 18 0.7786513 0.4798966## 19 0.7561845 0.4311233## 20 0.7544025 0.4236375#### Accuracy was used to select the optimal model using the largest value.## The final value used for the model was k = 18.
• Best k = 18• Accuracy = 0.779• Kappa = 0.480
Apply this model to test set
knn.test <- class::knn(train = train[, -9],test = test[, -9],cl = train[, 9],k = 18)
knn.real <- test$Outcomecaret::confusionMatrix(data = knn.test,
reference = knn.real,positive = "Yes")
## Confusion Matrix and Statistics#### Reference
18
## Prediction No Yes## No 130 39## Yes 20 41#### Accuracy : 0.7435## 95% CI : (0.6819, 0.7986)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.001872#### Kappa : 0.4014#### Mcnemar's Test P-Value : 0.019109#### Sensitivity : 0.5125## Specificity : 0.8667## Pos Pred Value : 0.6721## Neg Pred Value : 0.7692## Prevalence : 0.3478## Detection Rate : 0.1783## Detection Prevalence : 0.2652## Balanced Accuracy : 0.6896#### 'Positive' Class : Yes##
• Accuracy : 0.744• Sensitivity : 0.513• Specificity : 0.867• Kappa : 0.401 Both Accuracy & Kappa of test set decrease when comparing with Accuracy & Kappa
of train set (77.9% -> 74.4%, 48% -> 40.1%)
knn.predobj <- ROCR::prediction(predictions = as.numeric(x = knn.test),labels = as.numeric(x = knn.real))
knn.perform <- ROCR::performance(prediction.obj = knn.predobj,measure = 'tpr',x.measure = 'fpr')
plot(x = knn.perform, main = 'ROC curve')
19
ROC curve
False positive rate
True
pos
itive
rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
MLmetrics::F1_Score(y_pred = knn.test,y_true = knn.real,positive = "Yes"); pROC::auc(response = as.numeric(x = knn.real),
predictor = as.numeric(x = knn.test))
## [1] 0.5815603
## Area under the curve: 0.6896
F1 Score : 0.582 AUC : 0.690
Weighted KNN
Used weighted kernel : triangular, epanechnikov (rectangular is used as well to compare)
# library(class)# library(kknn)
set.seed(123)kknn.train <- kknn::train.kknn(formula = Outcome ~.,
data = train,kmax = 25,distance = 2,kernel = c("rectangular", "triangular", "epanechnikov"))
20
plot(kknn.train)abline(h=0.2286, col="red")
5 10 15 20 25
0.24
0.26
0.28
0.30
0.32
k
mis
clas
sific
atio
n
rectangulartriangularepanechnikov
kknn.train
#### Call:## kknn::train.kknn(formula = Outcome ~ ., data = train, kmax = 25, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov"))#### Type of response variable: nominal## Minimal misclassification: 0.2286245## Best kernel: rectangular## Best k: 15
• Best k = 15• Best Kernel = rectangular• Minimal Error = 0.229 (0.771 Accuracy)
Apply this model to test set
kknn.pred <- predict(kknn.train, newdata = test)kknn.real <- test$Outcomecaret::confusionMatrix(data = kknn.pred,
reference = kknn.real,positive = "Yes")
21
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 38## Yes 21 42#### Accuracy : 0.7435## 95% CI : (0.6819, 0.7986)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.001872#### Kappa : 0.4051#### Mcnemar's Test P-Value : 0.037249#### Sensitivity : 0.5250## Specificity : 0.8600## Pos Pred Value : 0.6667## Neg Pred Value : 0.7725## Prevalence : 0.3478## Detection Rate : 0.1826## Detection Prevalence : 0.2739## Balanced Accuracy : 0.6925#### 'Positive' Class : Yes##
• Accuracy : 0.744• Sensitivity : 0.525• Specificity : 0.860• Kappa : 0.402 Accuracy of test set still decrease when comparing with Accuracy of train set (77.1%
-> 74.4%)
MLmetrics::F1_Score(y_pred = kknn.pred,y_true = kknn.real,positive = "Yes"); pROC::auc(response = as.numeric(x = kknn.real),
predictor = as.numeric(x = kknn.pred))
## [1] 0.5874126
## Area under the curve: 0.6925
• F1 Score : 0.587• AUC : 0.693 As a result, with this data, the method of weighted distance values when training does
not significantly affect the improvement of the model performance.
Linear SVM
Make svm.real using test set’s input variable in order to compare with train model.
22
svm.real <- test1$Outcome
# library(e1071)set.seed(123)linear.tune <- e1071::tune.svm(Outcome ~.,
data = train1,kernel = 'linear',cost = c(0.001, 0.01, 0.1, 1, 5, 10))
summary(linear.tune)
#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## cost## 0.1#### - best performance: 0.219427#### - Detailed performance results:## cost error dispersion## 1 1e-03 0.3493012 0.05066509## 2 1e-02 0.2286513 0.04797243## 3 1e-01 0.2194270 0.05047919## 4 1e+00 0.2231307 0.06144298## 5 5e+00 0.2230957 0.06190707## 6 1e+01 0.2230957 0.06190707
• Best parameters (cost) : 0.1• Best error rate : 0.219
Use test data set to predict
best.linear <- linear.tune$best.modeltune.test <- predict(best.linear, newdata = test1)caret::confusionMatrix(data = tune.test,
reference = svm.real,positive = 'Yes')
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 130 34## Yes 20 46#### Accuracy : 0.7652## 95% CI : (0.705, 0.8184)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.0001394
23
#### Kappa : 0.4605#### Mcnemar's Test P-Value : 0.0768812#### Sensitivity : 0.5750## Specificity : 0.8667## Pos Pred Value : 0.6970## Neg Pred Value : 0.7927## Prevalence : 0.3478## Detection Rate : 0.2000## Detection Prevalence : 0.2870## Balanced Accuracy : 0.7208#### 'Positive' Class : Yes##
• Accuracy : 0.765• Sensitivity : 0.575• Specificity : 0.867• Kappa : 0.460
MLmetrics::F1_Score(y_pred = tune.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = tune.test),
response = as.numeric(x = svm.real))
## [1] 0.630137
## Area under the curve: 0.7208
• F1_Score : 0.630• AUC : 0.721
Nonlinear SVM
Three kernels are used in nonlinear SVM: Polynomial, Radial Basis, and Sigmoid
1. Polynomial
• polynomial degree : 3, 4, 5• kernel coefficient(coef0) : 0.1, 0.5, 1, 2, 3, 4
set.seed(123)poly.tune <- e1071::tune.svm(Outcome ~.,
data = train1,kernel = 'polynomial',degree = c(3, 4, 5),coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(poly.tune)
24
#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## degree coef0## 3 3#### - best performance: 0.2471349#### - Detailed performance results:## degree coef0 error dispersion## 1 3 0.1 0.2657582 0.05376570## 2 4 0.1 0.2824598 0.04826251## 3 5 0.1 0.2935360 0.05667382## 4 3 0.5 0.2472397 0.04302385## 5 4 0.5 0.2601677 0.03112229## 6 5 0.5 0.2731656 0.05011586## 7 3 1.0 0.2564990 0.04257450## 8 4 1.0 0.2693920 0.03763831## 9 5 1.0 0.2806429 0.04061729## 10 3 2.0 0.2508735 0.03694279## 11 4 2.0 0.2750874 0.02886593## 12 5 2.0 0.2993012 0.05806619## 13 3 3.0 0.2471349 0.03870198## 14 4 3.0 0.2750874 0.03467953## 15 5 3.0 0.3048917 0.06606766## 16 3 4.0 0.2508735 0.03694279## 17 4 4.0 0.2713836 0.04226771## 18 5 4.0 0.3067086 0.06479947
• Best parameters : degree = 3, coef0 = 3• Best error rate : 0.247
Use test data set to predict
best.poly <- poly.tune$best.modelpoly.test <- predict(best.poly, newdata = test1)caret::confusionMatrix(data = poly.test,
reference = svm.real,positive = 'Yes')
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 130 30## Yes 20 50#### Accuracy : 0.7826## 95% CI : (0.7236, 0.8341)## No Information Rate : 0.6522
25
## P-Value [Acc > NIR] : 1.156e-05#### Kappa : 0.5064#### Mcnemar's Test P-Value : 0.2031#### Sensitivity : 0.6250## Specificity : 0.8667## Pos Pred Value : 0.7143## Neg Pred Value : 0.8125## Prevalence : 0.3478## Detection Rate : 0.2174## Detection Prevalence : 0.3043## Balanced Accuracy : 0.7458#### 'Positive' Class : Yes##
• Accuracy : 0.783• Sensitivity : 0.625• Specificity : 0.868• Kappa : 0.506
MLmetrics::F1_Score(y_pred = poly.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = poly.test),
response = as.numeric(x = svm.real))
## [1] 0.6666667
## Area under the curve: 0.7458
• F1_Score : 0.667• AUC : 0.746
2. Radial Basis
• gamma degree : 0.1, 0.5, 1, 2, 3, 4
set.seed(123)rbf.tune <- e1071::tune.svm(Outcome ~.,
data = train1,kernel = 'radial',gamma = c(0.1, 0.5, 1, 2, 3, 4))
summary(rbf.tune)
#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:
26
## gamma## 0.1#### - best performance: 0.2473445#### - Detailed performance results:## gamma error dispersion## 1 0.1 0.2473445 0.04970510## 2 0.5 0.2751223 0.04198407## 3 1.0 0.3141859 0.05103077## 4 2.0 0.3474843 0.05043974## 5 3.0 0.3474843 0.05118960## 6 4.0 0.3493012 0.05066509
• Best parameters (gamma) : 0.1• Best error rate : 0.247
Use test data set to predict
best.rbf <- rbf.tune$best.modelrbf.test <- predict(best.rbf, newdata = test1)caret::confusionMatrix(data = rbf.test,
reference = svm.real,positive = 'Yes')
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 32## Yes 21 48#### Accuracy : 0.7696## 95% CI : (0.7097, 0.8224)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 7.748e-05#### Kappa : 0.4752#### Mcnemar's Test P-Value : 0.1696#### Sensitivity : 0.6000## Specificity : 0.8600## Pos Pred Value : 0.6957## Neg Pred Value : 0.8012## Prevalence : 0.3478## Detection Rate : 0.2087## Detection Prevalence : 0.3000## Balanced Accuracy : 0.7300#### 'Positive' Class : Yes##
27
• Accuracy : 0.770• Sensitivity : 0.600• Specificity : 0.860• Kappa : 0.476
MLmetrics::F1_Score(y_pred = rbf.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = rbf.test),
response = as.numeric(x = svm.real))
## [1] 0.6442953
## Area under the curve: 0.73
• F1_Score : 0.644• AUC : 0.730
3. Sigmoid
• gamma degree : 0.1, 0.5, 1, 2, 3, 4• kernel coefficient(coef0) : 0.1, 0.5, 1, 2, 3, 4
set.seed(123)sigmoid.tune <- e1071::tune.svm(Outcome ~.,
data = train1,kernel = 'sigmoid',gamma = c(0.1, 0.5, 1, 2, 3, 4),coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(sigmoid.tune)
#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## gamma coef0## 0.1 2#### - best performance: 0.2248777#### - Detailed performance results:## gamma coef0 error dispersion## 1 0.1 0.1 0.2602376 0.06427357## 2 0.5 0.1 0.3381551 0.04324919## 3 1.0 0.1 0.3437456 0.03814373## 4 2.0 0.1 0.3622642 0.05621633## 5 3.0 0.1 0.3603774 0.06885972## 6 4.0 0.1 0.3603424 0.06590605## 7 0.1 0.5 0.2991265 0.08565041## 8 0.5 0.5 0.3140811 0.03378694## 9 1.0 0.5 0.3455975 0.03701595
28
## 10 2.0 0.5 0.3621943 0.05450888## 11 3.0 0.5 0.3751572 0.07258214## 12 4.0 0.5 0.3751572 0.07716241## 13 0.1 1.0 0.3045423 0.07752164## 14 0.5 1.0 0.3289308 0.04890969## 15 1.0 1.0 0.3547519 0.07719423## 16 2.0 1.0 0.3641509 0.05061305## 17 3.0 1.0 0.3714885 0.05898717## 18 4.0 1.0 0.3677149 0.07815268## 19 0.1 2.0 0.2248777 0.05255027## 20 0.5 2.0 0.3343816 0.06232761## 21 1.0 2.0 0.3809224 0.04189985## 22 2.0 2.0 0.3678546 0.04947844## 23 3.0 2.0 0.3640461 0.07139168## 24 4.0 2.0 0.3696716 0.05947753## 25 0.1 3.0 0.3493012 0.05066509## 26 0.5 3.0 0.3923829 0.07090637## 27 1.0 3.0 0.3659679 0.07323244## 28 2.0 3.0 0.3937456 0.06471812## 29 3.0 3.0 0.3585954 0.04914676## 30 4.0 3.0 0.3585255 0.06827537## 31 0.1 4.0 0.3493012 0.05066509## 32 0.5 4.0 0.3848707 0.07178028## 33 1.0 4.0 0.3639413 0.07919411## 34 2.0 4.0 0.3529350 0.05726504## 35 3.0 4.0 0.3584906 0.05978758## 36 4.0 4.0 0.3436408 0.06669583
• Best parameters : gamma = 0.1, coef0 = 2• Best error rate : 0.225
Use test data set to predict
best.sigmoid <- sigmoid.tune$best.modelsigmoid.test <- predict(best.sigmoid, newdata = test1)caret::confusionMatrix(data = sigmoid.test,
reference = svm.real,positive = 'Yes')
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 125 32## Yes 25 48#### Accuracy : 0.7522## 95% CI : (0.6912, 0.8066)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.0007075#### Kappa : 0.4424##
29
## Mcnemar's Test P-Value : 0.4267767#### Sensitivity : 0.6000## Specificity : 0.8333## Pos Pred Value : 0.6575## Neg Pred Value : 0.7962## Prevalence : 0.3478## Detection Rate : 0.2087## Detection Prevalence : 0.3174## Balanced Accuracy : 0.7167#### 'Positive' Class : Yes##
• Accuracy : 0.752• Sensitivity : 0.600• Specificity : 0.833• Kappa : 0.442
MLmetrics::F1_Score(y_pred = sigmoid.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = sigmoid.test),
response = as.numeric(x = svm.real))
## [1] 0.627451
## Area under the curve: 0.7167
• F1_Score : 0.627• AUC : 0.717
Randomforest
Grid serach for finding best parameters (Because the tuning process may take a long time inlarge datasets, it is recommended that you tune the data into smaller samples)
• ntree = 100, 300, 500, 700, 1000• mtry = 3, 4, 5, 6, 7
rf.grid <- expand.grid(ntree = c(100, 300, 500, 700, 1000),mtry = c(3,4,5,6,7))
rf.tuned <- data.frame()for(i in 1:nrow(rf.grid)){
set.seed(123)fit <- randomForest(Outcome ~.,
data = train1,xtest = test1[, -9],ytest = test1[, 9],ntree = rf.grid[i, 'ntree'],mtry = rf.grid[i, 'mtry'],importance = TRUE,
30
do.trace = 100,keep.forest = TRUE)
mcSum <- sum(fit$predicted != train1$Outcome)mcRate <- mcSum / nrow(train1)df <- data.frame(index = i, misClassRate = mcRate)rf.tuned <- rbind(rf.tuned, df)cat('\n')
}
## ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%## 600: 25.65% 15.71% 44.15%| 23.48% 14.00% 41.25%## 700: 25.84% 16.00% 44.15%| 23.48% 13.33% 42.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%## 600: 25.65% 15.71% 44.15%| 23.48% 14.00% 41.25%## 700: 25.84% 16.00% 44.15%| 23.48% 13.33% 42.50%## 800: 26.02% 16.29% 44.15%| 23.04% 13.33% 41.25%## 900: 26.02% 16.29% 44.15%| 23.48% 14.00% 41.25%## 1000: 25.65% 15.71% 44.15%| 23.91% 14.67% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%##
31
## ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%## 600: 25.09% 15.43% 43.09%| 22.61% 13.33% 40.00%## 700: 24.91% 15.71% 42.02%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%## 600: 25.09% 15.43% 43.09%| 22.61% 13.33% 40.00%## 700: 24.91% 15.71% 42.02%| 23.04% 14.67% 38.75%## 800: 25.28% 15.71% 43.09%| 23.04% 14.00% 40.00%## 900: 25.28% 16.00% 42.55%| 23.04% 14.00% 40.00%## 1000: 25.28% 15.71% 43.09%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%## 600: 25.65% 16.29% 43.09%| 23.04% 14.00% 40.00%## 700: 25.28% 16.57% 41.49%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%
32
## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%## 600: 25.65% 16.29% 43.09%| 23.04% 14.00% 40.00%## 700: 25.28% 16.57% 41.49%| 23.48% 14.00% 41.25%## 800: 25.09% 16.00% 42.02%| 23.48% 14.00% 41.25%## 900: 25.09% 15.71% 42.55%| 23.48% 14.00% 41.25%## 1000: 25.09% 16.29% 41.49%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%## 600: 25.46% 16.00% 43.09%| 22.61% 12.67% 41.25%## 700: 25.46% 16.00% 43.09%| 22.61% 13.33% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%## 600: 25.46% 16.00% 43.09%| 22.61% 12.67% 41.25%## 700: 25.46% 16.00% 43.09%| 22.61% 13.33% 40.00%## 800: 24.91% 15.71% 42.02%| 23.48% 14.67% 40.00%## 900: 25.09% 16.00% 42.02%| 23.04% 14.00% 40.00%## 1000: 24.91% 16.00% 41.49%| 22.17% 12.67% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%##
33
## ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%## 600: 26.02% 17.43% 42.02%| 22.17% 14.00% 37.50%## 700: 25.46% 17.14% 40.96%| 22.61% 14.67% 37.50%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%## 600: 26.02% 17.43% 42.02%| 22.17% 14.00% 37.50%## 700: 25.46% 17.14% 40.96%| 22.61% 14.67% 37.50%## 800: 25.84% 17.43% 41.49%| 22.61% 14.67% 37.50%## 900: 25.46% 17.14% 40.96%| 22.61% 14.00% 38.75%## 1000: 25.09% 16.86% 40.43%| 22.61% 14.00% 38.75%
plot(x = rf.tuned, xlab = '', ylab = 'MisClassification Rate')abline(h = min(rf.tuned$misClassRate), col = 'red', lty = 2)
34
5 10 15 20 25
0.24
50.
255
0.26
5
Mis
Cla
ssifi
catio
n R
ate
min <- which(rf.tuned$misClassRate == min(rf.tuned$misClassRate))bestPara <- rf.grid[min, ]bestPara
## ntree mtry## 2 300 3
• Best parameter : ntree = 300, mtry = 3
# library(randomForest)set.seed(123)rf.train <- randomForest::randomForest(Outcome ~.,
data = train1,xtest = test1[,-9],ytest = test1[, 9],ntree = bestPara$ntree,mtry = bestPara$mtry,importance = TRUE,do.trace = 100,keep.forest = TRUE)
## ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%
35
rf.train
#### Call:## randomForest(formula = Outcome ~ ., data = train1, xtest = test1[, -9], ytest = test1[, 9], ntree = bestPara$ntree, mtry = bestPara$mtry, importance = TRUE, do.trace = 100, keep.forest = TRUE)## Type of random forest: classification## Number of trees: 300## No. of variables tried at each split: 3#### OOB estimate of error rate: 24.16%## Confusion matrix:## No Yes class.error## No 301 49 0.1400000## Yes 81 107 0.4308511## Test set error rate: 23.91%## Confusion matrix:## No Yes class.error## No 129 21 0.140## Yes 34 46 0.425
• train OOB : 0.242• test OOB : 0.239
plot(rf.train)
0 50 100 150 200 250 300
0.15
0.25
0.35
0.45
rf.train
trees
Err
or
36
plot(rf.train$err.rate[, 1], type = 'l')
0 50 100 150 200 250 300
0.24
0.26
0.28
0.30
Index
rf.tr
ain$
err.r
ate[
, 1]
Use test data set to predict
rf.test <- rf.train$test$predictedrf.real <- test1$Outcomecaret::confusionMatrix(data = rf.real,
reference = rf.test,positive = "Yes")
## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 21## Yes 34 46#### Accuracy : 0.7609## 95% CI : (0.7004, 0.8145)## No Information Rate : 0.7087## P-Value [Acc > NIR] : 0.04559#### Kappa : 0.4521##
37
## Mcnemar's Test P-Value : 0.10565#### Sensitivity : 0.6866## Specificity : 0.7914## Pos Pred Value : 0.5750## Neg Pred Value : 0.8600## Prevalence : 0.2913## Detection Rate : 0.2000## Detection Prevalence : 0.3478## Balanced Accuracy : 0.7390#### 'Positive' Class : Yes##
• Accuracy : 0.761• Sensitivity : 0.687• Specificity : 0.791• Kappa : 0.452
MLmetrics::F1_Score(y_pred = rf.real,y_true = rf.test,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = rf.real),
response = as.numeric(x = rf.test))
## [1] 0.6258503
## Area under the curve: 0.739
• F1_Score : 0.626• AUC : 0.739
Variable Importance
Mean is used because variables from many trees evaluate aspects that contribute to improved accuracy andGini impurity.
• MeanDecreaseAccuracy• MeanDecreaseGini
rf.train %>% treesize(terminal = TRUE) %>% hist(main = 'Number of Terminal Nodes')
38
Number of Terminal Nodes
.
Fre
quen
cy
60 70 80 90 100
020
4060
80
randomForest::varImpPlot(rf.train, main = "Variable Importance Plot"); randomForest::importance(rf.train)
39
BloodPressure
SkinThickness
Insulin
DiabetesPedigreeFunction
Pregnancies
BMI
Age
Glucose
5 20 35MeanDecreaseAccuracy
Pregnancies
BloodPressure
SkinThickness
Insulin
DiabetesPedigreeFunction
Age
BMI
Glucose
0 30 60MeanDecreaseGini
Variable Importance Plot
## No Yes MeanDecreaseAccuracy## Pregnancies 14.983572 -1.074802 13.519390## Glucose 26.224744 25.567881 35.470165## BloodPressure 3.180638 -0.334135 2.153196## SkinThickness 3.933890 2.077106 4.419072## Insulin 5.753626 1.214205 5.366615## BMI 7.234006 14.228060 14.997752## DiabetesPedigreeFunction 5.325098 5.737122 7.278320## Age 15.988667 8.236952 18.590614## MeanDecreaseGini## Pregnancies 16.08648## Glucose 61.03068## BloodPressure 18.42026## SkinThickness 20.33625## Insulin 28.43677## BMI 35.81051## DiabetesPedigreeFunction 29.40971## Age 34.10992
• MeanDecreaseAccuracy : Glucose > Age > BMI• MeanDecreaseGini : Glucose > BMI > Age
40
Xgboost
Grid serach for finding best parameters (Because the tuning process may take a long time inlarge datasets, it is recommended that you tune the data into smaller samples)
xg.grid <- expand.grid(nrounds = c(75, 100),colsample_bytree = 1,min_child_weight = 1,eta = c(0.01, 0.1, 0.3), # 0.3 basic valuegamma = c(0.5, 0.25),subsample = 0.5,max_depth = c(2, 3))
cntrl <- caret::trainControl(method = "cv",number = 5, # 5 fold cross-validationverboseIter = TRUE,returnData = FALSE,returnResamp = "final")
set.seed(123)train.xgb <- train(x = train1[, 1:8],
y = train1[, 9],trControl = cntrl,tuneGrid = xg.grid,method = "xgbTree")
## + Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100
41
## - Fold2: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100
42
## - Fold4: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## Aggregating results## Selecting tuning parameters## Fitting nrounds = 100, max_depth = 2, eta = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set
train.xgb
## eXtreme Gradient Boosting#### No pre-processing## Resampling: Cross-Validated (5 fold)## Summary of sample sizes: 431, 430, 431, 430, 430## Resampling results across tuning parameters:#### eta max_depth gamma nrounds Accuracy Kappa## 0.01 2 0.25 75 0.7657494 0.4538471
43
## 0.01 2 0.25 100 0.7676012 0.4587285## 0.01 2 0.50 75 0.7694704 0.4623929## 0.01 2 0.50 100 0.7732087 0.4699355## 0.01 3 0.25 75 0.7582728 0.4420598## 0.01 3 0.25 100 0.7582901 0.4433717## 0.01 3 0.50 75 0.7601765 0.4445163## 0.01 3 0.50 100 0.7601765 0.4471754## 0.10 2 0.25 75 0.7527518 0.4374509## 0.10 2 0.25 100 0.7509173 0.4393334## 0.10 2 0.50 75 0.7713396 0.4775565## 0.10 2 0.50 100 0.7657494 0.4712296## 0.10 3 0.25 75 0.7416753 0.4233366## 0.10 3 0.25 100 0.7472655 0.4383261## 0.10 3 0.50 75 0.7620630 0.4673431## 0.10 3 0.50 100 0.7528037 0.4516339## 0.30 2 0.25 75 0.7416234 0.4260231## 0.30 2 0.25 100 0.7323468 0.4075123## 0.30 2 0.50 75 0.7379716 0.4130307## 0.30 2 0.50 100 0.7342506 0.4005421## 0.30 3 0.25 75 0.7379370 0.4054532## 0.30 3 0.25 100 0.7174801 0.3671057## 0.30 3 0.50 75 0.7230183 0.3915425## 0.30 3 0.50 100 0.7192800 0.3784477#### Tuning parameter 'colsample_bytree' was held constant at a value of## 1## Tuning parameter 'min_child_weight' was held constant at a value of## 1## Tuning parameter 'subsample' was held constant at a value of 0.5## Accuracy was used to select the optimal model using the largest value.## The final values used for the model were nrounds = 100, max_depth = 2,## eta = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1## and subsample = 0.5.
• Accuracy : 0.773• Kappa : 0.470• nrounds : 100 (Maximum number of iterations (trees in final model))• max_depth : 2 (Maximum Depth of Individual Tree)• eta : 0.01 (Learning speed. Meaning the contribution of each tree to the solution.)• gamma : 0.5 (Minimum loss reduction)• colsample_bytree : 1 (# of features to sample when creating a tree)• min_child_weight : 1• subsample : 0.5 (% of data observations)
Variable Importance
• Gain : A value that indicates the degree of improvement in the accuracy of the feature on the tree• Cover : Relative figure of the total observed value associated with the corresponding feature• Frequency : A value expressed as a percentage of the number of times a feature has appeared for all
trees
# library(xgboost)param <- list(objective = "binary:logistic",
44
booster = "gbtree",eval_metric = "error",eta = 0.01,max_depth = 3,subsample = 0.5,colsample_bytree = 1,gamma = 0.25)
x <- as.matrix(train1[, 1:8])y <- ifelse(train1$Outcome == "Yes", 1, 0)train.mat <- xgboost::xgb.DMatrix(data = x,
label = y)set.seed(123)xgb.fit <- xgboost::xgb.train(params = param,
data = train.mat,nrounds = 75)
impMatrix <- xgboost::xgb.importance(feature_names = dimnames(x)[[2]], model = xgb.fit)impMatrix
## Feature Gain Cover Frequency## 1: Glucose 0.53674056 0.39745044 0.25154639## 2: BMI 0.15500238 0.16984326 0.21030928## 3: Age 0.14802032 0.19479942 0.17731959## 4: DiabetesPedigreeFunction 0.06119871 0.09038095 0.12371134## 5: Pregnancies 0.02771003 0.04512527 0.04948454## 6: SkinThickness 0.02707855 0.04490084 0.06804124## 7: Insulin 0.02393599 0.03204323 0.06185567## 8: BloodPressure 0.02031346 0.02545660 0.05773196
xgboost::xgb.plot.importance(impMatrix, main = "Gain by Feature")
45
BloodPressure
Insulin
SkinThickness
Pregnancies
DiabetesPedigreeFunction
Age
BMI
Glucose
Gain by Feature
0.0 0.1 0.2 0.3 0.4 0.50.0 0.1 0.2 0.3 0.4 0.5
* Glucose > BMI > Age
Results of the performance with respect to the test set
# install.packages("InformationValue")# library(InformationValue)xgb.pred <- predict(xgb.fit, x)InformationValue::optimalCutoff(y, xgb.pred) # Optimal threshold to minimize error
## [1] 0.4817243
• Cutoff : 0.482
xgb.testMat <- as.matrix(test1[, 1:8])xgb.test <- predict(xgb.fit, xgb.testMat)y.test <- ifelse(test1$Outcome == "Yes", 1, 0)InformationValue::confusionMatrix(actuals = y.test,
predictedScores = xgb.test,threshold = 0.48); 1 - misClassError(y.test, xgb.test, threshold = 0.48)
## 0 1## 0 123 31## 1 27 49
## [1] 0.7478
46
xgb.test <- as.numeric(xgb.test>0.5) %>% as.factor()MLmetrics::F1_Score(y_pred = xgb.test,
y_true = y.test,positive = "1");pROC::auc(predictor = as.numeric(x = xgb.test),
response = as.numeric(x = y.test))
## [1] 0.6206897
## Area under the curve: 0.7146
• Accuracy : 0.761• Error : 0.239• F1 Score : 0.626• AUC : 0.718
Models Evaluation
Models Comparsion
Accuracy
Accuracy
Model Train Test DifferenceKNN 0.779 0.744 0.035
Weighted KNN 0.771 0.744 0.027Linear SVM 0.781 0.765 0.016Poly SVM 0.753 0.783 -0.03Radial SVM 0.753 0.770 -0.017Sigmoid SVM 0.775 0.752 0.023Randomforest 0.758 0.761 -0.003
Xgboost 0.773 0.761 0.012
• Model Showing Best Accuracy in Train Set : Linear SVM (0.781)• Model Showing Best Accuracy in Test Set : Poly SVM (0.783)• Model Showing Lowest Difference between Train and Test Set : Poly SVM (-0.03)
F1 Score & AUC
F1 Score & AUC
Model F1 Score AUCKNN 0.582 0.690
Weighted KNN 0.587 0.693Linear SVM 0.630 0.721Poly SVM 0.667 0.746Radial SVM 0.644 0.730
47
Model F1 Score AUCSigmoid SVM 0.627 0.717Randomforest 0.626 0.739
Xgboost 0.626 0.718
• Model Showing Highest F1 Score : Poly SVM (0.667)• Model Showing Highest AUC : Poly SVM (0.746)
48