Pima Diabetes Analysisis a set of data that examines and records the physical condition of adult...

Pima Diabetes Analysis

Intro

Objective

• The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, basedon certain diagnostic measurements included in the dataset.

• The data is splited into a training (70%) and a testing (30%) to evaluate the accuracy of the models.

Dataset Background

• ‘Pima Indians Diabetes Database’, published by the National Institute of Diabetes in the United States,is a set of data that examines and records the physical condition of adult women of Indian descent,including pregnancy, blood pressure and insulin levels.

• Based on the eight variables, it consists of a form that can predict the occurrence of diabetes in about800 people surveyed.

Data Sources and Links

• Pima Indians Diabetes Database (Kaggle)• https://www.kaggle.com/uciml/pima-indians-diabetes-database

Variables Explaination

Variable Name ExplainationPregnancies Number of times pregnantGlucose Glucose Figure in an oral glucose tolerance testBloodPressure Diastolic blood pressure (mm Hg)SkinThickness Triceps skin fold thickness (mm)Insulin 2-Hour serum insulin (mu U/ml)BMI Body mass index (weight in kg/(height in m)ˆ2)DiabetesPedigreeFunction Diabetes pedigree functionAge Age (years)Outcome 1: Yes, 0: No

Diabetes Pedigree Function

DPF =∑

Ki(88 − ADMi) + 20∑Kj(ALCj − 14) + 50

i : The total number of relatives with diabetes j : The total number of relatives without diabetes x : Degreeof genetic match with a particular relative + 0.5 - Parent, brothers + 0.25 - Grandparents, Brothers ofParents + 0.125 - The children of brothers of parents ADM - the age at which relatives with diabetes(i)developed. ACL - the age at which a non-diabetic relative (j) tested for diabetes

1

https://www.kaggle.com/uciml/pima-indians-diabetes-database

EDA

Explore Data Analysis

Summary

Original Data Set

summary(data)

## Pregnancies Glucose BloodPressure SkinThickness## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00## Insulin BMI DiabetesPedigreeFunction Age## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00## Outcome## Min. :0.000## 1st Qu.:0.000## Median :0.000## Mean :0.349## 3rd Qu.:1.000## Max. :1.000

When I open the data and check it, a significant number of values are marked as zero. The human skinthickness can’t be zero or blood sugar or blood pressure can’t be zero, so it’s actually an unmeasured value, amissing value. In particular, there are many missing values for triceps and insulin concentrations, which arenot enough to make this learning data. There are some ways to replace the missing value such asexcludingthese 0 values from the analysis depending on types of missing values (MACR, MAR, NMAR). Howeover, inthis case, I used multiple Imputation method (MI) to deal with missing values in five columns (Isulin, BMI,BloodPressure, SkinThickness, Glucose).

Imputing Missing Values

Replace zero values for each variable with NA (Isulin, BMI, BloodPressure, SkinThickness, Glu-cose).

data <- data %>% mutate(Insulin = ifelse(data$Insulin == "0", NA, Insulin))data <- data %>% mutate(BMI= ifelse(data$BMI == "0", NA, BMI))data <- data %>% mutate(BloodPressure = ifelse(data$BloodPressure == "0", NA, BloodPressure))data <- data %>% mutate(SkinThickness = ifelse(data$SkinThickness == "0", NA, SkinThickness))data <- data %>% mutate(Glucose = ifelse(data$Glucose == "0", NA, Glucose))summary(data)

2

## Pregnancies Glucose BloodPressure SkinThickness## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00## NA's :5 NA's :35 NA's :227## Insulin BMI DiabetesPedigreeFunction Age## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00## Median :125.00 Median :32.30 Median :0.3725 Median :29.00## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00## NA's :374 NA's :11## Outcome## Min. :0.000## 1st Qu.:0.000## Median :0.000## Mean :0.349## 3rd Qu.:1.000## Max. :1.000##

sum(is.na(data));mean(is.na(data));sum(complete.cases(data))

## [1] 652

## [1] 0.0943287

## [1] 392

• Number of NA : 652• % of NAs in the total : 9.43%• rows not including NA : 392 (51%)• rows including NA : 371 (49%)

aggr <- VIM::aggr(data,numbers = TRUE,prop = c(TRUE, TRUE),sortVars = TRUE, #sort according to # of missing values.sortCombs = TRUE,only.miss = TRUE,labels = names(data),cex.axis = .7,gap = 2,col = c("navyblue", "red"),ylab = c("Histogram of Missings", "Pattern"))

3

His

togr

am o

f Mis

sing

s

0.0

0.1

0.2

0.3

0.4

Insu

lin

Ski

nThi

ckne

ss

Blo

odP

ress

ure

BM

I

Glu

cose

Pre

gnan

cies

Age

Out

com

e

Pat

tern

Insu

lin

Ski

nThi

ckne

ss

Blo

odP

ress

ure

BM

I

Glu

cose

Pre

gnan

cies

Age

Out

com

e

0.5104

0.2500

0.1823

0.0339

0.0091

0.0052

0.0026

0.0026

0.0013

0.0013

0.0013

#### Variables sorted by number of missings:## Variable Count## Insulin 0.486979167## SkinThickness 0.295572917## BloodPressure 0.045572917## BMI 0.014322917## Glucose 0.006510417## Pregnancies 0.000000000## DiabetesPedigreeFunction 0.000000000## Age 0.000000000## Outcome 0.000000000

Almost 51% of the samples are not missing any information, 25% are missing Insulin and SkinThicknessvalue, 18% are missing only Insulin value, and the remaining ones show other missing patterns.

Replace NA values by using Predictive Mean Matching Method

micedia <- mice::mice(data,seed = 1234,m = 5, # the number of imputed datsetsmaxit = 50,methd = 'pmm') # the imputation method (Predictive mean matching)

##

4

## iter imp variable## 1 1 Glucose BloodPressure SkinThickness Insulin BMI## 1 2 Glucose BloodPressure SkinThickness Insulin BMI## 1 3 Glucose BloodPressure SkinThickness Insulin BMI## 1 4 Glucose BloodPressure SkinThickness Insulin BMI## 1 5 Glucose BloodPressure SkinThickness Insulin BMI## 2 1 Glucose BloodPressure SkinThickness Insulin BMI## 2 2 Glucose BloodPressure SkinThickness Insulin BMI## 2 3 Glucose BloodPressure SkinThickness Insulin BMI## 2 4 Glucose BloodPressure SkinThickness Insulin BMI## 2 5 Glucose BloodPressure SkinThickness Insulin BMI## 3 1 Glucose BloodPressure SkinThickness Insulin BMI## 3 2 Glucose BloodPressure SkinThickness Insulin BMI## 3 3 Glucose BloodPressure SkinThickness Insulin BMI## 3 4 Glucose BloodPressure SkinThickness Insulin BMI## 3 5 Glucose BloodPressure SkinThickness Insulin BMI## 4 1 Glucose BloodPressure SkinThickness Insulin BMI## 4 2 Glucose BloodPressure SkinThickness Insulin BMI## 4 3 Glucose BloodPressure SkinThickness Insulin BMI## 4 4 Glucose BloodPressure SkinThickness Insulin BMI## 4 5 Glucose BloodPressure SkinThickness Insulin BMI## 5 1 Glucose BloodPressure SkinThickness Insulin BMI## 5 2 Glucose BloodPressure SkinThickness Insulin BMI## 5 3 Glucose BloodPressure SkinThickness Insulin BMI## 5 4 Glucose BloodPressure SkinThickness Insulin BMI## 5 5 Glucose BloodPressure SkinThickness Insulin BMI## 6 1 Glucose BloodPressure SkinThickness Insulin BMI## 6 2 Glucose BloodPressure SkinThickness Insulin BMI## 6 3 Glucose BloodPressure SkinThickness Insulin BMI## 6 4 Glucose BloodPressure SkinThickness Insulin BMI## 6 5 Glucose BloodPressure SkinThickness Insulin BMI## 7 1 Glucose BloodPressure SkinThickness Insulin BMI## 7 2 Glucose BloodPressure SkinThickness Insulin BMI## 7 3 Glucose BloodPressure SkinThickness Insulin BMI## 7 4 Glucose BloodPressure SkinThickness Insulin BMI## 7 5 Glucose BloodPressure SkinThickness Insulin BMI## 8 1 Glucose BloodPressure SkinThickness Insulin BMI## 8 2 Glucose BloodPressure SkinThickness Insulin BMI## 8 3 Glucose BloodPressure SkinThickness Insulin BMI## 8 4 Glucose BloodPressure SkinThickness Insulin BMI## 8 5 Glucose BloodPressure SkinThickness Insulin BMI## 9 1 Glucose BloodPressure SkinThickness Insulin BMI## 9 2 Glucose BloodPressure SkinThickness Insulin BMI## 9 3 Glucose BloodPressure SkinThickness Insulin BMI## 9 4 Glucose BloodPressure SkinThickness Insulin BMI## 9 5 Glucose BloodPressure SkinThickness Insulin BMI## 10 1 Glucose BloodPressure SkinThickness Insulin BMI## 10 2 Glucose BloodPressure SkinThickness Insulin BMI## 10 3 Glucose BloodPressure SkinThickness Insulin BMI## 10 4 Glucose BloodPressure SkinThickness Insulin BMI## 10 5 Glucose BloodPressure SkinThickness Insulin BMI## 11 1 Glucose BloodPressure SkinThickness Insulin BMI## 11 2 Glucose BloodPressure SkinThickness Insulin BMI## 11 3 Glucose BloodPressure SkinThickness Insulin BMI

5

## 11 4 Glucose BloodPressure SkinThickness Insulin BMI## 11 5 Glucose BloodPressure SkinThickness Insulin BMI## 12 1 Glucose BloodPressure SkinThickness Insulin BMI## 12 2 Glucose BloodPressure SkinThickness Insulin BMI## 12 3 Glucose BloodPressure SkinThickness Insulin BMI## 12 4 Glucose BloodPressure SkinThickness Insulin BMI## 12 5 Glucose BloodPressure SkinThickness Insulin BMI## 13 1 Glucose BloodPressure SkinThickness Insulin BMI## 13 2 Glucose BloodPressure SkinThickness Insulin BMI## 13 3 Glucose BloodPressure SkinThickness Insulin BMI## 13 4 Glucose BloodPressure SkinThickness Insulin BMI## 13 5 Glucose BloodPressure SkinThickness Insulin BMI## 14 1 Glucose BloodPressure SkinThickness Insulin BMI## 14 2 Glucose BloodPressure SkinThickness Insulin BMI## 14 3 Glucose BloodPressure SkinThickness Insulin BMI## 14 4 Glucose BloodPressure SkinThickness Insulin BMI## 14 5 Glucose BloodPressure SkinThickness Insulin BMI## 15 1 Glucose BloodPressure SkinThickness Insulin BMI## 15 2 Glucose BloodPressure SkinThickness Insulin BMI## 15 3 Glucose BloodPressure SkinThickness Insulin BMI## 15 4 Glucose BloodPressure SkinThickness Insulin BMI## 15 5 Glucose BloodPressure SkinThickness Insulin BMI## 16 1 Glucose BloodPressure SkinThickness Insulin BMI## 16 2 Glucose BloodPressure SkinThickness Insulin BMI## 16 3 Glucose BloodPressure SkinThickness Insulin BMI## 16 4 Glucose BloodPressure SkinThickness Insulin BMI## 16 5 Glucose BloodPressure SkinThickness Insulin BMI## 17 1 Glucose BloodPressure SkinThickness Insulin BMI## 17 2 Glucose BloodPressure SkinThickness Insulin BMI## 17 3 Glucose BloodPressure SkinThickness Insulin BMI## 17 4 Glucose BloodPressure SkinThickness Insulin BMI## 17 5 Glucose BloodPressure SkinThickness Insulin BMI## 18 1 Glucose BloodPressure SkinThickness Insulin BMI## 18 2 Glucose BloodPressure SkinThickness Insulin BMI## 18 3 Glucose BloodPressure SkinThickness Insulin BMI## 18 4 Glucose BloodPressure SkinThickness Insulin BMI## 18 5 Glucose BloodPressure SkinThickness Insulin BMI## 19 1 Glucose BloodPressure SkinThickness Insulin BMI## 19 2 Glucose BloodPressure SkinThickness Insulin BMI## 19 3 Glucose BloodPressure SkinThickness Insulin BMI## 19 4 Glucose BloodPressure SkinThickness Insulin BMI## 19 5 Glucose BloodPressure SkinThickness Insulin BMI## 20 1 Glucose BloodPressure SkinThickness Insulin BMI## 20 2 Glucose BloodPressure SkinThickness Insulin BMI## 20 3 Glucose BloodPressure SkinThickness Insulin BMI## 20 4 Glucose BloodPressure SkinThickness Insulin BMI## 20 5 Glucose BloodPressure SkinThickness Insulin BMI## 21 1 Glucose BloodPressure SkinThickness Insulin BMI## 21 2 Glucose BloodPressure SkinThickness Insulin BMI## 21 3 Glucose BloodPressure SkinThickness Insulin BMI## 21 4 Glucose BloodPressure SkinThickness Insulin BMI## 21 5 Glucose BloodPressure SkinThickness Insulin BMI## 22 1 Glucose BloodPressure SkinThickness Insulin BMI## 22 2 Glucose BloodPressure SkinThickness Insulin BMI

6


7


8

## 44 1 Glucose BloodPressure SkinThickness Insulin BMI## 44 2 Glucose BloodPressure SkinThickness Insulin BMI## 44 3 Glucose BloodPressure SkinThickness Insulin BMI## 44 4 Glucose BloodPressure SkinThickness Insulin BMI## 44 5 Glucose BloodPressure SkinThickness Insulin BMI## 45 1 Glucose BloodPressure SkinThickness Insulin BMI## 45 2 Glucose BloodPressure SkinThickness Insulin BMI## 45 3 Glucose BloodPressure SkinThickness Insulin BMI## 45 4 Glucose BloodPressure SkinThickness Insulin BMI## 45 5 Glucose BloodPressure SkinThickness Insulin BMI## 46 1 Glucose BloodPressure SkinThickness Insulin BMI## 46 2 Glucose BloodPressure SkinThickness Insulin BMI## 46 3 Glucose BloodPressure SkinThickness Insulin BMI## 46 4 Glucose BloodPressure SkinThickness Insulin BMI## 46 5 Glucose BloodPressure SkinThickness Insulin BMI## 47 1 Glucose BloodPressure SkinThickness Insulin BMI## 47 2 Glucose BloodPressure SkinThickness Insulin BMI## 47 3 Glucose BloodPressure SkinThickness Insulin BMI## 47 4 Glucose BloodPressure SkinThickness Insulin BMI## 47 5 Glucose BloodPressure SkinThickness Insulin BMI## 48 1 Glucose BloodPressure SkinThickness Insulin BMI## 48 2 Glucose BloodPressure SkinThickness Insulin BMI## 48 3 Glucose BloodPressure SkinThickness Insulin BMI## 48 4 Glucose BloodPressure SkinThickness Insulin BMI## 48 5 Glucose BloodPressure SkinThickness Insulin BMI## 49 1 Glucose BloodPressure SkinThickness Insulin BMI## 49 2 Glucose BloodPressure SkinThickness Insulin BMI## 49 3 Glucose BloodPressure SkinThickness Insulin BMI## 49 4 Glucose BloodPressure SkinThickness Insulin BMI## 49 5 Glucose BloodPressure SkinThickness Insulin BMI## 50 1 Glucose BloodPressure SkinThickness Insulin BMI## 50 2 Glucose BloodPressure SkinThickness Insulin BMI## 50 3 Glucose BloodPressure SkinThickness Insulin BMI## 50 4 Glucose BloodPressure SkinThickness Insulin BMI## 50 5 Glucose BloodPressure SkinThickness Insulin BMI

micedia

## Class: mids## Number of multiple imputations: 5## Imputation methods:## Pregnancies Glucose BloodPressure## "" "pmm" "pmm"## SkinThickness Insulin BMI## "pmm" "pmm" "pmm"## DiabetesPedigreeFunction Age Outcome## "" "" ""## PredictorMatrix:## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI## Pregnancies 0 1 1 1 1 1## Glucose 1 0 1 1 1 1## BloodPressure 1 1 0 1 1 1## SkinThickness 1 1 1 0 1 1## Insulin 1 1 1 1 0 1

9

## BMI 1 1 1 1 1 0## DiabetesPedigreeFunction Age Outcome## Pregnancies 1 1 1## Glucose 1 1 1## BloodPressure 1 1 1## SkinThickness 1 1 1## Insulin 1 1 1## BMI 1 1 1

Check that the imputed values are properly replaced

xyplot(micedia,Insulin ~ Glucose+BloodPressure+SkinThickness+BMI,pch = 18,cex = 1,col = c("red", "navyblue"))

Glucose + BloodPressure + SkinThickness + BMI

Insu

lin 0

200

400

600

800

Glucose

0 50 100 150 200

BloodPressure

0 50 100 150 200

SkinThickness

0

200

400

600

800

BMI

The location of the navyblue points (imputed) matches the shape of the red ones (observed). The matchingshape tells us that the imputed values can be considered as plausible values.

Check the number of zero, NA values

• q_zeros: quantity of zeros (p_zeros: in percentage)• q_na : quantity of NA (p_na: in percentage)• type : factor or numeric• unique : quantity of unique values

10

data <- complete(micedia, 1)data$Outcome <- as.factor(data$Outcome)levels(data$Outcome) <- c("No", "Yes")funModeling::df_status(data)

## variable q_zeros p_zeros q_na p_na q_inf p_inf type## 1 Pregnancies 111 14.45 0 0 0 0 integer## 2 Glucose 0 0.00 0 0 0 0 integer## 3 BloodPressure 0 0.00 0 0 0 0 integer## 4 SkinThickness 0 0.00 0 0 0 0 integer## 5 Insulin 0 0.00 0 0 0 0 integer## 6 BMI 0 0.00 0 0 0 0 numeric## 7 DiabetesPedigreeFunction 0 0.00 0 0 0 0 numeric## 8 Age 0 0.00 0 0 0 0 integer## 9 Outcome 0 0.00 0 0 0 0 factor## unique## 1 17## 2 135## 3 46## 4 50## 5 185## 6 247## 7 517## 8 52## 9 2

DataExplorer::profile_missing(data)

## feature num_missing pct_missing## 1 Pregnancies 0 0## 2 Glucose 0 0## 3 BloodPressure 0 0## 4 SkinThickness 0 0## 5 Insulin 0 0## 6 BMI 0 0## 7 DiabetesPedigreeFunction 0 0## 8 Age 0 0## 9 Outcome 0 0

ggpairs

# library(ggplot2)# library(GGally)ggpairs(data, aes(colour=Outcome, alpha = 0.8), lower = list(combo = wrap("facethist", binwidth = 1)))

11

Cor : 0.13No: 0.0834

Yes: −0.0523

Cor : 0.226No: 0.222

Yes: 0.156Cor : 0.227

No: 0.205Yes: 0.102

Cor : 0.114No: 0.152

Yes: −0.0671Cor : 0.191

No: 0.084Yes: 0.0851

Cor : 0.236No: 0.22

Yes: 0.172

Cor : 0.0189No: 0.0304Yes: −0.12

Cor : 0.604No: 0.597

Yes: 0.538Cor : 0.0847

No: 0.109Yes: −0.0508

Cor : 0.216No: 0.192

Yes: 0.129

Cor : 0.0299No: 0.0151

Yes: −0.132Cor : 0.24

No: 0.129Yes: 0.0583

Cor : 0.291No: 0.257

Yes: 0.243Cor : 0.639

No: 0.654Yes: 0.532

Cor : 0.259No: 0.275

Yes: 0.0831

Cor : −0.0335No: −0.08

Yes: −0.0692Cor : 0.138

No: 0.0867Yes: 0.0286

Cor : 0.00174No: −0.0223

Yes: −0.0407Cor : 0.119

No: 0.0503Yes: 0.125

Cor : 0.155No: 0.189

Yes: 0.0327Cor : 0.152

No: 0.0935Yes: 0.12

Cor : 0.544No: 0.573

Yes: 0.445Cor : 0.269

No: 0.219Yes: 0.114

Cor : 0.327No: 0.305

Yes: 0.286Cor : 0.176

No: 0.186Yes: 0.00788

Cor : 0.145No: 0.0852Yes: 0.101

Cor : 0.0323No: 0.0266

Yes: −0.192Cor : 0.0336

No: 0.0417Yes: −0.0881

Pregnancies GlucoseBloodPressureSkinThickness Insulin BMIDiabetesPedigreeFunctionAge OutcomeP

regnanciesGlucoseB

loodPressureS

kinThickness

InsulinB

MI

DiabetesP

edigreeFunctionA

geO

utcome

0 5 1015 50100150200255075100125 2550751000200400600800203040506070 0 1 2 20 40 60 80 No Yes

0.000.050.100.15

50100150200

255075

100125

255075

100

0200400600800

2030405060

0.00.51.01.52.02.5

20406080

0255075100

0255075100

BoxPlot

par(mfrow = c(2, 4), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0)) # 2 rows 4 columnswith(data, {

plot(Outcome, Pregnancies, main = "Pregnancies and Outcome")plot(Outcome, Age, main = "Age and Outcome")plot(Outcome, Glucose, main = "Glucose and Outcome")plot(Outcome, BMI, main = "BMI and Outcome")plot(Outcome, SkinThickness, main = "SkinThickness and Outcome")plot(Outcome, BloodPressure, main = "BloodPressure and Outcome")plot(Outcome, DiabetesPedigreeFunction, main = "DPF and Outcome")plot(Outcome, Insulin, main = "Insulin and Outcome")mtext("Relationship of Variables and Outcomes", outer = T)

})

12

No Yes

05

1015

Pregnancies and Outcome

x

y

No Yes20

4060

80

Age and Outcome

x

y

No Yes

5010

015

020

0

Glucose and Outcome

x

y

No Yes

2030

4050

60

BMI and Outcome

x

y

No Yes

2040

6080

100

SkinThickness and Outcome

x

y

No Yes

4060

8010

0

BloodPressure and Outcome

x

y

No Yes

0.0

0.5

1.0

1.5

2.0

2.5

DPF and Outcome

x

y

No Yes

020

040

060

080

0

Insulin and Outcome

x

y

Relationship of Variables and Outcomes

Distribution

# library(ggpubr)b1 <- qplot(Age, data = data, geom = "density", col = Outcome)b2 <- qplot(BMI, data = data, geom = "density", col = Outcome)b3 <- qplot(Pregnancies, data = data, geom = "density", col = Outcome)b4 <- qplot(Glucose, data = data, geom = "density", col = Outcome)b5 <- qplot(Insulin, data = data, geom = "density", col = Outcome)b6 <- qplot(BloodPressure, data = data, geom = "density", col = Outcome)b7 <- qplot(SkinThickness, data = data, geom = "density", col = Outcome)b8 <- qplot(DiabetesPedigreeFunction, data = data, geom = "density", col = Outcome)ggpubr::ggarrange(b1, b2, b3, b4, b5, b6, b7, b8, nrow = 4, ncol = 2, heights = c(3, 3, 3))

13

0.000.020.040.06

20 40 60 80Age

NU

LLOutcome

No

Yes 0.000.020.040.06

20 30 40 50 60BMI

NU

LL

Outcome

No

Yes

0.000.050.100.15

0 5 10 15Pregnancies

NU

LL

Outcome

No

Yes 0.0000.0050.0100.015

50 100 150 200Glucose

NU

LL

Outcome

No

Yes

0.0000.0020.0040.006

0 200 400 600 800Insulin

NU

LL

Outcome

No

Yes 0.000.010.020.03

25 50 75 100 125BloodPressure

NU

LL

Outcome

No

Yes

0.000.010.020.030.04

25 50 75 100SkinThickness

NU

LL

Outcome

No

Yes 0.00.51.01.52.0

0.0 0.5 1.0 1.5 2.0 2.5DiabetesPedigreeFunction

NU

LLOutcome

No

Yes

Correlation Matrix

Correlation Matrix between variables

corr <- data[ , -9]corr <- round(cor(corr), 3)corr

## Pregnancies Glucose BloodPressure SkinThickness## Pregnancies 1.000 0.130 0.226 0.114## Glucose 0.130 1.000 0.227 0.191## BloodPressure 0.226 0.227 1.000 0.236## SkinThickness 0.114 0.191 0.236 1.000## Insulin 0.019 0.604 0.085 0.216## BMI 0.030 0.240 0.291 0.639## DiabetesPedigreeFunction -0.034 0.138 0.002 0.119## Age 0.544 0.269 0.327 0.176## Insulin BMI DiabetesPedigreeFunction Age## Pregnancies 0.019 0.030 -0.034 0.544## Glucose 0.604 0.240 0.138 0.269## BloodPressure 0.085 0.291 0.002 0.327## SkinThickness 0.216 0.639 0.119 0.176## Insulin 1.000 0.259 0.155 0.145## BMI 0.259 1.000 0.152 0.032## DiabetesPedigreeFunction 0.155 0.152 1.000 0.034

14

## Age 0.145 0.032 0.034 1.000

# library(ggcorrplot)ggcorrplot::ggcorrplot(corr, method = "square", # default

type = "lower",show.legend = TRUE, # defaultlegend.title = "Correlation",show.diag = TRUE,hc.order = TRUE,outline.color = "white",ggtheme = ggplot2::theme_minimal, # defaultcolors = c("blue", "white", "red"),lab = TRUE, # Adding Correlation Coefficientlab_size = 4,sig.level = 0.05)

1 0.14 0.16−0.030.03 0 0.12 0.15

1 0.6 0.13 0.27 0.23 0.19 0.24

1 0.02 0.14 0.08 0.22 0.26

1 0.54 0.23 0.11 0.03

1 0.33 0.18 0.03

1 0.24 0.29

1 0.64

1

DiabetesPedigreeFunction

Glucose

Insulin

Pregnancies

Age

BloodPressure

SkinThickness

BMI

Diabet

esPed

igree

Functi

on

Glucos

e

Insu

lin

Pregn

ancie

sAge

BloodP

ress

ure

SkinThic

knes

sBM

I

−1.0

−0.5

0.0

0.5

1.0Correlation

Glucose and Insulin, Pregnancies and Age, SkinThickness and BMI have relatively high values. Diabetes-PedigreeFunction, in particular, appears to have little correlation with other variables.

15

Machine Learning Modeling

Machine Learning Methods

Data Separation

Check the ratio of the target variable before spliting the dataset

data.scale <- data.frame(scale(data[, -9]))data.scale$Outcome <- data$Outcomesummary(data.scale)

## Pregnancies Glucose BloodPressure SkinThickness## Min. :-1.1411 Min. :-2.5502 Min. :-3.90316 Min. :-2.10798## 1st Qu.:-0.8443 1st Qu.:-0.7210 1st Qu.:-0.67834 1st Qu.:-0.66782## Median :-0.2508 Median :-0.1551 Median :-0.03338 Median : 0.00425## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000## 3rd Qu.: 0.6395 3rd Qu.: 0.6324 3rd Qu.: 0.61158 3rd Qu.: 0.67632## Max. : 3.9040 Max. : 2.5353 Max. : 3.99764 Max. : 6.72499## Insulin BMI DiabetesPedigreeFunction## Min. :-1.1789 Min. :-2.04948 Min. :-1.1888## 1st Qu.:-0.6751 1st Qu.:-0.72139 1st Qu.:-0.6885## Median :-0.2464 Median :-0.03569 Median :-0.2999## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000## 3rd Qu.: 0.3060 3rd Qu.: 0.60669 3rd Qu.: 0.4659## Max. : 6.1748 Max. : 5.00959 Max. : 5.8797## Age Outcome## Min. :-1.0409 No :500## 1st Qu.:-0.7858 Yes:268## Median :-0.3606## Mean : 0.0000## 3rd Qu.: 0.6598## Max. : 4.0611

table(data.scale$Outcome)

#### No Yes## 500 268

Split the scaled data set into train and test set (7:3) for KNN & check the proportions of trainand test set

set.seed(123)index <- caret::createDataPartition(y = data.scale$Outcome,

p = 0.7,list = FALSE)

train <- data.scale[index, ]test <- data.scale[-index, ]

# Checktrain$Outcome %>% table() %>% prop.table()

16

## .## No Yes## 0.6505576 0.3494424

test$Outcome %>% table() %>% prop.table()

## .## No Yes## 0.6521739 0.3478261

Split the entire data set into train and test set (7:3) for other models & check the proportions oftrain and test set

set.seed(123)index1 <- caret::createDataPartition(y = data$Outcome,

p = 0.7,list = FALSE)

train1 <- data[index, ]test1 <- data[-index, ]

# Checktrain1$Outcome %>% table() %>% prop.table()

## .## No Yes## 0.6505576 0.3494424

test1$Outcome %>% table() %>% prop.table()

## .## No Yes## 0.6521739 0.3478261

KNN

To find a best tuning parameter k for KNN, make grid range from 2 to 20 by 1.

# library(class)

grid1 <- expand.grid(.k = seq(from = 2, to = 20, by = 1))control <- caret::trainControl(method = "cv")set.seed(123)knn.train <- caret::train(Outcome ~.,

data = train,method = "knn",trControl = control,tuneGrid = grid1)

knn.train

17

## k-Nearest Neighbors#### 538 samples## 8 predictor## 2 classes: 'No', 'Yes'#### No pre-processing## Resampling: Cross-Validated (10 fold)## Summary of sample sizes: 484, 485, 484, 484, 484, 484, ...## Resampling results across tuning parameters:#### k Accuracy Kappa## 2 0.6653040 0.2522687## 3 0.7135570 0.3525008## 4 0.7044025 0.3239550## 5 0.7284416 0.3718371## 6 0.7266946 0.3691580## 7 0.7396925 0.4016165## 8 0.7358141 0.3896026## 9 0.7395877 0.3995073## 10 0.7432565 0.4118207## 11 0.7451083 0.4113598## 12 0.7619846 0.4462147## 13 0.7712089 0.4679011## 14 0.7675402 0.4588197## 15 0.7563941 0.4330841## 16 0.7674703 0.4589493## 17 0.7600280 0.4380374## 18 0.7786513 0.4798966## 19 0.7561845 0.4311233## 20 0.7544025 0.4236375#### Accuracy was used to select the optimal model using the largest value.## The final value used for the model was k = 18.

• Best k = 18• Accuracy = 0.779• Kappa = 0.480

Apply this model to test set

knn.test <- class::knn(train = train[, -9],test = test[, -9],cl = train[, 9],k = 18)

knn.real <- test$Outcomecaret::confusionMatrix(data = knn.test,

reference = knn.real,positive = "Yes")

## Confusion Matrix and Statistics#### Reference

18

## Prediction No Yes## No 130 39## Yes 20 41#### Accuracy : 0.7435## 95% CI : (0.6819, 0.7986)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.001872#### Kappa : 0.4014#### Mcnemar's Test P-Value : 0.019109#### Sensitivity : 0.5125## Specificity : 0.8667## Pos Pred Value : 0.6721## Neg Pred Value : 0.7692## Prevalence : 0.3478## Detection Rate : 0.1783## Detection Prevalence : 0.2652## Balanced Accuracy : 0.6896#### 'Positive' Class : Yes##

• Accuracy : 0.744• Sensitivity : 0.513• Specificity : 0.867• Kappa : 0.401 Both Accuracy & Kappa of test set decrease when comparing with Accuracy & Kappa

of train set (77.9% -> 74.4%, 48% -> 40.1%)

knn.predobj <- ROCR::prediction(predictions = as.numeric(x = knn.test),labels = as.numeric(x = knn.real))

knn.perform <- ROCR::performance(prediction.obj = knn.predobj,measure = 'tpr',x.measure = 'fpr')

plot(x = knn.perform, main = 'ROC curve')

19

ROC curve

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

MLmetrics::F1_Score(y_pred = knn.test,y_true = knn.real,positive = "Yes"); pROC::auc(response = as.numeric(x = knn.real),

predictor = as.numeric(x = knn.test))

## [1] 0.5815603

## Area under the curve: 0.6896

F1 Score : 0.582 AUC : 0.690

Weighted KNN

Used weighted kernel : triangular, epanechnikov (rectangular is used as well to compare)

# library(class)# library(kknn)

set.seed(123)kknn.train <- kknn::train.kknn(formula = Outcome ~.,

data = train,kmax = 25,distance = 2,kernel = c("rectangular", "triangular", "epanechnikov"))

20

plot(kknn.train)abline(h=0.2286, col="red")

5 10 15 20 25

0.24

0.26

0.28

0.30

0.32

k

mis

clas

sific

atio

n

rectangulartriangularepanechnikov

kknn.train

#### Call:## kknn::train.kknn(formula = Outcome ~ ., data = train, kmax = 25, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov"))#### Type of response variable: nominal## Minimal misclassification: 0.2286245## Best kernel: rectangular## Best k: 15

• Best k = 15• Best Kernel = rectangular• Minimal Error = 0.229 (0.771 Accuracy)

Apply this model to test set

kknn.pred <- predict(kknn.train, newdata = test)kknn.real <- test$Outcomecaret::confusionMatrix(data = kknn.pred,

reference = kknn.real,positive = "Yes")

21

## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 38## Yes 21 42#### Accuracy : 0.7435## 95% CI : (0.6819, 0.7986)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.001872#### Kappa : 0.4051#### Mcnemar's Test P-Value : 0.037249#### Sensitivity : 0.5250## Specificity : 0.8600## Pos Pred Value : 0.6667## Neg Pred Value : 0.7725## Prevalence : 0.3478## Detection Rate : 0.1826## Detection Prevalence : 0.2739## Balanced Accuracy : 0.6925#### 'Positive' Class : Yes##

• Accuracy : 0.744• Sensitivity : 0.525• Specificity : 0.860• Kappa : 0.402 Accuracy of test set still decrease when comparing with Accuracy of train set (77.1%

-> 74.4%)

MLmetrics::F1_Score(y_pred = kknn.pred,y_true = kknn.real,positive = "Yes"); pROC::auc(response = as.numeric(x = kknn.real),

predictor = as.numeric(x = kknn.pred))

## [1] 0.5874126


• F1 Score : 0.587• AUC : 0.693 As a result, with this data, the method of weighted distance values when training does

not significantly affect the improvement of the model performance.

Linear SVM

Make svm.real using test set’s input variable in order to compare with train model.

22

svm.real <- test1$Outcome

# library(e1071)set.seed(123)linear.tune <- e1071::tune.svm(Outcome ~.,

data = train1,kernel = 'linear',cost = c(0.001, 0.01, 0.1, 1, 5, 10))

summary(linear.tune)

#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## cost## 0.1#### - best performance: 0.219427#### - Detailed performance results:## cost error dispersion## 1 1e-03 0.3493012 0.05066509## 2 1e-02 0.2286513 0.04797243## 3 1e-01 0.2194270 0.05047919## 4 1e+00 0.2231307 0.06144298## 5 5e+00 0.2230957 0.06190707## 6 1e+01 0.2230957 0.06190707

• Best parameters (cost) : 0.1• Best error rate : 0.219

Use test data set to predict

best.linear <- linear.tune$best.modeltune.test <- predict(best.linear, newdata = test1)caret::confusionMatrix(data = tune.test,

reference = svm.real,positive = 'Yes')

## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 130 34## Yes 20 46#### Accuracy : 0.7652## 95% CI : (0.705, 0.8184)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.0001394

23

#### Kappa : 0.4605#### Mcnemar's Test P-Value : 0.0768812#### Sensitivity : 0.5750## Specificity : 0.8667## Pos Pred Value : 0.6970## Neg Pred Value : 0.7927## Prevalence : 0.3478## Detection Rate : 0.2000## Detection Prevalence : 0.2870## Balanced Accuracy : 0.7208#### 'Positive' Class : Yes##

• Accuracy : 0.765• Sensitivity : 0.575• Specificity : 0.867• Kappa : 0.460

MLmetrics::F1_Score(y_pred = tune.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = tune.test),

response = as.numeric(x = svm.real))

## [1] 0.630137


• F1_Score : 0.630• AUC : 0.721

Nonlinear SVM

Three kernels are used in nonlinear SVM: Polynomial, Radial Basis, and Sigmoid

1. Polynomial

• polynomial degree : 3, 4, 5• kernel coefficient(coef0) : 0.1, 0.5, 1, 2, 3, 4

set.seed(123)poly.tune <- e1071::tune.svm(Outcome ~.,

data = train1,kernel = 'polynomial',degree = c(3, 4, 5),coef0 = c(0.1, 0.5, 1, 2, 3, 4))

summary(poly.tune)

24

#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## degree coef0## 3 3#### - best performance: 0.2471349#### - Detailed performance results:## degree coef0 error dispersion## 1 3 0.1 0.2657582 0.05376570## 2 4 0.1 0.2824598 0.04826251## 3 5 0.1 0.2935360 0.05667382## 4 3 0.5 0.2472397 0.04302385## 5 4 0.5 0.2601677 0.03112229## 6 5 0.5 0.2731656 0.05011586## 7 3 1.0 0.2564990 0.04257450## 8 4 1.0 0.2693920 0.03763831## 9 5 1.0 0.2806429 0.04061729## 10 3 2.0 0.2508735 0.03694279## 11 4 2.0 0.2750874 0.02886593## 12 5 2.0 0.2993012 0.05806619## 13 3 3.0 0.2471349 0.03870198## 14 4 3.0 0.2750874 0.03467953## 15 5 3.0 0.3048917 0.06606766## 16 3 4.0 0.2508735 0.03694279## 17 4 4.0 0.2713836 0.04226771## 18 5 4.0 0.3067086 0.06479947

• Best parameters : degree = 3, coef0 = 3• Best error rate : 0.247


best.poly <- poly.tune$best.modelpoly.test <- predict(best.poly, newdata = test1)caret::confusionMatrix(data = poly.test,


## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 130 30## Yes 20 50#### Accuracy : 0.7826## 95% CI : (0.7236, 0.8341)## No Information Rate : 0.6522

25

## P-Value [Acc > NIR] : 1.156e-05#### Kappa : 0.5064#### Mcnemar's Test P-Value : 0.2031#### Sensitivity : 0.6250## Specificity : 0.8667## Pos Pred Value : 0.7143## Neg Pred Value : 0.8125## Prevalence : 0.3478## Detection Rate : 0.2174## Detection Prevalence : 0.3043## Balanced Accuracy : 0.7458#### 'Positive' Class : Yes##


MLmetrics::F1_Score(y_pred = poly.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = poly.test),


## [1] 0.6666667


• F1_Score : 0.667• AUC : 0.746

2. Radial Basis

• gamma degree : 0.1, 0.5, 1, 2, 3, 4

set.seed(123)rbf.tune <- e1071::tune.svm(Outcome ~.,

data = train1,kernel = 'radial',gamma = c(0.1, 0.5, 1, 2, 3, 4))

summary(rbf.tune)

#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:

26

## gamma## 0.1#### - best performance: 0.2473445#### - Detailed performance results:## gamma error dispersion## 1 0.1 0.2473445 0.04970510## 2 0.5 0.2751223 0.04198407## 3 1.0 0.3141859 0.05103077## 4 2.0 0.3474843 0.05043974## 5 3.0 0.3474843 0.05118960## 6 4.0 0.3493012 0.05066509

• Best parameters (gamma) : 0.1• Best error rate : 0.247


best.rbf <- rbf.tune$best.modelrbf.test <- predict(best.rbf, newdata = test1)caret::confusionMatrix(data = rbf.test,


## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 32## Yes 21 48#### Accuracy : 0.7696## 95% CI : (0.7097, 0.8224)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 7.748e-05#### Kappa : 0.4752#### Mcnemar's Test P-Value : 0.1696#### Sensitivity : 0.6000## Specificity : 0.8600## Pos Pred Value : 0.6957## Neg Pred Value : 0.8012## Prevalence : 0.3478## Detection Rate : 0.2087## Detection Prevalence : 0.3000## Balanced Accuracy : 0.7300#### 'Positive' Class : Yes##

27


MLmetrics::F1_Score(y_pred = rbf.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = rbf.test),


## [1] 0.6442953


• F1_Score : 0.644• AUC : 0.730

3. Sigmoid

• gamma degree : 0.1, 0.5, 1, 2, 3, 4• kernel coefficient(coef0) : 0.1, 0.5, 1, 2, 3, 4

set.seed(123)sigmoid.tune <- e1071::tune.svm(Outcome ~.,

data = train1,kernel = 'sigmoid',gamma = c(0.1, 0.5, 1, 2, 3, 4),coef0 = c(0.1, 0.5, 1, 2, 3, 4))

summary(sigmoid.tune)

#### Parameter tuning of 'svm':#### - sampling method: 10-fold cross validation#### - best parameters:## gamma coef0## 0.1 2#### - best performance: 0.2248777#### - Detailed performance results:## gamma coef0 error dispersion## 1 0.1 0.1 0.2602376 0.06427357## 2 0.5 0.1 0.3381551 0.04324919## 3 1.0 0.1 0.3437456 0.03814373## 4 2.0 0.1 0.3622642 0.05621633## 5 3.0 0.1 0.3603774 0.06885972## 6 4.0 0.1 0.3603424 0.06590605## 7 0.1 0.5 0.2991265 0.08565041## 8 0.5 0.5 0.3140811 0.03378694## 9 1.0 0.5 0.3455975 0.03701595

28

## 10 2.0 0.5 0.3621943 0.05450888## 11 3.0 0.5 0.3751572 0.07258214## 12 4.0 0.5 0.3751572 0.07716241## 13 0.1 1.0 0.3045423 0.07752164## 14 0.5 1.0 0.3289308 0.04890969## 15 1.0 1.0 0.3547519 0.07719423## 16 2.0 1.0 0.3641509 0.05061305## 17 3.0 1.0 0.3714885 0.05898717## 18 4.0 1.0 0.3677149 0.07815268## 19 0.1 2.0 0.2248777 0.05255027## 20 0.5 2.0 0.3343816 0.06232761## 21 1.0 2.0 0.3809224 0.04189985## 22 2.0 2.0 0.3678546 0.04947844## 23 3.0 2.0 0.3640461 0.07139168## 24 4.0 2.0 0.3696716 0.05947753## 25 0.1 3.0 0.3493012 0.05066509## 26 0.5 3.0 0.3923829 0.07090637## 27 1.0 3.0 0.3659679 0.07323244## 28 2.0 3.0 0.3937456 0.06471812## 29 3.0 3.0 0.3585954 0.04914676## 30 4.0 3.0 0.3585255 0.06827537## 31 0.1 4.0 0.3493012 0.05066509## 32 0.5 4.0 0.3848707 0.07178028## 33 1.0 4.0 0.3639413 0.07919411## 34 2.0 4.0 0.3529350 0.05726504## 35 3.0 4.0 0.3584906 0.05978758## 36 4.0 4.0 0.3436408 0.06669583

• Best parameters : gamma = 0.1, coef0 = 2• Best error rate : 0.225


best.sigmoid <- sigmoid.tune$best.modelsigmoid.test <- predict(best.sigmoid, newdata = test1)caret::confusionMatrix(data = sigmoid.test,


## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 125 32## Yes 25 48#### Accuracy : 0.7522## 95% CI : (0.6912, 0.8066)## No Information Rate : 0.6522## P-Value [Acc > NIR] : 0.0007075#### Kappa : 0.4424##

29

## Mcnemar's Test P-Value : 0.4267767#### Sensitivity : 0.6000## Specificity : 0.8333## Pos Pred Value : 0.6575## Neg Pred Value : 0.7962## Prevalence : 0.3478## Detection Rate : 0.2087## Detection Prevalence : 0.3174## Balanced Accuracy : 0.7167#### 'Positive' Class : Yes##


MLmetrics::F1_Score(y_pred = sigmoid.test,y_true = svm.real,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = sigmoid.test),


## [1] 0.627451


• F1_Score : 0.627• AUC : 0.717

Randomforest

Grid serach for finding best parameters (Because the tuning process may take a long time inlarge datasets, it is recommended that you tune the data into smaller samples)

• ntree = 100, 300, 500, 700, 1000• mtry = 3, 4, 5, 6, 7

rf.grid <- expand.grid(ntree = c(100, 300, 500, 700, 1000),mtry = c(3,4,5,6,7))

rf.tuned <- data.frame()for(i in 1:nrow(rf.grid)){

set.seed(123)fit <- randomForest(Outcome ~.,

data = train1,xtest = test1[, -9],ytest = test1[, 9],ntree = rf.grid[i, 'ntree'],mtry = rf.grid[i, 'mtry'],importance = TRUE,

30

do.trace = 100,keep.forest = TRUE)

mcSum <- sum(fit$predicted != train1$Outcome)mcRate <- mcSum / nrow(train1)df <- data.frame(index = i, misClassRate = mcRate)rf.tuned <- rbind(rf.tuned, df)cat('\n')

}

## ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%## 600: 25.65% 15.71% 44.15%| 23.48% 14.00% 41.25%## 700: 25.84% 16.00% 44.15%| 23.48% 13.33% 42.50%#### ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%## 400: 24.91% 14.86% 43.62%| 24.78% 14.67% 43.75%## 500: 24.72% 15.14% 42.55%| 23.48% 14.00% 41.25%## 600: 25.65% 15.71% 44.15%| 23.48% 14.00% 41.25%## 700: 25.84% 16.00% 44.15%| 23.48% 13.33% 42.50%## 800: 26.02% 16.29% 44.15%| 23.04% 13.33% 41.25%## 900: 26.02% 16.29% 44.15%| 23.48% 14.00% 41.25%## 1000: 25.65% 15.71% 44.15%| 23.91% 14.67% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%##

31

## ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%## 600: 25.09% 15.43% 43.09%| 22.61% 13.33% 40.00%## 700: 24.91% 15.71% 42.02%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 25.46% 17.14% 40.96%| 23.04% 14.67% 38.75%## 200: 24.91% 16.86% 39.89%| 23.91% 15.33% 40.00%## 300: 25.28% 16.00% 42.55%| 23.48% 14.67% 40.00%## 400: 25.28% 16.29% 42.02%| 23.04% 13.33% 41.25%## 500: 24.91% 16.00% 41.49%| 23.04% 14.00% 40.00%## 600: 25.09% 15.43% 43.09%| 22.61% 13.33% 40.00%## 700: 24.91% 15.71% 42.02%| 23.04% 14.67% 38.75%## 800: 25.28% 15.71% 43.09%| 23.04% 14.00% 40.00%## 900: 25.28% 16.00% 42.55%| 23.04% 14.00% 40.00%## 1000: 25.28% 15.71% 43.09%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%## 600: 25.65% 16.29% 43.09%| 23.04% 14.00% 40.00%## 700: 25.28% 16.57% 41.49%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 24.91% 14.29% 44.68%| 21.30% 13.33% 36.25%

32

## 200: 25.28% 16.00% 42.55%| 23.04% 14.67% 38.75%## 300: 25.28% 15.43% 43.62%| 23.04% 14.67% 38.75%## 400: 25.28% 16.29% 42.02%| 23.48% 15.33% 38.75%## 500: 25.46% 16.29% 42.55%| 23.48% 14.67% 40.00%## 600: 25.65% 16.29% 43.09%| 23.04% 14.00% 40.00%## 700: 25.28% 16.57% 41.49%| 23.48% 14.00% 41.25%## 800: 25.09% 16.00% 42.02%| 23.48% 14.00% 41.25%## 900: 25.09% 15.71% 42.55%| 23.48% 14.00% 41.25%## 1000: 25.09% 16.29% 41.49%| 23.48% 14.00% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%## 600: 25.46% 16.00% 43.09%| 22.61% 12.67% 41.25%## 700: 25.46% 16.00% 43.09%| 22.61% 13.33% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 25.65% 17.43% 40.96%| 23.48% 12.67% 43.75%## 200: 24.35% 15.43% 40.96%| 23.48% 14.00% 41.25%## 300: 25.09% 16.29% 41.49%| 23.04% 14.00% 40.00%## 400: 24.91% 15.71% 42.02%| 23.04% 13.33% 41.25%## 500: 24.54% 15.14% 42.02%| 23.04% 13.33% 41.25%## 600: 25.46% 16.00% 43.09%| 22.61% 12.67% 41.25%## 700: 25.46% 16.00% 43.09%| 22.61% 13.33% 40.00%## 800: 24.91% 15.71% 42.02%| 23.48% 14.67% 40.00%## 900: 25.09% 16.00% 42.02%| 23.04% 14.00% 40.00%## 1000: 24.91% 16.00% 41.49%| 22.17% 12.67% 40.00%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%##

33

## ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%## 600: 26.02% 17.43% 42.02%| 22.17% 14.00% 37.50%## 700: 25.46% 17.14% 40.96%| 22.61% 14.67% 37.50%#### ntree OOB 1 2| Test 1 2## 100: 26.77% 17.43% 44.15%| 21.74% 14.00% 36.25%## 200: 25.65% 16.29% 43.09%| 22.61% 14.67% 37.50%## 300: 25.84% 16.86% 42.55%| 22.61% 14.00% 38.75%## 400: 26.21% 16.86% 43.62%| 22.17% 14.67% 36.25%## 500: 25.65% 16.86% 42.02%| 22.17% 14.00% 37.50%## 600: 26.02% 17.43% 42.02%| 22.17% 14.00% 37.50%## 700: 25.46% 17.14% 40.96%| 22.61% 14.67% 37.50%## 800: 25.84% 17.43% 41.49%| 22.61% 14.67% 37.50%## 900: 25.46% 17.14% 40.96%| 22.61% 14.00% 38.75%## 1000: 25.09% 16.86% 40.43%| 22.61% 14.00% 38.75%

plot(x = rf.tuned, xlab = '', ylab = 'MisClassification Rate')abline(h = min(rf.tuned$misClassRate), col = 'red', lty = 2)

34

5 10 15 20 25

0.24

50.

255

0.26

5

Mis

Cla

ssifi

catio

n R

ate

min <- which(rf.tuned$misClassRate == min(rf.tuned$misClassRate))bestPara <- rf.grid[min, ]bestPara

## ntree mtry## 2 300 3

• Best parameter : ntree = 300, mtry = 3

# library(randomForest)set.seed(123)rf.train <- randomForest::randomForest(Outcome ~.,

data = train1,xtest = test1[,-9],ytest = test1[, 9],ntree = bestPara$ntree,mtry = bestPara$mtry,importance = TRUE,do.trace = 100,keep.forest = TRUE)

## ntree OOB 1 2| Test 1 2## 100: 25.28% 14.86% 44.68%| 26.52% 15.33% 47.50%## 200: 24.54% 14.57% 43.09%| 24.35% 14.00% 43.75%## 300: 24.16% 14.00% 43.09%| 23.91% 14.00% 42.50%

35

rf.train

#### Call:## randomForest(formula = Outcome ~ ., data = train1, xtest = test1[, -9], ytest = test1[, 9], ntree = bestPara$ntree, mtry = bestPara$mtry, importance = TRUE, do.trace = 100, keep.forest = TRUE)## Type of random forest: classification## Number of trees: 300## No. of variables tried at each split: 3#### OOB estimate of error rate: 24.16%## Confusion matrix:## No Yes class.error## No 301 49 0.1400000## Yes 81 107 0.4308511## Test set error rate: 23.91%## Confusion matrix:## No Yes class.error## No 129 21 0.140## Yes 34 46 0.425

• train OOB : 0.242• test OOB : 0.239

plot(rf.train)

0 50 100 150 200 250 300

0.15

0.25

0.35

0.45

rf.train

trees

Err

or

36

plot(rf.train$err.rate[, 1], type = 'l')

0 50 100 150 200 250 300

0.24

0.26

0.28

0.30

Index

rf.tr

ain$

err.r

ate[

, 1]


rf.test <- rf.train$test$predictedrf.real <- test1$Outcomecaret::confusionMatrix(data = rf.real,

reference = rf.test,positive = "Yes")

## Confusion Matrix and Statistics#### Reference## Prediction No Yes## No 129 21## Yes 34 46#### Accuracy : 0.7609## 95% CI : (0.7004, 0.8145)## No Information Rate : 0.7087## P-Value [Acc > NIR] : 0.04559#### Kappa : 0.4521##

37

## Mcnemar's Test P-Value : 0.10565#### Sensitivity : 0.6866## Specificity : 0.7914## Pos Pred Value : 0.5750## Neg Pred Value : 0.8600## Prevalence : 0.2913## Detection Rate : 0.2000## Detection Prevalence : 0.3478## Balanced Accuracy : 0.7390#### 'Positive' Class : Yes##


MLmetrics::F1_Score(y_pred = rf.real,y_true = rf.test,positive = "Yes") ;pROC::auc(predictor = as.numeric(x = rf.real),

response = as.numeric(x = rf.test))

## [1] 0.6258503


• F1_Score : 0.626• AUC : 0.739

Variable Importance

Mean is used because variables from many trees evaluate aspects that contribute to improved accuracy andGini impurity.

• MeanDecreaseAccuracy• MeanDecreaseGini

rf.train %>% treesize(terminal = TRUE) %>% hist(main = 'Number of Terminal Nodes')

38

Number of Terminal Nodes

.

Fre

quen

cy

60 70 80 90 100

020

4060

80

randomForest::varImpPlot(rf.train, main = "Variable Importance Plot"); randomForest::importance(rf.train)

39

BloodPressure

SkinThickness

Insulin


Pregnancies

BMI

Age

Glucose

5 20 35MeanDecreaseAccuracy

Pregnancies

BloodPressure

SkinThickness

Insulin


Age

BMI

Glucose

0 30 60MeanDecreaseGini

Variable Importance Plot

## No Yes MeanDecreaseAccuracy## Pregnancies 14.983572 -1.074802 13.519390## Glucose 26.224744 25.567881 35.470165## BloodPressure 3.180638 -0.334135 2.153196## SkinThickness 3.933890 2.077106 4.419072## Insulin 5.753626 1.214205 5.366615## BMI 7.234006 14.228060 14.997752## DiabetesPedigreeFunction 5.325098 5.737122 7.278320## Age 15.988667 8.236952 18.590614## MeanDecreaseGini## Pregnancies 16.08648## Glucose 61.03068## BloodPressure 18.42026## SkinThickness 20.33625## Insulin 28.43677## BMI 35.81051## DiabetesPedigreeFunction 29.40971## Age 34.10992

• MeanDecreaseAccuracy : Glucose > Age > BMI• MeanDecreaseGini : Glucose > BMI > Age

40

Xgboost

Grid serach for finding best parameters (Because the tuning process may take a long time inlarge datasets, it is recommended that you tune the data into smaller samples)

xg.grid <- expand.grid(nrounds = c(75, 100),colsample_bytree = 1,min_child_weight = 1,eta = c(0.01, 0.1, 0.3), # 0.3 basic valuegamma = c(0.5, 0.25),subsample = 0.5,max_depth = c(2, 3))

cntrl <- caret::trainControl(method = "cv",number = 5, # 5 fold cross-validationverboseIter = TRUE,returnData = FALSE,returnResamp = "final")

set.seed(123)train.xgb <- train(x = train1[, 1:8],

y = train1[, 9],trControl = cntrl,tuneGrid = xg.grid,method = "xgbTree")

## + Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold1: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold1: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100

41

## - Fold2: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold2: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold2: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold3: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold3: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100

42

## - Fold4: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold4: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold4: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.01, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.10, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=2, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=2, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=3, gamma=0.25, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## + Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## - Fold5: eta=0.30, max_depth=3, gamma=0.50, colsample_bytree=1, min_child_weight=1, subsample=0.5, nrounds=100## Aggregating results## Selecting tuning parameters## Fitting nrounds = 100, max_depth = 2, eta = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set

train.xgb

## eXtreme Gradient Boosting#### No pre-processing## Resampling: Cross-Validated (5 fold)## Summary of sample sizes: 431, 430, 431, 430, 430## Resampling results across tuning parameters:#### eta max_depth gamma nrounds Accuracy Kappa## 0.01 2 0.25 75 0.7657494 0.4538471

43

## 0.01 2 0.25 100 0.7676012 0.4587285## 0.01 2 0.50 75 0.7694704 0.4623929## 0.01 2 0.50 100 0.7732087 0.4699355## 0.01 3 0.25 75 0.7582728 0.4420598## 0.01 3 0.25 100 0.7582901 0.4433717## 0.01 3 0.50 75 0.7601765 0.4445163## 0.01 3 0.50 100 0.7601765 0.4471754## 0.10 2 0.25 75 0.7527518 0.4374509## 0.10 2 0.25 100 0.7509173 0.4393334## 0.10 2 0.50 75 0.7713396 0.4775565## 0.10 2 0.50 100 0.7657494 0.4712296## 0.10 3 0.25 75 0.7416753 0.4233366## 0.10 3 0.25 100 0.7472655 0.4383261## 0.10 3 0.50 75 0.7620630 0.4673431## 0.10 3 0.50 100 0.7528037 0.4516339## 0.30 2 0.25 75 0.7416234 0.4260231## 0.30 2 0.25 100 0.7323468 0.4075123## 0.30 2 0.50 75 0.7379716 0.4130307## 0.30 2 0.50 100 0.7342506 0.4005421## 0.30 3 0.25 75 0.7379370 0.4054532## 0.30 3 0.25 100 0.7174801 0.3671057## 0.30 3 0.50 75 0.7230183 0.3915425## 0.30 3 0.50 100 0.7192800 0.3784477#### Tuning parameter 'colsample_bytree' was held constant at a value of## 1## Tuning parameter 'min_child_weight' was held constant at a value of## 1## Tuning parameter 'subsample' was held constant at a value of 0.5## Accuracy was used to select the optimal model using the largest value.## The final values used for the model were nrounds = 100, max_depth = 2,## eta = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1## and subsample = 0.5.

• Accuracy : 0.773• Kappa : 0.470• nrounds : 100 (Maximum number of iterations (trees in final model))• max_depth : 2 (Maximum Depth of Individual Tree)• eta : 0.01 (Learning speed. Meaning the contribution of each tree to the solution.)• gamma : 0.5 (Minimum loss reduction)• colsample_bytree : 1 (# of features to sample when creating a tree)• min_child_weight : 1• subsample : 0.5 (% of data observations)

Variable Importance

• Gain : A value that indicates the degree of improvement in the accuracy of the feature on the tree• Cover : Relative figure of the total observed value associated with the corresponding feature• Frequency : A value expressed as a percentage of the number of times a feature has appeared for all

trees

# library(xgboost)param <- list(objective = "binary:logistic",

44

booster = "gbtree",eval_metric = "error",eta = 0.01,max_depth = 3,subsample = 0.5,colsample_bytree = 1,gamma = 0.25)

x <- as.matrix(train1[, 1:8])y <- ifelse(train1$Outcome == "Yes", 1, 0)train.mat <- xgboost::xgb.DMatrix(data = x,

label = y)set.seed(123)xgb.fit <- xgboost::xgb.train(params = param,

data = train.mat,nrounds = 75)

impMatrix <- xgboost::xgb.importance(feature_names = dimnames(x)[[2]], model = xgb.fit)impMatrix

## Feature Gain Cover Frequency## 1: Glucose 0.53674056 0.39745044 0.25154639## 2: BMI 0.15500238 0.16984326 0.21030928## 3: Age 0.14802032 0.19479942 0.17731959## 4: DiabetesPedigreeFunction 0.06119871 0.09038095 0.12371134## 5: Pregnancies 0.02771003 0.04512527 0.04948454## 6: SkinThickness 0.02707855 0.04490084 0.06804124## 7: Insulin 0.02393599 0.03204323 0.06185567## 8: BloodPressure 0.02031346 0.02545660 0.05773196

xgboost::xgb.plot.importance(impMatrix, main = "Gain by Feature")

45

BloodPressure

Insulin

SkinThickness

Pregnancies


Age

BMI

Glucose

Gain by Feature

0.0 0.1 0.2 0.3 0.4 0.50.0 0.1 0.2 0.3 0.4 0.5

* Glucose > BMI > Age

Results of the performance with respect to the test set

# install.packages("InformationValue")# library(InformationValue)xgb.pred <- predict(xgb.fit, x)InformationValue::optimalCutoff(y, xgb.pred) # Optimal threshold to minimize error

## [1] 0.4817243

• Cutoff : 0.482

xgb.testMat <- as.matrix(test1[, 1:8])xgb.test <- predict(xgb.fit, xgb.testMat)y.test <- ifelse(test1$Outcome == "Yes", 1, 0)InformationValue::confusionMatrix(actuals = y.test,

predictedScores = xgb.test,threshold = 0.48); 1 - misClassError(y.test, xgb.test, threshold = 0.48)

## 0 1## 0 123 31## 1 27 49

## [1] 0.7478

46

xgb.test <- as.numeric(xgb.test>0.5) %>% as.factor()MLmetrics::F1_Score(y_pred = xgb.test,

y_true = y.test,positive = "1");pROC::auc(predictor = as.numeric(x = xgb.test),

response = as.numeric(x = y.test))

## [1] 0.6206897


• Accuracy : 0.761• Error : 0.239• F1 Score : 0.626• AUC : 0.718

Models Evaluation

Models Comparsion

Accuracy

Accuracy

Model Train Test DifferenceKNN 0.779 0.744 0.035

Weighted KNN 0.771 0.744 0.027Linear SVM 0.781 0.765 0.016Poly SVM 0.753 0.783 -0.03Radial SVM 0.753 0.770 -0.017Sigmoid SVM 0.775 0.752 0.023Randomforest 0.758 0.761 -0.003

Xgboost 0.773 0.761 0.012

• Model Showing Best Accuracy in Train Set : Linear SVM (0.781)• Model Showing Best Accuracy in Test Set : Poly SVM (0.783)• Model Showing Lowest Difference between Train and Test Set : Poly SVM (-0.03)

F1 Score & AUC

F1 Score & AUC

Model F1 Score AUCKNN 0.582 0.690

Weighted KNN 0.587 0.693Linear SVM 0.630 0.721Poly SVM 0.667 0.746Radial SVM 0.644 0.730

47

Model F1 Score AUCSigmoid SVM 0.627 0.717Randomforest 0.626 0.739

Xgboost 0.626 0.718

• Model Showing Highest F1 Score : Poly SVM (0.667)• Model Showing Highest AUC : Poly SVM (0.746)

48

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Pima Diabetes Analysisis a set of data that examines and records the physical condition of adult...

Documents