+ All Categories
Home > Documents > Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf ·...

Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf ·...

Date post: 08-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
34
Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus
Transcript
Page 1: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Fitting Classification and

Regression Trees Using

Statgraphics and R

Presented by

Dr. Neil W. Polhemus

Page 2: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Classification and Regression Trees

• Machine learning methods used to construct

predictive models from data.

• Recursively partitions the data space using simple

binary decisions.

• Commonly portrayed as a tree with a split at each

decision node.

Page 3: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Example – Fisher’s Iris Data

Species

Petal.length<2.45

setosa (p=1.0)Petal.width<1.75

Petal.length<4.95

Sepal.length<5.15

versicolor (p=0.8) versicolor (p=1.0)

virginica (p=0.666667)

Petal.length<4.95

virginica (p=0.833333) virginica (p=1.0)

Page 4: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Basic Model Structure

• Y: variable to be predicted

– If categorical, we construct a classification tree.

– If continuous, we construct a regression tree.

• X1, X2, … Xp: predictor variables

– May be either categorical or continuous.

Page 5: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Partitioning Algorithm

• Start at the root node.

• Amongst all variables Xj, find the split that minimizes

the resulting average within-node deviance.

– For a continuous variable, the split is of the form

Xj < c.

– For a discrete variable, the split divides the possible

values into 2 distinct groups.

• If one or more stopping criteria are met after the split,

stop. Otherwise consider splitting the child nodes.

Page 6: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

RMS Titanic

Page 7: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Sample Data File

Source: Frank Harrell and Thomas Cason, University of Virginiahttp://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataSets/titanic.html

n=1,309 observations (passengers only)

Page 8: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Data Input

Page 9: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Analysis Options

Page 10: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Partitioning Options

Page 11: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Node Impurity

• Measure the impurity in a tree using the residual

mean deviance (RMD).

n = number of observations in training set

k = number of leaves

pi,j = proportion of data of same type as i at its assigned leaf j

𝑌𝑖 = predicted value of observation i

– Classification trees

– Regression trees

Page 12: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Analysis Window

Page 13: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Decision Tree

survived

sex=female

pclass=3

fare<23.0875

Yes No

Yes

age<9.5

sibsp=3,4,5

No Yes

pclass=2,3

No No

Page 14: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Decision Tree Options

Page 15: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Tree Structure

* Note: based on complete cases only.

Page 16: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Node Probabilities

* Note: based on complete cases only.

Page 17: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Classification Table

* Note: based on both complete and partial cases.

Page 18: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Training and Validation Sets

• May separate the data into 2 sets:

– Training set used to build the tree.

– Test set used to estimate the tree

misclassification percentages.

Page 19: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Compare Results

Page 20: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Pruning Options

• Reduces the complexity of the tree by removing

branches.

Page 21: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Pruning by Cross-validation

• Runs a 10-fold cross-validation experiment. Builds

10 trees, leaving out 10% of the data each time,

and averages the results.

• Uses all of the observations for both training and

validation.

• Can be used to determine the optimal size for the

tree by increasing the number of leaves until you

see warnings such as:

Page 22: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Pruning Example

• First reduce within-node deviance to fit a complex

tree.

Page 23: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Pruning Example

surv iv ed

sex=fem

pclass=3

fare<23.0875

embarked=Q,S

age<27.5

sibsp=1,2,3

parch=0

No

Ye

s

fare<10.825

fare<8.35625

embarked=Q

Ye

s

fare<7.7625

Ye

s

fare<7.8896

No

Ye

sN

oY

es

age<29.5

No

fare<15.7

age<42.5

age<31.5

No

No

Ye

sY

es

fare<13.9354

Ye

s

fare<15.4937

No

Ye

s

fare<31.3312N

o

fare<37.0313N

o

No

fare<32.0896

parch=2,3

age<17.5

Ye

ssibsp=0,2,3

fare<26.125Y

es

Ye

sage<21.5

Ye

s

fare<27.5833

fare<22.0

age<25.0

Ye

s

fare<13.25

age<37.0

Ye

sY

es

Ye

sY

es

Ye

s

Ye

s

fare<149.036

Ye

s

sibsp=1

Ye

s

Ye

s

age<9.5

sibsp=0,1,2

age<3.5

No

No

fare<15.5729

fare<13.125

Ye

s

No

Ye

s

pclass=1

age<32.25

sibsp=0,1,3

No

age<31.5

embarked=Q,S

fare<7.8375

age<20.75

No

fare<7.7854

age<23.5

fare<7.62705

No

No

age<27.5

No

No

Ye

s

fare<7.9104

No

fare<11.0

fare<7.9875

No

age<19.5

fare<8.10415

Ye

s

fare<9.22085

No

No

fare<8.5479

No

age<21.5

No

No

age<25.5

No

parch=1

No

No

fare<14.1584

fare<7.5625

age<25.75

age<21.5

No

No

No

Ye

s

pclass=3

No

No

No

fare<7.9104

No

fare<18.9625

age<36.25

fare<12.9375

No

No

age<39.5

No

fare<13.25

age<46.0

No

age<59.0

No

No

No

No

age<54.5

fare<26.1438

age<43.0

No

No

fare<32.5104

age<42.5

age<36.5

age<33.5

Ye

sY

es

No

Ye

s

fare<51.9312

age<33.5

No

No

fare<58.875

fare<54.2708

No

Ye

s

fare<80.7542

fare<77.0084

age<32.5

Ye

sN

oN

o

fare<135.067

fare<109.892

No

Ye

s

fare<237.523

No

Ye

s

fare<86.35

fare<29.85

No

No

No

Page 24: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Pruning Example

• Now reduce the number of leaves to 10 and select

cross-validation.

surv iv ed

sex=female

pclass=3

fare<23.0875

Ye

s

No

fare<32.0896

Ye

s

Ye

s

age<9.5

sibsp=3,4,5

No

Ye

s

pclass=2,3

age<32.25

No

No

age<54.5

No

No

Page 25: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

10 Leaves is Too Complex

Page 26: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Finding Optimal Number of Leaves

• Step 1: Copy script from Statgraphics to Word and

modify last line.

Page 27: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Finding Optimal Number of Leaves

• Step 2: Copy modified script to R and run it. Look

for size with minimum deviance.

Page 28: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Prune Tree

Page 29: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Final Tree

survived

sex=female

pclass=3

No

Yes

age<9.5

sibsp=3,4,5

No

Yes

No

Page 30: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Predict Additional Cases

• To make predictions for additional cases, add

them to the bottom of the original data, leaving the

cell for Y blank.

Page 31: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Predictions and Residuals

Page 32: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Example 2: World Bank Demographics

Page 33: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

Decision Tree

Life Expectancy

Fertility.Rate<3.3

GDP.per.Capita<11068.0

GDP.per.Capita<4282.75

Pop..Density<40.37

66.8

83

71.6

66

74.4

99

79.5

18

Fertility.Rate<4.585

Female.Percentage<49.97

66.6

52

Age.Dependency.Ratio<72.925

62.2

28

54.0

03

Pop..Density<31.005

50.5

15

55.5

07

Page 34: Fitting Classification and Statgraphics and R › webinars › decision trees webinar.pdf · 2018-12-13 · Classification and Regression Trees •Machine learning methods used to

References

• StatFolios and data files are at:

www.statgraphics.com/webinars

• R Package “tree” (2015) https://cran.r-

project.org/web/packages/tree/tree.pdf

• Classic text: Brieman, L., Friedman, J., Stone, C.J.

and Olshen, R.A. (1998) Classification and

Regression Trees. Wadsworth.


Recommended