Introduction Method R Implementation Data Preparation Conclusion
Decision Tree in R
December 7, 2011
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Content I
1 IntroductionExample 1: Titanic DataExample 2: New York City Air Quality DataExample 3: An Artificial Data
2 MethodGeneral CommentRegression TreeClassification TreeTree Stop and PruningMissing Data
3 R Implementationrpartparty package
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Content IIRWeka packageevtree packagemvpart packagepartykit package
4 Data PreparationOverviewExample
5 Conclusion
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Agenda
Introduction through examples
Algorithm - a conceptual view
R implementation
Data preparation
Conclusion
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 1: Titanic Data
Titanic Passenger Survival Data
Data (n = 1046 complete records)
survived yes, no
sex female, male
pclass passenger class in on the ship
age continuous
Question
What survived and who perished?
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 2: New York City Air Quality Data
Survival of Titanic Passengers (n = 1046)
sexp < 0.001
1
female male
pclassp < 0.001
2
3 {1, 2}
Node 3 (n = 152)
yes
No
0
0.2
0.4
0.6
0.8
1Node 4 (n = 236)
yes
No
0
0.2
0.4
0.6
0.8
1
pclassp < 0.001
5
{2, 3} 1
agep < 0.001
6
≤ 9 > 9
Node 7 (n = 40)
yes
No
0
0.2
0.4
0.6
0.8
1Node 8 (n = 467)
yes
No
0
0.2
0.4
0.6
0.8
1
agep = 0.008
9
≤ 54 > 54
Node 10 (n = 123)
yes
No
0
0.2
0.4
0.6
0.8
1Node 11 (n = 28)
yes
No
0
0.2
0.4
0.6
0.8
1
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 2: New York City Air Quality Data
Air quality data
Data (n = 111 days May - Sep 1976)
Ozone Mean ozone in parts per billion
Solar.R Solar radiation
Wind Average wind speed
Temp Maximum daily temperature
Month Month (1-12)
Day Day (1-31)
Question
What explains the variation of Ozone level in New York City?
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 2: New York City Air Quality Data
> data(airquality) # load data
> air <- airquality
> air <- na.omit(air) # exclude missing data
> head(air) # display a few records
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 2: New York City Air Quality Data
NYC Air Quality, May−Sept 1976 (n=111)
Tempp < 0.001
1
≤ 82 > 82
Windp < 0.001
2
≤ 9.2 > 9.2
Node 3 (n = 24)
●
●
0
50
100
150
Tempp = 0.003
4
≤ 75 > 75
Node 5 (n = 32)
●
0
50
100
150
Node 6 (n = 21)
●
0
50
100
150
Node 7 (n = 34)
0
50
100
150
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 3: An Artificial Data
> ## Create an artificial data (Hastie etc.)
> z <- matrix(0, 40, 40) # Create a 40x40 matrix
> z[1:16, 1:12] <- 2
> z[17:24, ] <- 5
> z[25:40, 1:28] <- 8
> z[25:40, 29:40] <- 10
> ## Data set: 40x40 = 1600 rows
> ds <- data.frame(y =as.vector(z),expand.grid(1:40,1:40))
> colnames(ds) <- c("y", "x1", "x2")
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 3: An Artificial Data
x1
x2
y
Artificial Data (Breaks: x1 = 16.5, 24.5; x2 = 12.5, 28.5)Example in Hastie et al.
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example 3: An Artificial Data
Artificial Data Completely Recovered by Tree Model(n = 40 x 40 = 1600)
|x1< 16.5
x2>=12.5 x1< 24.5
x2< 28.50
n=4482
n=1925
n=320
8n=448
10n=192
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
General Comment
The Basic Idea of Decision Tree
A dependent variable variable (y): continuous or categorical
Multiple (can be many) independent variables (x ’s):continuous or categorical
Tree looks for split on node that can lead to the mostdifferention on y
Tree stops when further split becomes ineffective
A decision tree can:
Serve as a model (e.g. create rules)Make predictionSegment the data
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
General Comment
History
Social scientists: Morgan and Sonquist (1963); Morgan andMessenger (1973)
Statistics: Breiman et al. (1984) - CART
Machine learning: Quinlan (1979 and after)
Ripley (1996)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
General Comment
Type of Algorithms
By dependent variable type
Classification tree Dependent variable is discreteEx: purchase (yes/no), types of disease treatment, . . .
Regression tree Dependent variable is continuousEx: spending, likelihood to buy, . . .
Popular Implementations
CHAID CHi-squared Automatic Interaction Detector
CART Classification And Regression Tree
C4.5 and C5.0 and some newer ones
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
General Comment
Branch Split
CHAID allows multiple branch split - a wider tree
CART uses binary split
All major tree implementation in R enforces binary split
Multiple branch split can be achieved by several binary splits
Binary split avoids potential issues with multiple branch split:
The need for normalizing size when comparing splitsQuick fragmentation of the sample size
Note
Tree grows based on optimizing only the split from the currentnode rather then optimizing the entire tree
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Regression Tree
Regression Tree
The dependent variable is continuous
Fit a simple constant model to minimize the sum of thesquare from the constant
Like fitting an ANOVA modelEquivalent to fitting a Guassian GLMA greedy algorithm makes the search of the best splitcomputationally easy
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Regression Tree
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Regression Tree
Air Quality Data − Ozone (Numeric) Split on Temperature Alone
Tempp < 0.001
1
≤ 82 > 82
Tempp < 0.001
2
≤ 77 > 77
Node 3 (n = 50)
0
20
40
60
80
Node 4 (n = 27)
0
20
40
60
80
Tempp = 0.017
5
≤ 87 > 87
Node 6 (n = 17)
0
20
40
60
80
Node 7 (n = 17)
0
20
40
60
80
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Regression Tree
60 70 80 90
050
100
150
Matching Tree Split to Raw Data (n = 111)
Temp
Ozo
ne
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●●
loess fit on raw databreaks identified by tree
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Classification Tree
Classfication Tree
The dependent variable is catgegorical
Common node impurity measures used:
Misclassification errorGini indexCross-entropy or deviance
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Classification Tree
> air0 <- air <- na.omit(airquality)
> vars<-c("Ozone","Solar.R","Wind","Temp")
> for (v in vars) {
+ air[,paste(v,".Cat",sep="")] <- cut(air[, v],
+ breaks=c(-Inf, median(air[, v]), Inf),
+ label = c("Low", "High"))
+ }
> air$Month.Cat <- as.factor(air$Month)
> air <- subset(air, select = -c(Ozone, Month, Day))
> head(air)
Solar.R Wind Temp Ozone.Cat Solar.R.Cat Wind.Cat Temp.Cat Month.Cat
1 190 7.4 67 High Low Low Low 5
2 118 8.0 72 High Low Low Low 5
3 149 12.6 74 Low Low High Low 5
4 313 11.5 62 Low High High Low 5
7 299 8.6 65 Low High Low Low 5
8 99 13.8 59 Low Low High Low 5
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Classification Tree
Air Quality − ALL Categorical Variables
Temp.Catp < 0.001
1
High Low
Solar.R.Catp = 0.074
2
Low High
Node 3 (n = 25)
Hig
hLo
w
0
0.2
0.4
0.6
0.8
1Node 4 (n = 29)
Hig
hLo
w
0
0.2
0.4
0.6
0.8
1Node 5 (n = 57)
Hig
hLo
w
0
0.2
0.4
0.6
0.8
1
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Classification Tree
Split on Solar.R
Solar.R
p = 0.03
1
High Low
Node 2 (n = 55)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1Node 3 (n = 56)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1
Split on Temperature
Temp
p < 0.001
1
High Low
Node 2 (n = 54)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1Node 3 (n = 57)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1
Split on Wind
Wind
p < 0.001
1
Low High
Node 2 (n = 58)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1Node 3 (n = 53)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1
Split on Month
Month
p < 0.001
1
{7, 8} {5, 6, 9}
Node 2 (n = 49)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1Node 3 (n = 62)
Hig
hL
ow
0
0.2
0.4
0.6
0.8
1
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Tree Stop and Pruning
Tree Stop and Pruning
General strategy:
Grow tree first and then prune
Implement cost-complexity pruning
Use tuning parameter a
Estiamte a throught 10-fold validation
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Missing Data
Missing Predictor Values
Several strategies rather than casewise deletion:
Missing value coded as a separate category
Constrcut surrogate variables - use highly correlatedvariables without missing value
Split case with missing value when passing down a branch
Missing value imputation
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Everything in R is an object. - John Chambers
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Key Packages
rpart Classic but update work horse on decision tree inR
party Conditional Inference Tree
RWeka R/Weka interface
tree Another regression and classification tree
evtree Global optimization
mvpart Multivariate dependent variable tree
partykit A general tree infrastructure
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> #library(rpart)
> op <- par(mfrow = c(1,2)) # print two plots on one screen
> ## run rpart model
> rp1 <- rpart(survived ~ sex + age + pclass, data = Titanic)
> ## simple plot. branch proportional to error in the fit.
> plot(rp1, main = "Simple Display") # simple on the left
> text(rp1) # add text label
> ## Fancier plot. equal branch spacing.
> plot(rp1, branch = 0.5, uniform = TRUE, main = "Pretty Display")
> text(rp1, pretty = 0, fancy = TRUE, use.n=TRUE, all = TRUE)
> par(op) # reset par
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
Simple Display
|sex=b
age>=9.5
pclass=c
pclass=c
age>=1.5No
No yes No yesyes
Pretty Display
|
sex=male
age>=9.5
pclass=3
pclass=3
age>=1.5
sex=female
age< 9.5
pclass=1,2
pclass=1,2
age< 1.5
No 619/427
No 523/135
No 505/110
yes18/25
No 18/11
yes0/14
yes96/292
No 80/72
No 79/66
yes1/6
yes16/220
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> print(rp1) # print rpart object
n= 1046
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1046 427 No (0.59177820 0.40822180)
2) sex=male 658 135 No (0.79483283 0.20516717)
4) age>=9.5 615 110 No (0.82113821 0.17886179) *
5) age< 9.5 43 18 yes (0.41860465 0.58139535)
10) pclass=3 29 11 No (0.62068966 0.37931034) *
11) pclass=1,2 14 0 yes (0.00000000 1.00000000) *
3) sex=female 388 96 yes (0.24742268 0.75257732)
6) pclass=3 152 72 No (0.52631579 0.47368421)
12) age>=1.5 145 66 No (0.54482759 0.45517241) *
13) age< 1.5 7 1 yes (0.14285714 0.85714286) *
7) pclass=1,2 236 16 yes (0.06779661 0.93220339) *
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> path.rpart(rp1, node=c(4, 7)) ##
node number: 4
root
sex=male
age>=9.5
node number: 7
root
sex=female
pclass=1,2
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> head(predict(rp1)) ## predicted pprobability of survival
No yes
1 0.06779661 0.9322034
2 0.00000000 1.0000000
3 0.06779661 0.9322034
4 0.82113821 0.1788618
5 0.06779661 0.9322034
6 0.82113821 0.1788618
> ## actual vs. predicted probability in one data frame
> tmp <- cbind(actual=as.numeric(Titanic$survived)-1, pred=predict(rp1)[, 2])
> cor(tmp) ## correlation
actual pred
actual 1.000000 0.640939
pred 0.640939 1.000000
> aggregate(tmp, by=list(Titanic$survived), mean) # compare by survival
Group.1 actual pred
1 No 0 0.2405231
2 yes 1 0.6513260
> aggregate(tmp, by=list(Titanic$sex), mean) # compare by gender
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
Group.1 actual pred
1 female 0.7525773 0.7525773
2 male 0.2051672 0.2051672
> aggregate(tmp, by=list(Titanic$pclass), mean) #compare by pclass
Group.1 actual pred
1 1 0.6373239 0.5403331
2 2 0.4406130 0.5107649
3 3 0.2614770 0.2799117
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> # use all predictors with control change
> rp1 <- rpart(survived ~ . , control = rpart.control(minsplit=30,
+ minbucket=15, cp=0.012), data = Titanic)
> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.1,
+ main = "minsplit=30, minbucket=15, cp=0.012")
> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
minsplit=30, minbucket=15, cp=0.012
|
sex=male
age>=9.5 pclass=3
age>=27.5
sex=female
age< 9.5 pclass=1,2
age< 27.5
No 619/427
No 523/135
No 505/110
yes18/25
yes96/292
No 80/72
No 30/16
yes50/56
yes16/220
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
> ## Pruning tree by a high CP
> rp1 <- prune(rp1, cp=0.018)
> plot(rp1, branch=0.3, uniform=TRUE, margin = 0.2,
+ main = "After Pruning by cp = 0.018")
> text(rp1, pretty=0, fancy=TRUE, use.n=TRUE, all = TRUE)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
After Pruning by cp = 0.018
|
sex=male
pclass=3
sex=female
pclass=1,2
No 619/427
No 523/135
yes96/292
No 80/72
yes16/220
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
rpart
Comments on rpart
Tree model is an object so everything is acccessible
Finer tree control can be made on function parameters
rpart function: method, model, parms, controls
rpart.control function controls:
Minimum node size for continuous splitMinimum number of records in a nodeComplex Parameter (CP)Depth of the tree. . .
More CP related functions for tree size determination
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
party package Overview
Use conditional inference
A framework for general tree model
Powerful and flexible tree graphics
Many types of depedent variables:
nominalordinalnumericcensoredmultivariate
Covariates in arbitrary measurement scales
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
Key tree modeling functions in party
ctree Conditional Inference Tree
mob Model-based Recursive Partitioning
cforest Random forest
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
Edgar Anderson’s Iris Data (in R)
Data
Species Factor of 3 classes: setosa, versicolor, virginica
Sepal.Length continuous
Sepal.Width continuous
Petal.Length continuous
Petal.Width continuous
Question
Use tree model to predict species when knowing the 4measurements
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
> str(iris) ## show object structure
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> levels(iris$Species) <- c("setos", "versi", "virgi")
> ### classification
> iris.ct <- ctree(Species ~ .,data = iris)
> iris.ct
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46
> table(predict(iris.ct), iris$Species)
setos versi virgi
setos 50 0 0
versi 0 49 5
virgi 0 1 45
> plot(iris.ct, main = "Predict iris Species by ctree")
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
Predict iris Species by ctree
Petal.Length
1
<= 1.9 > 1.9
Node 2 (n = 50)
setos versi virgi
0
0.2
0.4
0.6
0.8
1
Petal.Width
3
<= 1.7 > 1.7
Petal.Length
4
<= 4.8 > 4.8
Node 5 (n = 46)
setos versi virgi
0
0.2
0.4
0.6
0.8
1Node 6 (n = 8)
setos versi virgi
0
0.2
0.4
0.6
0.8
1Node 7 (n = 46)
setos versi virgi
0
0.2
0.4
0.6
0.8
1
> ### survival analysis
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
> data("GBSG2", package = "ipred")
> ct1 <- ctree(Surv(time, cens) ~ .,data = GBSG2)
> plot(ct1)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
pnodesp < 0.001
1
≤ 3 > 3
horThp = 0.035
2
no yes
Node 3 (n = 248)
0 500 1500 2500
0
0.2
0.4
0.6
0.8
1Node 4 (n = 128)
0 500 1500 2500
0
0.2
0.4
0.6
0.8
1
progrecp < 0.001
5
≤ 20 > 20
Node 6 (n = 144)
0 500 1500 2500
0
0.2
0.4
0.6
0.8
1Node 7 (n = 166)
0 500 1500 2500
0
0.2
0.4
0.6
0.8
1
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
> ex1<-ctree(Ozone ~ ., data=air0, controls=ctree_control(
+ maxdepth=3, mincriterion=0.95, minbucket=20))
> plot(ex1, inner_panel = node_inner(ex1, fill = "pink2"),
+ terminal_panel = node_hist(ex1, ymax = 0.07,
+ xscale = c(0, 200), fill = "cyan"),
+ main="NYC Air Quality - Different Tree Display")
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
NYC Air Quality − Different Tree Display
Tempp < 0.001
1
≤ 82 > 82
Windp < 0.001
2
≤ 9.2 > 9.2
Node 3 (n = 24)
0 0.020.040.06
0
50
100
150
200
Tempp = 0.003
4
≤ 75 > 75
Node 5 (n = 32)
0 0.020.040.06
0
50
100
150
200Node 6 (n = 21)
0 0.020.040.06
0
50
100
150
200Node 7 (n = 34)
0 0.020.040.06
0
50
100
150
200
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
mob - Model-based Recursive Partitioning
Typical tree algorithms partition data on the dependentvariable difference
mob partitions data on model difference
It relies on test of parameter instability
The outcome is still a tree which nodes display differentmodel pattern
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
> ## recursive partitioning of a logistic regression model
> ## load data
> data("PimaIndiansDiabetes", package = "mlbench")
> ## partition logistic regression diabetes ~ glucose
> ## wth respect to all remaining variables
> fmPID <- mob(diabetes ~ glucose | pregnant + pressure + triceps +
+ insulin + mass + pedigree + age,
+ data = PimaIndiansDiabetes, model = glinearModel,
+ family = binomial())
> ## fitted model
> coef(fmPID)
(Intercept) glucose
2 -9.951510 0.05870786
4 -6.705586 0.04683748
5 -2.770954 0.02353582
> plot(fmPID, main = "Pima Indians Diabetic Data (n = 768)")
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
party package
Pima Indians Diabetic Data (n = 768)
massp < 0.001
1
≤ 26.3 > 26.3
Node 2 (n = 167)
0 99 117 140.5
pos
neg
0
0.2
0.4
0.6
0.8
1
● ●●
●
agep < 0.001
3
≤ 30 > 30
Node 4 (n = 304)
0 99 117 140.5 199
pos
neg
0
0.2
0.4
0.6
0.8
1
●
●
●
●
Node 5 (n = 297)
0 99 117 140.5 199
pos
neg
0
0.2
0.4
0.6
0.8
1
●
●
●
●
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
RWeka package Overview
Weka (http://www.cs.waikato.ac.nz/ml/weka/)offers a collection of machine learning algorithms for datamining
Weka is written in in Java
Tree learners offered by Weka: C4.5, Naive Bayes trees,M5, logistic model tree
RWeka package creates R interface to Weka
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
> library(RWeka)
> w1 <- J48(survived ~ ., data=Titanic,
+ control = Weka_control(R = TRUE, B= TRUE))
> plot(w1, main="Tree by Weka J48 Model - Titanc Data")
> w1 ## print J48 model
J48 pruned tree
------------------
sex = female: yes (257.0/64.0)
sex != female
| age <= 9.0
| | pclass = 3: No (21.0/10.0)
| | pclass != 3: yes (9.0)
| age > 9.0: No (411.0/73.0)
Number of Leaves : 4
Size of the tree : 7
> summary(w1) ## Summary of J48 model
=== Summary ===
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
Correctly Classified Instances 829 79.2543 %
Incorrectly Classified Instances 217 20.7457 %
Kappa statistic 0.5667
Mean absolute error 0.3244
Root mean squared error 0.4028
Relative absolute error 67.1335 %
Root relative squared error 81.9435 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 99.3308 %
Total Number of Instances 1046
=== Confusion Matrix ===
a b <-- classified as
523 96 | a = No
121 306 | b = yes
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
RWeka package
[c]
Tree by Weka J48 Model − Titanic Data
sex
female male
yes(257.0/64.0)
age
≤ 9 > 9
pclass
3 {1, 2}
No(21.0/10.0)
yes(9.0)
No(411.0/73.0)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
evtree package
evtree package Overview
Global optimization
Evolutionary algorithm
Classification and regression tree
Use partykit for tree structure
Computation demanding
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
evtree package
> library(evtree)
> iris.ev <-evtree(Species ~ ., data=iris) ## evtree
> iris.ev
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width < 1: setos (n = 50, err = 0.0%)
| [3] Petal.Width >= 1
| | [4] Petal.Length < 5
| | | [5] Petal.Width < 1.7: versi (n = 47, err = 0.0%)
| | | [6] Petal.Width >= 1.7: virgi (n = 7, err = 14.3%)
| | [7] Petal.Length >= 5: virgi (n = 46, err = 4.3%)
Number of inner nodes: 3
Number of terminal nodes: 4
> plot(iris.ev, main = "Iris data using evtree")
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
evtree package
Iris data using evtree
Petal.Width
1
< 1 >= 1
Node 2 (n = 50)
setos
00.20.40.60.8
1
Petal.Length
3
< 5 >= 5
Petal.Width
4
< 1.7 >= 1.7
Node 5 (n = 47)
setos
00.20.40.60.8
1Node 6 (n = 7)
setos
00.20.40.60.8
1Node 7 (n = 46)
setos
00.20.40.60.8
1
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
mvpart package
> ##~ Use mvtree function to fit a multiple response model
> ##~ Automobile Data from 'Consumer Reports' 1990 (n = 49 cars)
>
> library(mvpart)
> ## Data set up
> data(car.test.frame) ## Conumser report car data in mvpart package
> car <- na.omit(car.test.frame) # use a short name
> head(car) # display a few records
Price Country Reliability Mileage Type Weight Disp. HP
Eagle Summit 4 8895 USA 4 33 Small 2560 97 113
Ford Escort 4 7402 USA 2 33 Small 2345 114 90
Ford Festiva 4 6319 Korea 4 37 Small 1845 81 63
Honda Civic 4 6635 Japan/USA 5 32 Small 2260 91 92
Mazda Protege 4 6599 Japan 5 32 Small 2440 113 103
Mercury Tracer 4 8672 Mexico 4 26 Small 2285 97 82
> car <- cbind(as.data.frame(scale(car[, c(1,3:4,6:7)])),
+ car[, c(2,5)]) # recale 5 dependent variables
> # fit and display a tree using "mvpart"
> car.mv <- mvpart(data.matrix(car[, 1:5]) ~ Country + Type,
+ data = car, uniform = TRUE, prn = TRUE, all.leaves = TRUE)
rpart(formula = form, data = data)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
mvpart package
Variables actually used in tree construction:
[1] Country Type
Root node error: 240/49 = 4.898
n= 49
CP nsplit rel error xerror xstd
1 0.356769 0 1.00000 1.04325 0.119931
2 0.142423 1 0.64323 0.73462 0.099250
3 0.083626 2 0.50081 0.59171 0.072715
4 0.068475 3 0.41718 0.57669 0.073553
5 0.024982 4 0.34871 0.49260 0.065112
> # PCA biplot of 5 group means (leaves)
> rpart.pca(car.mv, wgt.ave = FALSE)
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
mvpart package
Type=Smll
Country=USA
Type=Cmpc,Medm,Sprt
Type=Cmpc
Type=Cmpc,Sprt
Type=Cmpc,Larg,Medm,Sprt,Van
Country=Japn,J/US,Swdn
Type=Larg,Van
Type=Medm,Sprt
Type=Medm,Van
251 : n=49
19.5 : n=12 142 : n=37
61.7 : n=21
36.1 : n=16
3.62 : n=5 26.4 : n=11
8.35 : n=5
44.6 : n=16
16.8 : n=11 6.63 : n=5
PriceReliabilityMileageWeightDisp.
Error : 0.349 CV Error : 0.521 SE : 0.0694Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
mvpart package
Dim 1 77.99 % : [ 0.892 ]
D
im 2
19
.88
% :
[ 0.8
]
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
● ●
●
●
●
Price
Reliability
Mileage
Weight
Disp.
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
partykit package
partykit - a toolkit for tree infrastructure in R
Represent tree model (objective)
Summarize result
Visualize tree structure
Read/coerce tree models from other sources (rpart,RWeka, PMML)
Offer standard methods for tree manipulation (print, plot,predict . . . )
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Overview
Data import
foreign R package that reads data stored by Minitab, S,SAS, SPSS, Stata, Systat, dBase,
ASCII file read.table statement
manual R official manaul R Data Import/Export(http://cran.wustl.edu/doc/manuals/R-data.pdf)
database See Relational databases section above
Excel See Reading Excel spreadsheets section above
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Overview
Data inspection
Common R statements
summary Summary statistics
table Frequency or crosstab
hist Histogram
str Display object structure
head Display a few record
dsni:j, m:n ] Display rows i-j and columns m-n
describe A function in package Hmisc
. . .
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Overview
Data manipulation
Statements for recoding
ifelse Conditional statement
as.factor Coerce to a (nominal) factor data type
as.ordered Coerce to an ordinal factor data type
cut Cut numerical data into categorical variables(factor)
as.data.frame Coerce into a data frame
apply By rows or columns: appl, lapply, sapply, by,aggregate, . . .
. . .
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Overview
Data manipulation
Statements data subsetting
dsn[1:n, ] Row indexing
dsn[, 10:6 ] Column indexing
subset Select record, rows and columns
. . .
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Example
> ## ----- Example: Titanic Passenger Data
> Titanic <- read.excel("C:\\Project\\MyTree\\titanic3.xlsx", "titanic3")
> str(Titanic)
'data.frame': 1310 obs. of 14 variables:
$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...
$ survived : num 1 1 0 0 0 1 1 0 1 0 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
$ sex : chr "female" "male" "female" "male" ...
$ age : num 29 0.917 2 30 25 ...
$ sibsp : num 0 1 1 1 1 0 1 0 2 0 ...
$ parch : num 0 2 2 2 2 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211 152 152 152 152 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : chr "S" "S" "S" "S" ...
$ boat : chr "2" "11" NA NA ...
$ body : chr NA NA NA "135" ...
$ home#dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
> ds <- subset(Titanic, select = c(survived, sex, age, pclass)) ## select variables
> sort(colSums(is.na(ds))) ## check missing
survived sex pclass age
1 1 1 264
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
> ds <- na.omit(ds) # use only complete records
> str(ds)
'data.frame': 1046 obs. of 4 variables:
$ survived: num 1 1 0 0 0 1 1 0 1 0 ...
$ sex : chr "female" "male" "female" "male" ...
$ age : num 29 0.917 2 30 25 ...
$ pclass : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")=Class 'omit' Named int [1:264] 16 38 41 47 60 70 71 75 81 107 ...
.. ..- attr(*, "names")= chr [1:264] "16" "38" "41" "47" ...
> # change to factor
> ds$survived <- factor(ds$survived, labels = c("No", "yes"))
> ds$sex <- as.factor(ds$sex)
> ds$pclass <- as.factor(ds$pclass)
> # run ctree
> # cf <- ctree(survived ~ ., data = ds, controls =
> # ctree_control(maxdepth = 3, mincriterion = 0.95, minbucket = 20))
>
> #plot(cf, main = "Titanic Passengers (n = 1046)")
>
> Titanic <- ds
> save(Titanic, file = "C:\\Project\\MyTree\\Titanic.RData") ## save R data
> #load("C:\\Project\\MyTree\\Titanic.RData") ## load data next time
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Advantages
Little statistics
Easy data requirement (even allows missing data!)
Capture nonlinear relationship
Accomodate interactions
Runs very fast
Good intrepretability and visualization
A convenient method for data segmentation
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Limitations
Less stable (or reproducible)
Only use limited variables
Lack of parametric information (such as variableimportance)
Typically requires relative large sample size
Not a good way to identify variable importance
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R
Introduction Method R Implementation Data Preparation Conclusion
Thank you for attending!
Ming Shan R/Predictive Analytics Meetup - December 7, 2011
Decision Tree in R