Post on 09-Aug-2020
transcript
Classification Trees and MARSSTA450S/4000S: Topics in statistics. Statistical Aspects
of Data Mining
Ana-Maria Staicu
Classification Trees and MARS – p. 1/19
Recap Regression Trees
CART (classification and regression trees) is a method developed byBreiman, Friedman, Olshen and Stone to classify data on the basis ofsome of the variables.Known also as Recursive Partitioning.
Basic idea: construct a tree that will separate the data in the "best"way by finding binary splits on variables; find the best splitting variableand the best splitting point at each stage. The routine is recursive.Usually the process stops when some minimum node size, say 5
nodes is reached.
Once the tree has been grown, a cost complexity criterion is used toprune it.The tuning parameter α which governs the tradeoff between tree sizeand its goodness of fit to the data.
Classification Trees and MARS – p. 2/19
Classification Trees
For trees, R uses either package tree or rpart.
Classification Trees and MARS – p. 3/19
Classification Trees
For trees, R uses either package tree or rpart.
The target variable Y takes values 1, 2, . . . , K
One basic difference between classification and regression trees isthe action that takes place at the splits:
Classification Trees and MARS – p. 3/19
Classification Trees
For trees, R uses either package tree or rpart.
The target variable Y takes values 1, 2, . . . , K
One basic difference between classification and regression trees isthe action that takes place at the splits:
Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).
Classification Trees and MARS – p. 3/19
Classification Trees
For trees, R uses either package tree or rpart.
The target variable Y takes values 1, 2, . . . , K
One basic difference between classification and regression trees isthe action that takes place at the splits:
Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).
Classification tree: we try to minimize a measure of impurity (lossfunction):
Classification Trees and MARS – p. 3/19
Node impurity measure:
Misclassification error: 1Nm
∑
i∈Rm
I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1
Nm
∑
i∈Rm
I(yi = k) is the proportion of class k
observations in node m.
Gini Index:∑
k 6=k′ p̂mkp̂mk′ =∑K
1 p̂mk(1 − p̂mk).
Cross-Entropy or Deviance: −∑K
1 p̂mk log p̂mk.
Classification Trees and MARS – p. 4/19
Node impurity measure:
Misclassification error: 1Nm
∑
i∈Rm
I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1
Nm
∑
i∈Rm
I(yi = k) is the proportion of class k
observations in node m.
Gini Index:∑
k 6=k′ p̂mkp̂mk′ =∑K
1 p̂mk(1 − p̂mk).
Cross-Entropy or Deviance: −∑K
1 p̂mk log p̂mk.
When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.
Classification Trees and MARS – p. 4/19
Node impurity measure:
Misclassification error: 1Nm
∑
i∈Rm
I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1
Nm
∑
i∈Rm
I(yi = k) is the proportion of class k
observations in node m.
Gini Index:∑
k 6=k′ p̂mkp̂mk′ =∑K
1 p̂mk(1 − p̂mk).
Cross-Entropy or Deviance: −∑K
1 p̂mk log p̂mk.
When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.
When pruning the treeChoose any of the three. Misclassification error is typically used.
Classification Trees and MARS – p. 4/19
§9.2.4 Other issues
Handling Unordered Inputs
If an input Xj has q ordered possible values, there are q − 1
possible partitions into 2 groups
If an input Xj is categorical, having q unordered possible values,there are 2q−1 − 1 possible partitions into 2 groups
Solution (for a 0 − 1 or quantitative outcome): order the predictorclasses according to the proportion falling in outcome class 1. Splitthe predictor Xj as if the values were ordered. This split results inthe optimal split in terms of squared error or Gini index. SeeBreiman’s et al Classification and Regression Trees .
Classification Trees and MARS – p. 5/19
§9.2.4 The Loss Matrix
Misclassifying observations may vary with classes.
Classification Trees and MARS – p. 6/19
§9.2.4 The Loss Matrix
Misclassifying observations may vary with classes.
Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk
′
. Evidently Lkk = 0.
Classification Trees and MARS – p. 6/19
§9.2.4 The Loss Matrix
Misclassifying observations may vary with classes.
Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk
′
. Evidently Lkk = 0.
How to incorporate the losses into the modeling process?Case K = 2: Weight observation in class 1 by L12, and observation inclass 2 by L21
Case K > 2: If Lkk′ is a function only of k, not of k′, weightobservation in class k by Lkk′ . In a terminal node m class k(m) will beassigned;k(m) = arg mink
∑
l Llkp̂ml.To incorporate the loss into the process, modify Gini index to∑
k 6=k′ Lkk′ p̂mkp̂mk′ .
Classification Trees and MARS – p. 6/19
§9.2.4 Missing Predictor Values
In general two approaches:
1) discard the observations with missing values
2) impute the missing values, eg by mean of the predictor over the
non-missing observations
Classification Trees and MARS – p. 7/19
§9.2.4 Missing Predictor Values
In general two approaches:
1) discard the observations with missing values
2) impute the missing values, eg by mean of the predictor over the
non-missing observations
Tree based methods:
1) make a new category "NA" for the missing values of a categoricalpredictor.
2) use surrogate variables.
Classification Trees and MARS – p. 7/19
§9.2.4 Missing Predictor Values
In general two approaches:
1) discard the observations with missing values
2) impute the missing values, eg by mean of the predictor over the
non-missing observations
Tree based methods:
1) make a new category "NA" for the missing values of a categoricalpredictor.
2) use surrogate variables.
At any split, alternative splitting variables and corresponding splittingpoints are determined when building the model. A first surrogate splitwould best mimic the split of the training data achieved by the primarysplit, and so on. Use surrogate splits in order in the absence of theprimary splitting predictor.
Classification Trees and MARS – p. 7/19
Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.
Classification Trees and MARS – p. 8/19
Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.
Linear Combination splitsChoose a split of the form
∑
ajXj ≤ c instead of the form Xj ≤ c.Consequences:
1) it improves the predictive power of the tree
2) it reduces its interpretability.
Alternatives:
HME(hierarchial mixture model)
Classification Trees and MARS – p. 8/19
Advantages
Trees are easy to interpret
Trees can handle multicolinearity
Tree-method is a non parametric method (assumptions free).
Classification Trees and MARS – p. 9/19
Advantages
Trees are easy to interpret
Trees can handle multicolinearity
Tree-method is a non parametric method (assumptions free).
Disadvantages
High variance caused by the hierarchical nature of the process
Even a more stable split criterion does not remove the instabilityAn error in the top split is propagated down to all of the splits below it;
Lack of smoothness of the predictor surface (MARS alleviate)
Difficulty in modeling additive structure (MARS capture).
Classification Trees and MARS – p. 9/19
Some code for trees
library(MASS)
library(rpart)
cpus.rp <- rpart(log10(perf) ˜ ., cpus[ ,2:8], cp=1e-3)
summary(cpus.rp)
printcp(cpus.rp)
# Regression tree:
# rpart(formula = log10(perf) ˜ ., data = cpus[, 2:8], cp=0.0 01)
# Variables actually used in tree construction:
# [1] cach chmax chmin mmax syct
# Root node error: 43.116/209 = 0.20629
# CP nsplit rel error xerror xstd
# 1 0.5492697 0 1.00000 1.02128 0.098176
# 2 0.0893390 1 0.45073 0.47818 0.048282
cpus.rp.pr <- prune(cpus.rp, cp=0.006)
post(cpus.rp.pr,title="Plot of rpart object cpus.rp.pr" ,
filename="C:\\AM\\CpusTree.eps",horizontal=F, points ize=8)
Classification Trees and MARS – p. 10/19
|
cach< 27
mmax< 6100
mmax< 1750 syct>=360
chmin< 5.5
cach< 0.5
mmax< 2.8e+04
cach< 96.5
mmax< 1.124e+04
cach< 56
cach>=27
mmax>=6100
mmax>=1750 syct< 360
chmin>=5.5
cach>=0.5
mmax>=2.8e+04
cach>=96.5
mmax>=1.124e+04
cach>=56
1.753n=209
1.525n=143
1.375n=78
1.089n=12
1.427n=66
1.704n=65
1.28n=7
1.756n=58
1.699n=46
1.531n=11
1.751n=35
1.974n=12
2.249n=66
2.062n=41
2.008n=34
1.827n=14
2.135n=20
2.324n=7
2.555n=25
2.268n=7
2.667n=18
Plot of rpart object cpus.rp.pr
Classification Trees and MARS – p. 11/19
Some code for trees
library(tree)
fgl.tr <- tree(type˜.,fgl)
summary(fgl.tr)
# Classification tree: tree(formula = type ˜ ., data = fgl)
# Number of terminal nodes: 20
# Residual mean deviance: 0.6853 = 133 / 194
# Misclassification error rate: 0.1542 = 33 / 214
fgl.tr1 <- snip.tree(fgl.tr, nodes = 9)
# The nodes could be sniped off interactively, by clicking wi th
# the mouse on the terminal node. fgl.tr1 <- snip.tree(fgl.t r)
fgl.cv <- cv.tree(fgl.tr,, FUN=prune.tree, K=10)
# The algorithm below randomly divides the training set.
for(i in 2:5)
{fgl.cv$dev <- fgl.cv$dev + cv.tree(fgl.tr,, prune.tree) $dev}
fgl.cv$dev <- fgl.cv$dev/5
plot(fgl.cv) title("Cross-validation plot for pruning")
Classification Trees and MARS – p. 12/19
X−v
al R
elat
ive
Err
or
0.2
0.4
0.6
0.8
1.0
1.2
Inf 0.054 0.016 0.0048 0.0012
1 3 5 7 10 12 14 16
size of tree
Pruning: choosing parameter cp
devi
ance
9510
010
511
011
512
012
513
0
5 10 15 20
170.0 23.0 16.0 10.0 8.1Cross−validation plot for pruning
Classification Trees and MARS – p. 13/19
§9.4 MARS
For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.
There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.
Classification Trees and MARS – p. 14/19
§9.4 MARS
For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.
For MARS a similar process is used to find the best split withreference to the deviances from a spline function on either side of thesplit.
There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.
Classification Trees and MARS – p. 14/19
The spline functions used by MARS are:
(X − t)+ =
x − t x > t
0 o/wand (t − X)+ =
t − x x < t
0 o/w
Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t
is called a knot.
Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).
Classification Trees and MARS – p. 15/19
The spline functions used by MARS are:
(X − t)+ =
x − t x > t
0 o/wand (t − X)+ =
t − x x < t
0 o/w
Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t
is called a knot.
MARS uses the collection of basis functionsC =
{
(Xj − t)+, (t − Xj)+}
, with t ∈ {x1j , . . . , xNj} j = 1, 2, . . . , p
Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).
Classification Trees and MARS – p. 15/19
The model is of the form:
f(X) = β0 +
M∑
m=1
βmhm(X) (1)
where the hm(X) ∈ C or a product of functions in C.M = {h0(X), . . . , hM (X)} is the set of all functions included in the model.
How to build the model if model functions were known?If functions hm(X) were known, determine the coefficients β0, . . . , βM
by minimizing the residual sum of squares. The model buildingstrategy is similar to stepwise linear regression: use functions of theform of hm(X) instead of the original inputs.
Classification Trees and MARS – p. 16/19
MARS Model functions
Step 1
Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}
The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19
MARS Model functions
Step 1
Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}
Step 2
Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂
(2)0 + β̂
(2)1 (XJ − xkJ)+ + β̂
(2)2 (xkJ − XJ)+.
M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.
The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19
MARS Model functions
Step 1
Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}
Step 2
Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂
(2)0 + β̂
(2)1 (XJ − xkJ)+ + β̂
(2)2 (xkJ − XJ)+.
M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.
Step m + 1
Add to the model a function of the formb2m−1hl(X)(Xj − t)+ + b2mhl(X)(t − Xj)+ with hl(X) ∈ M(m)
that produces the largest decrease in training error.Say this is achieved by j = J ′, t = xk′J′ and l = L.M(m+1) = M(m) ∪ {h2m−1(X), h2m(X)}
h2m−1 = hL(X)(XJ′ − xk′J′)+ and h2m = hL(X)(xk′J′ − XJ′)+
The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19
At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.
Classification Trees and MARS – p. 18/19
At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.
The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.
Classification Trees and MARS – p. 18/19
At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.
The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.
The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:
GCV (λ) =
∑N
i=1(yi − f̂λ(xi))2
(1 − M(λ)/N)2
Classification Trees and MARS – p. 18/19
At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.
The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.
The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:
GCV (λ) =
∑N
i=1(yi − f̂λ(xi))2
(1 − M(λ)/N)2
M(λ) is the effective number of parameters used in the model;namely # terms in the model plus # parameters used to select theoptimal positions of the knots (3 parameters per knot).
Classification Trees and MARS – p. 18/19
Advantages:
Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.
MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.
Classification Trees and MARS – p. 19/19
Advantages:
Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.
MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.
Limitations:
Hierarchical (forward) modeling strategy.The philosophy used is that a higher-order interaction will likelyexists only if some of its "footprints" exist as well.
Restriction in the formation of model terms: each input can appearat most once in a product.
Classification Trees and MARS – p. 19/19