Decision Trees
Sewoong Oh
CSE/STAT 416University of Washington
!2
Decision trees: An interpretable predictor
!3
Example: predicting potential loan defaults• Data: discrete for now• Goal: Given a new loan application,
predict the whether he will default on the loan
• Learning: fitting a model to data• Inference: making prediction using
a learned model
!4
Credit Term Income yexcellent 3 yrs high safe
fair 5 yrs low riskyfair 3 yrs high safepoor 5 yrs high risky
excellent 3 yrs low riskyfair 5 yrs low safepoor 3 yrs high riskypoor 5 yrs low safefair 3 yrs high safe
Did I pay previous loans on time?
Example: excellent, good, or fair
What’s my income?
Example:>$80K per year
How soon do I need to pay the loan?
Example: 3 years, 5 years,…
predictor f(x) ⇡ y<latexit sha1_base64="ibr5p71ydrEdcrNs+VWmFWXdR6I=">AAACAnicbVC7TsMwFHV4lvIKMCEWixapLFVSBhgrWBiLRB9SG1WO47RWndiyHdQoqlj4FRYGEGLlK9j4G9w2A7Sc6eicc3XvPb5gVGnH+bZWVtfWNzYLW8Xtnd29ffvgsKV4IjFpYs647PhIEUZj0tRUM9IRkqDIZ6Ttj26mfvuBSEV5fK9TQbwIDWIaUoy0kfr2sUkHFGsuYTmsjM97SAjJxzAt9+2SU3VmgMvEzUkJ5Gj07a9ewHESkVhjhpTquo7QXoakppiRSbGXKCIQHqEB6Roao4goL5u9MIFnRglgaM4IeazhTP09kaFIqTTyTTJCeqgWvan4n9dNdHjlZTQWiSYxni8KEwY1h9M+YEAlwZqlhiAsqbkV4iGSCGvTWtGU4C6+vExatap7Ua3d1Ur167yOAjgBp6ACXHAJ6uAWNEATYPAInsEreLOerBfr3fqYR1esfOYI/IH1+QMg15aZ</latexit>
Decision tree
f(poor credit, high income, 3 year) = ?!5
Start
Credit?
Safe
excellent
Income?
poor
Term?
Risky Safe
fair
5 years3 years
Risky
Low
Term?
Risky Safe
high
5 years3 years
• Each internal node tests a feature x[i]• Each branch assigns a feature value x[i]=fair
(or a subset of feature values {fair,poor})• Each leaf node assigns a class y • To predict, traverse the tree from root to a leaf
• Decision trees are naturally human interpretable!
What functions can be represented?• For discrete input and output data, any function of the
input can be represented as a decision tree• However, in general, it could require exponentially many
nodes to represent a arbitrary function (exponential in the dimension of the input)
• For example, a function that is sensitive to a small change in the input
!6
x[1] x[2] x[3] Y0 0 0 00 0 1 10 1 0 11 0 0 10 1 1 01 0 1 01 1 0 01 1 1 1
Parity function x[1]
x[2] x[2]
x[3] x[3] x[3] x[3]
0 1 1 0 1 0 0 1
0 1x[1] x[2] x[3] Y0 0 0 10 0 1 10 1 0 11 0 0 00 1 1 11 0 1 01 1 0 01 1 1 0
Simple functionx[1]
0 11 0
Which tree is better?• Trade off between Accuracy vs. Simplicity • Accuracy is measured by
• Simplicity can be measured by depth, number of leaves, etc
• If you have just the root node, for your decision tree, what should the decision be, and what is the accuracy and what is the error?
!7
# of correct predictions
# of examples<latexit sha1_base64="AMSq3+p1e9d388z/SMphdKIsOXI=">AAACJ3icbVBNSwMxFMz6WetX1aOXYBE8lV0V9CRFLx4r2FbolpJN39ZgdrMkb6Vl6b/x4l/xIqiIHv0nZtsebOtAYJiZl+RNkEhh0HW/nYXFpeWV1cJacX1jc2u7tLPbMCrVHOpcSaXvAmZAihjqKFDCXaKBRYGEZvBwlfvNR9BGqPgWBwm0I9aLRSg4Qyt1Shd+qBnPfIQ+Zn6ZqpBypTVwpPaeruB5zAyHUwnosyiRYOVOqexW3BHoPPEmpEwmqHVKb35X8TSCGLlkxrQ8N8F2xjQKLmFY9FMDCeMPrActS2MWgWlnoz2H9NAqXRoqbU+MdKT+nchYZMwgCmwyYnhvZr1c/M9rpRietzMRJylCzMcPhamkqGheGu2KvBE5sIRxLexfKb9ntji01RZtCd7syvOkcVzxTirHN6fl6uWkjgLZJwfkiHjkjFTJNamROuHkibyQd/LhPDuvzqfzNY4uOJOZPTIF5+cXTPancg==</latexit>
(all data)
Loan status: Safe Risky
# of Safe loans
22
# of Risky loans
18
N = 40 examples
Root22 18
Loan status: Safe Risky
# of Safe loans# of Risky loans
Training Data Visualizing a decision tree and data
Decision stump: single level tree
• We grow the tree, by adding one more level of branching,and deciding which hypothesis (or feature) to test at the branch
• At the intermediate node, the prediction is determined by the majority rule• In a greedy approach, we choose the hypothesis that gives better
accuracy at the intermediate nodes; credit: 32/40=0.8, term: 30/40=0.75!8
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Choice 1: Split on CreditRoot
22 18
3 years16 4
5 years6 14
Loan status:Safe Risky
Term?
Choice 2: Split on Term
RiskySafeSafe Safe Risky
Intermediate nodes Internal nodes
Root
Greedy algorithm for growing a decision tree• Start with the root node as an intermediate node• Repeat if there exists an intermediate node• Choose a feature x[i] to split at the intermediate node
that maximizes the accuracy• Change the intermediate node into a internal node
branching on x[i]• Add intermediate nodes to each branch• If a intermediate node meets the stopping rule, change it
to a leaf node and make a prediction
• Stopping rule: • 1. Do not branch if at that intermediate node,
all data have the same label (perfect prediction)• 2. Do not branch if no feature left to branch
!9
Greedy approach
!10
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe All data points are Safe ènothing else to do with this subset of data
Greedy approach
!11
Root22 18
excellent9 0
fair9 4
poor4 14
Loan status:Safe Risky
Credit?
SafeBuild decision stump with
subset of data where Credit = poor
Build decision stump with subset of data where
Credit = fair
End of second level
!12
Root22 18
Loan status:Safe Risky
Credit?
excellent9 0
fair9 4
poor4 14
Safe
3 years0 4
5 years9 0
Term?
Risky Safe
Build another stumpthese data points
high4 5
Low0 9
Income?
Risky
Final decision tree
• Branching only increases the accuracy (and decreases the error)• The final accuracy is 37/40• Why not branch further?
!13
Root22 18
excellent9 0 Fair
9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe
5 years9 0
3 years0 4
Term?
Risky Safe
low0 9
high4 5
Income?
5 years4 3
3 years0 2
Term?
Risky Safe
Risky
Another potential early stopping rule• If a branching does not increase accuracy, should we still
branch?
!14
x[1] x[2] yFALSE FALSE FALSE
FALSE TRUE TRUE
TRUE FALSE TRUE
TRUE TRUE FALSE
Root2 2
Tree Classification error(root) 0.5
XOR
!15
Another potential early stopping rule• If a branching does not increase accuracy, should we still
branch?
Tree Classification error(root) 0.5
Split on x[1] 0.5
x[1] x[2] yFALSE FALSE FALSE
FALSE TRUE TRUE
TRUE FALSE TRUE
TRUE TRUE FALSE
Root2 2
True1 1
False1 1
x[1]
!16
Another potential early stopping rule• If a branching does not increase accuracy, should we still
branch? Yes.
x[1] x[2] yFALSE FALSE FALSE
FALSE TRUE TRUE
TRUE FALSE TRUE
TRUE TRUE FALSE
Root2 2
True1 1
False1 1
x[1]
True0 1
x[2]
True1 0
False1 0
x[2]
False0 1
True FalseFalse True
Tree Classification error
(root) 0.5Split on x[1] 0.5
Split on x[1],x[2] 0
Decision trees on real valued data
!17
Binary branching on real valued data
!18
Income Credit Term y$105 K excellent 3 yrs Safe$112 K good 5 yrs Risky$73 K fair 3 yrs Safe$69 K excellent 5 yrs Safe$217 K excellent 3 yrs Risky$120 K good 5 yrs Safe$64 K fair 3 yrs Risky$340 K excellent 5 yrs Safe$60 K good 3 yrs Risky
Root22 18
Loan status:Safe Risky
< $60K8 13
>= $60K14 5
Income?
Subset of data with Income >= $60K
• Is there any gain in ternary or k-ary branching?
• Suppose we want to branch on a real valued feature x[i], then how do we choose the threshold?• Between two data points, it does not make any difference
where you split
• there are really only a finite number of choices
• We choose one with lowest error (or maximum accuracy)!19
SafeRisky
Income
$120K$10K
vA vB
SafeRisky
Income
$120K$10K
Growing decision tree for real-valued data
!20
0 10 20 30 40 …$0K
$40K
$80K
…
Age
Income
Age
Income age >= 38age < 38
Predict Safe
Predict Risky
0 10 20 30 40 …$0K
$40K
$80K
…
Age
Income
0 10 20 30 40 …$0K
$40K
$80K
…
root 9 8
Age
Income
age<38 5 3
risky
age>=38 4 5
income<48k 4 0
income>=48k 0 5
Input data Step 1
Step 2
Class is 31 body parts: LU/RU/LW/RW head, neck, L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, L/R ankle, L/R foot (Left, Right, Upper, loWer).
!21
!22
features:
Decision tree:
Classification with decision trees• Data
!23
Root18 13
x[1] >= -0.074 11
x[1] < -0.0713 3
x[1]
!24
Root18 13
x[1] < -0.0713 3
x[1] >= -0.074 11
x[1]
x[1] < -1.667 0
x[1] >= -1.666 3
x[1]
x[2] < 1.551 11
x[2] >= 1.55 3 0
x[2]
y values- +
For threshold splits, same feature can be used multiple times
Evolution of decision boundaries• Decision boundaries get more complicated as we increase the
model complexity (here measured by the depth fo decision tree)
!25
Logistic Regression
Decision Tree
Degree 2 featuresDegree 1 features
Depth 3Depth 1 Depth 10
Degree 6 features
Probabilistic prediction with decision trees
• By taking the probability of the training data in that subset, we can make probabilistic predictions
!26
root18 13
X1 < -0.0713 3
X1 >= -0.074 11
X1
Y values- +
P(yi = +1) =11
15<latexit sha1_base64="UuHDfzIeEpg8V/VHNDmkfcFHA58=">AAACEXicbVDLSsNAFJ34rPUVdelmsAgVoWSqoptC0Y3LCvYBTSiT6aQdOnkwMxFCyC+48VfcuFDErTt3/o2TNgttPXDhcM693HuPG3EmlWV9G0vLK6tr66WN8ubW9s6uubffkWEsCG2TkIei52JJOQtoWzHFaS8SFPsup113cpP73QcqJAuDe5VE1PHxKGAeI1hpaWBWU9vHauy6sJVV7TFWaTJgWeMUnTSg7QlMUoSyFF1kA7Ni1awp4CJBBamAAq2B+WUPQxL7NFCEYyn7yIqUk2KhGOE0K9uxpBEmEzyifU0D7FPppNOPMnislSH0QqErUHCq/p5IsS9l4ru6Mz9fznu5+J/Xj5V35aQsiGJFAzJb5MUcqhDm8cAhE5QonmiCiWD6VkjGWOegdIhlHQKaf3mRdOo1dFar351XmtdFHCVwCI5AFSBwCZrgFrRAGxDwCJ7BK3gznowX4934mLUuGcXMAfgD4/MHnR2cOg==</latexit>
P(yi = �1) =13
16<latexit sha1_base64="OZ2S3naisEZz4A2W1LgKMQRWC5c=">AAACEnicbVDLSsNAFJ34rPUVdelmsAjtwpK0om4KRTcuK9gHNCFMppN26OTBzEQIId/gxl9x40IRt67c+TdO2iy09cCFwzn3cu89bsSokIbxra2srq1vbJa2yts7u3v7+sFhT4Qxx6SLQxbygYsEYTQgXUklI4OIE+S7jPTd6U3u9x8IFzQM7mUSEdtH44B6FCOpJEevpZaP5MR1YSerWhMk08ShWevMrLWg5XGEU7OZpeZFBh29YtSNGeAyMQtSAQU6jv5ljUIc+ySQmCEhhqYRSTtFXFLMSFa2YkEihKdoTIaKBsgnwk5nL2XwVCkj6IVcVSDhTP09kSJfiMR3VWd+v1j0cvE/bxhL78pOaRDFkgR4vsiLGZQhzPOBI8oJlixRBGFO1a0QT5AKQqoUyyoEc/HlZdJr1M1mvXF3XmlfF3GUwDE4AVVggkvQBregA7oAg0fwDF7Bm/akvWjv2se8dUUrZo7AH2ifPwi+nGk=</latexit>
root18 13
X1 < -0.0713 3
X1 >= -0.074 11
X1 < -1.667 0
X1 >= -1.666 3
X2 < 1.551 11
X2 >= 1.55 3 0
X1
X1 X2
Y values- +
!27
Overfitting
Trade-off between training error and depth• As in other regression and classification approaches,
training error monotonically decreases (or more precisely non-increases) with model complexity (usually measured in depth)
• Training error that is too small is a sign of overfitting
!28
Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10
Training error 0.22 0.13 0.10 0.03 0.00
Decision boundary
Training error reduces with depth
Two ways to prevent overfitting
• 1. Early stopping • Stop before tree gets too complicated
• 2. Pruning • Simplify after training a complex model
!29
Clas
sifica
tion
Erro
r
Complex trees
True error
Training error
Simple trees
Tree depthmax_depth
1. Early stopping• Stopping rule: (Recap)• 1. All examples in the subset have the same label• 2. No more features left to split
• Early stopping rule• Only grow up to max_depth (which is chosen via validation)• Hard to figure out the right depth• An imbalanced tree can be a better one
• Do not split if it does not give sufficient decrease in error• There are cases where it takes some depth to get the gain• e.g. XOR
• Do not split intermediate nodes with too few data points!30
x[1] x[2] yFALSE FALSE FALSEFALSE TRUE TRUETRUE FALSE TRUETRUE TRUE FALSE
!30
2. Pruning• Train till overfit, and then simplify
• Pruning is guided by a choice of quality metric that balances fitting data and simplicity
• Data fit is measured by the error• Simplicity is measured by the number of leaves in the tree
!31
Complex Tree
Simplify
Simpler Tree
Loss(T ) = Error(T ) + � r(T )|{z}# of leaf nodes
<latexit sha1_base64="WiYI9wGc9iHgj3xPgi5tMVV1iwg=">AAACSXicbZDNSxwxGMYzq7Y62rqtRy/BpaC0LDMqtCAFsQgePCi4KmyWbSbzjgYzyZC8U7oM8+/10ltv/R968aCIJzO7e/CjLyQ8PM/75uOXFEo6jKK/QWtmdu7V6/mFcHHpzdvl9rv3p86UVkBPGGXsecIdKKmhhxIVnBcWeJ4oOEuuvjX52Q+wThp9gqMCBjm/0DKTgqO3hu3vtGI2p4fGuXr9ZIOyHbbztdkm/r61xjZB+JGGTPlzU84+UVbqFGxiuYDK+rQeVgzhJ1asQ01GFfCMapOCq+thuxN1o3HRlyKeig6Z1tGw/YelRpQ5aBSKO9ePowIHFbcohYI6ZKWDgosrfgF9LzXPwQ2qMYmafvBOSjNj/dJIx+7jiYrnzo3yxHfmHC/d86wx/5f1S8y+DCqpixJBi8lFWakoGtpgpam0IFCNvODCSv9WKi65B4QefughxM+//FKcbnbjre7m8XZnd2+KY56skjWyTmLymeySA3JEekSQX+QfuSG3we/gOrgL7ietrWA6s0KeVGvmAeZksCM=</latexit>
Simplify
simplest tree<latexit sha1_base64="BfW3nLYEmiRf1Apg4FqpvxuHnCU=">AAAB9HicbVA9SwNBEN2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHubuWTJ7t25OxcIR36HjYUitv4YO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZeJUc2jyWMa6EzADUkTQRIESOokGpgIJ7WB8N/fbE9BGxNEjThPwFRtGIhScoZV8I1QiwSBFDdAvV9yquwBdJ15OKiRHo1/+6g1iniqIkEtmTNdzE/QzplFwCbNSLzWQMD5mQ+haGjEFxs8WR8/ohVUGNIy1rQjpQv09kTFlzFQFtlMxHJlVby7+53VTDG/8TERJihDx5aIwlRRjOk+ADoQGjnJqCeNa2FspHzHNONqcSjYEb/XlddKqVb2rau2hVqnf5nEUyRk5J5fEI9ekTu5JgzQJJ0/kmbySN2fivDjvzseyteDkM6fkD5zPHwOskj8=</latexit>
� = 0<latexit sha1_base64="5TvkrM6y2794e0KmFtszz1YXvBk=">AAAB8HicbVDLSgMxFL1TX7W+qi7dBIvgqszUgm6EohuXFexD2qFkMpk2NMkMSUYoQ7/CjQtF3Po57vwb03YW2nogcDjnXHLvCRLOtHHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFaEtEvNYdQOsKWeStgwznHYTRbEIOO0E49uZ33miSrNYPphJQn2Bh5JFjGBjpcc+t9EQX7uDcsWtunOgVeLlpAI5moPyVz+MSSqoNIRjrXuemxg/w8owwum01E81TTAZ4yHtWSqxoNrP5gtP0ZlVQhTFyj5p0Fz9PZFhofVEBDYpsBnpZW8m/uf1UhNd+RmTSWqoJIuPopQjE6PZ9ShkihLDJ5ZgopjdFZERVpgY21HJluAtn7xK2rWqd1Gt3dcrjZu8jiKcwCmcgweX0IA7aEILCAh4hld4c5Tz4rw7H4towclnjuEPnM8fMu+QAg==</latexit>
� = 1<latexit sha1_base64="rKTxpoxfdyIMAiMb+o13DP0k40k=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiRV0I1QdOOygn1AE8tkMmmHTiZhZqKE0P9w40IRt/6LO//GaZuFth4YOJxzLvfO8RPOlLbtb6u0srq2vlHerGxt7+zuVfcPOipOJaFtEvNY9nysKGeCtjXTnPYSSXHkc9r1xzdTv/tIpWKxuNdZQr0IDwULGcHaSA8uN9EAX7lMhDobVGt23Z4BLROnIDUo0BpUv9wgJmlEhSYcK9V37ER7OZaaEU4nFTdVNMFkjIe0b6jAEVVePrt6gk6MEqAwluYJjWbq74kcR0plkW+SEdYjtehNxf+8fqrDSy9nIkk1FWS+KEw50jGaVoACJinRPDMEE8nMrYiMsMREm6IqpgRn8cvLpNOoO2f1xt15rXld1FGGIziGU3DgAppwCy1oAwEJz/AKb9aT9WK9Wx/zaMkqZg7hD6zPH4s3koo=</latexit>
Pruning algorithm
!32
• 1. Compute current loss, and find a candidate split
Loss(T ) = Error(T ) + � r(T )|{z}# of leaf nodes
<latexit sha1_base64="WiYI9wGc9iHgj3xPgi5tMVV1iwg=">AAACSXicbZDNSxwxGMYzq7Y62rqtRy/BpaC0LDMqtCAFsQgePCi4KmyWbSbzjgYzyZC8U7oM8+/10ltv/R968aCIJzO7e/CjLyQ8PM/75uOXFEo6jKK/QWtmdu7V6/mFcHHpzdvl9rv3p86UVkBPGGXsecIdKKmhhxIVnBcWeJ4oOEuuvjX52Q+wThp9gqMCBjm/0DKTgqO3hu3vtGI2p4fGuXr9ZIOyHbbztdkm/r61xjZB+JGGTPlzU84+UVbqFGxiuYDK+rQeVgzhJ1asQ01GFfCMapOCq+thuxN1o3HRlyKeig6Z1tGw/YelRpQ5aBSKO9ePowIHFbcohYI6ZKWDgosrfgF9LzXPwQ2qMYmafvBOSjNj/dJIx+7jiYrnzo3yxHfmHC/d86wx/5f1S8y+DCqpixJBi8lFWakoGtpgpam0IFCNvODCSv9WKi65B4QefughxM+//FKcbnbjre7m8XZnd2+KY56skjWyTmLymeySA3JEekSQX+QfuSG3we/gOrgL7ietrWA6s0KeVGvmAeZksCM=</latexit>
• 2. Replace the split by a leaf node and recompute
Root22 18
excellent9 0 Fair
9 4
poor4 14
Loan status:Safe Risky
Credit?
Safe
5 years9 0
3 years0 4
Term?
Risky Safe
low0 9
high4 5
Income?
5 years4 3
3 years0 2
Term?
Risky Safe
Risky
Tree Error #Leaves TotalT 3/40 6 0.255
T-term 4/40 5 0.25T-credit
T-incomeT-term’
� = 0.03<latexit sha1_base64="v8W0GJA3q72RI6EzEDRgEFFAf1k=">AAAB83icbVDNSgMxGPy2/tX6V/XoJVgET2W3FfQiFL14rGBrobuUbDbbhmazIckKZelrePGgiFdfxptvY9ruQVsHQoaZ+ciXCSVn2rjut1NaW9/Y3CpvV3Z29/YPqodHXZ1mitAOSXmqeiHWlDNBO4YZTntSUZyEnD6G49uZ//hElWapeDATSYMEDwWLGcHGSr7PbTTC127dbQ6qNXvNgVaJV5AaFGgPql9+lJIsocIQjrXue640QY6VYYTTacXPNJWYjPGQ9i0VOKE6yOc7T9GZVSIUp8oeYdBc/T2R40TrSRLaZILNSC97M/E/r5+Z+CrImZCZoYIsHoozjkyKZgWgiClKDJ9YgolidldERlhhYmxNFVuCt/zlVdJt1L1mvXF/UWvdFHWU4QRO4Rw8uIQW3EEbOkBAwjO8wpuTOS/Ou/OxiJacYuYY/sD5/AGHrpCx</latexit>
high4 5
Risky
• 3. Repeat for all splits (there are 4 splits in this example)
Decision trees for regression
!33
X
Y
Decision trees for regression• Same process as classification, but • error measured in squared error• Prediction is the mean value of the samples in that partition
!34