+ All Categories
Home > Documents > Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6:...

Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6:...

Date post: 27-Oct-2019
Category:
Upload: others
View: 17 times
Download: 2 times
Share this document with a friend
42
Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia Tech
Transcript
Page 1: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Lecture 6: Decision Tree, Random Forest,and Boosting

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

Page 2: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Tree?

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Page 3: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Decision Trees

Page 4: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Trees

Decision trees have a long history in machine learning

The first popular algorithm dates back to 1979

Very popular in many real world problems

Intuitive to understand

Easy to build

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

Page 5: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

History

EPAM- Elementary Perceiver and Memorizer

1961: Cognitive simulation model of human concept learning

1966: CLS-Early algorithm for decision tree construction

1979: ID3 based on information theory

1993: C4.5 improved over ID3

Also has history in statistics as CART (Classification andregression tree)

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 5/42

Page 6: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Motivation

How do people make decisions?

Consider a variety of factors

Follow a logical path of checks

Should I eat at this restaurant?

If there is no wait

Yes

If there is short wait and I am hungry

Yes

Else

No

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 6/42

Page 7: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Flowchart

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 7/42

Page 8: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Should We Play Tennis?

If temperature is not hot

Play

If outlook is overcast

Play tennis

Otherwise

Don’t play tennis

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 8/42

Page 9: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Tree

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 9/42

Page 10: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Structure

A decision tree consists of

Nodes

Tests for variables

Branches

Results of tests

Leaves

Classification

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 10/42

Page 11: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Function Class

What functions can decision trees model?

Non-linear: very powerful function class

A decision tree can encode any Boolean function

Proof

Given a truth table for a function

Construct a path in the tree for each row of the table

Given a row as input, follow that path to the desired leaf(output)

Problem: exponentially large trees!

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 11/42

Page 12: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Smaller Trees

Can we produce smaller decision trees for functions?

Yes (Possible)

Key decision factors

Counter examples

Parity function: Return 1 on even inputs, 0 on odd inputs

Majority function: Return 1 if more than half of inputs are 1

Bias-Variance Tradeoff

Bias: Representation Power of Decision Trees

Variance: require a sample size exponential in depth

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 12/42

Page 13: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Fitting Decision Trees

Page 14: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

What Makes a Good Tree?

Small tree:

Occam’s razor: Simpler is better

Avoids over-fitting

A decision tree may be human readable, but not use humanlogic!

How do we build small trees that accurately capture data?

Learning an optimal decision tree is computationallyintractable

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 14/42

Page 15: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Greedy Algorithm

We can get good trees by simple greedy algorithms

Adjustments are usually to fix greedy selection problems

Recursive:

Select the “best” variable, and generate child nodes: One foreach possible value;

Partition samples using the possible values, and assign thesesubsets of samples to the child nodes;

Repeat for each child node until all samples associated with anode that are either all positive or all negative.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 15/42

Page 16: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Variable Selection

The best variable for partition

The most informative variable

Select the variable that is most informative about the labels

Classification error?

Example:

P(Y = 1|X1 = 1) = 0.75 v.s. P(Y = 1|X2 = 1) = 0.55

Which to choose?

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 16/42

Page 17: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Information Theory

The quantification of information

Founded by Claude Shannon

Simple concepts:

Entropy: H(X) = −∑x P(X = x) logP(X = x)

Conditional Entropy: H(Y |X) =∑

x P(X = x)H(Y |X = x)

Information Gain: IG(Y|X) = H(Y)− H(Y|X)

Select the variable with the highest information gain

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 17/42

Page 18: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Should We Play Tennis?

H(Tennis) = −3/5 log2 3/5− 2/5 log2 2/5 = 0.97

H(Tennis|Out. = Sunny) = −2/2 log2 2/2− 0/2 log2 0/2 = 0

H(Tennis|Out. = Overcast) = −0/1 log2 0/1− 1/1 log2 1/1 = 0

H(Tennis|Out. = Rainy) = −0/2 log2 0/2− 2/2 log2 2/2 = 0

H(Tennis|Out.) = 2/5× 0 + 1/5× 0 + 2/5× 0 = 0

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 18/42

Page 19: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Should We Play Tennis?

IG(Tennis|Out.) = 0.97− 0 = 0.97

If we knew the Outlook we’d be able to predict Tennis!

Outlook is the variable to pick for our decision tree

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 19/42

Page 20: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

When to Stop Training?

All data have same label

Return that label

No examples

Return majority label of all data

No further splits possible

Return majority label of passed data

If max IG = 0?

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 20/42

Page 21: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: No Information Gain?

Y X1 X20 0 01 0 11 1 00 1 1

Both features give 0 IG

Once we divide the data, perfect classification!

We need a little exploration sometimes (Unstable Equilibrium)

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 21/42

Page 22: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Tree with Continuous Features

Decision Stump

Threshold for binary classification:

Check Xj ≥ δj or Xj < δj .

More than two classes:

Threshold for three classes:

Check Xj ≥ δj , γj ≥ Xj < δj or γj < Xj .

Decompose one node to two modes:

First node: Check Xj ≥ δj or Xj < δj ,

Second node: If Xj < δj , check γj ≥ Xj or γj < Xj .

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 22/42

Page 23: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Tree for Spam ClassificationBoosting Trevor Hastie, Stanford University 10

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 23/42

Page 24: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Tree for Spam Classification

Boosting Trevor Hastie, Stanford University 11

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivi

ty

ROC curve for pruned tree on SPAM data

o TREE − Error: 8.7%

SPAM Data

Overall error rate on test data:

8.7%.

ROC curve obtained by vary-

ing the threshold c0 of the clas-

sifier:

C(X) = +1 if P (+1|X) > c0.

Sensitivity: proportion of true

spam identified

Specificity: proportion of true

email identified.

We may want specificity to be high, and suffer some spam:

Specificity : 95% =⇒ Sensitivity : 79%

Sensitivity: proportion of true spam identified

Specificity: proportion of true email identified.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 24/42

Page 25: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Decision Tree v.s. SVM

Boosting Trevor Hastie, Stanford University 12

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivity

ROC curve for TREE vs SVM on SPAM data

oo

SVM − Error: 6.7%TREE − Error: 8.7%

TREE vs SVM

Comparing ROC curves on

the test data is a good

way to compare classi-

fiers. SVM dominates

TREE here.

SVM outperforms Decision Tree.

AUC: Area Under Curve.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 25/42

Page 26: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Bagging and Random Forest

Page 27: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Toy Example

Boosting Trevor Hastie, Stanford University 14

Toy Example - No Noise

-3 -2 -1 0 1 2 3

-3-2

-10

12

30

0

00

0 0

0

0 0

0

0

0

0

0

0

000

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

0 0

00

00

0

0

0

0

00

0

0

0

0

0

0

0

0 0

00

00

0

0

0

00

0

00

0

0

0

0

00

0

0

000

0

0

00

0

00

0 0

00

0

0 00

0

0

0

000

0 00

0

00

0

0

0

00

00

0

00

00 0

0

000

0

00

000 0

0

00

00

0

0

0

0

0

00

0

0 0

00

0

0 00

00

00

00

00

0

00

0

00

0 0

0

00

0

0

0

0 0

0

00

0

00

0

000

0000

00 0

0

0

0

0 00

0

0

0

00

000

0

00

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

00

0

0

000

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

00

0

0

0

0

0

0

00

0

00

00

0

0

0

0

000

0

0 0000

0

0

0

0

000

00

00

00

00

00

0

0

000

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

0 0000

00

0

00

1

1

1

1

1

1

11

11

11

1

1

11

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

11

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

11

1

1

1

1 11

1

1

11

11

1

1 1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X2

X1

• Deterministic problem;

noise comes from sam-

pling distribution of X.

• Use a training sample

of size 200.

• Here Bayes Error is

0%.

Nonlinear separable data.

Optimal decision boundary: X21 +X2

2 = 1.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 27/42

Page 28: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Toy Example: Decision TreeBoosting Trevor Hastie, Stanford University 15

Classification Tree

x.2<-1.06711x.2>-1.06711

94/2001

0/341

x.2<1.14988x.2>1.14988

72/1660

x.1<1.13632x.1>1.13632

40/1340

x.1<-0.900735x.1>-0.900735

23/1170

x.1<-1.1668x.1>-1.1668

5/261

0/121

x.1<-1.07831x.1>-1.07831

5/141

1/51

4/91

x.2<-0.823968x.2>-0.823968

2/910

2/80

0/830

0/171

0/321

Boosting Trevor Hastie, Stanford University 16

Decision Boundary: Tree

X1

X2-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

1 1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 11

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

00

0

0

0

000

0

00

00

0

0

0

00

0

00

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

00

00 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

00

0

0

Error Rate: 0.073

When the nested spheres

are in 10-dimensions, Clas-

sification Trees produces a

rather noisy and inaccurate

rule C(X), with error rates

around 30%.

Sample size: 200

7 branching nodes; 6 layers.

Classification error: 7.3% when d = 2; > 30% when d = 10.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 28/42

Page 29: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Model Averaging

Decision trees can be simple, but often produce noisy (bushy) orweak (stunted) classifiers.

Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.

Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.

Random Forests (Breiman 1999): Fancier version of bagging.

In general Boosting>Random Forests>Bagging>Single Tree.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 29/42

Page 30: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Bagging/Bootstrap Aggregation

Motivation: Average a given procedure over many samples toreduce the variance.

Given a classifier C(S, x), based on our training data S,producing a predicted class label at input point x.

To bag C, we draw bootstrap samples S1, ..., SB each of sizeN with replacement from the training data. Then

CBag = Mojarity Vote{C(Sb, x)}Bb=1.

Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction.

All simple structures in a tree are lost.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 30/42

Page 31: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Toy Example: Bagging TreesBoosting Trevor Hastie, Stanford University 19

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Original Tree

x.2<0.36x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

Bootstrap Tree 1

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

Bootstrap Tree 2

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

Bootstrap Tree 3

x.2<0.255x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

Bootstrap Tree 4

x.2<0.38x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

Bootstrap Tree 5

2 branching nodes; 2 layers.

5 dependent trees.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 31/42

Page 32: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Toy Example: Bagging Trees

Boosting Trevor Hastie, Stanford University 20

Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

31

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

1 1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 11

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

00

0

0

0

000

0

00

00

0

0

0

00

0

00

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

00

00 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

00

0

0

Error Rate: 0.032

Bagging averages many

trees, and produces

smoother decision bound-

aries.

A smoother decision boundary.

Classification error: 3.2% (Single deeper tree 7.3%).

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 32/42

Page 33: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Random Forest

Bagging features and samples simultaneously:

At each tree split, a random sample of m features is drawn,and only those m features are considered for splitting.Typically m =

√d or log2 d, where d is the number of features

For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.

random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 33/42

Page 34: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Random Forest for Spam ClassificationBoosting Trevor Hastie, Stanford University 22

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivity

ROC curve for TREE, SVM and Random Forest on SPAM data

ooo

Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%

TREE, SVM and RF

Random Forest dominates

both other methods on the

SPAM data — 5.0% error.

Used 500 trees with default

settings for random Forest

package in R.

RF outperforms SVM.

500 Trees.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 34/42

Page 35: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Random Forest: Variable Importance Scores

Boosting Trevor Hastie, Stanford University 40

Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

The bootstrap roughly covers 1/3 samples for each time.

Do permutations over variables to check how important theyare.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 35/42

Page 36: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Boosting

Page 37: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Boosting

Boosting Trevor Hastie, Stanford University 23

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Average many trees, each

grown to re-weighted versions

of the training data.

• Final Classifier is weighted av-

erage of classifiers:

C(x) = sign!"M

m=1 αmCm(x)#

Average many trees, each grown to re-weighted versions ofthe training data (Iterative).

Final Classifier is weighted average of classifiers:

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 37/42

Page 38: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Adaboost v.s. BaggingBoosting Trevor Hastie, Stanford University 24

Number of Terms

Test

Erro

r

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

Boosting vs Bagging

• 2000 points from

Nested Spheres in R10

• Bayes error rate is 0%.

• Trees are grown best

first without pruning.

• Leftmost term is a sin-

gle tree.

Each classifier for bagging is a tree with a single node(decision stump).

2000 samples from Spheres in R10.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 38/42

Page 39: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Adaboost

Boosting Trevor Hastie, Stanford University 25

AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Cm(x) to the training data using weights wi.

(b) Compute weighted error of newest tree

errm =

!Ni=1 wiI(yi = Cm(xi))!N

i=1 wi

.

(c) Compute αm = log[(1 − errm)/errm].

(d) Update weights for i = 1, . . . , N :

wi ← wi · exp[αm · I(yi = Cm(xi))]

and renormalize to wi to sum to 1.

3. Output C(x) = sign"!M

m=1 αmCm(x)#.

wi’s are the weights of the samples.

errm is the weighted training error.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 39/42

Page 40: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

AdaboostBoosting Trevor Hastie, Stanford University 26

Boosting Iterations

Test

Erro

r

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

Boosting Stumps

A stump is a two-node

tree, after a single split.

Boosting stumps works

remarkably well on the

nested-spheres problem.

The ensemble of classifiers is much more efficient than thesimple combination of classifiers.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 40/42

Page 41: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 27

Number of Terms

Trai

n an

d Te

st E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5 Training Error

• Nested spheres in 10-

Dimensions.

• Bayes error is 0%.

• Boosting drives the

training error to zero.

• Further iterations con-

tinue to improve test

error in many exam-

ples.

More iterations continue to improve test error in manyexamples.

The Adaboost is often observed to be robust to overfitting.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 41/42

Page 42: Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 28

Number of Terms

Trai

n an

d Te

st E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Noisy Problems

• Nested Gaussians in

10-Dimensions.

• Bayes error is 25%.

• Boosting with stumps

• Here the test error

does increase, but quite

slowly.

Optimal Bayes classification error: 25%.

The test error does increase, but quite slowly.

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 42/42


Recommended