+ All Categories
Home > Documents > Trees, Bagging, Random Forests and...

Trees, Bagging, Random Forests and...

Date post: 05-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
115
Boosting Trevor Hastie, Stanford University 1 Trees, Bagging, Random Forests and Boosting Classification Trees Bagging: Averaging Trees Random Forests: Cleverer Averaging of Trees Boosting: Cleverest Averaging of Trees Methods for improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not generalize well. The techniques discussed here enhance their performance considerably.
Transcript
Page 1: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 1

Trees, Bagging, Random Forests and Boosting

• Classification Trees

• Bagging: Averaging Trees

• Random Forests: Cleverer Averaging of Trees

• Boosting: Cleverest Averaging of Trees

Methods for improving the performance of weak learners such asTrees. Classification trees are adaptive and robust, but do notgeneralize well. The techniques discussed here enhance theirperformance considerably.

Page 2: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 2

Two-class Classification

• Observations are classified into two or more classes, coded by aresponse variable Y taking values 1, 2, . . . , K.

• We have a feature vector X = (X1, X2, . . . , Xp), and we hopeto build a classification rule C(X) to assign a class label to anindividual with feature X.

• We have a sample of pairs (yi, xi), i = 1, . . . , N . Note thateach of the xi are vectors xi = (xi1, xi2, . . . , xip).

• Example: Y indicates whether an email is spam or not. X

represents the relative frequency of a subset of specially chosenwords in the email message.

• The technology described here estimates C(X) directly, or viathe probability function P (C = k|X).

Page 3: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 3

Classification Trees

• Represented by a series of binary splits.

• Each internal node represents a value query on one of thevariables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right,else go left.

• The terminal nodes are the decision nodes. Typically eachterminal node is dominated by one of the classes.

• The tree is grown using training data, by recursive splitting.

• The tree is often pruned to an optimal size, evaluated bycross-validation.

• New observations are classified by passing their X down to aterminal node of the tree, and then using majority vote.

Page 4: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 4

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Classification Tree

Page 5: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 5

Properties of Trees

✔ Can handle huge datasets

✔ Can handle mixed predictors—quantitative and qualitative

✔ Easily ignore redundant variables

✔ Handle missing data elegantly

✔ Small trees are easy to interpret

✖ large trees are hard to interpret

✖ Often prediction performance is poor

Page 6: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 6

Example: Predicting e-mail spam

• data from 4601 email messages

• goal: predict whether an email message is spam (junk email) orgood.

• input features: relative frequencies in a message of 57 of themost commonly occurring words and punctuation marks in allthe training the email messages.

• for this problem not all errors are equal; we want to avoidfiltering out good email, while letting spam get through is notdesirable but less serious in its consequences.

• we coded spam as 1 and email as 0.

• A system like this would be trained for each user separately(e.g. their word lists would be different)

Page 7: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 7

Predictors

• 48 quantitative predictors—the percentage of words in theemail that match a given word. Examples include business,address, internet, free, and george. The idea was that thesecould be customized for individual users.

• 6 quantitative predictors—the percentage of characters in theemail that match a given character. The characters are ch;,ch(, ch[, ch!, ch$, and ch#.

• The average length of uninterrupted sequences of capitalletters: CAPAVE.

• The length of the longest uninterrupted sequence of capitalletters: CAPMAX.

• The sum of the length of uninterrupted sequences of capitalletters: CAPTOT.

Page 8: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 8

Details

• A test set of size 1536 was randomly chosen, leaving 3065observations in the training set.

• A full tree was grown on the training set, with splittingcontinuing until a minimum bucket size of 5 was reached.

• This bushy tree was pruned back using cost-complexitypruning, and the tree size was chosen by 10-foldcross-validation.

• We then compute the test error and ROC curve on the testdata.

Page 9: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 9

Some important features

39% of the training data were spam.

Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen thewords and characters showing the largest difference between spam

and email.

george you your hp free hpl

spam 0.00 2.26 1.38 0.02 0.52 0.01

email 1.27 1.27 0.44 0.90 0.07 0.43

! our re edu remove

spam 0.51 0.51 0.13 0.01 0.28

email 0.11 0.18 0.42 0.29 0.01

Page 10: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 10

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

Page 11: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 11

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for pruned tree on SPAM data

o TREE − Error: 8.7%

SPAM Data

Overall error rate on test data:8.7%.ROC curve obtained by vary-ing the threshold c0 of the clas-sifier:C(X) = +1 if P (+1|X) > c0.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.

We may want specificity to be high, and suffer some spam:Specificity : 95% =⇒ Sensitivity : 79%

Page 12: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 12

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE vs SVM on SPAM data

oo

SVM − Error: 6.7%TREE − Error: 8.7%

TREE vs SVM

Comparing ROC curves onthe test data is a goodway to compare classi-fiers. SVM dominatesTREE here.

Page 13: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 13

Toy Classification Problem

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

Bayes Error Rate: 0.25

0

00

0

00

0

0

00

0

0

0

0

0

0

0

0

0 0 0

00

0

00

00

00

0

0

00

00

00

0 0

0

0

00

0

0

0 0

00

0

0

0

000

0

0 0

00

00

00 00 0000

0

0

0

0

0

0

0

00

00

0

0

00

00

0

0

0

000

00

00

0

000 00

0

0

0

0

0

00

00

0

00

0

00

00

0

0

0

0

0

0

0

0

00

00

0

0

0

0

00

0

0

0 00

0

0 00

00

00

0

0

0

0

0 0

0

0

000

0

0

0

0

00

0

0

000

0

00

0

0

0

0 00

0

00 00 0

0 0

000

00

0

0000 0

0

0

0

0

00 0

000

0

0

0

0

00

0

00

0

0

0

0

0

0

0

0 00

0

00

00

0

0

0

0 0

00

00

0 0

0

0

0

00

0

00

0

0

000

0

0

0

00

0

00 00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

00

000

0

0

00

00

0

00

0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

00

00

0

0

00

00 0

0

0

0

0 0

0

0

00

000

0

0 0

00

00

0 0

0 0

0

0

0

0

00

0

0

0

0

000

0

00

0 0

0

0

00

00

0 0

0

0

0

0

0

0

0

0

0

0 00

00

0

00

0 0

00

00

1

1

1

11

1

1

1

11

1

111

1

1

11

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1 11

1

1

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

11

1

1

11

1

1

11

11

11

1

11

11

1

1

1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

111

1

1

1

11

11

11

1

1

1

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

1

1

1

1

11

111

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

11

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

11

1

1

1

1

1

11

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

11

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

11

1

11

1

X2

X1

• Data X and Y , with Y

taking values +1 or −1.

• Here X = (X1, X2)

• The black boundaryis the Bayes DecisionBoundary - the bestone can do.

• Goal: Given N train-ing pairs (Xi, Yi)produce a classifierC(X) ∈ {−1, 1}

• Also estimate the probability of the class labels P (Y = +1|X).

Page 14: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 14

Toy Example - No Noise

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

0

0

00

0 0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

00

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

00

00

0

0

0

0

0

0

00

0

0

0

0

00

0

0

000

0

0

00

0

0

0

0 0

00

0

0 00

0

0

0

00

00 0

0

0

00

0

0

0

0

0

00

0

00

00 0

0

000

0

0

0

00

0 00

0

0

0

0

0

0

0

0

0

00

0

0 0

0

0

0

00

0

00

00

0

0

00

0

0

0

0

00

0 0

0

00

0

0

0

0 0

0

0

0

0

00

0

000

000

0

00 0

0

0

0

0 00

0

0

0

00

00

0

0

0

0

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

0

0

0

0

00

0

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

0

0

0

0

0

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0 00

00

0

0

0

0

0

00

00

0

0

00

0

00

00

0

0

00

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

000

00

00

0

0

0

1

1

1

1

1

1

11

11

11

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 11

1

1

11

11

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X2

X1

• Deterministic problem;noise comes from sam-pling distribution of X.

• Use a training sampleof size 200.

• Here Bayes Error is0%.

Page 15: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 15

Classification Tree

x.2<-1.06711x.2>-1.06711

94/200

1

0/34

1

x.2<1.14988x.2>1.14988

72/166

0

x.1<1.13632x.1>1.13632

40/134

0

x.1<-0.900735x.1>-0.900735

23/117

0

x.1<-1.1668x.1>-1.1668

5/26

1

0/12

1

x.1<-1.07831x.1>-1.07831

5/14

1

1/5

1

4/9

1

x.2<-0.823968x.2>-0.823968

2/91

0

2/8

0

0/83

0

0/17

1

0/32

1

Page 16: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 16

Decision Boundary: Tree

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.073

When the nested spheresare in 10-dimensions, Clas-sification Trees produces arather noisy and inaccuraterule C(X), with error ratesaround 30%.

Page 17: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 17

Model Averaging

Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.

• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.

• Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.

• Random Forests (Breiman 1999): Fancier version of bagging.

In general Boosting � Random Forests � Bagging � Single Tree.

Page 18: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 18

Bagging

Bagging or bootstrap aggregation averages a given procedure overmany samples, to reduce its variance — a poor man’s Bayes. See

pp 246.

Suppose C(S, x) is a classifier, such as a tree, based on our trainingdata S, producing a predicted class label at input point x.

To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size N

with replacement from the training data. Then

Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.

Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction. Howeverany simple structure in C (e.g a tree) is lost.

Page 19: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 19

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Original Tree

x.2<0.36x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

Bootstrap Tree 1

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

Bootstrap Tree 2

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

Bootstrap Tree 3

x.2<0.255x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

Bootstrap Tree 4

x.2<0.38x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

Bootstrap Tree 5

Page 20: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 20

Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.032

Bagging averages manytrees, and producessmoother decision bound-aries.

Page 21: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 21

Random forests

• refinement of bagged trees; quite popular

• at each tree split, a random sample of m features is drawn, andonly those m features are considered for splitting. Typicallym =

√p or log2 p, where p is the number of features

• For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.

• random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.

Page 22: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 22

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE, SVM and Random Forest on SPAM data

ooo

Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%

TREE, SVM and RF

Random Forest dominatesboth other methods on theSPAM data — 5.0% error.Used 500 trees with defaultsettings for random Forest

package in R.

Page 23: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 23

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Average many trees, eachgrown to re-weighted versionsof the training data.

• Final Classifier is weighted av-erage of classifiers:

C(x) = sign[∑M

m=1 αmCm(x)]

Page 24: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 24

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

Boosting vs Bagging

• 2000 points fromNested Spheres in R10

• Bayes error rate is 0%.

• Trees are grown bestfirst without pruning.

• Leftmost term is a sin-gle tree.

Page 25: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 25

AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Cm(x) to the training data using weights wi.

(b) Compute weighted error of newest tree

errm =∑N

i=1 wiI(yi �= Cm(xi))∑Ni=1 wi

.

(c) Compute αm = log[(1 − errm)/errm].

(d) Update weights for i = 1, . . . , N :wi ← wi · exp[αm · I(yi �= Cm(xi))]and renormalize to wi to sum to 1.

3. Output C(x) = sign[∑M

m=1 αmCm(x)].

Page 26: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 26

Boosting Iterations

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

Boosting Stumps

A stump is a two-nodetree, after a single split.Boosting stumps worksremarkably well on thenested-spheres problem.

Page 27: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 27

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5 Training Error

• Nested spheres in 10-Dimensions.

• Bayes error is 0%.

• Boosting drives thetraining error to zero.

• Further iterations con-tinue to improve testerror in many exam-ples.

Page 28: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 28

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Noisy Problems

• Nested Gaussians in10-Dimensions.

• Bayes error is 25%.

• Boosting with stumps

• Here the test errordoes increase, but quiteslowly.

Page 29: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 29

Stagewise Additive Modeling

Boosting builds an additive model

f(x) =M∑

m=1

βmb(x; γm).

Here b(x, γm) is a tree, and γm parametrizes the splits.

We do things like that in statistics all the time!

• GAMs: f(x) =∑

j fj(xj)

• Basis expansions: f(x) =∑M

m=1 θmhm(x)

Traditionally the parameters fm, θm are fit jointly (i.e. leastsquares, maximum likelihood).

With boosting, the parameters (βm, γm) are fit in a stagewisefashion. This slows the process down, and overfits less quickly.

Page 30: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 30

Additive Trees

• Simple example: stagewise least-squares?

• Fix the past M − 1 functions, and update the Mth using a tree:

minfM∈Tree(x)

E(Y −M−1∑m=1

fm(x) − fM (x))2

• If we define the current residuals to be

R = Y −M−1∑m=1

fm(x)

then at each stage we fit a tree to the residuals

minfM∈Tree(x)

E(R − fM (x))2

Page 31: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 31

Stagewise Least Squares

Suppose we have available a basis family b(x; γ) parametrized by γ.

• After m − 1 steps, suppose we have the modelfm−1(x) =

∑m−1j=1 βjb(x; γj).

• At the mth step we solve

minβ,γ

N∑i=1

(yi − fm−1(xi) − βb(xi; γ))2

• Denoting the residuals at the mth stage byrim = yi − fm−1(xi), the previous step amounts to

minβ,γ

(rim − βb(xi; γ))2,

• Thus the term βmb(x; γm) that best fits the current residuals isadded to the expansion at each step.

Page 32: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 32

Adaboost: Stagewise Modeling

• AdaBoost builds an additive logistic regression model

f(x) = logPr(Y = 1|x)

Pr(Y = −1|x)=

M∑m=1

αmGm(x)

by stagewise fitting using the loss function

L(y, f(x)) = exp(−y f(x)).

• Given the current fM−1(x), our solution for (βm, Gm) is

arg minβ,G

N∑i=1

exp[−yi(fm−1(xi) + β G(x))]

where Gm(x) ∈ {−1, 1} is a tree classifier and βm is acoefficient.

Page 33: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 33

• With w(m)i = exp(−yi fm−1(xi)), this can be re-expressed as

arg minβ,G

N∑i=1

w(m)i exp(−β yi G(xi))

• We can show that this leads to the Adaboost algorithm; See

pp 305.

Page 34: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 34

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

Loss

y · f

Why Exponential Loss?

• e−yF (x) is a monotone,smooth upper bound onmisclassification loss at x.

• Leads to simple reweightingscheme.

• Has logit transform as popu-lation minimizer

f∗(x) =12

logPr(Y = 1|x)

Pr(Y = −1|x)

• Other more robust loss func-tions, like binomial deviance.

Page 35: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 35

General Stagewise Algorithm

We can do the same for more general loss functions, not only leastsquares.

1. Initialize f0(x) = 0.

2. For m = 1 to M :

(a) Compute(βm, γm) = arg minβ,γ

∑Ni=1 L(yi, fm−1(xi) + βb(xi; γ)).

(b) Set fm(x) = fm−1(x) + βmb(x; γm).

Sometimes we replace step (b) in item 2 by

(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)

Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows thestagewise model-building even more, and typically leads to betterperformance.

Page 36: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 36

Gradient Boosting

• General boosting algorithm that works with a variety ofdifferent loss functions. Models include regression, resistantregression, K-class classification and risk modeling.

• Gradient Boosting builds additive tree models, for example, forrepresenting the logits in logistic regression.

• Tree size is a parameter that determines the order ofinteraction (next slide).

• Gradient Boosting inherits all the good features of trees(variable selection, missing data, mixed predictors), andimproves on the weak features, such as prediction performance.

• Gradient Boosting is described in detail in , section 10.10.

Page 37: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 37

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Stumps

10 Node100 NodeAdaboost

Tree Size

The tree size J determinesthe interaction order of themodel:

η(X) =∑

j

ηj(Xj)

+∑jk

ηjk(Xj , Xk)

+∑jkl

ηjkl(Xj , Xk, Xl)

+ · · ·

Page 38: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 38

Stumps win!

Since the true decision boundary is the surface of a sphere, thefunction that describes it has the form

f(X) = X21 + X2

2 + . . . + X2p − c = 0.

Boosted stumps via Gradient Boosting returns reasonableapproximations to these quadratic functions.

Coordinate Functions for Additive Logistic Trees

f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)

f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)

Page 39: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 39

Spam Example Results

With 3000 training and 1500 test observations, Gradient Boostingfits an additive logistic model

f(x) = logPr(spam|x)Pr(email|x)

using trees with J = 6 terminal-node trees.

Gradient Boosting achieves a test error of 4%, compared to 5.3% foran additive GAM, 5.0% for Random Forests, and 8.7% for CART.

Page 40: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 40

Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

Page 41: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 41

Spam: Partial Dependence

!

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

remove

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

edu

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.6

-0.2

0.0

0.2

hp

Par

tial D

epen

denc

e

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-1.0

-0.6

-0.2

0.0

0.2

Page 42: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 42

Comparison of Learning Methods

Some characteristics of different learning methods.

Key: ● = good, ● =fair, and ● =poor.

Characteristic NeuralNets

SVM CART GAM KNN,Kernel

GradientBoost

Natural handling of dataof “mixed” type ● ● ● ● ● ●

Handling of missing val-ues ● ● ● ● ● ●

Robustness to outliers ininput space ● ● ● ● ● ●

Insensitive to monotonetransformations of in-puts

● ● ● ● ● ●

Computational scalabil-ity (large N) ● ● ● ● ● ●

Ability to deal with irrel-evant inputs ● ● ● ● ● ●

Ability to extract linearcombinations of features ● ● ● ● ● ●

Interpretability● ● ● ● ● ●

Predictive power● ● ● ● ● ●

Page 43: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting Trevor Hastie, Stanford University 43

Software

• R: free GPL statistical computing environment available fromCRAN, implements the S language. Includes:

– randomForest: implementation of Leo Breimans algorithms.

– rpart: Terry Therneau’s implementation of classificationand regression trees.

– gbm: Greg Ridgeway’s implementation of Friedman’sgradient boosting algorithm.

• Salford Systems: Commercial implementation of trees, randomforests and gradient boosting.

• Splus (Insightful): Commerical version of S.

• Weka: GPL software from University of Waikato, New Zealand.Includes Trees, Random Forests and many other procedures.

Page 44: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Ensembles

Leon Bottou

COS 424 – 4/8/2010

Page 45: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Readings

• T. G. Dietterich (2000)

“Ensemble Methods in Machine Learning”.

• R. E. Schapire (2003):

“The Boosting Approach to Machine Learning”.

Sections 1,2,3,4,6.

Leon Bottou 2/33 COS 424 – 4/8/2010

Page 46: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Summary

1. Why ensembles?

2. Combining outputs.

3. Constructing ensembles.

4. Boosting.

Leon Bottou 3/33 COS 424 – 4/8/2010

Page 47: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

I. Ensembles

Leon Bottou 4/33 COS 424 – 4/8/2010

Page 48: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Ensemble of classifiers

Ensemble of classifiers

– Consider a set of classifiers h1, h2, . . . , hL.

– Construct a classifier by combining their individual decisions.

– For example by voting their outputs.

Accuracy

– The ensemble works if the classifiers have low error rates.

Diversity

– No gain if all classifiers make the same mistakes.

– What if classifiers make different mistakes?

Leon Bottou 5/33 COS 424 – 4/8/2010

Page 49: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Uncorrelated classifiers

Assume ∀r 6= s Cov [ 1I{hr(x) = y} , 1I{hs(x) = y} ] = 0

The tally of classifier votes follows a binomial distribution.

ExampleTwenty-one uncorrelated classifiers with 30% error rate.

Leon Bottou 6/33 COS 424 – 4/8/2010

Page 50: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Statistical motivation

blue : classifiers that work well on the training set(s)f : best classifier.

Leon Bottou 7/33 COS 424 – 4/8/2010

Page 51: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Computational motivation

blue : classifier search may reach local optimaf : best classifier.

Leon Bottou 8/33 COS 424 – 4/8/2010

Page 52: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Representational motivation

blue : classifier space may not contain best classifierf : best classifier.

Leon Bottou 9/33 COS 424 – 4/8/2010

Page 53: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Practical success

Recommendation system

– Netflix “movies you may like”.

– Customers sometimes rate movies they rent.

– Input: (movie, customer)

– Output: rating

Netflix competition

– 1M$ for the first team to do 10% better than their system.

Winner: BellKor team and friends

– Ensemble of more than 800 rating systems.

Runner-up: everybody else

– Ensemble of all the rating systems built by the other teams.

Leon Bottou 10/33 COS 424 – 4/8/2010

Page 54: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Bayesian ensembles

Let D represent the training data.

Enumerating all the classifiers

P (y|x,D) =∑h

P (y, h|x,D)

=∑h

P (h|x,D) P (y|h, x,D)

=∑h

P (h|D) P (y|x, h)

P (h|D) : how well does h match the training data.

P (y|x, h) : what h predicts for pattern x.

Note that this is a weighted average.

Leon Bottou 11/33 COS 424 – 4/8/2010

Page 55: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

II. Combining Outputs

Leon Bottou 12/33 COS 424 – 4/8/2010

Page 56: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Simple averaging

��

��

��

���

� �

Leon Bottou 13/33 COS 424 – 4/8/2010

Page 57: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Weighted averaging a priori

��

��

��

���

� �

���� ����� ��������������������������

Weights derived from the training errors, e.g. exp(−β TrainingError(ht)).Approximate Bayesian ensemble.

Leon Bottou 14/33 COS 424 – 4/8/2010

Page 58: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Weighted averaging with trained weights

��

��

��

���

� �

���� ��������������

���������������

Train weights on the validation set.Training weights on the training set overfits easily.You need another validation set to estimate the performance!

Leon Bottou 15/33 COS 424 – 4/8/2010

Page 59: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Stacked classifiers

��

��

��

���

������ ������

� ���������

� ��� �������

Second tier classifier trained on the validation set.

You need another validation set to estimate the performance!

Leon Bottou 16/33 COS 424 – 4/8/2010

Page 60: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

III. Constructing Ensembles

Leon Bottou 17/33 COS 424 – 4/8/2010

Page 61: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Diversification

Cause of the mistake Diversification strategy

Pattern was difficult. hopeless

Overfitting (?) vary the training sets

Some features were noisy vary the set of input features

Multiclass decisions were inconsistent vary the class encoding

Leon Bottou 18/33 COS 424 – 4/8/2010

Page 62: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Manipulating the training examples

Bootstrap replication simulates training set selection

– Given a training set of size n, construct a new training set

by sampling n examples with replacement.

– About 30% of the examples are excluded.

Bagging

– Create bootstrap replicates of the training set.

– Build a decision tree for each replicate.

– Estimate tree performance using out-of-bootstrap data.

– Average the outputs of all decision trees.

Boosting

– See part IV.

Leon Bottou 19/33 COS 424 – 4/8/2010

Page 63: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Manipulating the features

Random forests

– Construct decision trees on bootstrap replicas.

Restrict the node decisions to a small subset of features

picked randomly for each node.

– Do not prune the trees.

Estimate tree performance using out-of-bootstrap data.

Average the outputs of all decision trees.

Multiband speech recognition

– Filter speech to eliminate a random subset of the frequencies.

– Train speech recognizer on filtered data.

– Repeat and combine with a second tier classifier.

– Resulting recognizer is more robust to noise.

Leon Bottou 20/33 COS 424 – 4/8/2010

Page 64: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Manipulating the output codes

Reducing multiclass problems to binary classification

– We have seen one versus all.

– We have seen all versus all.

Error correcting codes for multiclass problems

– Code the class numbers with an error correcting code.

– Construct a binary classifier for each bit of the code.

– Run the error correction algorithm on the binary classifier outputs.

Leon Bottou 21/33 COS 424 – 4/8/2010

Page 65: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

IV. Boosting

Leon Bottou 22/33 COS 424 – 4/8/2010

Page 66: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Motivation

• Easy to come up with rough rules of thumb for classifying data

– email contains more than 50% capital letters.

– email contains expression “buy now”.

• Each alone isnt great, but better than random.

• Boosting converts rough rules of thumb into an accurate classier.

Boosting was invented by Prof. Schapire.

Leon Bottou 23/33 COS 424 – 4/8/2010

Page 67: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Adaboost

Given examples (x1, y1) . . . (xn, yn) with yi = ±1.

Let D1(i) = 1/n for i = 1 . . . n.

For t = 1 . . . T do

• Run weak learner using examples with weights Dt.

• Get weak classifier ht• Compute error: εt =

∑iDt(i) 1I(ht(xi) 6= yi)

• Compute magic coefficient αt =1

2log

(1− εtεt

)• Update weights Dt+1(i) =

Dt(i) e−αt yi ht(xi)

Zt

Output the final classifier fT (x) = sign

T∑t=1

αtht(x)

Leon Bottou 24/33 COS 424 – 4/8/2010

Page 68: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Toy example

Weak classifiers: vertical or horizontal half-planes.

Leon Bottou 25/33 COS 424 – 4/8/2010

Page 69: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Adaboost round 1

Leon Bottou 26/33 COS 424 – 4/8/2010

Page 70: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Adaboost round 2

Leon Bottou 27/33 COS 424 – 4/8/2010

Page 71: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Adaboost round 3

Leon Bottou 28/33 COS 424 – 4/8/2010

Page 72: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Adaboost final classifier

Leon Bottou 29/33 COS 424 – 4/8/2010

Page 73: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

From weak learner to strong classifier (1)

Preliminary

DT+1(i) = D1(i)e−α1 yi h1(xi)

Z1· · · e

−αT yi hT (xi)

ZT=

1

n

e−yi fT (xi)∏tZt

Bounding the training error

1

n

∑i

1I{fT (xi) 6= yi} ≤1

n

∑i

e−yi fT (xi) =1

n

∑i

DT+1(i)∏t

Zt =∏t

Zt

Idea: make Zt as small as possible.

Zt =

n∑i=1

Dt(i)e−αt yi ht(xi) = n (1− εt) e−αt + n εt e

αt

1. Pick ht to minimize εt.

2. Pick αt to minimize Zt.

Leon Bottou 30/33 COS 424 – 4/8/2010

Page 74: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

From weak learner to strong classifier (2)

Pick αt to minimize Zt (the magic coefficient)

∂Zt∂αt

= −(1− εt) e−αt + εt eαt = 0 =⇒ αt =

1

2log

1− εtεt

Weak learner assumption: γt = 12 − εt is positive and small.

Zt = (1− ε)√

ε

1− ε+ ε

√1− εε

=√

4ε(1− ε) =√

1− 4γ2t ≤ exp

(− 2γ2

t

)

TrainingError(fT ) ≤T∏t=1

Zt ≤ exp

−2

T∑t=1

γ2t

The training error decreases exponentially if inf γt > 0.

But that does not happen beyond a certain point. . .

Leon Bottou 31/33 COS 424 – 4/8/2010

Page 75: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting and exponential loss

Proofs are instructive

We obtain the bound

TrainingError(fT ) ≤ 1

n

∑i

e−yiH(xi) =

T∏t=1

Zt

– without saying how Dt relates to ht– without using the value of αt

y y(x)^

Conclusion

– Round T chooses the hT and αTthat maximize the exponential loss reduction from fT−1 to fT .

Exercise

– Tweak Adaboost to minimize the log loss instead of the exp loss.

Leon Bottou 32/33 COS 424 – 4/8/2010

Page 76: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Boosting and margins

marginH(x, y) =y H(x)∑t |αt|

=

∑t αt y ht(x)∑

t |αt|

Remember support vector machines?

Leon Bottou 33/33 COS 424 – 4/8/2010

Page 77: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Ensemble  learning  Lecture  12  

David  Sontag  New  York  University  

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola

Page 78: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Ensemble  methods  Machine learning competition with a $1 million prize

Page 79: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

3

Bias/Variance  Tradeoff  

Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001

Page 80: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

4

Reduce  Variance  Without  Increasing  Bias  

•  Averaging  reduces  variance:  

Average models to reduce model variance

One problem: only one training set

where do multiple models come from?

(when predictions are independent)

Page 81: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

5

Bagging:  Bootstrap  AggregaGon  

•  Leo  Breiman  (1994)  •  Take  repeated  bootstrap  samples  from  training  set  D.  •  Bootstrap  sampling:  Given  set  D  containing  N  training  examples,  create  D’  by  drawing  N  examples  at  random  with  replacement  from  D.  

•  Bagging:  –  Create  k  bootstrap  samples  D1  …  Dk.  –  Train  disGnct  classifier  on  each  Di.  –  Classify  new  instance  by  majority  vote  /  average.  

Page 82: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

6

Bagging  

•  Best  case:  

In practice: models are correlated, so reduction is smaller than 1/N

variance of models trained on fewer training cases usually somewhat larger

Page 83: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

7

Page 84: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

8

decision tree learning algorithm; very similar to ID3

Page 85: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

shades of blue/red indicate strength of vote for particular classification

Page 86: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

10

Reduce  Bias2  and  Decrease  Variance?  

•  Bagging  reduces  variance  by  averaging  •  Bagging  has  liZle  effect  on  bias  •  Can  we  average  and  reduce  bias?  •  Yes:    

•  BoosGng  

Page 87: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Theory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of Boosting

Rob Schapire

Page 88: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”[Gorin et al.]

• goal: automatically categorize type of call requested by phonecustomer (Collect, CallingCard, PersonToPerson, etc.)

• yes I’d like to place a collect call long distance

please (Collect)• operator I need to make a call but I need to bill

it to my office (ThirdNumber)• yes I’d like to place a call on my master card

please (CallingCard)• I just called a number in sioux city and I musta

rang the wrong number because I got the wrong

party and I would like to have that taken off of

my bill (BillingCredit)

• observation:• easy to find “rules of thumb” that are “often” correct

• e.g.: “IF ‘card’ occurs in utteranceTHEN predict ‘CallingCard’ ”

• hard to find single highly accurate prediction rule

Page 89: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach

• devise computer program for deriving rough rules of thumb

• apply procedure to subset of examples

• obtain rule of thumb

• apply to 2nd subset of examples

• obtain 2nd rule of thumb

• repeat T times

Page 90: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Key DetailsKey DetailsKey DetailsKey DetailsKey Details

• how to choose examples on each round?• concentrate on “hardest” examples(those most often misclassified by previous rules ofthumb)

• how to combine rules of thumb into single prediction rule?• take (weighted) majority vote of rules of thumb

Page 91: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

BoostingBoostingBoostingBoostingBoosting

• boosting = general method of converting rough rules ofthumb into highly accurate prediction rule

• technically:• assume given “weak” learning algorithm that canconsistently find classifiers (“rules of thumb”) at leastslightly better than random, say, accuracy ! 55%(in two-class setting) [ “weak learning assumption” ]

• given su!cient data, a boosting algorithm can provablyconstruct single classifier with very high accuracy, say,99%

Page 92: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Preamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early History

Page 93: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Strong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak Learnability

• boosting’s roots are in “PAC” learning model [Valiant ’84]

• get random examples from unknown, arbitrary distribution

• strong PAC learning algorithm:

• for any distributionwith high probabilitygiven polynomially many examples (and polynomial time)can find classifier with arbitrarily small generalizationerror

• weak PAC learning algorithm

• same, but generalization error only needs to be slightlybetter than random guessing (12 " !)

• [Kearns & Valiant ’88]:• does weak learnability imply strong learnability?

Page 94: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...

• can use (fairly) wild guesses to produce highly accuratepredictions

• if can learn “part way” then can learn “all the way”

• should be able to improve any learning algorithm

• for any learning problem:• either can always learn with nearly perfect accuracy• or there exist cases where cannot learn even slightlybetter than random guessing

Page 95: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

First Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting Algorithms

• [Schapire ’89]:• first provable boosting algorithm

• [Freund ’90]:• “optimal” algorithm that “boosts by majority”

• [Drucker, Schapire & Simard ’92]:• first experiments using boosting• limited by practical drawbacks

• [Freund & Schapire ’95]:• introduced “AdaBoost” algorithm• strong practical advantages over previous boostingalgorithms

Page 96: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Page 97: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory

• introduction to AdaBoost

• analysis of training error

• analysis of test errorand the margins theory

• experiments and applications

Page 98: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory

• introduction to AdaBoost

• analysis of training error

• analysis of test errorand the margins theory

• experiments and applications

Page 99: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

A Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of Boosting

• given training set (x1, y1), . . . , (xm, ym)

• yi # {"1,+1} correct label of instance xi # X

• for t = 1, . . . ,T :• construct distribution Dt on {1, . . . ,m}

• find weak classifier (“rule of thumb”)

ht : X $ {"1,+1}

with error "t on Dt :

"t = Pri!Dt [ht(xi ) %= yi ]

• output final/combined classifier Hfinal

Page 100: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost[with Freund]

• constructing Dt :

• D1(i) = 1/m• given Dt and ht :

Dt+1(i) =Dt(i)

Zt&

!

e"!t if yi = ht(xi )e!t if yi %= ht(xi )

=Dt(i)

Ztexp("#t yi ht(xi ))

where Zt = normalization factor

#t =1

2ln

"1" "t"t

#

> 0

• final classifier:

• Hfinal(x) = sign

$

%

t

#tht(x)

&

Page 101: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Toy ExampleToy ExampleToy ExampleToy ExampleToy Example

D1

weak classifiers = vertical or horizontal half-planes

Page 102: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Round 1Round 1Round 1Round 1Round 1

h1

α

ε11

=0.30=0.42

2D

Page 103: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Round 2Round 2Round 2Round 2Round 2

α

ε22

=0.21=0.65

h2 3D

Page 104: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Round 3Round 3Round 3Round 3Round 3

h3

α

ε33=0.92=0.14

Page 105: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Final ClassifierFinal ClassifierFinal ClassifierFinal ClassifierFinal Classifier

Hfinal

+ 0.92+ 0.650.42sign=

=

Page 106: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Voted combination of classifiers• The general problem here is to try to combine many simple

“weak” classifiers into a single “strong” classifier

• We consider voted combinations of simple binary ±1component classifiers

hm(x) = α1 h(x; θ1) + . . . + αm h(x; θm)

where the (non-negative) votes αi can be used to emphasizecomponent classifiers that are more reliable than others

Tommi Jaakkola, MIT CSAIL 3

Page 107: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Components: decision stumps• Consider the following simple family of component classifiers

generating ±1 labels:

h(x; θ) = sign( w1 xk − w0 )

where θ = {k, w1, w0}. These are called decision stumps.

• Each decision stump pays attention to only a singlecomponent of the input vector

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1

−0.5

0

0.5

1

1.5

Tommi Jaakkola, MIT CSAIL 4

Page 108: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Voted combination cont’d• We need to define a loss function for the combination so

we can determine which new component h(x; θ) to add andhow many votes it should receive

hm(x) = α1h(x; θ1) + . . . + αmh(x; θm)

• While there are many options for the loss function we considerhere only a simple exponential loss

exp{−y hm(x) }

Tommi Jaakkola, MIT CSAIL 5

Page 109: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Modularity, errors, and loss• Consider adding the mth component:

n�

i=1

exp{−yi[hm−1(xi) + αmh(xi; θm)] }

=n�

i=1

exp{−yihm−1(xi)− yiαmh(xi; θm) }

Tommi Jaakkola, MIT CSAIL 6

Page 110: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Modularity, errors, and loss• Consider adding the mth component:

n�

i=1

exp{−yi[hm−1(xi) + αmh(xi; θm)] }

=n�

i=1

exp{−yihm−1(xi)− yiαmh(xi; θm) }

=n�

i=1

exp{−yihm−1(xi)}� �� �fixed at stage m

exp{−yiαmh(xi; θm) }

Tommi Jaakkola, MIT CSAIL 7

Page 111: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Modularity, errors, and loss• Consider adding the mth component:

n�

i=1

exp{−yi[hm−1(xi) + αmh(xi; θm)] }

=n�

i=1

exp{−yihm−1(xi)− yiαmh(xi; θm) }

=n�

i=1

exp{−yihm−1(xi)}� �� �fixed at stage m

exp{−yiαmh(xi; θm) }

=n�

i=1

W (m−1)i exp{−yiαmh(xi; θm) }

So at the mth iteration the new component (and the votes)should optimize a weighted loss (weighted towards mistakes).

Tommi Jaakkola, MIT CSAIL 8

Page 112: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Empirical exponential loss cont’d• To increase modularity we’d like to further decouple the

optimization of h(x; θm) from the associated votes αm

• To this end we select h(x; θm) that optimizes the rate atwhich the loss would decrease as a function of αm

∂αm��αm=0

n�

i=1

W (m−1)i exp{−yiαmh(xi; θm) } =

�n�

i=1

W (m−1)i exp{−yiαmh(xi; θm) } ·

�− yih(xi; θm)

��

αm=0

=

�n�

i=1

W (m−1)i

�− yih(xi; θm)

��

Tommi Jaakkola, MIT CSAIL 11

Page 113: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Empirical exponential loss cont’d• We find h(x; θm) that minimizes

−n�

i=1

W (m−1)i yih(xi; θm)

We can also normalize the weights:

−n�

i=1

W (m−1)i�n

j=1 W (m−1)j

yih(xi; θm)

= −n�

i=1

W (m−1)i yih(xi; θm)

so that�n

i=1 W (m−1)i = 1.

Tommi Jaakkola, MIT CSAIL 13

Page 114: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

Selecting a new component: summary• We find h(x; θm) that minimizes

−n�

i=1

W (m−1)i yih(xi; θm)

where�n

i=1 W (m−1)i = 1.

• αm is subsequently chosen to minimize

n�

i=1

W (m−1)i exp{−yiαmh(xi; θm) }

Tommi Jaakkola, MIT CSAIL 14

Page 115: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words

The AdaBoost algorithm0) Set W (0)

i = 1/n for i = 1, . . . , n

1) At the mth iteration we find (any) classifier h(x; θm) forwhich the weighted classification error �m

�m = 0.5− 12

�n�

i=1

W (m−1)i yih(xi; θm)

is better than chance.

2) The new component is assigned votes based on its error:

αm = 0.5 log( (1− �m)/�m )

3) The weights are updated according to (Zm is chosen so thatthe new weights W (m)

i sum to one):

W (m)i =

1Zm

· W (m−1)i · exp{−yiαmh(xi; θm) }

Tommi Jaakkola, MIT CSAIL 18


Recommended