+ All Categories
Home > Education > Machine Learning Workshop

Machine Learning Workshop

Date post: 26-Jan-2015
Category:
Upload: enplus-advisors-inc
View: 105 times
Download: 0 times
Share this document with a friend
Description:
Presentation on Decision Trees and Random Forests given for the Boston Predictive Analytics Machine Learning Workshop on December 2, 2012. Code to accompany the slides is available at www.github.com/dgerlanc/mlclass or http://www.enplusadvisors.com/wp-content/uploads/2012/12/mlclass_1.0.tar.gz
Popular Tags:
28
Hands on Classification: Decision Trees and Random Forests Daniel Gerlanc, Managing Director Enplus Advisors, Inc. www.enplusadvisors.com [email protected] Predictive Analytics Meetup Group Machine Learning Workshop December 2, 2012
Transcript
Page 1: Machine Learning Workshop

Hands on Classification:Decision Trees and Random Forests

Daniel Gerlanc, Managing DirectorEnplus Advisors, [email protected]

Predictive Analytics Meetup GroupMachine Learning WorkshopDecember 2, 2012

Page 2: Machine Learning Workshop

© Daniel Gerlanc, 2012. All rights reserved.

If you’d like to use this material for any purpose, please contact [email protected]

Page 3: Machine Learning Workshop

What You’ll Learn

•Intuition behind decision trees and random forests

•Implementation in R

•Assessing the results

Page 4: Machine Learning Workshop

Dataset

•Chemical Analysis of Italian Wines

•http://www.parvus.unige.it/

•178 records, 14 attributes

Page 5: Machine Learning Workshop

Follow along

> library(mlclass)> data(wine)> str(wine)'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

Page 6: Machine Learning Workshop

What are Decision Trees?

•Model for partitioning an input space

Page 7: Machine Learning Workshop

What’s partitioning?

See rf-1.R

Page 8: Machine Learning Workshop

Create the 1st split.

G

Not G

See rf-1.R

Page 9: Machine Learning Workshop

G

Not G

G

Create the 2nd Split

See rf-1.R

Page 10: Machine Learning Workshop

G

Not G

G

Create more splits…

Not G

I drew this one in.

Page 11: Machine Learning Workshop

Another view of partitioning

See rf-2.R

Page 12: Machine Learning Workshop

Use R to do the partitioning.

tree.1 <- rpart(Type ~ ., data=wine)prp(tree.1, type=4, extra=2)

• See the ‘rpart’ and ‘rpart.plot’ R packages.• Many parameters available to control the fit.

See rf-2.R

Page 13: Machine Learning Workshop

Make predictions on a test dataset

predict(tree.1, data=wine, type=“vector”)

Page 14: Machine Learning Workshop

How’d it do?

Guessing: 60.11%

CART: 94.38% Accuracy • Precision: 92.95% (66 / 71)• Sensitivity/Recall: 92.95% (66 / 71)

Actual

Predicted Grig no

Grig (1)66 (3) 5

No (2) 5 (4) 102

Page 15: Machine Learning Workshop

Decision Tree Problems

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Page 16: Machine Learning Workshop

Random Forests

One Decision Tree

Many Decision Trees (Ensemble)

Page 17: Machine Learning Workshop

Random Forest Fixes

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Page 18: Machine Learning Workshop

Building RF

For each tree:

Sample from the data

At each split, sample from the available variables

Page 19: Machine Learning Workshop

Bootstrap Sampling

Page 20: Machine Learning Workshop

Sample Attributes at each split

Page 21: Machine Learning Workshop

Motivations for RF

•Create uncorrelated trees

•Variance reduction

•Subspace exploration

Page 22: Machine Learning Workshop

Random Forestsrffit.1 <- randomForest(Type ~ ., data=wine)

See rf-3.R

Page 23: Machine Learning Workshop

RF Parameters in RMost important parameters are:

Variable

Description Default

ntree Number of Trees 500

mtry Number of variables to randomly select at each node

• square root of # predictors for classification

• # predictors / 3 for regression

nodesize

Minimum number of records in a terminal node

• 1 for classification• 5 for regression

sampsize

Number of records to select in each bootstrap sample

• 63.2%

Page 24: Machine Learning Workshop

How’d it do?

Guessing Accuracy: 60.11%

Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71)• Sensitivity/Recall: 100% (68 / 68)

Actual

Predicted Grig No

Grig (1)68 (3) 3

No (2) 0 (4) 107

Page 25: Machine Learning Workshop

Tuning RF: Grid Search

See rf-4.R

Th

is is

the d

efa

ult

.

Page 26: Machine Learning Workshop

Tuning is Expensive

•Polynomial in the number of tuning parameters:

•Plus repeated model fitting in cross-validation

Page 27: Machine Learning Workshop

Benefits of RF

•Good performance with default settings

•Relatively easy to make parallel

•Many implementations

•R, Weka, RapidMiner, Mahout

Page 28: Machine Learning Workshop

References

• A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.

• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm


Recommended