Decision Tree Ensembles

Post on 15-Apr-2017

407 views 1 download

transcript

Decision Tree Ensembles(Random Forest & Gradient

Boosting)by Devin Didericksen

Sponsors

About Me

Education:● B.S. Math - University of Utah● M.S. Statistics - Utah State University

Career:● Sr. Pricing Strategist - Overstock.com

Hobbies:● Traveling, movies, food, and my two

rabbits My wife, Sara, and me

What Some of the Leading Data Scientists Say

“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic

What Some of the Leading Data Scientists Say

“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic

“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler

What Some of the Leading Data Scientists Say

“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic

“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler

“When in doubt, use GBM” - Owen Zhang, #2 Ranked Kaggler, Chief Product Officer at DataRobot

What Some of the Leading Data Scientists Say

“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic

“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler

“When in doubt, use GBM” - Owen Zhang, #2 Ranked Kaggler, Chief Product Officer at DataRobot

“We achieve state of the art accuracy and demonstrate improved generalization [with Random Forest]” - Microsoft Xbox Kinect Research Team on human pose recognition

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around

$20k)

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around

$20k)● Competition data is sliced as follows:

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around

$20k)● Competition data is sliced as follows:

● Labels are hidden from validation and holdout sets

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around

$20k)● Competition data is sliced as follows:

● Labels are hidden from validation and holdout sets● Winners are determined by performance against holdout set

Kaggle in a Nutshell

Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around

$20k)● Competition data is sliced as follows:

● Labels are hidden from validation and holdout sets● Winners are determined by performance against holdout set● Competitions are very competitive!

Last 10 Kaggle Competitions

Competition Prize Money End Date Best Single Model SpringLeaf Marketing $100,000 10/19 Gradient Boosting

Dato Truly Native $10,000 10/14 Gradient Boosting

Flavours of Physics $15,000 10/12 Gradient Boosting

Recruit Coupon Purchase $50,000 9/30 Gradient Boosting

Caterpillar Tube Pricing $30,000 8/31 Gradient Boosting

Grasp-and-Lift EEG $10,000 8/31 Recurrent Convolutional Neural Net

Liberty Mutual Property $25,000 8/28 Gradient Boosting

Drawbridge Cross-Device $10,000 8/24 Gradient Boosting

Avito Context Ad Clicks $20,000 7/28 Gradient Boosting

Diabetic Retinopathy $100,000 7/27 Convolutional Neural Net

Last 10 Kaggle Competitions

Competition Prize Money End Date Best Single Model SpringLeaf Marketing $100,000 10/19 Gradient Boosting

Dato Truly Native $10,000 10/14 Gradient Boosting

Flavours of Physics $15,000 10/12 Gradient Boosting

Recruit Coupon Purchase $50,000 9/30 Gradient Boosting

Caterpillar Tube Pricing $30,000 8/31 Gradient Boosting

Grasp-and-Lift EEG $10,000 8/31 Recurrent Convolutional Neural Net

Liberty Mutual Property $25,000 8/28 Gradient Boosting

Drawbridge Cross-Device $10,000 8/24 Gradient Boosting

Avito Context Ad Clicks $20,000 7/28 Gradient Boosting

Diabetic Retinopathy $100,000 7/27 Convolutional Neural Net

*It doesn’t take a data scientist to identify a trend here

Kaggle Competition - Titanic

Can you predict which people survived the Titanic tragedy?

Kaggle Titanic Data

PassengerId

Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S

2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)

female 38 1 0 PC 17599 71.2833 C85 C

3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282

7.925 S

4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)

female 35 1 0 113803 53.1 C123 S

5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S

6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q

7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S

Kaggle Titanic Data - Training Variable Selection

PassengerId

Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S

2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)

female 38 1 0 PC 17599 71.2833 C85 C

3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282

7.925 S

4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)

female 35 1 0 113803 53.1 C123 S

5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S

6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q

7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S

LabelDrop Drop Drop Drop

Kaggle Titanic Data - Training Set

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

Decision Tree Pioneers

Leo Breiman, PhDU. of California Berkeley

Adele Cutler, PhDUtah State University

Jerome Friedman, PhDStanford University

CART, Random Forest, Gradient Boosting Random Forest Gradient Boosting

Simple Decision Tree

Classification and Regression Tree (CART)

Titanic Survival Classification Tree

Classification and Regression Tree (CART)

Like Mr. Bean’s car, CART is ● Super Simple - They are

often easier to interpret than linear models.

● Very Efficient - The computation cost is minimal.

● Weak - It has low predictive power on its own. It’s in a class of models called the “weak learners”.

Random Forest

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.

Random Forest

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.

2. Repeat many times (1,000+)

Random Forest

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.

2. Repeat many times (1,000+)3. Ensemble trees by majority

vote (ie. if 300 out of 1,000 trees predicts a given individual dies then probability of death is 30%).

Random Forest - Tree 1

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

Random Forest - Tree 2

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

Random Forest - Several Trees

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 22 1 0 7.25 S

1 1 female 38 1 0 71.2833 C

1 3 female 26 0 0 7.925 S

1 1 female 35 1 0 53.1 S

0 3 male 35 0 0 8.05 S

0 3 male 0 0 8.4583 Q

0 1 male 54 0 0 51.8625 S

0 3 male 2 3 1 21.075 S

1 3 female 27 0 2 11.1333 S

1 2 female 14 1 0 30.0708 C

Random Forest

Like a Honda CR-V, Random Forest is● Versatile - It can do classification,

regression, missing value imputation, clustering, feature importance, and works well on most data sets right out of the box.

● Efficient - Trees can be grown in parallel.

● Low Maintenance - There is only one parameter to tune, number of columns to sample.

Gradient Boosting Machines (GBM)

Gradient Boosting Machines (GBM) - Tree 1

Gradient Boosting Machines (GBM) - Boost 1

Gradient Boosting Machines (GBM) - Tree 2

Gradient Boosting Machines (GBM) - Boost 2

Gradient Boosting Machines (GBM) - Tree 3

Gradient Boosting Machines (GBM) - Probabilities

Gradient Boosting Machines (GBM)

Given this process, how quickly do you think this will overfit?

Gradient Boosting Machines (GBM)

Given this process, how quickly do you think this will overfit?

The surprising answer is not very fast. There is even a mathematical proof behind this.

Gradient Boosting Machines (GBM)

Like the original hummer, GBM is● Powerful - On most real world data

sets, it is hard to beat in predictive power. It handles missing values natively.

● High Maintenance - There are many parameters to tune. Extra precautions must be taken to prevent overfitting.

● Expensive - Boosting is inherently sequential and computationally expensive. However, GBM is a lot faster than it used to be with the arrival of XGBoost.

Deep Learning

Deep learning is like the 2008 Tesla Roadster…

The technology holds a lot of promise for the future, but in it’s current form it is usually not the most practical tool for most problems (other than speech and image recognition).

Kaggle Titanic - R Code

Some (simple) CART and Random Forest R code can be found here:

https://github.com/didericksen/titanic/blob/master/titanic.R

End

Questions?

Contact: didericksen@gmail.com

linkedin.com/in/didericksen