Introduction to RandomForests 2004

Data Mining with Random Forests

An Introduction to RandomForests™

Salford Systems

http://[email protected]

Dan Steinberg, Mikhail Golovnya, N. Scott Cardell

http://www.salford-systems.com/

mailto:[email protected]

New approach for many data analytical tasks developed by Leo Breiman of University of California, Berkeley◦ Co-author of CART® with Friedman, Olshen, and Stone

◦ Author of Bagging and Arcing approaches to combining trees

Good for classification and regression problems◦ Also for clustering, density estimation

◦ Outlier and anomaly detection

◦ Explicit missing value imputation

Builds on the notions of committees of experts but is substantially different in key implementation details

Introduction to Random Forests

The term usually refers to pattern discovery in large data bases

Initially appeared in the late twentieth century and directly associated with the PC boom◦ Spread of data collection devices

◦ Dramatically increased data storage capacity

◦ Exponential growth in computational power of CPUs

The necessity to go way beyond standard statistical techniques in data analysis◦ Dealing with extremely large numbers of variables

◦ Dealing with highly non-linear dependency structures

◦ Dealing with missing values and dirty data

Data Mining

The following major classes of problems are usually considered:

◦ Supervised Learning (interested in predicting some outcome variable based on observed predictors)

Regression (quantitative outcome)

Classification (nominal or categorical outcome)

◦ Unsupervised Learning (no single target variable available- interested in partitioning data into cluster, finding association rules, etc.)

Data Mining Problems

Relating gene expressions to the presence of a certain decease based upon microarray data

Indentifying potential fraud cases in credit card transactions (binary target)

Predicting level of user satisfaction as poor, average, good, excellent (4-level target)

Optical Digit Recognition (10-level target)

Predicting consumer preferences towards different kinds of vehicles (could be as many as several hundred level target)

Classification Examples

Predicting efficacy of a drug based upon demographic factors

Predicting the amount of sales (target) based on current observed conditions

Predicting user energy consumption (target) depending on the season, business type, location, etc.

Predicting medium house value (target) based on the crime rate, pollution level, proximity, age, industrialization level, etc.

Regression Examples

DNA Microarray Data- which samples cluster together? Which genes cluster together?

Market Basket Analysis- which products do customers tend to buy together?

Clustering For Classification- Handwritten zip code problem: can we find prototype digits for 1,2, etc. to use for classification?

Unsupervised Learning Problem Examples

The answer usually has two sides:

◦ Understanding the relationship◦ Predictive accuracy

Some algorithms dominate one side (understanding)

◦ Classical methods◦ Single trees◦ Nearest neighbor◦ MARS

Others dominate the other side (predicting)

◦ Neural nets◦ TreeNet◦ Random Forests

What do We Want from a Model- Traditional Approach

Leo Breiman says:

◦ Framing the question as the choice between accuracy and interpretability is an incorrect interpretation of what the goal of a statistical analysis is

The goal is NOT interpretability, but accurate information

Nature’s mechanisms are generally complex and cannot be summarized by a relatively simple stochastic model, even as a first approximation

The better the model fits the data, the more sound the inferences about the phenomenon are

New Way of Thinking!

The only way to attain the best predictive accuracy o real life data is to build a complex model

Analyzing this model will also provide the most accurate insight!

At the same time, the model complexity makes it far more difficult to analyze it

◦ A random forest may contain 3,000 trees jointly contributing to the overall prediction

◦ There could be 5,000 association rules found in a typical unsupervised learning algorithm

Modeling Consequences

(Insert table) Example of a classification tree for UCSD

heart decease study

Tree- Basic Building Block

Relatively fast

Requires minimal supervision by analyst

Produces easy to understand models

Conducts automatic variable selection

Handles missing values via surrogate splits

Invariant to monotonic transformations of predictors

Impervious to outliers

Single Tree- Major Strengths

Piece-wise constant models

“Sharp” decision boundaries

Exponential data exhaustion

Difficulties capturing global linear patterns

Models tend to evolve around the strongest effects

Not the best predictive accuracy

Single Tree- Major Weaknesses

A random forest is a collection of single trees grown in a special way

The overall prediction is determined by voting (in classification) or averaging (in regression)

The law of Large Numbers ensures convergence

The key to accuracy is low correlation and bias

To keep bias low, trees are grown to maximum depth

Random Forest

Each tree is grown on a bootstrap sample from the learning set

A number R us specified (square root by default) such that it is noticeably smaller than the total number of available predictors

During tree growing phase, at each node only R predictors are randomly selected and tried

The Algorithm

All major advantages of a single tree are automatically preserved

Since each tree is grown on a bootstrap sample, one can

◦ Use out of bag samples to compute an unbiased estimate of the accuracy

◦ Use out of bag samples to determine variable importances

There is no overfitting as the number of trees increases

Performance

It is possible to compute generalized proximity between any pair of cases

Based on proximities one can

◦ Proceed with a well-defined clustering solution

◦ Detect outliers

◦ Generate informative data views/projections using scaling coordinates

◦ Do missing value imputation

Easy expansion into the unsupervised learning domain

Further Derivatives

High levels of predictive accuracy delivered automatically

◦ Only a few control parameters to experiment with◦ Strong for both regression and classification

Resistant to overtraining (overfitting)- generalizes well to new data

Trains rapidly even with thousands of potential predictors

◦ No need for prior feature (variable) selection

Diagnostic pinpoint multivariate outliers

Offers a revolutionary new approach to clustering using tree-based between-record distance measures

Built on CART® inspired trees and thus◦ Results invariant to monotone transformations of variables

Benefits of Random Forests

Method intended to generate a large number of substantially different models◦ Randomness introduced in two simultaneous ways

◦ By row: records selected for training at random with replacement (as in bootstrap resampling of the bagger)

◦ By column: candidate predictors at any node are chosen at random and best splitter selected from the random subset

Each tree is grown out to maximal size and left unpruned◦ Trees are deliberately overfit, becoming a form of nearest neighbor

predictor

◦ Experiments convincingly show that pruning these trees hurt performance

◦ Overfit individual trees combine to yield properly fit ensembles

Random Forests: Key Innovations-1

Self-testing possible even if all data is used for training◦ Only 63% of available training data will be used to grow any one

tree

◦ A 37% portion of training data always unused

The unused portion of the training data is known as Out-Of-Bag (OOB) data and can be used to provide an ongoing dynamic assessment of model performance

◦ Allows fitting to small data sets without explicitly holding back any data for testing

◦ All training data is used cumulatively in training, but only a 63% portion used at any one time Similar to cross-validation but unstructured

Random Forests: Key Innovations-2

Intensive post processing of data to extract more insight into data

◦ Most important is introduction of distance metric between any two data records

◦ The more similar two records are the more often they will land in same terminal node of a tree

◦ With a large number of different trees simply count the number of times they co-locate in same leaf nodes

◦ Distance metric can be used to construct dissimilarity matrix input into hierarchical clustering

Random Forests: Key Innovations- 3

Ultimately in modeling our goal is to produce a single score, prediction, forecast, or class assignment

The motivation generating multiple models is the hope that by somehow combining models results will be better than if we relied on a single model

When multiple models are generated they are normally combined by

◦ Voting in classification problems, perhaps weighted

◦ Averaging in regression problems, perhaps weighted

Combining Trees

Combining trees via averaging or voting will only be beneficial if the trees are different from each other

In original bootstrap aggregation paper Breiman noted bagging worked best for high variance (unstable) techniques

◦ If results of each model are near identical little to be gained by averaging

Resampling of the bagger from the training data intended to induce differences in trees

◦ Accomplished essentially varying the weight on any data record

Random Forests and Uncorrelated Trees

Bootstrap sample is fairly similar to taking a 65% sample from the original training data

If you grow many trees each based on a different 65% random sample of your data you expect some variation in the trees produced

Bootstrap sample goes a bit further in ensuring that the new sample is of the same size as the original by allowing some records to be selected multiple times

In practice the different samples induce different trees but trees are not that different

Bootstrap and Random Sampling

The bagger was limited by the fact that even with resampling trees are likely to be somewhat similar to each other, particularly with strong data structure

Random Forests induces vastly more between tree differences by forcing splits to be based on different predictors

◦ Accomplished by introducing randomness into split selection

Random Forests Key Insight: How to minimize inter-tree dependence

Breiman points out tradeoff:

◦ As R increases strength of individual tree should increase

◦ However, correlation between trees also increases reducing advantage of combining

Want to select R to optimally balance the two effects

◦ Can only be determined via experimentation

Breiman has suggested three values to test:◦ R= 1/2sqrt(M)◦ R= sqrt(M)◦ R= 2sqrt(M)◦ For M= 100 test values for R: 5,10,20◦ For M= 400 test values for R: 10, 20, 40

Trade-Off: Individual tree strength vs advantage of the ensemble

Random Forests machinery unlike CART in that

◦ Only one splitting rule: Gini

◦ Class weight concept but no explicit priors or costs

◦ No surrogates: Missing values imputed for data first automatically

Default fast imputation just uses means Compute intensive method uses tree-based nearest neighbors to base

imputation on (discussed later)

◦ None of the display and reporting machinery are tree refinement services of CART

Does follow CART in that all splits are binary

Random Forests vc CART tree

Trees combined via voting (classification) or averaging (regression)

Classification trees “vote”

◦ Recall that classification trees classify Assign each case to ONE class only

◦ With 50 trees, 50 class assignments for each case

◦ Winner is the class with the most votes

◦ Votes could be weighted- say by accuracy of individual trees

Regression trees assign a real predicted value for each case

◦ Predictions are combined via averaging

◦ Results will be much smoother than from a single tree

Trees Can be Combined By Voting or Averaging

Probability of being omitted in a single draw is (1-1/n)

Probability of being omitted in all n draws is (1-1/n)n

Limit of series as n increases is (1/e)= 0.368◦ Approximately 36.8% sample excluded 0% of resample

◦ 36.8% sample included once 36.8% of resample

◦ 18.4% sample included twice thus represent…36.8% of resample

◦ 6.1% sample included three times…18.4% of resample

◦ 1.9% sample included four or more times…8% if resample 100%

◦ Example: distribution of weights in a 2,000 record resample:◦ (insert table)

Bootstrap Resampling Effectively Reweights Training Data (Randomly and Independently)

Want to use mass spectrometer data to classify different types of prostate cancer

◦ 772 observations available

398- healthy samples

178- 1st type of cancer samples

196- 2nd type of cancer samples

◦ 111 mass spectra measurements are recorded for each sample

A Working Example:Mass Spectra Analysis

(insert table) The above table shows cross-validated prediction success

results of a single CART tree for the prostate data

The run was conducted under PRIORS DATA to facilitate comparisons with subsequent RF run

◦ The relative error corresponds to the absolute error of 30.4%

Conventional CART a Baseline

Topic discussed by several Machine Learning researchers

Possibilities:◦ Select splitter, split point, or both at random

◦ Choose splitter at random from the top K splitters

Random Forests: Suppose we have M available predictors◦ Select R eligible splitters at random and let best split node

◦ If R=1 this is just random splitter selection

◦ If R=M this becomes Brieman’s bagger

◦ If R<< M then we get Breian’s Random Forests

Breiman suggests R=sqrt(M) as a good rule of thumb

Randomness in split selection

A performance of a single tree will be somewhat driven by the number of candidate predictors allowed at each node

Consider R=1: the splitter is always chosen at random + performance could be quite weak

As relevant splitters get into tree and tree is allowed to grow massively, single tree can be predictive even if R=1

As R is allowed to increase quality of splits can improve as there will be better (and more relevant) splitters

Strength of a Single Random Forest Tree

(insert graph) In this experiment, we ran RF with 100 trees on

the prostate data using different values for the number of variables Nvars searched at each split

Performance of RF versus Single Tree as a function of NVars

RF clearly outperforms single tree for any number of Nvars◦ We saw above that a properly pruned tree gives cross-validated absolute error

of 30.4% (the very right end of the red curve)

The performance of a single tree tends to deviate substantially with the number of predictors allowed to be searched (a single tree is a high variance object)

The RF reaches the nearly stable error rate of about 20% when only 10 variables are searched in each node (marked by the blue color)

Discounting the minor fluctuations, the error rate also remains stable for Nvars above 10◦ This generally agrees with Breiman’s suggestion to use square root N=111 as

a rough estimate of the optimal value for Nvars

The performance for small Nvars can be usually further improved by increasing the number of runs

Comments

(insert graph)

Initial RF Run- GUI

(insert table) The above results correspond to a standard RF

run with 500 trees, Nvars=15, and unit class weights

Note that the overall error rate is 19.4% which is 2/3 of the baseline CART error of 30.4%

Initial RF Run- Classic

RF does not use a test dataset to report accuracy

For every tree grown, about 30% of data are left out-of-bag (OOB)

This means that these cases can be safely used in place of the test data to evaluate the performance of the current tree

For any tree in RF, its own OOB sample is used- hence no bias is ever introduced into the estimates

The final OOB estimate for the entire RF can be simply obtained by averaging individual OOB estimates

Consequently, this estimate is unbiased and behaves as if we had an independent test sample of the same size as the learn sample

Can We Trust the OOB Estimate?

(insert table)

Comparing Within-Class Performance

The prostate dataset is somewhat partially unbalanced- class 1 contains fewer records than the remaining classes

Under the default RF settings, the minority classes will have higher misclassification rates than the dominant classes

Misbalance in the individual class error rates may also be caused by other data specific issues

Class weights are used in RF to boost the accuracy of the specified classes

General Rule of Thumb: to increase accuracy in the given class, one should increase the corresponding class weight

In many ways this is similar to the PRIORS control used in CART for the same purpose

Class weights

Our next run sets the weight for class one to 2

As a result, class 1 is classified with a much better accuracy at the cost of slightly reduced accuracy in the remaining classes

Manipulating Class Weights

At the end of an RF run, the proportion of votes for each class is recorded

We can define Margin of a case simply as the proportion of votes for the true class minus the maximum proportion of votes for the other classes

The larger the margin, the higher the confidence of classification

Class Vote Proportions and Margin

(insert table) This extract shows percent votes for the top 30

records in the dataset along with the corresponding margins

The green lines have high margins and therefore high confidence of predictions

The pink lines have negative margins, which means that these observations are not classified correctly

Example

The concept of margin allows new “unbiased” definition of variable importance

To estimate the importance of the mth variable:◦ Take the OOB cases for the ldh tree, assume that we already know the margin

for those cases M

◦ Randomly permute all values of the variable m

◦ Apply the ldh tree to the OOB cases with the permuted values

◦ Compute the new margin M

◦ Compute the difference M-M

The variable importance is defined as the average lowering of the margin across all OOB cases and all trees in the RF

This procedure is fundamentally different from the intrinsic variable importance scored computed by CART- the latter are always based on the LEARN data and are subject to the overfitting issues

Variable Importance

The top portion of the variable importance list for the data is shown here

Analysis of the complete list reveals that all 111 variables are nearly equally strongly contributing to the model predictions

This is in a striking contrast with the single CART tree that has no choice but to use a limited subset of variables by tree’s construction

The above explains why the RF model has a significantly lower error rate (20%) when compared to a single CART tree (30%)

Example

RF introduces a novel way to define proximity between two observations◦ Initialize proximities to zeroes

◦ For any given tree, apply the tree to all cases

◦ If case I and j both end up in the same node, increase proximity prox(ij) between I and j by one

◦ Accumulate over all trees in RF and normalize by twice the number of trees in RF

The resulting matrix of size NxN provides intrinsic measure of proximity◦ The measure is invariant to monotone transformations

◦ The measure is clearly defined for any type of independent variables, including categorical

Proximity Measure

(insert graph) The above extract shows the proximity matrix for the

top 10 records of the prostate dataset

◦ Note ones on the main diagonal- any case has “perfect” proximity to itself

◦ Observations that are “alike” will have proximities close to one these cells have green background

◦ The closer proximity to 0, the more dissimilar cases i and j are These cells have pink B

Example

Having the full intrinsic proximity matrix opens new horizons◦ Informative data views using metric scaling

◦ Missing value imputation

◦ Outlier detection

Unfortunately, things get out of control when dataset size exceeds 5,000 observations (25,000,000+ cells are needed)

RF switches to “compressed” form of the proximity matrix to handle large datasets- for any case, only M closest cases are recorded. M is usually less than 100.

Using Proximities

The values 1-prox(ij) can be treated as Euclidean distances in a high dimensional space

The theory of metric scaling solves the problem of finding the most representative projections of the underlying data “cloud” onto low dimensional space using the data proximities

◦ The theory is similar in spirit to the principal components analysis and discriminant analysis

The solution is given in the form of ordered “scaling coordinates”

Looking at the scatter plots of the top scaling coordinates provides informative views of the data

Scaling Coordinates

(insert graph)

This extract shows five initial scaling coordinates for the top 30 records of the prostate data

We will look at the scatter plots among the first, second, and third scaling coordinates

The following color codes will be used for the target classes:

◦ Green- class 0◦ Red- class 1◦ Blue- class 2

Example

(insert graphs)

A nearly perfect separation of all three classes is clearly seen

From this we conclude that the outcome variable admits clear prediction using RF model which utilizes 111 original predictors

The residual error is mostly due to the presence of the “focal” point where all the three rays meet

2-D Scatter Plots

(insert graph)

Sample GUI Display

(insert graphs)

Again, three distinct target classes show up as separate clusters

The “focal” point represents a cluster of records that can’t be distinguished from each other

3-D Rotation Plots

Outliers are defined as cases having small proximities to all other cases belonging to the same target class

The following algorithm is used:◦ For a case n, compute the sum of the squares of prox(nk) for all k in

the same class as n

◦ Take the inverse- it will be large if the case is “far away” from the rest

◦ Standardize using the median and standard deviation◦ ◦ Look at the cases with the largest values- those are potential outliers

Generally, a value above 10 is reason to suspect the case of being an outlier

Outlier Detection

This extract shows top 30 records of the prostate dataset sorted descending by the outlier measure

Clearly the top 6 cases (class 2 with IDs: 771, 683, 539, and class 0 with IDs 127, 281, 282) are suspicious

All of these seem to be located at the “focal point” on the corresponding scaling coordinate plots

Example

(insert graph)

Sample GUI Display

RF offers two ways of missing value imputation

The Cheap Way- conventional median imputation for continuous variables and mode imputation for categorical variables

The Right Way:◦ Suppose case n has x coordinate missing

◦ Do the Cheap Way imputation for starters

◦ Grow a full size RF

◦ We can now re-estimate the missing value by a weighted average

◦ over all cases k with non-missing x using weights prox(nk)

◦ Repeat steps 2 and 3 several times to ensure convergence

Missing Value Imputation

An alternative display to view how the target classes are different with respect to the individual predictors

◦ Recall, at the end of an RF run all cases in the dataset, obtain K separate votes for the class membership (assuming K target classes)

◦ Take any target class and sort all observations by the count of votes for this class descending

◦ Take the top 50 observations and the bottom 50 observations, those are correspondingly the most likely and the least likely members of the given target class

◦ Parallel coordinate plots report uniformly (0,1) scaled values of all predictors for the top 50 and bottom 50 sorted records, along with the 25th, 50th and j percentiles within each predictor

Parallel Coordinates

(insert graph)

This is a detailed display of the normalized values of the initial 20 predictors for the top voted 50 records in each target class (this gives 50x3=150 graphs)

Class 0 generally has normalized values of the initial 20 predictors close to 0 (left side 0tt, lw, y, o, ragg, wp) except perhaps M9X11

Example

(insert graph) It is easier to see this when looking at the quartile

plots only

Note that class 2 tends to have the largest values of the corresponding predictors

The graph can be scrolled forward to view all of the 111 predictors

Example (continued)

(insert graph)

The least likely plots roughly result to the similar conclusions: small predictor values are the least likely for class 2, etc.

Example (continued)

RF admits an interesting possibility to solve unsupervised learning problems, in particular, clustering problems and missing value imputation in the general sense

Recall that in the unsupervised learning the concept of target is not defined

RF generates a synthetic target variable in order to proceed with a regular run:◦ Give class label 1 to the original data

◦ Create a copy of the data such that each variable is sampled independently from the values available in the original dataset

◦ Give class label 2 to the copy of the data

◦ Note that the second copy has marginal distributions identical to the first copy, whereas the possible dependency among predictors is completely destroyed

◦ ◦ A necessary drawback is that the resulting dataset is twice as large as the original

The Road to Clustering

We now have a clear binary supervised learning problem

Running an RF on this dataset may provide the following insights:◦ When the resulting misclassification error is high (above 50%), the variables

are basically independent- no interesting structure exists

◦ Otherwise, the dependency structure can be further studied by looking at the scaling coordinates and exploiting the proximity matrix in other ways

◦ For instance, the resulting proximity matrix can be used as an important starting point for the subsequent hierarchical clustering analysis

Recall that the proximity measures are invariant to monotone transformations and naturally support categorical variables

The same missing value imputation procedure as before can now be employed

These techniques work extremely well for small datasets

Analyzing the Synthetic Data

We generated a synthetic dataset based on the prostate data

The resulting dataset still has 111 predictors but twice the number of records- the first half being the exact replica of the original data

The final error is only 0.2% which is an indication of a very strong dependency among the predictors

Example

(insert graph)

The resulting plots resemble what we had before

However, this distance is in terms of how dependent the predictors are, whereas previously it was in terms of having the same target class

In view of this, the non cancerous tissue (green) appears to stand apart from the cancerous

Scaling Coordinates

+ Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. + Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics

Department, University of California. + Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial

Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201. + Dietterich, T. (1998). An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158.

+ Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.

+ Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford University.

+ Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.

+ Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery, France.

+ Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-Holland, 327-335.

References

Date post:	20-Aug-2015
Category:	Technology
Upload:	salford-systems
View:	1,126 times
Download:	3 times

Introduction to RandomForests 2004

Technology