+ All Categories
Home > Documents > On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Date post: 27-Mar-2015
Category:
Upload: david-archer
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson
Transcript
Page 1: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

On the Optimality of Probability Estimation by Random Decision Trees

Wei FanIBM T.J.Watson

Page 2: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Some important facts about inductive learning

Given a set of labeled data items, such as, (amt, merchant category, outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud.

Inductive model: predict if a transaction is a fraud or non-fraud.

Perfect model: never makes mistakes. Not always possible due to:

Stochastic nature of the problem Noise in training data Data is insufficient

Page 3: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Optimal Model Loss function L(t,y) to evaluate performance.

t is true label and y is prediction Optimal decision decision y* is the label that

minimizes the expected loss when x is sampled many times: 0-1 loss: y* is the label that appears the most

often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the

“empirical risk”. If P(fraud|x) * $1000 > $90 or p(fraud|

x) > 0.09, predict fraud

Page 4: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

How we look for optimal models?

NP-hard for most “model representation” We think that simplest hypothesis that fits

the data is the best. We employ all kinds of heuristics to look for

it. info gain, gini index, etc pruning: MDL pruning, reduced error-pruning,

cost-based pruning. Reality: tractable, but still very expensive

Page 5: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

How many optimal models out there?

0-1 loss binary problem: Truth: P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes no

difference in final prediction! Cost-sensitive problems:

Truth: P(fraud|x) * $1000 > $90, we predict x to be fraud.

Re-write it P(fraud|x) > 0.09 P(fraud|x) = 1.0 and P(fraud|x) = 0.091 makes no

difference. There are really many many optimal models out

there.

Page 6: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Random Decision Tree: Outline Train multiple trees. Details to follow. Each tree outputs posterior

probability when classifying an example x.

The probability outputs of many trees are averaged as the final probability estimation.

Loss function and probability are used to make the best prediction.

Page 7: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Training At each node, an un-used feature is

chosen randomly A discrete feature is un-used if it has

never been chosen previously on a given decision path starting from the root to the current node.

A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Page 8: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Training At each node, an un-used feature is

chosen randomly A discrete feature is un-used if it has

never been chosen previously on a given decision path starting from the root to the current node.

A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Page 9: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Example

Gender?

M F

Age>30

y n

P: 100N: 150

P: 1N: 9

… …

Age> 25

Page 10: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Training: Continued We stop when one of the following

happens: A node becomes empty. Or the total height of the tree exceeds a

threshold, currently set as the total number of features.

Each node of the tree keeps the number of examples belonging to each class.

Page 11: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Classification Each tree outputs membership probability

p(fraud|x) = n_fraud/(n_fraud + n_normal) If a leaf node is empty (very likely for when discrete

feature is tested at the end): Use the parent nodes’ probability estimate but do not

output 0 or NaN The membership probability from multiple random

trees are averaged to approximate as the final output

Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

Page 12: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Credit Card Fraud

Detect if a transaction is a fraud There is an overhead to detect a

fraud, {$60, $70, $80, $90} Loss Function

Page 13: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Result

Page 14: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Donation Dataset

Decide whom to send charity solicitation letter. About 5% positive.

It costs $0.68 to send a letter. Loss function

Page 15: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Result

Page 16: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Independent study and implementation of random decision tree

Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets

Edward Greengrass from DOD on their data sets. 100 to 300 features. Both categorical and continuous features. Some features have a lot of values. 2000 to 3000 examples. Both binary and multiple class problem (16 and

25)

Page 17: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Why random decision tree works?

Original explanation: Error tolerance property. Truth: P(positive|x) > 0.5, we predict x to be

positive. P(positive|x) = 0.6, P(positive|x) = 0.9

makes no difference in final prediction! New discovery:

Posterior probability, such as P(positive|x), is a better estimate than the single best tree.

Page 18: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Credit Card Fraud

Page 19: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Adult Dataset

Page 20: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Donation

Page 21: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Overfitting

Page 22: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Non-overfitting

Page 23: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Selectivity

Page 24: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Tolerance to data insufficiency

Page 25: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Other related applications of random decision tree

n-fold cross-validation Stream Mining. Multiple class probability estimation

Page 26: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Implementation issues

When there is not an astronomical number of features and feature values, we can build some empty tree structures and feed the data in one simple scan to finalize the construction.

Otherwise, build the tree iteratively just like traditional tree construction WITHOUT any expensive purity function check.

Both ways are very efficient since we do not check any expensive purity function

Page 27: On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

On the other hand Occam’s Razor’s interpretation: two hypotheses with

the same loss, we should prefer the simpler one. Very complicated hypotheses that are highly

accurate: Meta-learning Boosting (weighted voting) Bagging (sampling without replacement)

None of purity functions really obeys Occam’s razor Their philosophy is: simpler is better, but we hope simpler

brings high accuracy. That is not true!


Recommended