+ All Categories
Home > Documents > ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from...

ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from...

Date post: 01-Jul-2018
Category:
Upload: lynga
View: 215 times
Download: 0 times
Share this document with a friend
61
Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn Some overheads from Galit Shmueli and Peter Bruce 2010 1
Transcript
Page 1: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Big Data Analytics:Evaluating Classification Performance

April, 2016R. Bohn

Some overheads from Galit Shmueli and Peter Bruce 20101

Win-Vector Blog !

A bit on the F1 score floor

April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-

ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,

symPy

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,

California we spent some time on classifier measures derived from the so-called

“confusion matrix.”

We repeated our usual admonition to not use “accuracy” as a project goal (busi-

ness people tend to ask for it as it is the word they are most familiar with, but it

usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does noth-

ing is “more accurate” than one that actually has some utility. (slides here)

And we worked through the usual bestiary of other metrics (precision, recall,

sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out.

! ! !

!

Page 2: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Most accurate ≠ Best!

Some overheads from Galit Shmueli and Peter Bruce 20102

Win-Vector Blog !

A bit on the F1 score floor

April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-

ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,

symPy

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,

California we spent some time on classifier measures derived from the so-called

“confusion matrix.”

We repeated our usual admonition to not use “accuracy” as a project goal (busi-

ness people tend to ask for it as it is the word they are most familiar with, but it

usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does noth-

ing is “more accurate” than one that actually has some utility. (slides here)

And we worked through the usual bestiary of other metrics (precision, recall,

sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out.

! ! !

!

Which is more accurate??

Win-Vector Blog !

A bit on the F1 score floor

April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-

ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,

symPy

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,

California we spent some time on classifier measures derived from the so-called

“confusion matrix.”

We repeated our usual admonition to not use “accuracy” as a project goal (busi-

ness people tend to ask for it as it is the word they are most familiar with, but it

usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does noth-

ing is “more accurate” than one that actually has some utility. (slides here)

And we worked through the usual bestiary of other metrics (precision, recall,

sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out.

! ! !

!

Actual value

Page 3: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Why Evaluate a classifier?

� Multiple methods are available to classify or predict� For each method, multiple choices are available for

settings� To choose best model, need to assess each model’s

performance� When you use it, how good will it be?

� How “reliable” are the results?

Some overheads from Galit Shmueli and Peter Bruce 20103

Page 4: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Misclassification error for discrete classification problems

� Error = classifying a record as belonging to one class when it belongs to another class.

� Error rate = percent of misclassified records out of the total records in the validation data

Some overheads from Galit Shmueli and Peter Bruce 20104

Page 5: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

examples

Some overheads from Galit Shmueli and Peter Bruce 20105

Page 6: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Confusion Matrix

201 1’s correctly classified as “1”85 1’s incorrectly classified as “0”25 0’s incorrectly classified as “1”2689 0’s correctly classified as “0”

Note asymmetry: Many more 0s than 1sGenerally 1 = “interesting” or rare case

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Some overheads from Galit Shmueli and Peter Bruce 20106

Page 7: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Naïve Rule

� Often used as benchmark: we hope to do better than that

� Exception: when goal is to identify high-value but rare outcomes, we may prefer to do worse than the naïve rule (see “lift” – later)

Naïve rule: classify all records as belonging to the most prevalent class

Some overheads from Galit Shmueli and Peter Bruce 20107

Page 8: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives

N - sample size

N+ = FN + TP number of positive examples

N� = FP + TN number of negative examples

O+ = TP + FP number of positive predictions

O� = FN + TN number of negative predictions

outputs\ labeling y = +1 y = �1 ⌃

f (x) = +1 TP FP O+

f (x) = �1 FN TN O�

⌃ N+ N� N

c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61

Memorial Sloan-Kettering Cancer Center

x = inputsf(x) = predictionY = true result

Some overheads from Galit Shmueli and Peter Bruce 20108

Page 9: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Error Rate Overall

Overall error rate = (25+85)/3000 = 3.67%Accuracy = 1 – err = (201+2689) = 96.33%Expand to n x n matrix, where n = # of alternatives If multiple classes, error rate is:

(sum of misclassified records)/(total records)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Some overheads from Galit Shmueli and Peter Bruce 20109

Page 10: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Cutoff for classificationMost DM algorithms classify via a 2-step process:For each record,

1. Compute probability of belonging to class “1”2. Compare to cutoff value, and classify accordingly

� Default cutoff value is often chosen = 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”

� May prefer a different cutoff value

Some overheads from Galit Shmueli and Peter Bruce 201010

Page 11: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Some overheads from Galit Shmueli and Peter Bruce 201011

Page 12: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Example: 24 cases classified. Est prob. Range .004 to .996

Obs.#Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 010 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0

Some overheads from Galit Shmueli and Peter Bruce 201012

Page 13: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13Obs.#

Est.probability Actual

Obs.# EstProb. Actual

1 0.996 13 0.5062 0.988 14 0.4713 0.984 15 0.3374 0.98 16 0.2185 0.948 17 0.1996 0.889 18 0.1497 0.949 19 0.0488 0.762 20 0.0389 0.707 21 0.025

10 0.681 22 0.02211 0.656 23 0.01612 0.622 24 0.004

Some overheads from Galit Shmueli and Peter Bruce 201013

Page 14: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13

Obs.#Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0

10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0

Result = 4 false positives, 1 false negative, 5/24 total errorsSome overheads from Galit Shmueli and Peter Bruce 201014

Page 15: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Set cutoff = 70% Forecast = 9 type 1s, ID 1-9Obs.#

Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0

10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0

Result = 1 false positives, 2 false negative, (we were lucky)Some overheads from Galit Shmueli and Peter Bruce 201015

Page 16: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Which is better??

� High cutoff (70%) => more false negatives, fewer false positives

� Lower cutoff (50%) gives opposite � Which better is not a stat question; it’s a

business application question� Total number of errors is usually NOT the

right� Which kind of error is more serious???� Only rarely are the costs the same

� Statistical decision theory, Bayesian logicSome overheads from Galit Shmueli and Peter Bruce 201016

Page 17: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

When One Class is More Important

� Tax fraud� Credit default� Response to promotional offer� Detecting electronic network intrusion� Predicting delayed flights

In many cases it is more important to identify members of one class

In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention

Some overheads from Galit Shmueli and Peter Bruce 201017

Page 18: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Always a tradeoff: false + vs. false -� How do we estimate these two types of errors?� Answer: with our set-aside data

Validation data set.

Some overheads from Galit Shmueli and Peter Bruce 201018

Page 19: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives

N - sample size

N+ = FN + TP number of positive examples

N� = FP + TN number of negative examples

O+ = TP + FP number of positive predictions

O� = FN + TN number of negative predictions

outputs\ labeling y = +1 y = �1 ⌃

f (x) = +1 TP FP O+

f (x) = �1 FN TN O�

⌃ N+ N� N

c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61

Memorial Sloan-Kettering Cancer Center

x = inputsf(x) = predictionY = true result

Some overheads from Galit Shmueli and Peter Bruce 201019

Page 20: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Common performance measuresEvaluation Measures for Classification II

Several commonly used performance measuresName Computation

Accuracy ACC = TP+TN

N

Error rate (1-accuracy) ERR = FP+FN

N

Balanced error rate BER = 1

2

⇣FN

FN+TP

+ FP

FP+TN

Weighted relative accuracy WRACC = TP

TP+FN

� FP

FP+TN

F1 score F1 = 2⇤TP2⇤TP+FP+FN

Cross-correlation coe�cient CC = TP·TN�FP·FNp(TP+FP)(TP+FN)(TN+FP)(TN+FN)

Sensitivity/recall TPR = TP/N+ = TP

TP+FN

Specificity TNR = TN/N� = TN

TN+FP

1-sensitivity FNR = FN/N+ = FN

FN+TP

1-specificity FPR = FP/N� = FP

FP+TN

P.p.v. / precision PPV = TP/O+ = TP

TP+FP

False discovery rate FDR = FP/O+ = FP

FP+TP

c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 62

Memorial Sloan-Kettering Cancer Center

None of these is a hypothesis test. (For example, is p(positive) >0? )Reason: we don’t care about the “overall;” we care about prediction of cases.

Some overheads from Galit Shmueli and Peter Bruce 201020

Page 21: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Alternate Accuracy Measures

If “C1” is the important class,

Sensitivity = % of “C1” class correctly classifiedSpecificity = % of “C0” class correctly classified

False positive rate = % of predicted “C1’s” that were not “C1’s”

False negative rate = % of predicted “C0’s” that were not “C0’s”

Some overheads from Galit Shmueli and Peter Bruce 201021

Page 22: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Medical terms� In medical diagnostics, test sensitivity is the ability

of a test to correctly identify those with the disease (true positive rate),

� whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).[2]

Some overheads from Galit Shmueli and Peter Bruce 201022

Page 23: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Significance vs power (statistics)� False positive rate (α) = P(type I error)

� = 1 − specificity � =FP / (FP + TN)� = Significance level� For simple hypotheses, this is the test's probability of

incorrectly rejecting the null hypothesis. The false positive rate.

� False negative rate (β)� = type II error � = 1 − sensitivity � = FN / (TP + FN) = 10 / (20 + 10)

� Power = sensitivity = 1 − β� The test's probability of correctly rejecting the null

hypothesis. The complement of the false negative rate, β. Some overheads from Galit Shmueli and Peter Bruce 201023

Page 24: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Evaluation Measures for Classification III Evaluation Measures for Classification III

[left] ROC Curve [right] Precision Recall Curve

0 0.2 0.4 0.6 0.8 10.010.1

1

false positive rate

true

posit

ive ra

te

ROC

proposed methodfirstefeponinemcpromotor

proposed method

firstef

mcpromotor

eponine

0 0.2 0.4 0.6 0.8 1

0.1

1

true positive rate

posi

tive

pred

ictiv

e va

lue

PPV

proposed methodfirstefeponinemcpromotor

proposed method

firstefeponine

mcpromotor

(Obtained by varying bias and recording TPR/FPR or PPV/TPR.)

Use bias independent scalar evaluation measureArea under ROC Curve (auROC)

Area under Precision Recall Curve (auPRC)

c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 63

Memorial Sloan-Kettering Cancer Center

Area under Precision Recall Curve (auPRC)

Some overheads from Galit Shmueli and Peter Bruce 201024

Page 25: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

ROC Curve

False positive rate

Some overheads from Galit Shmueli and Peter Bruce 201025

Page 26: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift and Decile Charts: Goal

Useful for assessing performance in terms of identifying the most important class

Helps evaluate, e.g.,� How many tax records to examine� How many loans to grant� How many customers to mail offer to

Some overheads from Galit Shmueli and Peter Bruce 201026

Page 27: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift and Decile Charts – Cont.

Compare performance of DM model to “no model, pick randomly”

Measures ability of DM model to identify the important class, relative to its average prevalence

Charts give explicit assessment of results over a large number of cutoffs

Some overheads from Galit Shmueli and Peter Bruce 201027

Page 28: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift and Decile Charts: How to Use

Compare lift to “no model” baseline

In lift chart: compare step function to straight line

In decile chart compare to ratio of 1

Some overheads from Galit Shmueli and Peter Bruce 201028

Page 29: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift Chart – cumulative performance

After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have been correctly identified

02468101214

0 10 20 30

Cumulative

#cases

Liftchart(trainingdataset)

CumulativeOwnershipwhensortedusingpredictedvalues

CumulativeOwnershipusingaverage

Some overheads from Galit Shmueli and Peter Bruce 201029

Page 30: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Decile Chart

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10

Decilemean/Glob

alm

ean

Deciles

Decile-wiseliftchart(trainingdataset)

In “most probable” (top) decile, model is twice as likely to identify the important class (compared to avg. prevalence)Some overheads from Galit Shmueli and Peter Bruce 201030

Page 31: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift Charts: How to Compute

� Using the model’s classifications, sort records from most likely to least likely members of the important class

� Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis)

Some overheads from Galit Shmueli and Peter Bruce 201031

Page 32: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift vs. Decile Charts

Both embody concept of “moving down” through the records, starting with the most probable

Decile chart does this in decile chunks of dataY axis shows ratio of decile mean to overall mean

Lift chart shows continuous cumulative resultsY axis shows number of important class records identified

Some overheads from Galit Shmueli and Peter Bruce 201032

Page 33: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Asymmetric Costs

Some overheads from Galit Shmueli and Peter Bruce 201033

Page 34: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Misclassification Costs May Differ

The cost of making a misclassification error may be higher for one class than the other(s)

Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Some overheads from Galit Shmueli and Peter Bruce 201034

Page 35: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Example – Response to Promotional Offer

� “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)

� With lower cutoff, suppose we can correctly classify eight 1’s as 1’s

It comes at the cost of misclassifying twenty 0’s as 1’s and two 0’s as 1’s.

Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)

Some overheads from Galit Shmueli and Peter Bruce 201035

Page 36: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

New Confusion Matrix

Predictas1 Predictas0Actual1 8 2Actual0 20 970

Error rate = (2+20) = 2.2% (higher than naïve rate)

Some overheads from Galit Shmueli and Peter Bruce 201036

Page 37: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Introducing Costs & BenefitsSuppose:� Profit from a “1” is $10� Cost of sending offer is $1Then:� Under naïve rule, all are classified as “0”, so no

offers are sent: no cost, no profit� Under DM predictions, 28 offers are sent.

8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)

� Net profit = $60Some overheads from Galit Shmueli and Peter Bruce 201037

Page 38: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Profit Matrix

Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0

Some overheads from Galit Shmueli and Peter Bruce 201038

Page 39: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift (again)

Adding costs to the mix, as above, does not change the actual classifications

Better: Use the lift curve and change the cutoff value for “1” to maximize profit

Some overheads from Galit Shmueli and Peter Bruce 201039

Page 40: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Generalize to Cost Ratio

Sometimes actual costs and benefits are hard to estimate

�Need to express everything in terms of costs (i.e., cost of misclassification per record)�Goal is to minimize the average cost per record

A good practical substitute for individual costs is the ratioof misclassification costs (e,g,, “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)

Some overheads from Galit Shmueli and Peter Bruce 201040

Page 41: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Minimizing Cost Ratio

q1 = cost of misclassifying an actual “1”,

q0 = cost of misclassifying an actual “0”

Minimizing the cost ratio q1/q0 is identical tominimizing the average cost per record

Software* may provide option for user to specify cost ratio

*Currently unavailable in XLMinerSome overheads from Galit Shmueli and Peter Bruce 201041

Page 42: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Note: Opportunity costs

� As we see, best to convert everything to costs, as opposed to a mix of costs and benefits

� E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”

� Leads to same decisions, but referring only to costs allows greater applicability

Some overheads from Galit Shmueli and Peter Bruce 201042

Page 43: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Cost Matrix (inc. opportunity costs)

Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):

Predictas1 Predictas0Actual1 $8 $20Actual0 $20 $0

Predictas1 Predictas0Actual1 8 2Actual0 20 970

Some overheads from Galit Shmueli and Peter Bruce 201043

Page 44: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Adding Cost/Benefit to Lift Curve

� Sort records in descending probability of success� For each case, record cost/benefit of actual outcome� Also record cumulative cost/benefit� Plot all records

X-axis is index number (1 for 1st case, n for nth case)Y-axis is cumulative cost/benefitReference line from origin to yn ( yn = total net benefit)

Some overheads from Galit Shmueli and Peter Bruce 201044

Page 45: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Multiple Classes

� Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways

� In practice, often too many to work with

� In decision-making context, though, such complexity rarely arises – one class is usually of primary interest

For m classes, confusion matrix has m rows and m columns

Some overheads from Galit Shmueli and Peter Bruce 201045

Page 46: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Negative slope to reference curve

Some overheads from Galit Shmueli and Peter Bruce 201046

Page 47: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift Curve May Go Negative

If total net benefit from all cases is negative, reference line will have negative slope

Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Some overheads from Galit Shmueli and Peter Bruce 201047

Page 48: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Oversampling and Asymmetric Costs

Caution: most real DM problems fallhave asymmetric costs!

Some overheads from Galit Shmueli and Peter Bruce 201048

Page 49: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Rare Cases

� Responder to mailing� Someone who commits fraud� Debt defaulter

� Often we oversample rare cases to give model more information to work with

� Typically use 50% “1” and 50% “0” for training

Asymmetric costs/benefits typically go hand in hand with presence of rare but important class

Some overheads from Galit Shmueli and Peter Bruce 201049

Page 50: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Example

Following graphs show optimal classification under three scenarios:� assuming equal costs of misclassification� assuming that misclassifying “o” is five times the cost

of misclassifying “x”� Oversampling scheme allowing DM methods to

incorporate asymmetric costs

Some overheads from Galit Shmueli and Peter Bruce 201050

Page 51: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Classification: equal costs

Some overheads from Galit Shmueli and Peter Bruce 201051

Page 52: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Classification: Unequal costs

Some overheads from Galit Shmueli and Peter Bruce 201052

Page 53: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Oversampling SchemeOversample “o” to appropriately weight misclassification costs

Some overheads from Galit Shmueli and Peter Bruce 201053

Page 54: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

An Oversampling Procedure1. Separate the responders (rare) from non-

responders2. Randomly assign half the responders to the

training sample, plus equal number of non-responders

3. Remaining responders go to validation sample4. Add non-responders to validation data, to maintain

original ratio of responders to non-responders5. Randomly take test set (if needed) from validation

Some overheads from Galit Shmueli and Peter Bruce 201054

Page 55: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Classification Using Triage

� Instead of classifying as C1 or C0, we classify asC1

C0

Can’t say

The third category might receive special human review

Take into account a gray area in making classification decisions

Some overheads from Galit Shmueli and Peter Bruce 201055

Page 56: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Evaluating Predictive Performance

Some overheads from Galit Shmueli and Peter Bruce 201056

Page 57: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Measuring Predictive error

� Not the same as “goodness-of-fit”

� We want to know how well the model predicts newdata, not how well it fits the data it was trained with

� Key component of most measures is difference between actual y and predicted y (“error”)

Some overheads from Galit Shmueli and Peter Bruce 201057

Page 58: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Some measures of errorMAE or MAD: Mean absolute error (deviation)

Gives an idea of the magnitude of errors

Average errorGives an idea of systematic over- or under-prediction

MAPE: Mean absolute percentage error

RMSE (root-mean-squared-error): Square the errors, find their average, take the square root

Total SSE: Total sum of squared errorsSome overheads from Galit Shmueli and Peter Bruce 201058

Page 59: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift Chart for Predictive Error

Similar to lift chart for classification, except…

Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of “responses”

Some overheads from Galit Shmueli and Peter Bruce 201059

Page 60: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Lift chart example – spending

Some overheads from Galit Shmueli and Peter Bruce 201060

Page 61: ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from Galit Shmueli and Peter Bruce 2010 Win-Vector Blog! A bit on the F1 score floor

Summary� Evaluation metrics are important for comparing

across DM models, for choosing the right configuration of a specific DM model, and for comparing to the baseline

� Major metrics: confusion matrix, error rate, predictive error

� Other metrics whenone class is more importantasymmetric costs

� When important class is rare, use oversampling� In all cases, metrics computed from validation data

Some overheads from Galit Shmueli and Peter Bruce 201061


Recommended