ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from...

Big Data Analytics:Evaluating Classification Performance

April, 2016R. Bohn

Some overheads from Galit Shmueli and Peter Bruce 20101

Win-Vector Blog !

A bit on the F1 score floor

April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-

ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,

symPy

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,

California we spent some time on classifier measures derived from the so-called

“confusion matrix.”

We repeated our usual admonition to not use “accuracy” as a project goal (busi-

ness people tend to ask for it as it is the word they are most familiar with, but it

usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does noth-

ing is “more accurate” than one that actually has some utility. (slides here)

And we worked through the usual bestiary of other metrics (precision, recall,

sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out.

! ! !

!

Most accurate ≠ Best!


Win-Vector Blog !




symPy












! ! !

!

Which is more accurate??

Win-Vector Blog !




symPy












! ! !

!

Actual value

Why Evaluate a classifier?

� Multiple methods are available to classify or predict� For each method, multiple choices are available for

settings� To choose best model, need to assess each model’s

performance� When you use it, how good will it be?

� How “reliable” are the results?


Misclassification error for discrete classification problems

� Error = classifying a record as belonging to one class when it belongs to another class.

� Error rate = percent of misclassified records out of the total records in the validation data


examples


Confusion Matrix

201 1’s correctly classified as “1”85 1’s incorrectly classified as “0”25 0’s incorrectly classified as “1”2689 0’s correctly classified as “0”

Note asymmetry: Many more 0s than 1sGenerally 1 = “interesting” or rare case

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix


Naïve Rule

� Often used as benchmark: we hope to do better than that

� Exception: when goal is to identify high-value but rare outcomes, we may prefer to do worse than the naïve rule (see “lift” – later)

Naïve rule: classify all records as belonging to the most prevalent class


Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives

N - sample size

N+ = FN + TP number of positive examples

N� = FP + TN number of negative examples

O+ = TP + FP number of positive predictions

O� = FN + TN number of negative predictions

outputs\ labeling y = +1 y = �1 ⌃

f (x) = +1 TP FP O+

f (x) = �1 FN TN O�

⌃ N+ N� N

c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61

Memorial Sloan-Kettering Cancer Center

x = inputsf(x) = predictionY = true result


Error Rate Overall

Overall error rate = (25+85)/3000 = 3.67%Accuracy = 1 – err = (201+2689) = 96.33%Expand to n x n matrix, where n = # of alternatives If multiple classes, error rate is:

(sum of misclassified records)/(total records)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix


Cutoff for classificationMost DM algorithms classify via a 2-step process:For each record,

1. Compute probability of belonging to class “1”2. Compare to cutoff value, and classify accordingly

� Default cutoff value is often chosen = 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”

� May prefer a different cutoff value



Example: 24 cases classified. Est prob. Range .004 to .996

Obs.#Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 010 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0


Set cutoff = 50% Forecast = 13 type 1s, ID 1-13Obs.#

Est.probability Actual

Obs.# EstProb. Actual

1 0.996 13 0.5062 0.988 14 0.4713 0.984 15 0.3374 0.98 16 0.2185 0.948 17 0.1996 0.889 18 0.1497 0.949 19 0.0488 0.762 20 0.0389 0.707 21 0.025

10 0.681 22 0.02211 0.656 23 0.01612 0.622 24 0.004


Set cutoff = 50% Forecast = 13 type 1s, ID 1-13

Obs.#Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0

10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0

Result = 4 false positives, 1 false negative, 5/24 total errorsSome overheads from Galit Shmueli and Peter Bruce 201014

Set cutoff = 70% Forecast = 9 type 1s, ID 1-9Obs.#

Est.probability Actual Obs.# EstProb. Actual

1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0

10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0

Result = 1 false positives, 2 false negative, (we were lucky)Some overheads from Galit Shmueli and Peter Bruce 201015

Which is better??

� High cutoff (70%) => more false negatives, fewer false positives

� Lower cutoff (50%) gives opposite � Which better is not a stat question; it’s a

business application question� Total number of errors is usually NOT the

right� Which kind of error is more serious???� Only rarely are the costs the same

� Statistical decision theory, Bayesian logicSome overheads from Galit Shmueli and Peter Bruce 201016

When One Class is More Important

� Tax fraud� Credit default� Response to promotional offer� Detecting electronic network intrusion� Predicting delayed flights

In many cases it is more important to identify members of one class

In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention


Always a tradeoff: false + vs. false -� How do we estimate these two types of errors?� Answer: with our set-aside data

Validation data set.


Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives

N - sample size

N+ = FN + TP number of positive examples

N� = FP + TN number of negative examples

O+ = TP + FP number of positive predictions

O� = FN + TN number of negative predictions

outputs\ labeling y = +1 y = �1 ⌃

f (x) = +1 TP FP O+

f (x) = �1 FN TN O�

⌃ N+ N� N



x = inputsf(x) = predictionY = true result


Common performance measuresEvaluation Measures for Classification II

Several commonly used performance measuresName Computation

Accuracy ACC = TP+TN

N

Error rate (1-accuracy) ERR = FP+FN

N

Balanced error rate BER = 1

2

⇣FN

FN+TP

+ FP

FP+TN

⌘

Weighted relative accuracy WRACC = TP

TP+FN

� FP

FP+TN

F1 score F1 = 2⇤TP2⇤TP+FP+FN

Cross-correlation coe�cient CC = TP·TN�FP·FNp(TP+FP)(TP+FN)(TN+FP)(TN+FN)

Sensitivity/recall TPR = TP/N+ = TP

TP+FN

Specificity TNR = TN/N� = TN

TN+FP

1-sensitivity FNR = FN/N+ = FN

FN+TP

1-specificity FPR = FP/N� = FP

FP+TN

P.p.v. / precision PPV = TP/O+ = TP

TP+FP

False discovery rate FDR = FP/O+ = FP

FP+TP



None of these is a hypothesis test. (For example, is p(positive) >0? )Reason: we don’t care about the “overall;” we care about prediction of cases.


Alternate Accuracy Measures

If “C1” is the important class,

Sensitivity = % of “C1” class correctly classifiedSpecificity = % of “C0” class correctly classified

False positive rate = % of predicted “C1’s” that were not “C1’s”

False negative rate = % of predicted “C0’s” that were not “C0’s”


Medical terms� In medical diagnostics, test sensitivity is the ability

of a test to correctly identify those with the disease (true positive rate),

� whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).[2]


Significance vs power (statistics)� False positive rate (α) = P(type I error)

� = 1 − specificity � =FP / (FP + TN)� = Significance level� For simple hypotheses, this is the test's probability of

incorrectly rejecting the null hypothesis. The false positive rate.

� False negative rate (β)� = type II error � = 1 − sensitivity � = FN / (TP + FN) = 10 / (20 + 10)

� Power = sensitivity = 1 − β� The test's probability of correctly rejecting the null

hypothesis. The complement of the false negative rate, β. Some overheads from Galit Shmueli and Peter Bruce 201023

Evaluation Measures for Classification III Evaluation Measures for Classification III

[left] ROC Curve [right] Precision Recall Curve

0 0.2 0.4 0.6 0.8 10.010.1

1

false positive rate

true

posit

ive ra

te

ROC

proposed methodfirstefeponinemcpromotor

proposed method

firstef

mcpromotor

eponine

0 0.2 0.4 0.6 0.8 1

0.1

1

true positive rate

posi

tive

pred

ictiv

e va

lue

PPV

proposed methodfirstefeponinemcpromotor

proposed method

firstefeponine

mcpromotor

(Obtained by varying bias and recording TPR/FPR or PPV/TPR.)

Use bias independent scalar evaluation measureArea under ROC Curve (auROC)

Area under Precision Recall Curve (auPRC)



Area under Precision Recall Curve (auPRC)


ROC Curve

False positive rate


Lift and Decile Charts: Goal

Useful for assessing performance in terms of identifying the most important class

Helps evaluate, e.g.,� How many tax records to examine� How many loans to grant� How many customers to mail offer to


Lift and Decile Charts – Cont.

Compare performance of DM model to “no model, pick randomly”

Measures ability of DM model to identify the important class, relative to its average prevalence

Charts give explicit assessment of results over a large number of cutoffs


Lift and Decile Charts: How to Use

Compare lift to “no model” baseline

In lift chart: compare step function to straight line

In decile chart compare to ratio of 1


Lift Chart – cumulative performance

After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have been correctly identified

02468101214

0 10 20 30

Cumulative

#cases

Liftchart(trainingdataset)

CumulativeOwnershipwhensortedusingpredictedvalues

CumulativeOwnershipusingaverage


Decile Chart

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10

Decilemean/Glob

alm

ean

Deciles

Decile-wiseliftchart(trainingdataset)

In “most probable” (top) decile, model is twice as likely to identify the important class (compared to avg. prevalence)Some overheads from Galit Shmueli and Peter Bruce 201030

Lift Charts: How to Compute

� Using the model’s classifications, sort records from most likely to least likely members of the important class

� Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis)


Lift vs. Decile Charts

Both embody concept of “moving down” through the records, starting with the most probable

Decile chart does this in decile chunks of dataY axis shows ratio of decile mean to overall mean

Lift chart shows continuous cumulative resultsY axis shows number of important class records identified


Asymmetric Costs


Misclassification Costs May Differ

The cost of making a misclassification error may be higher for one class than the other(s)

Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)


Example – Response to Promotional Offer

� “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)

� With lower cutoff, suppose we can correctly classify eight 1’s as 1’s

It comes at the cost of misclassifying twenty 0’s as 1’s and two 0’s as 1’s.

Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)


New Confusion Matrix

Predictas1 Predictas0Actual1 8 2Actual0 20 970

Error rate = (2+20) = 2.2% (higher than naïve rate)


Introducing Costs & BenefitsSuppose:� Profit from a “1” is $10� Cost of sending offer is $1Then:� Under naïve rule, all are classified as “0”, so no

offers are sent: no cost, no profit� Under DM predictions, 28 offers are sent.

8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)

� Net profit = $60Some overheads from Galit Shmueli and Peter Bruce 201037

Profit Matrix

Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0


Lift (again)

Adding costs to the mix, as above, does not change the actual classifications

Better: Use the lift curve and change the cutoff value for “1” to maximize profit


Generalize to Cost Ratio

Sometimes actual costs and benefits are hard to estimate

�Need to express everything in terms of costs (i.e., cost of misclassification per record)�Goal is to minimize the average cost per record

A good practical substitute for individual costs is the ratioof misclassification costs (e,g,, “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)


Minimizing Cost Ratio

q1 = cost of misclassifying an actual “1”,

q0 = cost of misclassifying an actual “0”

Minimizing the cost ratio q1/q0 is identical tominimizing the average cost per record

Software* may provide option for user to specify cost ratio

*Currently unavailable in XLMinerSome overheads from Galit Shmueli and Peter Bruce 201041

Note: Opportunity costs

� As we see, best to convert everything to costs, as opposed to a mix of costs and benefits

� E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”

� Leads to same decisions, but referring only to costs allows greater applicability


Cost Matrix (inc. opportunity costs)

Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):

Predictas1 Predictas0Actual1 $8 $20Actual0 $20 $0

Predictas1 Predictas0Actual1 8 2Actual0 20 970


Adding Cost/Benefit to Lift Curve

� Sort records in descending probability of success� For each case, record cost/benefit of actual outcome� Also record cumulative cost/benefit� Plot all records

X-axis is index number (1 for 1st case, n for nth case)Y-axis is cumulative cost/benefitReference line from origin to yn ( yn = total net benefit)


Multiple Classes

� Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways

� In practice, often too many to work with

� In decision-making context, though, such complexity rarely arises – one class is usually of primary interest

For m classes, confusion matrix has m rows and m columns


Negative slope to reference curve


Lift Curve May Go Negative

If total net benefit from all cases is negative, reference line will have negative slope

Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum


Oversampling and Asymmetric Costs

Caution: most real DM problems fallhave asymmetric costs!


Rare Cases

� Responder to mailing� Someone who commits fraud� Debt defaulter

� Often we oversample rare cases to give model more information to work with

� Typically use 50% “1” and 50% “0” for training

Asymmetric costs/benefits typically go hand in hand with presence of rare but important class


Example

Following graphs show optimal classification under three scenarios:� assuming equal costs of misclassification� assuming that misclassifying “o” is five times the cost

of misclassifying “x”� Oversampling scheme allowing DM methods to

incorporate asymmetric costs


Classification: equal costs


Classification: Unequal costs


Oversampling SchemeOversample “o” to appropriately weight misclassification costs


An Oversampling Procedure1. Separate the responders (rare) from non-

responders2. Randomly assign half the responders to the

training sample, plus equal number of non-responders

3. Remaining responders go to validation sample4. Add non-responders to validation data, to maintain

original ratio of responders to non-responders5. Randomly take test set (if needed) from validation


Classification Using Triage

� Instead of classifying as C1 or C0, we classify asC1

C0

Can’t say

The third category might receive special human review

Take into account a gray area in making classification decisions


Evaluating Predictive Performance


Measuring Predictive error

� Not the same as “goodness-of-fit”

� We want to know how well the model predicts newdata, not how well it fits the data it was trained with

� Key component of most measures is difference between actual y and predicted y (“error”)


Some measures of errorMAE or MAD: Mean absolute error (deviation)

Gives an idea of the magnitude of errors

Average errorGives an idea of systematic over- or under-prediction

MAPE: Mean absolute percentage error

RMSE (root-mean-squared-error): Square the errors, find their average, take the square root

Total SSE: Total sum of squared errorsSome overheads from Galit Shmueli and Peter Bruce 201058

Lift Chart for Predictive Error

Similar to lift chart for classification, except…

Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of “responses”


Lift chart example – spending


Summary� Evaluation metrics are important for comparing

across DM models, for choosing the right configuration of a specific DM model, and for comparing to the baseline

� Major metrics: confusion matrix, error rate, predictive error

� Other metrics whenone class is more importantasymmetric costs

� When important class is rare, use oversampling� In all cases, metrics computed from validation data


Date post:	01-Jul-2018
Category:	Documents
Upload:	lynga
View:	215 times
Download:	0 times

ence, Pragmatic Machine Learning, Statistics, Tutorials … accurate ≠ Best! 2 Some overheads from...

Documents