Big Data Analytics:Evaluating Classification Performance
April, 2016R. Bohn
Some overheads from Galit Shmueli and Peter Bruce 20101
Win-Vector Blog !
A bit on the F1 score floor
April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-
ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,
symPy
At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,
California we spent some time on classifier measures derived from the so-called
“confusion matrix.”
We repeated our usual admonition to not use “accuracy” as a project goal (busi-
ness people tend to ask for it as it is the word they are most familiar with, but it
usually isn’t what they really want).
One reason not to use accuracy: an example where a classifier that does noth-
ing is “more accurate” than one that actually has some utility. (slides here)
And we worked through the usual bestiary of other metrics (precision, recall,
sensitivity, specificity, AUC, balanced accuracy, and many more).
Please read on to see what stood out.
! ! !
!
Most accurate ≠ Best!
Some overheads from Galit Shmueli and Peter Bruce 20102
Win-Vector Blog !
A bit on the F1 score floor
April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-
ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,
symPy
At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,
California we spent some time on classifier measures derived from the so-called
“confusion matrix.”
We repeated our usual admonition to not use “accuracy” as a project goal (busi-
ness people tend to ask for it as it is the word they are most familiar with, but it
usually isn’t what they really want).
One reason not to use accuracy: an example where a classifier that does noth-
ing is “more accurate” than one that actually has some utility. (slides here)
And we worked through the usual bestiary of other metrics (precision, recall,
sensitivity, specificity, AUC, balanced accuracy, and many more).
Please read on to see what stood out.
! ! !
!
Which is more accurate??
Win-Vector Blog !
A bit on the F1 score floor
April 2, 2016 John Mount Mathematics, Opinion, Pragmatic Data Sci-
ence, Pragmatic Machine Learning, Statistics, Tutorials AUC, F1, python, R,
symPy
At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose,
California we spent some time on classifier measures derived from the so-called
“confusion matrix.”
We repeated our usual admonition to not use “accuracy” as a project goal (busi-
ness people tend to ask for it as it is the word they are most familiar with, but it
usually isn’t what they really want).
One reason not to use accuracy: an example where a classifier that does noth-
ing is “more accurate” than one that actually has some utility. (slides here)
And we worked through the usual bestiary of other metrics (precision, recall,
sensitivity, specificity, AUC, balanced accuracy, and many more).
Please read on to see what stood out.
! ! !
!
Actual value
Why Evaluate a classifier?
� Multiple methods are available to classify or predict� For each method, multiple choices are available for
settings� To choose best model, need to assess each model’s
performance� When you use it, how good will it be?
� How “reliable” are the results?
Some overheads from Galit Shmueli and Peter Bruce 20103
Misclassification error for discrete classification problems
� Error = classifying a record as belonging to one class when it belongs to another class.
� Error rate = percent of misclassified records out of the total records in the validation data
Some overheads from Galit Shmueli and Peter Bruce 20104
examples
Some overheads from Galit Shmueli and Peter Bruce 20105
Confusion Matrix
201 1’s correctly classified as “1”85 1’s incorrectly classified as “0”25 0’s incorrectly classified as “1”2689 0’s correctly classified as “0”
Note asymmetry: Many more 0s than 1sGenerally 1 = “interesting” or rare case
Actual Class 1 01 201 850 25 2689
Predicted ClassClassification Confusion Matrix
Some overheads from Galit Shmueli and Peter Bruce 20106
Naïve Rule
� Often used as benchmark: we hope to do better than that
� Exception: when goal is to identify high-value but rare outcomes, we may prefer to do worse than the naïve rule (see “lift” – later)
Naïve rule: classify all records as belonging to the most prevalent class
Some overheads from Galit Shmueli and Peter Bruce 20107
Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives
N - sample size
N+ = FN + TP number of positive examples
N� = FP + TN number of negative examples
O+ = TP + FP number of positive predictions
O� = FN + TN number of negative predictions
outputs\ labeling y = +1 y = �1 ⌃
f (x) = +1 TP FP O+
f (x) = �1 FN TN O�
⌃ N+ N� N
c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61
Memorial Sloan-Kettering Cancer Center
x = inputsf(x) = predictionY = true result
Some overheads from Galit Shmueli and Peter Bruce 20108
Error Rate Overall
Overall error rate = (25+85)/3000 = 3.67%Accuracy = 1 – err = (201+2689) = 96.33%Expand to n x n matrix, where n = # of alternatives If multiple classes, error rate is:
(sum of misclassified records)/(total records)
Actual Class 1 01 201 850 25 2689
Predicted ClassClassification Confusion Matrix
Some overheads from Galit Shmueli and Peter Bruce 20109
Cutoff for classificationMost DM algorithms classify via a 2-step process:For each record,
1. Compute probability of belonging to class “1”2. Compare to cutoff value, and classify accordingly
� Default cutoff value is often chosen = 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”
� May prefer a different cutoff value
Some overheads from Galit Shmueli and Peter Bruce 201010
Some overheads from Galit Shmueli and Peter Bruce 201011
Example: 24 cases classified. Est prob. Range .004 to .996
Obs.#Est.probability Actual Obs.# EstProb. Actual
1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 010 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0
Some overheads from Galit Shmueli and Peter Bruce 201012
Set cutoff = 50% Forecast = 13 type 1s, ID 1-13Obs.#
Est.probability Actual
Obs.# EstProb. Actual
1 0.996 13 0.5062 0.988 14 0.4713 0.984 15 0.3374 0.98 16 0.2185 0.948 17 0.1996 0.889 18 0.1497 0.949 19 0.0488 0.762 20 0.0389 0.707 21 0.025
10 0.681 22 0.02211 0.656 23 0.01612 0.622 24 0.004
Some overheads from Galit Shmueli and Peter Bruce 201013
Set cutoff = 50% Forecast = 13 type 1s, ID 1-13
Obs.#Est.probability Actual Obs.# EstProb. Actual
1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0
10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0
Result = 4 false positives, 1 false negative, 5/24 total errorsSome overheads from Galit Shmueli and Peter Bruce 201014
Set cutoff = 70% Forecast = 9 type 1s, ID 1-9Obs.#
Est.probability Actual Obs.# EstProb. Actual
1 0.996 1 13 0.506 02 0.988 1 14 0.471 03 0.984 1 15 0.337 04 0.98 1 16 0.218 15 0.948 1 17 0.199 06 0.889 1 18 0.149 07 0.949 1 19 0.048 08 0.762 0 20 0.038 09 0.707 1 21 0.025 0
10 0.681 0 22 0.022 011 0.656 1 23 0.016 012 0.622 0 24 0.004 0
Result = 1 false positives, 2 false negative, (we were lucky)Some overheads from Galit Shmueli and Peter Bruce 201015
Which is better??
� High cutoff (70%) => more false negatives, fewer false positives
� Lower cutoff (50%) gives opposite � Which better is not a stat question; it’s a
business application question� Total number of errors is usually NOT the
right� Which kind of error is more serious???� Only rarely are the costs the same
� Statistical decision theory, Bayesian logicSome overheads from Galit Shmueli and Peter Bruce 201016
When One Class is More Important
� Tax fraud� Credit default� Response to promotional offer� Detecting electronic network intrusion� Predicting delayed flights
In many cases it is more important to identify members of one class
In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention
Some overheads from Galit Shmueli and Peter Bruce 201017
Always a tradeoff: false + vs. false -� How do we estimate these two types of errors?� Answer: with our set-aside data
Validation data set.
Some overheads from Galit Shmueli and Peter Bruce 201018
Evaluation Measures for ClassificationThe Contingency Table/Confusion MatrixTP, FP, FN, TN are absolute counts of true positives, false positives,false negatives and true negatives
N - sample size
N+ = FN + TP number of positive examples
N� = FP + TN number of negative examples
O+ = TP + FP number of positive predictions
O� = FN + TN number of negative predictions
outputs\ labeling y = +1 y = �1 ⌃
f (x) = +1 TP FP O+
f (x) = �1 FN TN O�
⌃ N+ N� N
c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 61
Memorial Sloan-Kettering Cancer Center
x = inputsf(x) = predictionY = true result
Some overheads from Galit Shmueli and Peter Bruce 201019
Common performance measuresEvaluation Measures for Classification II
Several commonly used performance measuresName Computation
Accuracy ACC = TP+TN
N
Error rate (1-accuracy) ERR = FP+FN
N
Balanced error rate BER = 1
2
⇣FN
FN+TP
+ FP
FP+TN
⌘
Weighted relative accuracy WRACC = TP
TP+FN
� FP
FP+TN
F1 score F1 = 2⇤TP2⇤TP+FP+FN
Cross-correlation coe�cient CC = TP·TN�FP·FNp(TP+FP)(TP+FN)(TN+FP)(TN+FN)
Sensitivity/recall TPR = TP/N+ = TP
TP+FN
Specificity TNR = TN/N� = TN
TN+FP
1-sensitivity FNR = FN/N+ = FN
FN+TP
1-specificity FPR = FP/N� = FP
FP+TN
P.p.v. / precision PPV = TP/O+ = TP
TP+FP
False discovery rate FDR = FP/O+ = FP
FP+TP
c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 62
Memorial Sloan-Kettering Cancer Center
None of these is a hypothesis test. (For example, is p(positive) >0? )Reason: we don’t care about the “overall;” we care about prediction of cases.
Some overheads from Galit Shmueli and Peter Bruce 201020
Alternate Accuracy Measures
If “C1” is the important class,
Sensitivity = % of “C1” class correctly classifiedSpecificity = % of “C0” class correctly classified
False positive rate = % of predicted “C1’s” that were not “C1’s”
False negative rate = % of predicted “C0’s” that were not “C0’s”
Some overheads from Galit Shmueli and Peter Bruce 201021
Medical terms� In medical diagnostics, test sensitivity is the ability
of a test to correctly identify those with the disease (true positive rate),
� whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).[2]
Some overheads from Galit Shmueli and Peter Bruce 201022
Significance vs power (statistics)� False positive rate (α) = P(type I error)
� = 1 − specificity � =FP / (FP + TN)� = Significance level� For simple hypotheses, this is the test's probability of
incorrectly rejecting the null hypothesis. The false positive rate.
� False negative rate (β)� = type II error � = 1 − sensitivity � = FN / (TP + FN) = 10 / (20 + 10)
� Power = sensitivity = 1 − β� The test's probability of correctly rejecting the null
hypothesis. The complement of the false negative rate, β. Some overheads from Galit Shmueli and Peter Bruce 201023
Evaluation Measures for Classification III Evaluation Measures for Classification III
[left] ROC Curve [right] Precision Recall Curve
0 0.2 0.4 0.6 0.8 10.010.1
1
false positive rate
true
posit
ive ra
te
ROC
proposed methodfirstefeponinemcpromotor
proposed method
firstef
mcpromotor
eponine
0 0.2 0.4 0.6 0.8 1
0.1
1
true positive rate
posi
tive
pred
ictiv
e va
lue
PPV
proposed methodfirstefeponinemcpromotor
proposed method
firstefeponine
mcpromotor
(Obtained by varying bias and recording TPR/FPR or PPV/TPR.)
Use bias independent scalar evaluation measureArea under ROC Curve (auROC)
Area under Precision Recall Curve (auPRC)
c� Gunnar Ratsch (cBio@MSKCC) Introduction to Kernels @ MLSS 2012, Santa Cruz 63
Memorial Sloan-Kettering Cancer Center
Area under Precision Recall Curve (auPRC)
Some overheads from Galit Shmueli and Peter Bruce 201024
ROC Curve
False positive rate
Some overheads from Galit Shmueli and Peter Bruce 201025
Lift and Decile Charts: Goal
Useful for assessing performance in terms of identifying the most important class
Helps evaluate, e.g.,� How many tax records to examine� How many loans to grant� How many customers to mail offer to
Some overheads from Galit Shmueli and Peter Bruce 201026
Lift and Decile Charts – Cont.
Compare performance of DM model to “no model, pick randomly”
Measures ability of DM model to identify the important class, relative to its average prevalence
Charts give explicit assessment of results over a large number of cutoffs
Some overheads from Galit Shmueli and Peter Bruce 201027
Lift and Decile Charts: How to Use
Compare lift to “no model” baseline
In lift chart: compare step function to straight line
In decile chart compare to ratio of 1
Some overheads from Galit Shmueli and Peter Bruce 201028
Lift Chart – cumulative performance
After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have been correctly identified
02468101214
0 10 20 30
Cumulative
#cases
Liftchart(trainingdataset)
CumulativeOwnershipwhensortedusingpredictedvalues
CumulativeOwnershipusingaverage
Some overheads from Galit Shmueli and Peter Bruce 201029
Decile Chart
0
0.5
1
1.5
2
2.5
1 2 3 4 5 6 7 8 9 10
Decilemean/Glob
alm
ean
Deciles
Decile-wiseliftchart(trainingdataset)
In “most probable” (top) decile, model is twice as likely to identify the important class (compared to avg. prevalence)Some overheads from Galit Shmueli and Peter Bruce 201030
Lift Charts: How to Compute
� Using the model’s classifications, sort records from most likely to least likely members of the important class
� Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis)
Some overheads from Galit Shmueli and Peter Bruce 201031
Lift vs. Decile Charts
Both embody concept of “moving down” through the records, starting with the most probable
Decile chart does this in decile chunks of dataY axis shows ratio of decile mean to overall mean
Lift chart shows continuous cumulative resultsY axis shows number of important class records identified
Some overheads from Galit Shmueli and Peter Bruce 201032
Asymmetric Costs
Some overheads from Galit Shmueli and Peter Bruce 201033
Misclassification Costs May Differ
The cost of making a misclassification error may be higher for one class than the other(s)
Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)
Some overheads from Galit Shmueli and Peter Bruce 201034
Example – Response to Promotional Offer
� “Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)
� With lower cutoff, suppose we can correctly classify eight 1’s as 1’s
It comes at the cost of misclassifying twenty 0’s as 1’s and two 0’s as 1’s.
Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)
Some overheads from Galit Shmueli and Peter Bruce 201035
New Confusion Matrix
Predictas1 Predictas0Actual1 8 2Actual0 20 970
Error rate = (2+20) = 2.2% (higher than naïve rate)
Some overheads from Galit Shmueli and Peter Bruce 201036
Introducing Costs & BenefitsSuppose:� Profit from a “1” is $10� Cost of sending offer is $1Then:� Under naïve rule, all are classified as “0”, so no
offers are sent: no cost, no profit� Under DM predictions, 28 offers are sent.
8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)
� Net profit = $60Some overheads from Galit Shmueli and Peter Bruce 201037
Profit Matrix
Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0
Some overheads from Galit Shmueli and Peter Bruce 201038
Lift (again)
Adding costs to the mix, as above, does not change the actual classifications
Better: Use the lift curve and change the cutoff value for “1” to maximize profit
Some overheads from Galit Shmueli and Peter Bruce 201039
Generalize to Cost Ratio
Sometimes actual costs and benefits are hard to estimate
�Need to express everything in terms of costs (i.e., cost of misclassification per record)�Goal is to minimize the average cost per record
A good practical substitute for individual costs is the ratioof misclassification costs (e,g,, “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)
Some overheads from Galit Shmueli and Peter Bruce 201040
Minimizing Cost Ratio
q1 = cost of misclassifying an actual “1”,
q0 = cost of misclassifying an actual “0”
Minimizing the cost ratio q1/q0 is identical tominimizing the average cost per record
Software* may provide option for user to specify cost ratio
*Currently unavailable in XLMinerSome overheads from Galit Shmueli and Peter Bruce 201041
Note: Opportunity costs
� As we see, best to convert everything to costs, as opposed to a mix of costs and benefits
� E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”
� Leads to same decisions, but referring only to costs allows greater applicability
Some overheads from Galit Shmueli and Peter Bruce 201042
Cost Matrix (inc. opportunity costs)
Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):
Predictas1 Predictas0Actual1 $8 $20Actual0 $20 $0
Predictas1 Predictas0Actual1 8 2Actual0 20 970
Some overheads from Galit Shmueli and Peter Bruce 201043
Adding Cost/Benefit to Lift Curve
� Sort records in descending probability of success� For each case, record cost/benefit of actual outcome� Also record cumulative cost/benefit� Plot all records
X-axis is index number (1 for 1st case, n for nth case)Y-axis is cumulative cost/benefitReference line from origin to yn ( yn = total net benefit)
Some overheads from Galit Shmueli and Peter Bruce 201044
Multiple Classes
� Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways
� In practice, often too many to work with
� In decision-making context, though, such complexity rarely arises – one class is usually of primary interest
For m classes, confusion matrix has m rows and m columns
Some overheads from Galit Shmueli and Peter Bruce 201045
Negative slope to reference curve
Some overheads from Galit Shmueli and Peter Bruce 201046
Lift Curve May Go Negative
If total net benefit from all cases is negative, reference line will have negative slope
Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum
Some overheads from Galit Shmueli and Peter Bruce 201047
Oversampling and Asymmetric Costs
Caution: most real DM problems fallhave asymmetric costs!
Some overheads from Galit Shmueli and Peter Bruce 201048
Rare Cases
� Responder to mailing� Someone who commits fraud� Debt defaulter
� Often we oversample rare cases to give model more information to work with
� Typically use 50% “1” and 50% “0” for training
Asymmetric costs/benefits typically go hand in hand with presence of rare but important class
Some overheads from Galit Shmueli and Peter Bruce 201049
Example
Following graphs show optimal classification under three scenarios:� assuming equal costs of misclassification� assuming that misclassifying “o” is five times the cost
of misclassifying “x”� Oversampling scheme allowing DM methods to
incorporate asymmetric costs
Some overheads from Galit Shmueli and Peter Bruce 201050
Classification: equal costs
Some overheads from Galit Shmueli and Peter Bruce 201051
Classification: Unequal costs
Some overheads from Galit Shmueli and Peter Bruce 201052
Oversampling SchemeOversample “o” to appropriately weight misclassification costs
Some overheads from Galit Shmueli and Peter Bruce 201053
An Oversampling Procedure1. Separate the responders (rare) from non-
responders2. Randomly assign half the responders to the
training sample, plus equal number of non-responders
3. Remaining responders go to validation sample4. Add non-responders to validation data, to maintain
original ratio of responders to non-responders5. Randomly take test set (if needed) from validation
Some overheads from Galit Shmueli and Peter Bruce 201054
Classification Using Triage
� Instead of classifying as C1 or C0, we classify asC1
C0
Can’t say
The third category might receive special human review
Take into account a gray area in making classification decisions
Some overheads from Galit Shmueli and Peter Bruce 201055
Evaluating Predictive Performance
Some overheads from Galit Shmueli and Peter Bruce 201056
Measuring Predictive error
� Not the same as “goodness-of-fit”
� We want to know how well the model predicts newdata, not how well it fits the data it was trained with
� Key component of most measures is difference between actual y and predicted y (“error”)
Some overheads from Galit Shmueli and Peter Bruce 201057
Some measures of errorMAE or MAD: Mean absolute error (deviation)
Gives an idea of the magnitude of errors
Average errorGives an idea of systematic over- or under-prediction
MAPE: Mean absolute percentage error
RMSE (root-mean-squared-error): Square the errors, find their average, take the square root
Total SSE: Total sum of squared errorsSome overheads from Galit Shmueli and Peter Bruce 201058
Lift Chart for Predictive Error
Similar to lift chart for classification, except…
Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of “responses”
Some overheads from Galit Shmueli and Peter Bruce 201059
Lift chart example – spending
Some overheads from Galit Shmueli and Peter Bruce 201060
Summary� Evaluation metrics are important for comparing
across DM models, for choosing the right configuration of a specific DM model, and for comparing to the baseline
� Major metrics: confusion matrix, error rate, predictive error
� Other metrics whenone class is more importantasymmetric costs
� When important class is rare, use oversampling� In all cases, metrics computed from validation data
Some overheads from Galit Shmueli and Peter Bruce 201061