Classification Evaluation
Estimating Future Accuracy
• Given available data, how can we reliably predict accuracy on future, unseen data?
• Three basic approaches– Training set– Hold-out set (2 variants)– Cross-validation
Estimating with Training Set
• Simplest approach– Build model from training set– Compute accuracy on training set
• Pros and cons– Easy– Likely to overestimate
• (think overfitting)
Estimating with Hold-out Set (1)
• Method 1– Two distinct data sets are made available a priori– One is used to build the model– The other is used to test the model
• Pros and cons– No “bias”– Not always feasible
Estimating with Hold-out Set (2)
• Method 2:– Randomly partition data into training and test set– Training set used to train/build the model– Test set used evaluate the model
• Pros and cons– Easy– Less likely to overfit– Reduces amount of training data
Holding out data
• The holdout method reserves a certain amount for testing and uses the remainder for training– Usually: one third for testing, the rest for
training
• For “unbalanced” datasets, random samples might not be representative– Few or none instances of some classes
• Stratified sample: – Make sure that each class is represented with
approximately equal proportions in both subsets
Repeated holdout method
• Holdout estimate can be made more reliable by repeating the process with different subsamples– In each iteration, a certain proportion is
randomly selected for training (possibly with stratification)
– The error rates on the different iterations are averaged to yield an overall error rate
• This is called the repeated holdout method
Cross-validation
• Most popular and effective type of repeated holdout is cross-validation
• Cross-validation avoids overlapping test sets– First step: data is split into k subsets of equal
size– Second step: each subset in turn is used for
testing and the remainder for training• This is called k-fold cross-validation• Often the subsets are stratified before the
cross-validation is performed
Cross-validation example:
9
More on cross-validation
• Standard data-mining method for evaluation: stratified ten-fold cross-validation
• Why ten?– Good choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation– E.g., ten-fold cross-validation is repeated ten
times and results are averaged (reduces the sampling variance)
• Error estimate is the mean across all repetitions
Leave-One-Out cross-validation
• Leave-One-Out:a particular form of cross-validation:– Set number of folds to number of training
instances– I.e., for n training instances, build classifier
n times
• Makes best use of the data• Involves no random subsampling • Computationally expensive, but good
performance
Leave-One-Out-CV and stratification
• Disadvantage of Leave-One-Out-CV: stratification is not possible– It guarantees a non-stratified sample
because there is only one instance in the test set!
• Extreme example: random dataset split equally into two classes– Best model predicts majority class– 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!
Three-way Data Splits
• One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward.
• If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split:– Training set: examples used for learning– Validation set: used to tune parameters– Test set: never used in the model fitting
process, used at the end for unbiased estimate of hold out error
Nested Cross-validation
Issues with Accuracy
• Measuring accuracy– Is 99% accuracy good? Is 20% accuracy bad?– Can be excellent, good, mediocre, poor, terrible
• Why?– Depends on problem complexity– Depends on base accuracy (i.e., majority learner)– Depends on cost of error (e.g., ICU, etc.)
• Problem: assumes equal cost for all errors
Confusion MatrixPredicted Output
True
Out
put (
Targ
et)
1 0
1
0
True Positive (TP)Hits
False Negative (FN)Misses
True Negative (TN)Correct Rejections
False Positive (FP)False Alarm
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Single number: loses information
PrecisionPredicted Output
True
Out
put (
Targ
et)
1 0
1
0
True Positive (TP)Hits
False Negative (FN)Misses
True Negative (TN)Correct Rejections
False Positive (FP)False Alarm
Precision = TP/(TP+FP)
The percentage of predicted true positives that are target true positives
(of those I predict true, how many are actually true)
RecallPredicted Output
True
Out
put (
Targ
et)
1 0
1
0
True Positive (TP)Hits
False Negative (FN)Misses
True Negative (TN)Correct Rejections
False Positive (FP)False Alarm
Recall = TP/(TP+FN)
The percentage of target true positives that were predicted as true positives
(of those that are true, how many do I predict are true
P/R Trade-off (I)• ICU monitoring:
– Is precision the goal? Not so much, rather not miss any– Recall is the goal: Don’t want to miss any, and would rather err towards
accepting some false positives (check patient when unnecessary) and minimize false negatives (not check on a needy patient)
• Google search:– Is recall the goal? Not really, because we never get to the millionth page, rather
get a few very good results early– Precision is the goal: Don’t want to see irrelevant documents (false positives),
can tolerate missing some (false negatives), there are plenty of sites anyways and we don’t need to get all
• Trade-off:– Easy to maximize precision – only classify the one or few most confident
candidates as true– Easy to maximize recall – classify everything as true– Neither is particularly useful!
P/R Trade-off (II)
Complete P/R curveBreakeven Point defined by P=R
Alternatively, F-measure:
Other Measures
• Sensitivity (Recall):– TP / (TP + FN)
• Specificity:– TN / (TN + FP)
• Positive Predictive Value (Precision):– TP / (TP + FP)
• Negative Predictive Value:– TN / (TN + FN)
ROC Curves
• Receiver Operating Characteristic Curve– Developed in WWII to statistically model false positive and
false negative detections of radar operators
• Standard measure in medicine and biology• Graphs true positive rate (sensitivity) vs. false
positive rate (1- specificity)• Goal: Maximize TPR and minimize FPR
– Max TPR: classify everything positive– Min FPR: classify everything negative– Neither is acceptable, of course!
Several Points in ROC Space• Lower left point (0, 0) represents the
strategy of never issuing a positive classification;– No FP but also no TP
• Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications.
• Point (0, 1) represents perfect classification. – D's performance is perfect as shown
• Informally, one point in ROC space is better than another if it is to the northwest of the first– TP rate is higher, FP rate is lower, or both.
ROC Curves and AUC (II)
Each point on the ROC curve represents a different tradeoff (cost ratio) between TPR and FPR
AUC is area under the curve: represents performance averaged over all possible cost ratios
Single summary number
Perfect model has AUC = 1.0Random model has AUC = 0.5
Specific Example
Test Result
Pts with disease
Pts without the disease
Test Result
Call these patients “negative” Call these patients “positive”
Threshold
Test Result
Call these patients “negative” Call these patients “positive”
without the diseasewith the disease
True Positives
Some definitions ...
Test Result
Call these patients “negative” Call these patients “positive”
without the diseasewith the disease
False Positives
Test Result
Call these patients “negative” Call these patients “positive”
without the diseasewith the disease
True negatives
Test Result
Call these patients “negative” Call these patients “positive”
without the diseasewith the disease
False negatives
How to Construct ROC Curve for one Classifier
• Sort the instances according to their Ppos
• Move a threshold on the sorted instances• For each threshold define a classifier with confusion matrix• Plot the TPR and FP of the classifier
Ppos True Class0.99 pos0.98 pos0.7 neg0.6 pos0.43 neg
TruePredicted
pos neg
pos 2 1
neg 1 1
ROC Properties
• AUC properties– 1.0 - Perfect prediction– .9 - Excellent– .7 - Mediocre– .5 - Random
• ROC Curve properties– If two ROC curves do not intersect then one method
dominates the other– If they do intersect then one method is better for some cost
ratios, and is worse for others • Blue alg better for precision, yellow alg for recall, red neither
• Can choose method and balance based on goals
Lift (I)
• In some situations, we are not interested in the accuracy over the entire data set– Accurate predictions for 5%, 10%, or 20% of data– Don’t care about the rest
• Prototypical application: direct marketing– Baseline: random targeting of population– Can we do better?
• Want to know how much better a targeted offer is on a fraction of the population
Lift (II)Predicted Output
True
Out
put (
Targ
et)
1 0
1
0
True Positive (TP)Hits
False Negative (FN)Misses
True Negative (TN)Correct Rejections
False Positive (FP)False Alarm
Lift = [TP / (TP+TN)] / [(TP+FP) / (TP+TN+FP+FN)]
How much better a model is over random predictions
Lift (III)
Lift(t) = CR(t) / t
E.g., Lift (25%) = CR(25) / 25= 62
/ 25 = 2.5
If we select 25% of prospects using our model, they are 2.5 times more likely to respond than if we selected them randomly
Can vary t to make decisions (e.g., cost/benefit analysis)
Summary
• Several measures– Single value vs. range of thresholds
• Restricted to binary classification– Could always cast problem as a set of two class problems but
that can be inconvenient• Accuracy handles multi-class outputs• Key point:
– The measure you optimize makes a difference– The measure you report makes a difference– Measure what you want to optimize/report (i.e., use measure
appropriate to task/domain)