Chapter 5 Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel ...

Post on 18-Jan-2018

229 views 2 download

description

Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical

transcript

Chapter 5 – Evaluating Predictive Performance

Data Mining for Business AnalyticsShmueli, Patel & Bruce

Why Evaluate?Multiple methods are available to classify

or predictFor each method, multiple choices are

available for settingsTo choose best model, need to assess each

model’s performance

Types of outcomePredicted numerical value

Outcome variable is numerical and continuousPredicted class membership

Outcome variable is categoricalPropensity

Probability of class membership when the outcome variable is categorical

Accuracy Measures (Continuous outcome)

Evaluating Predictive PerformanceNot the same as goodness-of-fit (R2, SY/X)Predicted performance is measured using

the validation datasetBenchmark is average used as prediction

Prediction accuracy measuresei = yi – ŷi,

where yi = Actual y value and ŷi = predicted y value

MAE = Mean Absolute Error = MAE is also known as MAD = Mean Absolute DeviationAverage Error = MAPE = Mean Absolute Percent Error = 100% x RSME = Root-Mean-Squared Error = Total SSE = Total Sum of Squared Error =

Excel MinerExample: Multiple Regression using Boston Housing dataset

Excel MinerResiduals for Training dataset Residuals for Validation dataset

Excel MinerResiduals for Training dataset Residuals for Validation dataset

SAS Enterprise MinerExample: Multiple Regression using Boston Housing dataset

SAS Enterprise MinerResiduals for Validation dataset

SAS Enterprise MinerBoxplot of residuals

Lift ChartOrder predicted y-value from highest to

lowestX-axis = Cases from 1 to nY-axis = Cumulative predicted value of YTwo lines are plotted …

one with as the predicted valueone with ŷ found using a prediction model

Excel Miner: Lift Chart

Decile-wise lift ChartOrder predicted y-value from highest to

lowestX-axis = % of cases from 10% to 100%

i.e. 1st decile to 10th decileY-axis = Cumulative predicted value of YFor each decile the ratio of sum of ŷ and

sum of is potted

XLMINER: Decile-wise lift Chart

Accuracy Measures (Classification)

Misclassification error

Error = classifying a record as belonging to one class when it belongs to another class.

Error rate = percent of misclassified records out of the total records in the validation data

Benchmark: Naïve RuleNaïve rule: classify all records as belonging

to the most prevalent classUsed only as benchmark to evaluate more

complex classifiers

Separation of Records “High separation of records” means that using predictor variables attains low error

“Low separation of records” means that using predictor variables does not improve much on naïve rule

High Separation of Records

Low Separation of Records

Classification Confusion Matrix

201 1’s correctly classified as “1” (True positives)

85 1’s incorrectly classified as “0” (False negatives)

25 0’s incorrectly classified as “1” (False positives)

2689 0’s correctly classified as “0” (True negatives)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Error Rate

Overall error rate = (25+85)/3000 = 3.67%

Accuracy = 1 – err = (201+2689) = 96.33%

If multiple classes, error rate is: (sum of misclassified records)/(total

records)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Classification Matrix:Meaning of each cell

Actual Class

Predicted ClassC1 C2

C1 n1,1= No. of C1 cases classified correctly as C1

n1,2= No. of C1 cases classified incorrectly as C2

C2 n2,1= No. of C2 cases classified incorrectly as C1

n2,2= No. of C2 cases classified correctly as C2

Misclassification rate = err =

Accuracy = 1 - err =

PropensityPropensities are estimated probabilities

that a case belongs to each of the classesIt is used in two ways …

To generate predicted class membershipRank ordering cases by probability of

belonging to a particular class of interest

Most Data Mining algorithms classify via a 2-step process:

For each case,1. Compute probability of belonging to the

class of interest2. Compare to cutoff value, and classify

accordingly

Default cutoff value is 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”

Can use different cutoff values Typically, error rate is lowest for cutoff =

0.50

Propensity and Cutoff for Classification

Cutoff TableActual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004

If cutoff is 0.50: thirteen records are classified as “1”

If cutoff is 0.80: seven records are classified as “1”

Confusion Matrix for Different Cutoffs

Confusion Matrix for Different Cutoffs0.25

Actual Class owner non-owner

owner 11 1

non-owner 4 8

0.75

Actual Class owner non-owner

owner 7 5

non-owner 1 11

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

When One Class is More ImportantIn many cases it is more important to identify members of one class

Tax fraudCredit defaultResponse to promotional offerPredicting delayed flights

In such cases, overall accuracy is not a good measure of evaluating the classifier.

SensitivitySpecificity

SensitivitySuppose that “C1” is the important class. Sensitivity is the ability (probability) to

detect membership in “C1” class correctly and is given by

=

Hit rate = True Positive rate

SpecificitySuppose that “C1” is the important class. Specificity is the ability (probability) to

rule out members who belong to “C2” class correctly and is given by

=

1 – Specificity = False Positive rate

ROC CurveReceiver Operating Characteristics Curve

Asymmetric Misclassification Costs

Misclassification Costs May Differ

The cost of making a misclassification error may be higher for one class than the other(s)

Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Example – Response to Promotional Offer

“Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)

Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)

  Predict Class 0 Predict Class 1Actual 0 990 0Actual 1 10 0

The Confusion Matrix

Error rate = (2+20) = 2.2% (higher than naïve rate)

  Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8

Suppose that using DM we can correctly classify eight 1’s as 1’s

It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.

Introducing Costs & BenefitsSuppose:Profit from a “1” is $10Cost of sending offer is $1Then:Under naïve rule, all are classified as “0”,

so no offers are sent: no cost, no profitUnder DM predictions, 28 offers are sent.

8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)

Net profit = $80 - $20 = $60

Profit Matrix

  Predict Class 0 Predict Class 1Actual 0 0 -$20Actual 1 0 $80

Minimize Opportunity costsAs we see, best to convert everything to

costs, as opposed to a mix of costs and benefits

E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”

Leads to same decisions, but referring only to costs allows greater applicability

Cost Matrix with opportunity costs

Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):

 Costs Predict Class 0 Predict Class 1Actual 0 970 x $0 = $0 20 x $1 = $20Actual 1 2 x $10 = $20 8 x $1 = $8

  Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8

Total opportunity cost = 0 + 20 + 20 + 8 = 48

Average misclassification cost

q1 = cost of misclassifying an actual C1 as belonging to C2

q2 = cost of misclassifying an actual C2 as belonging to C1

Average misclassification cost =

Look for a classifier that minimizes this average cost.

Generalize to Cost RatioSometimes actual costs and benefits are hard to estimate

Need to express everything in terms of costs (i.e., cost of misclassification per record)

A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)

Multiple Classes

Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes

Practically too many to work withIn decision-making context, though, such

complexity rarely arises – one class is usually of primary interest

For m classes, confusion matrix has m rows and m columns

Judging Ranking Performance

Lift Chart for Binary DataInput: Scored validation dataset

Actual class and propensity (probability) to belong to the class of interest C1)

Sort records in descending order of propensity to belong to the class of interest

Compute cumulative number of C1 members for each row

Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C1 members as the y-axis

Lift Chart for Binary Data - ExampleCase No.

Propensity(Predicted

probability of belonging to class

"1")Actual

class

Cumulative actual classes

1 0.9959767 1 12 0.9875331 1 23 0.9844564 1 34 0.9804396 1 45 0.9481164 1 56 0.8892972 1 67 0.8476319 1 78 0.7628063 0 79 0.7069919 1 810 0.6807541 1 911 0.6563437 1 1012 0.6224195 0 1013 0.5055069 1 1114 0.4713405 0 1115 0.3371174 0 1116 0.2179678 1 1217 0.1992404 0 1218 0.1494827 0 1219 0.0479626 0 1220 0.0383414 0 1221 0.0248510 0 1222 0.0218060 0 1223 0.0161299 0 1224 0.0035600 0 12

Lift Chart with Costs and BenefitsSort records in descending order of

probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Compute a column of cumulative cost/benefit

Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis

Lift Chart with cost/benefit - Example

Case No.

Propensity(Predicted

probability of belonging to class

"1")

Actual

class

Cost/Benefi

t

Cumulative Cost/benefi

t1 0.9959767 1 10 102 0.9875331 1 10 203 0.9844564 1 10 304 0.9804396 1 10 405 0.9481164 1 10 506 0.8892972 1 10 607 0.8476319 1 10 708 0.7628063 0 -1 699 0.7069919 1 10 7910 0.6807541 1 10 8911 0.6563437 1 10 9912 0.6224195 0 -1 9813 0.5055069 1 10 10814 0.4713405 0 -1 10715 0.3371174 0 -1 10616 0.2179678 1 10 11617 0.1992404 0 -1 11518 0.1494827 0 -1 11419 0.0479626 0 -1 11320 0.0383414 0 -1 11221 0.0248510 0 -1 11122 0.0218060 0 -1 11023 0.0161299 0 -1 10924 0.0035600 0 -1 108

Lift Curve May Go Negative

If total net benefit from all cases is negative, reference line will have negative slope

Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Negative slope to reference curve

Oversampling and Asymmetric Costs

Rare Cases

Responder to mailingSomeone who commits fraudDebt defaulter

Often we oversample rare cases to give model more information to work with

Typically use 50% “1” and 50% “0” for training

Asymmetric costs/benefits typically go hand in hand with presence of rare but important class

ExampleFollowing graphs show optimal classification

under three scenarios:assuming equal costs of misclassificationassuming that misclassifying “o” is five times

the cost of misclassifying “x”Oversampling scheme allowing DM methods

to incorporate asymmetric costs

Classification: equal costs

Classification: Unequal costsSuppose that failing to catch “o” is 5 times as costly as failing to catch “x”.

Oversampling for asymmetric costs

Oversample “o” to appropriately weight misclassification costs – without or with replacement

Equal number of respondersSample equal number of responders and

non-responders

An Oversampling Procedure1. Separate the responders (rare) from non-

responders2. Randomly assign half the responders to

the training sample, plus equal number of non-responders

3. Remaining responders go to validation sample

4. Add non-responders to validation data, to maintain original ratio of responders to non-responders

5. Randomly take test set (if needed) from validation

Assessing model performance1. Score the model with a validation

dataset selected without oversampling2. Score the model with a oversampled

validation dataset and reweight the results to remove the effects of oversampling

Method 1 is straightforward and easier to implement.

Adjusting confusion matrix for oversampling

Whole data SampleResponders 2% 50%Nonresponders 98% 50%

Example:

One responder in whole data = 50/2 = 25 in the sample

One nonresponder in whole data = 50/98 = 0.5102 in the sample

Adjusting confusion matrix for oversampling

Suppose that the confusion matrix with oversampled validationdataset is as follows:

Classification matrix, Oversampled data (Validation)

Predicted 0 Predcited 1 TotalActual 0 390 110 500Actual 1 80 420 500Total 470 530 1000

Misclassification rate = (80 + 110)/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 =

0.53 or 53%

Adjusting confusion matrix for oversampling

Classification matrix, ReweightedPredicted 0 Predcited 1 Total

Actual 0 390/0.5102 = 764.4 110/0.5102 = 215.6 980Actual 1 80/25 = 3.2 420/25 = 16.8 500Total 767.6 232.4 1000

Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = 0.5102

Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9%

Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%

Adjusting Lift curve for OversamplingSort records in descending order of

probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Divide the value by the oversampling rate of the actual class

Compute a cumulative column of weighted cost/benefit

Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis

Adjusting Lift curve for Oversampling -- Example

Cost/BenefitOversamplin

g weightActual 0 -3 0.6Actual 1 50 20

Suppose that cost/benefit and oversampling weights are as follows:

Case No.

Propensity(Predicted

probability of belonging to class

"1")Actual

classweighted

Cost/Benefit

Cumulative weighted

Cost/benefit1 0.9959767 1 2.50 2.502 0.9875331 1 2.50 5.003 0.9844564 1 2.50 7.504 0.9804396 1 2.50 10.005 0.9481164 1 2.50 12.506 0.8892972 1 2.50 15.007 0.8476319 1 2.50 17.508 0.7628063 0 -5.00 12.509 0.7069919 1 2.50 15.0010 0.6807541 1 2.50 17.5011 0.6563437 1 2.50 20.0012 0.6224195 0 -5.00 15.0013 0.5055069 1 2.50 17.5014 0.4713405 0 -5.00 12.5015 0.3371174 0 -5.00 7.5016 0.2179678 1 2.50 10.0017 0.1992404 0 -5.00 5.0018 0.1494827 0 -5.00 0.0019 0.0479626 0 -5.00 -5.0020 0.0383414 0 -5.00 -10.0021 0.0248510 0 -5.00 -15.0022 0.0218060 0 -5.00 -20.0023 0.0161299 0 -5.00 -25.0024 0.0035600 0 -5.00 -30.00

Adjusting Lift curve for Oversampling -- Example