evaluation and credibility-Part 2

Tilani Gunawardena

Machine Learning and Data Mining

Evaluation and Credibility

• Introduction• Train, Test and Validation sets • Evaluation on Large data Unbalanced data • Evaluation on Small data

– Cross validation– Bootstrap

• Comparing data mining schemes– Significance test– Lift Chart / ROC curve

• Numeric Prediction Evaluation

Outline

Model’s Evaluation in the KDD Process

How to Estimate the Metrics?

• We can use:– Training data;– Independent test data;– Hold-out method;– k-fold cross-validation method;– Leave-one-out method;– Bootstrap method;– And many more…

Estimation with Training Data

• The accuracy/error estimates on the training data are not good indicators of performance on future data.

– Q: Why? – A: Because new data will probably not be exactly the same as the

training data!• The accuracy/error estimates on the training data measure

the degree of classifier’s overfitting.

Training set

Classifier

Training set

Estimation with Independent Test Data

• Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data.

• For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.

Training set

Classifier

Test set

Hold-out Method• The hold-out method splits the data into training data and

test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

• The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class.

Training set

Classifier

Test set

Data

Classification: Train, Validation, Test Split

Data

Predictions

Y N

Results Known

Training set

Validation set

++--+

Classifier BuilderEvaluate

+-+-

ClassifierFinal Test Set

+-+-

Final Evaluation

ModelBuilder

The test data can’t be used for parameter tuning!

k-Fold Cross-Validation• k-fold cross-validation avoids overlapping test sets:

– First step: data is split into k subsets of equal size;– Second step: each subset in turn is used for testing and the

remainder for training.• The estimates are averaged to

yield an overall estimate. Classifier

Data

train train test

train test train

test train train

Example

collect data from real world(photographs and labels)

Method 1: Training Process

Giving students the answer before giving them exam

Method 2

Cross Validation Error

Method 3

If the world happens to be well represented by our dataset

• Model Selection• Evaluating our selection Method

CV

35

The Bootstrap• CV uses sampling without replacement

– The same instance, once selected, can not be selected again for a particular training/test set

• The bootstrap uses sampling with replacement to form the training set– Sample a dataset of n instances n times with replacement to form

a new dataset of n instances– Use this data as the training set– Use the instances from the original

dataset that don’t occur in the newtraining set for testing

Example

• Sample of same size N(with replacement)• N=4,M=N=4,M=3

• N=150, M=5000• This gives M=5000 means of random samples

of X

37

The 0.632 bootstrap• Also called the 0.632 bootstrap

– A particular instance has a probability of 1–1/n of not being picked– Thus its probability of ending up in the test data is:

– This means the training data will contain approximately 63.2% of the instances

38

Estimating errorwith the bootstrap

• The error estimate on the test data will be very pessimistic – Trained on just ~63% of the instances

• Therefore, combine it with the resubstitution error:

• The resubstitution error gets less weight than the error on the test data

• Repeat process several times with different replacement samples; average the results

39

More on the bootstrap• Probably the best way of estimating performance for very

small datasets• However, it has some problems

– Completely random dataset with two classes of equal size. The true error rate is 50% for any prediction rule.

– Consider the random dataset from above– 0% resubstitution error and

~50% error on test data– Bootstrap estimate for this classifier:

– True expected error: 50%

• It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution

41

Evaluation Summary:

• Use Train, Test, Validation sets for “LARGE” data

• Balance “un-balanced” data• Use Cross-validation for Middle size/small data• Use the leave-one-out and bootstrap methods

for small data• Don’t use test data for parameter tuning - use

separate validation data

Agenda

• Quantifying learner performance– Cross validation– Error vs. loss– Precision & recall

• Model selection

Accuracy Vs Precision

accuracy refers to the closeness of a measurement or estimate to the TRUE value.

precision (or variance) refers to the degree of agreement for a series of measurements.

Precision Vs Recall

precision: Percentage of retrieved documents that are relevant.

recall: Percentage of relevant documents that are returned.

Scenario

• We use a dataset with knows classes to build a model

• We use another dataset with known classes to evaluate the model(this dataset could be part of the original dataset)

• We compare/count the predicted classes against the actual classes

Confusion Matrix• A confusion matrix shows the number of

correct and incorrect predictions made by the classification model compared to the actual outcomes(target value) in the data

• The matrix is NxN, where N is the number of target values(Classes)

• Performance of such models commonly evaluated using data in the matrix

Two Types of Error

False negative (“miss”), FNalarm doesn’t sound but person is carrying metal

False positive (“false alarm”), FPalarm sounds but person is not carrying metal

How to evaluate the Classifier’s Generalization Performance?

Predicted class

Actualclass

Pos Neg

Pos TP FN

Neg FP TN

• Assume that we test a classifier on some test set and we derive at the end the following confusion matrix (Two-Class)

• Also called contingency table

P

N

Measures in Two-Class Classification

Example:

1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.

What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=

1) True Positive (“Tony Blair”)=2) False Positive (“Tony Blair”)=3) False Negative(“Tony Blair”)=4) True Positive (“Donald Rumsdeld”)=5) False Positive (““Donald Rumsdeld”)=6) False Negative(““Donald Rumsdeld”)=

Metrics for Classifier’s Evaluation

Predicted class

Actualclass

Pos Neg

Pos TP FN

Neg FP TN

• Accuracy = (TP+TN)/(P+N)• Error = (FP+FN)/(P+N)• Precision = TP/(TP+FP)• Recall/TP rate = TP/P• FP Rate = FP/N

P

N

Example: 3 classifiers

TruePredictedpos neg

pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPR = FPR =



Example: 3 classifiers


pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPR = 0.4FPR = 0.3



Multiclass-Things to Notice• The total number of test examples of any class would be the

sum of corresponding row(i.e the TP +FN for that class)• The total number of FN’s for a class is sum of values in the

corresponding row(excluding the TP)• The total number of FP’s for a class is the sum of values in the

corresponding column(excluding the TP)• The total number of TN’s for a certain class will be the sum of

all column and rows excluding that class's column and rowPredicted

Actual A B C D EA TPA EAB EAC EAD EAE

B EBA TPB EBC EBD EBE

C ECA ECB TPC ECD ECE

D EDA EDB EDC TPD EDE

E EEA EEB EEC EED TPE

Predicted

Actual A B C D E

A TPA EAB EAC EAD EAE

B EBA TPB EBC EBD EBE

C ECA ECB TPC ECD ECE

D EDA EDB EDC TPD EDE

E EEA EEB EEC EED TPE

Multi-classPredicted

Actual

A B CA TPA EAB EAC

B EBA TPB EBC

C ECA ECB TPC

Predicted class

Actualclass

P N

P TP FN

N FP TN

Predicted

Actual A Not AANot A

Predicted

Actual B Not BBNot B

Predicted

Actual C Not CCNot C

Multi-classPredicted

Actual

A B CA TPA EAB EAC

B EBA TPB EBC

C ECA ECB TPC

Predicted class

Actualclass

P N

P TP FN

N FP TN

Predicted

Actual A Not AA TPA EAB + EAC

Not A EBA + ECA TPB + EBCECB + TPC

Predicted

Actual B Not BB TPB EBA + EBC

Not B EAB+ ECB TPA + EACECA + TPC

Predicted

Actual C Not CC TPC ECA + ECB

Not C EAC + EBC TPA + EABEBA + TPB

Example:

A B CA 25 5 2

B 3 32 4

C 1 0 15

Overall Accuracy: Precision A=Recall B=

Predicted

Actual

Example:A B C

A 25 5 2

B 3 32 4

C 1 0 15

Overall Accuracy = (25+32+15)/(25+5+2+3+32+4+1+0+15) Precision A= 25/(25+3+1)Recall B= 32/(32+3+4)

Counting the Costs

• In practice, different types of classification errors often incur different costs

• Examples:– ¨ Terrorist profiling

• “Not a terrorist” correct 99.99% of the time– Loan decisions– Fault diagnosis– Promotional mailing

Cost MatricesPos Neg

Pos TP Cost FN CostNeg FP Cost TN Cost

Usually, TP Cost and TN Cost are set equal to 0

Hypothesized class

True class

Lift Charts• In practice, decisions are usually made by comparing possible scenarios taking into account different costs.• Example: • Promotional mail out to 1,000,000 households. If we

mail to all households, we get 0.1% respond (1000).• Data mining tool identifies

-subset of 100,000 households with 0.4% respond (400); or- subset of 400,000 households with 0.2% respond (800);

• Depending on the costs we can make final decision using lift charts!

• A lift chart allows a visual comparison for measuring model performance

Generating a Lift Chart

• Given a scheme that outputs probability, sort the instances in descending order according to the predicted probability

• In lift chart, x axis is sample size and y axis is number of true positives.

Rank Predicted Probability

Actual Class

1 0.95 Yes

2 0.93 Yes

3 0.93 No

4 0.88 Yes

….. …. ….

Gains Chart

Example 01: Direct Marketing• A company wants to do a mail marketing campaign• It costs the company $1 for each item mailed• They have information on 100,000 customers• Create a cumulative gains and lift charts from the following

data• Overall Response Rate: If we assume we have no model

other than the prediction of the overall response rate, then we can predict the number of positive responses as a fraction of the total customers contacted

• Suppose the response rate is 20%• If all 100,000 customers are contacted we will receive

around 20,000 positive responses

Cost($) Total Customers Contacted Positive Responses

100000 100000 20000

• Prediction of Response Model: A Response model predicts who will respond to a marketting campaign

• If we have a response model, we can make more detailed predictions

• For example, we use the response model to assign a score to all 100,000 customers and predict the results of contacting only the top 10,000 customers, the top 20,000 customers ,etc

Cost($) Total Customers Contacted

Positive Responses

10,000 10,000 6,000

20,000 20,000 10,000

30,000 30,000 13,000

40,000 40,000 15,800

50,000 50,000 17,000

60,000 60,000 18,000

70,000 70,000 18,800

80,000 80,000 19,400

90,000 90,000 19,800

100,000 100,000 20,000

Cumulative Gains Chart

• The y-axis shows the percentage of positive responses. This is a percentage of the total possible positive responses (20,000 as the overall response rate shows)

• The x-axis shows the percentage of customers contacted, which is a fraction of the 100,000 total customers

• Baseline(Overall response rate): If we contact X% of customers then we will receive X% if the total positive responses

• Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent customers contacted and map these points to create the lift curve

Cost($) Total Customers Contacted

Positive Responses

10,000 10,000 6,00020,000 20,000 10,00030,000 30,000 13,00040,000 40,000 15,80050,000 50,000 17,00060,000 60,000 18,00070,000 70,000 18,80080,000 80,000 19,40090,000 90,000 19,800100,000 100,000 20,000

Lift Chart

• Shows the actual lift.• To plot the chart: Calculate the points on the lift

curve by determining the ratio between the result predicted by our model and the result using no model.

• Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3

Lift Chart

Cumulative gains and lift charts are a graphical representation of the advantage of using a predictive model to choose which customers to contact

Example 2:

• Using the response model P(x)=100-AGE(x) for customer x and the data table shown below, construct the cumulative gains and lift charts.

Calculate P(x) for each person x1. Calculate P(x) for each person x

2. Order the people according to rank P(x)

3. Calculate the percentage of total responses for each cutoff point

Response Rate = Number of Responses / Total Number of Responses

Total CustomerContacted

#ofResponses

Response Rate

2

4

6

8

10

12

14

16

18

20

Calculate P(x) for each person x1. Calculate P(x) for each person x

2. Order the people according to rank P(x)

3. Calculate the percentage of total responses for each cutoff point

Response Rate = Number of Responses / Total Number of Responses

Cumulative Gains vs Lift ChartThe lift curve and the baseline have the same values for 10%-20% and 90%-100%.

ROC Curves

• ROC curves are similar to lift charts– Stands for “receiver operating characteristic” – Used in signal detection to show tradeoff between

hit rate and false alarm rate over noisy channel • Differences from gains chart:

– x axis shows percentage of false positives in sample, rather than sample size

ROC Curve

81

Non-diseasedcases

Diseasedcases

Threshold

ROC Curves and Analysis


pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPr = 0.4FPr = 0.3



ROC analysis

• True Positive Rate– TPR = TP / (TP+FN)– also called sensitivity– true abnormals called abnormal by the observer

• False Positive Rate– FPR = FP / (FP+TN)

• Specificity (TNR)= TN / (TN+FP)– True normals called normal by the observer

• FPR = 1 - specificity

Evaluating classifiers (via their ROC curves)

Classifier A can’t distinguish between normal and abnormal.

B is better but makes some mistakes.

C makes very few mistakes.

“Perfect” means no false positives and no false negatives.

Quiz 4:

1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.

What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=

Date post:	24-Jan-2017
Category:	Education
Upload:	tilani-gunawardena-phdunibas-bscpera-fheauk-amiesl
View:	94 times
Download:	0 times

evaluation and credibility-Part 2

Education