+ All Categories
Home > Education > evaluation and credibility-Part 2

evaluation and credibility-Part 2

Date post: 24-Jan-2017
Category:
Upload: tilani-gunawardena-phdunibas-bscpera-fheauk-amiesl
View: 94 times
Download: 0 times
Share this document with a friend
83
Tilani Gunawardena Machine Learning and Data Mining Evaluation and Credibility
Transcript
Page 1: evaluation and credibility-Part 2

Tilani Gunawardena

Machine Learning and Data Mining

Evaluation and Credibility

Page 2: evaluation and credibility-Part 2

• Introduction• Train, Test and Validation sets • Evaluation on Large data Unbalanced data • Evaluation on Small data

– Cross validation– Bootstrap

• Comparing data mining schemes– Significance test– Lift Chart / ROC curve

• Numeric Prediction Evaluation

Outline

Page 3: evaluation and credibility-Part 2

Model’s Evaluation in the KDD Process

Page 4: evaluation and credibility-Part 2

How to Estimate the Metrics?

• We can use:– Training data;– Independent test data;– Hold-out method;– k-fold cross-validation method;– Leave-one-out method;– Bootstrap method;– And many more…

Page 5: evaluation and credibility-Part 2

Estimation with Training Data

• The accuracy/error estimates on the training data are not good indicators of performance on future data.

– Q: Why? – A: Because new data will probably not be exactly the same as the

training data!• The accuracy/error estimates on the training data measure

the degree of classifier’s overfitting.

Training set

Classifier

Training set

Page 6: evaluation and credibility-Part 2

Estimation with Independent Test Data

• Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data.

• For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.

Training set

Classifier

Test set

Page 7: evaluation and credibility-Part 2

Hold-out Method• The hold-out method splits the data into training data and

test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

• The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class.

Training set

Classifier

Test set

Data

Page 8: evaluation and credibility-Part 2

Classification: Train, Validation, Test Split

Data

Predictions

Y N

Results Known

Training set

Validation set

++--+

Classifier BuilderEvaluate

+-+-

ClassifierFinal Test Set

+-+-

Final Evaluation

ModelBuilder

The test data can’t be used for parameter tuning!

Page 9: evaluation and credibility-Part 2

k-Fold Cross-Validation• k-fold cross-validation avoids overlapping test sets:

– First step: data is split into k subsets of equal size;– Second step: each subset in turn is used for testing and the

remainder for training.• The estimates are averaged to

yield an overall estimate. Classifier

Data

train train test

train test train

test train train

Page 10: evaluation and credibility-Part 2

Example

collect data from real world(photographs and labels)

Page 11: evaluation and credibility-Part 2
Page 12: evaluation and credibility-Part 2

Method 1: Training Process

Page 13: evaluation and credibility-Part 2

Giving students the answer before giving them exam

Page 14: evaluation and credibility-Part 2

Method 2

Page 15: evaluation and credibility-Part 2
Page 16: evaluation and credibility-Part 2
Page 17: evaluation and credibility-Part 2
Page 18: evaluation and credibility-Part 2

Cross Validation Error

Page 19: evaluation and credibility-Part 2
Page 20: evaluation and credibility-Part 2
Page 21: evaluation and credibility-Part 2

Method 3

Page 22: evaluation and credibility-Part 2
Page 23: evaluation and credibility-Part 2
Page 24: evaluation and credibility-Part 2
Page 25: evaluation and credibility-Part 2
Page 26: evaluation and credibility-Part 2
Page 27: evaluation and credibility-Part 2
Page 28: evaluation and credibility-Part 2
Page 29: evaluation and credibility-Part 2
Page 30: evaluation and credibility-Part 2
Page 31: evaluation and credibility-Part 2
Page 32: evaluation and credibility-Part 2

If the world happens to be well represented by our dataset

Page 33: evaluation and credibility-Part 2

• Model Selection• Evaluating our selection Method

CV

Page 34: evaluation and credibility-Part 2

35

The Bootstrap• CV uses sampling without replacement

– The same instance, once selected, can not be selected again for a particular training/test set

• The bootstrap uses sampling with replacement to form the training set– Sample a dataset of n instances n times with replacement to form

a new dataset of n instances– Use this data as the training set– Use the instances from the original

dataset that don’t occur in the newtraining set for testing

Page 35: evaluation and credibility-Part 2

Example

• Sample of same size N(with replacement)• N=4,M=N=4,M=3

• N=150, M=5000• This gives M=5000 means of random samples

of X

Page 36: evaluation and credibility-Part 2

37

The 0.632 bootstrap• Also called the 0.632 bootstrap

– A particular instance has a probability of 1–1/n of not being picked– Thus its probability of ending up in the test data is:

– This means the training data will contain approximately 63.2% of the instances

Page 37: evaluation and credibility-Part 2

38

Estimating errorwith the bootstrap

• The error estimate on the test data will be very pessimistic – Trained on just ~63% of the instances

• Therefore, combine it with the resubstitution error:

• The resubstitution error gets less weight than the error on the test data

• Repeat process several times with different replacement samples; average the results

Page 38: evaluation and credibility-Part 2

39

More on the bootstrap• Probably the best way of estimating performance for very

small datasets• However, it has some problems

– Completely random dataset with two classes of equal size. The true error rate is 50% for any prediction rule.

– Consider the random dataset from above– 0% resubstitution error and

~50% error on test data– Bootstrap estimate for this classifier:

– True expected error: 50%

Page 39: evaluation and credibility-Part 2

• It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution

Page 40: evaluation and credibility-Part 2

41

Evaluation Summary:

• Use Train, Test, Validation sets for “LARGE” data

• Balance “un-balanced” data• Use Cross-validation for Middle size/small data• Use the leave-one-out and bootstrap methods

for small data• Don’t use test data for parameter tuning - use

separate validation data

Page 41: evaluation and credibility-Part 2

Agenda

• Quantifying learner performance– Cross validation– Error vs. loss– Precision & recall

• Model selection

Page 42: evaluation and credibility-Part 2

Accuracy Vs Precision

accuracy refers to the closeness of a measurement or estimate to the TRUE value.

precision (or variance) refers to the degree of agreement for a series of measurements.

Page 43: evaluation and credibility-Part 2

Precision Vs Recall

precision: Percentage of retrieved documents that are relevant.

recall: Percentage of relevant documents that are returned.

Page 44: evaluation and credibility-Part 2

Scenario

• We use a dataset with knows classes to build a model

• We use another dataset with known classes to evaluate the model(this dataset could be part of the original dataset)

• We compare/count the predicted classes against the actual classes

Page 45: evaluation and credibility-Part 2

Confusion Matrix• A confusion matrix shows the number of

correct and incorrect predictions made by the classification model compared to the actual outcomes(target value) in the data

• The matrix is NxN, where N is the number of target values(Classes)

• Performance of such models commonly evaluated using data in the matrix

Page 46: evaluation and credibility-Part 2

Two Types of Error

False negative (“miss”), FNalarm doesn’t sound but person is carrying metal

False positive (“false alarm”), FPalarm sounds but person is not carrying metal

Page 47: evaluation and credibility-Part 2

How to evaluate the Classifier’s Generalization Performance?

Predicted class

Actualclass

Pos Neg

Pos TP FN

Neg FP TN

• Assume that we test a classifier on some test set and we derive at the end the following confusion matrix (Two-Class)

• Also called contingency table

P

N

Page 48: evaluation and credibility-Part 2

Measures in Two-Class Classification

Page 49: evaluation and credibility-Part 2

Example:

1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.

What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=

Page 50: evaluation and credibility-Part 2

1) True Positive (“Tony Blair”)=2) False Positive (“Tony Blair”)=3) False Negative(“Tony Blair”)=4) True Positive (“Donald Rumsdeld”)=5) False Positive (““Donald Rumsdeld”)=6) False Negative(““Donald Rumsdeld”)=

Page 51: evaluation and credibility-Part 2

Metrics for Classifier’s Evaluation

Predicted class

Actualclass

Pos Neg

Pos TP FN

Neg FP TN

• Accuracy = (TP+TN)/(P+N)• Error = (FP+FN)/(P+N)• Precision = TP/(TP+FP)• Recall/TP rate = TP/P• FP Rate = FP/N

P

N

Page 52: evaluation and credibility-Part 2

Example: 3 classifiers

TruePredictedpos neg

pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPR = FPR =

Classifier 2TPR = FPR =

Classifier 3TPR = FPR =

Page 53: evaluation and credibility-Part 2

Example: 3 classifiers

TruePredictedpos neg

pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPR = 0.4FPR = 0.3

Classifier 2TPR = 0.7FPR = 0.5

Classifier 3TPR = 0.6FPR = 0.2

Page 54: evaluation and credibility-Part 2

Multiclass-Things to Notice• The total number of test examples of any class would be the

sum of corresponding row(i.e the TP +FN for that class)• The total number of FN’s for a class is sum of values in the

corresponding row(excluding the TP)• The total number of FP’s for a class is the sum of values in the

corresponding column(excluding the TP)• The total number of TN’s for a certain class will be the sum of

all column and rows excluding that class's column and rowPredicted

Actual A B C D EA TPA EAB EAC EAD EAE

B EBA TPB EBC EBD EBE

C ECA ECB TPC ECD ECE

D EDA EDB EDC TPD EDE

E EEA EEB EEC EED TPE

Page 55: evaluation and credibility-Part 2

Predicted

Actual A B C D E

A TPA EAB EAC EAD EAE

B EBA TPB EBC EBD EBE

C ECA ECB TPC ECD ECE

D EDA EDB EDC TPD EDE

E EEA EEB EEC EED TPE

Page 56: evaluation and credibility-Part 2

Multi-classPredicted

Actual

A B CA TPA EAB EAC

B EBA TPB EBC

C ECA ECB TPC

Predicted class

Actualclass

P N

P TP FN

N FP TN

Predicted

Actual A Not AANot A

Predicted

Actual B Not BBNot B

Predicted

Actual C Not CCNot C

Page 57: evaluation and credibility-Part 2

Multi-classPredicted

Actual

A B CA TPA EAB EAC

B EBA TPB EBC

C ECA ECB TPC

Predicted class

Actualclass

P N

P TP FN

N FP TN

Predicted

Actual A Not AA TPA EAB + EAC

Not A EBA + ECA TPB + EBCECB + TPC

Predicted

Actual B Not BB TPB EBA + EBC

Not B EAB+ ECB TPA + EACECA + TPC

Predicted

Actual C Not CC TPC ECA + ECB

Not C EAC + EBC TPA + EABEBA + TPB

Page 58: evaluation and credibility-Part 2

Example:

A B CA 25 5 2

B 3 32 4

C 1 0 15

Overall Accuracy: Precision A=Recall B=

Predicted

Actual

Page 59: evaluation and credibility-Part 2

Example:A B C

A 25 5 2

B 3 32 4

C 1 0 15

Overall Accuracy = (25+32+15)/(25+5+2+3+32+4+1+0+15) Precision A= 25/(25+3+1)Recall B= 32/(32+3+4)

Page 60: evaluation and credibility-Part 2

Counting the Costs

• In practice, different types of classification errors often incur different costs

• Examples:– ¨ Terrorist profiling

• “Not a terrorist” correct 99.99% of the time– Loan decisions– Fault diagnosis– Promotional mailing

Page 61: evaluation and credibility-Part 2

Cost MatricesPos Neg

Pos TP Cost FN CostNeg FP Cost TN Cost

Usually, TP Cost and TN Cost are set equal to 0

Hypothesized class

True class

Page 62: evaluation and credibility-Part 2

Lift Charts• In practice, decisions are usually made by comparing possible scenarios taking into account different costs.• Example: • Promotional mail out to 1,000,000 households. If we

mail to all households, we get 0.1% respond (1000).• Data mining tool identifies

-subset of 100,000 households with 0.4% respond (400); or- subset of 400,000 households with 0.2% respond (800);

• Depending on the costs we can make final decision using lift charts!

• A lift chart allows a visual comparison for measuring model performance

Page 63: evaluation and credibility-Part 2

Generating a Lift Chart

• Given a scheme that outputs probability, sort the instances in descending order according to the predicted probability

• In lift chart, x axis is sample size and y axis is number of true positives.

Rank Predicted Probability

Actual Class

1 0.95 Yes

2 0.93 Yes

3 0.93 No

4 0.88 Yes

….. …. ….

Page 64: evaluation and credibility-Part 2

Gains Chart

Page 65: evaluation and credibility-Part 2

Example 01: Direct Marketing• A company wants to do a mail marketing campaign• It costs the company $1 for each item mailed• They have information on 100,000 customers• Create a cumulative gains and lift charts from the following

data• Overall Response Rate: If we assume we have no model

other than the prediction of the overall response rate, then we can predict the number of positive responses as a fraction of the total customers contacted

• Suppose the response rate is 20%• If all 100,000 customers are contacted we will receive

around 20,000 positive responses

Page 66: evaluation and credibility-Part 2

Cost($) Total Customers Contacted Positive Responses

100000 100000 20000

• Prediction of Response Model: A Response model predicts who will respond to a marketting campaign

• If we have a response model, we can make more detailed predictions

• For example, we use the response model to assign a score to all 100,000 customers and predict the results of contacting only the top 10,000 customers, the top 20,000 customers ,etc

Cost($) Total Customers Contacted

Positive Responses

10,000 10,000 6,000

20,000 20,000 10,000

30,000 30,000 13,000

40,000 40,000 15,800

50,000 50,000 17,000

60,000 60,000 18,000

70,000 70,000 18,800

80,000 80,000 19,400

90,000 90,000 19,800

100,000 100,000 20,000

Page 67: evaluation and credibility-Part 2

Cumulative Gains Chart

• The y-axis shows the percentage of positive responses. This is a percentage of the total possible positive responses (20,000 as the overall response rate shows)

• The x-axis shows the percentage of customers contacted, which is a fraction of the 100,000 total customers

• Baseline(Overall response rate): If we contact X% of customers then we will receive X% if the total positive responses

• Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent customers contacted and map these points to create the lift curve

Page 68: evaluation and credibility-Part 2

Cost($) Total Customers Contacted

Positive Responses

10,000 10,000 6,00020,000 20,000 10,00030,000 30,000 13,00040,000 40,000 15,80050,000 50,000 17,00060,000 60,000 18,00070,000 70,000 18,80080,000 80,000 19,40090,000 90,000 19,800100,000 100,000 20,000

Page 69: evaluation and credibility-Part 2

Lift Chart

• Shows the actual lift.• To plot the chart: Calculate the points on the lift

curve by determining the ratio between the result predicted by our model and the result using no model.

• Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3

Page 70: evaluation and credibility-Part 2

Lift Chart

Cumulative gains and lift charts are a graphical representation of the advantage of using a predictive model to choose which customers to contact

Page 71: evaluation and credibility-Part 2

Example 2:

• Using the response model P(x)=100-AGE(x) for customer x and the data table shown below, construct the cumulative gains and lift charts.

Page 72: evaluation and credibility-Part 2

Calculate P(x) for each person x1. Calculate P(x) for each person x

2. Order the people according to rank P(x)

3. Calculate the percentage of total responses for each cutoff point

Response Rate = Number of Responses / Total Number of Responses

Total CustomerContacted

#ofResponses

Response Rate

2

4

6

8

10

12

14

16

18

20

Page 73: evaluation and credibility-Part 2

Calculate P(x) for each person x1. Calculate P(x) for each person x

2. Order the people according to rank P(x)

3. Calculate the percentage of total responses for each cutoff point

Response Rate = Number of Responses / Total Number of Responses

Page 74: evaluation and credibility-Part 2

Cumulative Gains vs Lift ChartThe lift curve and the baseline have the same values for 10%-20% and 90%-100%.

Page 75: evaluation and credibility-Part 2

ROC Curves

• ROC curves are similar to lift charts– Stands for “receiver operating characteristic” – Used in signal detection to show tradeoff between

hit rate and false alarm rate over noisy channel • Differences from gains chart:

– x axis shows percentage of false positives in sample, rather than sample size

Page 76: evaluation and credibility-Part 2

ROC Curve

Page 77: evaluation and credibility-Part 2

81

Non-diseasedcases

Diseasedcases

Threshold

Page 78: evaluation and credibility-Part 2

ROC Curves and Analysis

TruePredictedpos neg

pos 60 40neg 20 80

TruePredicted

pos neg

pos 70 30neg 50 50

TruePredicted

pos neg

pos 40 60neg 30 70

Classifier 1TPr = 0.4FPr = 0.3

Classifier 2TPr = 0.7FPr = 0.5

Classifier 3TPr = 0.6FPr = 0.2

Page 79: evaluation and credibility-Part 2

ROC analysis

• True Positive Rate– TPR = TP / (TP+FN)– also called sensitivity– true abnormals called abnormal by the observer

• False Positive Rate– FPR = FP / (FP+TN)

• Specificity (TNR)= TN / (TN+FP)– True normals called normal by the observer

• FPR = 1 - specificity

Page 80: evaluation and credibility-Part 2
Page 81: evaluation and credibility-Part 2

Evaluating classifiers (via their ROC curves)

Classifier A can’t distinguish between normal and abnormal.

B is better but makes some mistakes.

C makes very few mistakes.

Page 82: evaluation and credibility-Part 2

“Perfect” means no false positives and no false negatives.

Page 83: evaluation and credibility-Part 2

Quiz 4:

1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.

What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=


Recommended