Date post: | 24-Jan-2017 |
Category: |
Education |
Upload: | tilani-gunawardena-phdunibas-bscpera-fheauk-amiesl |
View: | 94 times |
Download: | 0 times |
Tilani Gunawardena
Machine Learning and Data Mining
Evaluation and Credibility
• Introduction• Train, Test and Validation sets • Evaluation on Large data Unbalanced data • Evaluation on Small data
– Cross validation– Bootstrap
• Comparing data mining schemes– Significance test– Lift Chart / ROC curve
• Numeric Prediction Evaluation
Outline
Model’s Evaluation in the KDD Process
How to Estimate the Metrics?
• We can use:– Training data;– Independent test data;– Hold-out method;– k-fold cross-validation method;– Leave-one-out method;– Bootstrap method;– And many more…
Estimation with Training Data
• The accuracy/error estimates on the training data are not good indicators of performance on future data.
– Q: Why? – A: Because new data will probably not be exactly the same as the
training data!• The accuracy/error estimates on the training data measure
the degree of classifier’s overfitting.
Training set
Classifier
Training set
Estimation with Independent Test Data
• Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data.
• For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.
Training set
Classifier
Test set
Hold-out Method• The hold-out method splits the data into training data and
test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.
• The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class.
Training set
Classifier
Test set
Data
Classification: Train, Validation, Test Split
Data
Predictions
Y N
Results Known
Training set
Validation set
++--+
Classifier BuilderEvaluate
+-+-
ClassifierFinal Test Set
+-+-
Final Evaluation
ModelBuilder
The test data can’t be used for parameter tuning!
k-Fold Cross-Validation• k-fold cross-validation avoids overlapping test sets:
– First step: data is split into k subsets of equal size;– Second step: each subset in turn is used for testing and the
remainder for training.• The estimates are averaged to
yield an overall estimate. Classifier
Data
train train test
train test train
test train train
Example
collect data from real world(photographs and labels)
Method 1: Training Process
Giving students the answer before giving them exam
Method 2
Cross Validation Error
Method 3
If the world happens to be well represented by our dataset
• Model Selection• Evaluating our selection Method
CV
35
The Bootstrap• CV uses sampling without replacement
– The same instance, once selected, can not be selected again for a particular training/test set
• The bootstrap uses sampling with replacement to form the training set– Sample a dataset of n instances n times with replacement to form
a new dataset of n instances– Use this data as the training set– Use the instances from the original
dataset that don’t occur in the newtraining set for testing
Example
• Sample of same size N(with replacement)• N=4,M=N=4,M=3
• N=150, M=5000• This gives M=5000 means of random samples
of X
37
The 0.632 bootstrap• Also called the 0.632 bootstrap
– A particular instance has a probability of 1–1/n of not being picked– Thus its probability of ending up in the test data is:
– This means the training data will contain approximately 63.2% of the instances
38
Estimating errorwith the bootstrap
• The error estimate on the test data will be very pessimistic – Trained on just ~63% of the instances
• Therefore, combine it with the resubstitution error:
• The resubstitution error gets less weight than the error on the test data
• Repeat process several times with different replacement samples; average the results
39
More on the bootstrap• Probably the best way of estimating performance for very
small datasets• However, it has some problems
– Completely random dataset with two classes of equal size. The true error rate is 50% for any prediction rule.
– Consider the random dataset from above– 0% resubstitution error and
~50% error on test data– Bootstrap estimate for this classifier:
– True expected error: 50%
• It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution
41
Evaluation Summary:
• Use Train, Test, Validation sets for “LARGE” data
• Balance “un-balanced” data• Use Cross-validation for Middle size/small data• Use the leave-one-out and bootstrap methods
for small data• Don’t use test data for parameter tuning - use
separate validation data
Agenda
• Quantifying learner performance– Cross validation– Error vs. loss– Precision & recall
• Model selection
Accuracy Vs Precision
accuracy refers to the closeness of a measurement or estimate to the TRUE value.
precision (or variance) refers to the degree of agreement for a series of measurements.
Precision Vs Recall
precision: Percentage of retrieved documents that are relevant.
recall: Percentage of relevant documents that are returned.
Scenario
• We use a dataset with knows classes to build a model
• We use another dataset with known classes to evaluate the model(this dataset could be part of the original dataset)
• We compare/count the predicted classes against the actual classes
Confusion Matrix• A confusion matrix shows the number of
correct and incorrect predictions made by the classification model compared to the actual outcomes(target value) in the data
• The matrix is NxN, where N is the number of target values(Classes)
• Performance of such models commonly evaluated using data in the matrix
Two Types of Error
False negative (“miss”), FNalarm doesn’t sound but person is carrying metal
False positive (“false alarm”), FPalarm sounds but person is not carrying metal
How to evaluate the Classifier’s Generalization Performance?
Predicted class
Actualclass
Pos Neg
Pos TP FN
Neg FP TN
• Assume that we test a classifier on some test set and we derive at the end the following confusion matrix (Two-Class)
• Also called contingency table
P
N
Measures in Two-Class Classification
Example:
1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.
What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=
1) True Positive (“Tony Blair”)=2) False Positive (“Tony Blair”)=3) False Negative(“Tony Blair”)=4) True Positive (“Donald Rumsdeld”)=5) False Positive (““Donald Rumsdeld”)=6) False Negative(““Donald Rumsdeld”)=
Metrics for Classifier’s Evaluation
Predicted class
Actualclass
Pos Neg
Pos TP FN
Neg FP TN
• Accuracy = (TP+TN)/(P+N)• Error = (FP+FN)/(P+N)• Precision = TP/(TP+FP)• Recall/TP rate = TP/P• FP Rate = FP/N
P
N
Example: 3 classifiers
TruePredictedpos neg
pos 60 40neg 20 80
TruePredicted
pos neg
pos 70 30neg 50 50
TruePredicted
pos neg
pos 40 60neg 30 70
Classifier 1TPR = FPR =
Classifier 2TPR = FPR =
Classifier 3TPR = FPR =
Example: 3 classifiers
TruePredictedpos neg
pos 60 40neg 20 80
TruePredicted
pos neg
pos 70 30neg 50 50
TruePredicted
pos neg
pos 40 60neg 30 70
Classifier 1TPR = 0.4FPR = 0.3
Classifier 2TPR = 0.7FPR = 0.5
Classifier 3TPR = 0.6FPR = 0.2
Multiclass-Things to Notice• The total number of test examples of any class would be the
sum of corresponding row(i.e the TP +FN for that class)• The total number of FN’s for a class is sum of values in the
corresponding row(excluding the TP)• The total number of FP’s for a class is the sum of values in the
corresponding column(excluding the TP)• The total number of TN’s for a certain class will be the sum of
all column and rows excluding that class's column and rowPredicted
Actual A B C D EA TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE
Predicted
Actual A B C D E
A TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE
Multi-classPredicted
Actual
A B CA TPA EAB EAC
B EBA TPB EBC
C ECA ECB TPC
Predicted class
Actualclass
P N
P TP FN
N FP TN
Predicted
Actual A Not AANot A
Predicted
Actual B Not BBNot B
Predicted
Actual C Not CCNot C
Multi-classPredicted
Actual
A B CA TPA EAB EAC
B EBA TPB EBC
C ECA ECB TPC
Predicted class
Actualclass
P N
P TP FN
N FP TN
Predicted
Actual A Not AA TPA EAB + EAC
Not A EBA + ECA TPB + EBCECB + TPC
Predicted
Actual B Not BB TPB EBA + EBC
Not B EAB+ ECB TPA + EACECA + TPC
Predicted
Actual C Not CC TPC ECA + ECB
Not C EAC + EBC TPA + EABEBA + TPB
Example:
A B CA 25 5 2
B 3 32 4
C 1 0 15
Overall Accuracy: Precision A=Recall B=
Predicted
Actual
Example:A B C
A 25 5 2
B 3 32 4
C 1 0 15
Overall Accuracy = (25+32+15)/(25+5+2+3+32+4+1+0+15) Precision A= 25/(25+3+1)Recall B= 32/(32+3+4)
Counting the Costs
• In practice, different types of classification errors often incur different costs
• Examples:– ¨ Terrorist profiling
• “Not a terrorist” correct 99.99% of the time– Loan decisions– Fault diagnosis– Promotional mailing
Cost MatricesPos Neg
Pos TP Cost FN CostNeg FP Cost TN Cost
Usually, TP Cost and TN Cost are set equal to 0
Hypothesized class
True class
Lift Charts• In practice, decisions are usually made by comparing possible scenarios taking into account different costs.• Example: • Promotional mail out to 1,000,000 households. If we
mail to all households, we get 0.1% respond (1000).• Data mining tool identifies
-subset of 100,000 households with 0.4% respond (400); or- subset of 400,000 households with 0.2% respond (800);
• Depending on the costs we can make final decision using lift charts!
• A lift chart allows a visual comparison for measuring model performance
Generating a Lift Chart
• Given a scheme that outputs probability, sort the instances in descending order according to the predicted probability
• In lift chart, x axis is sample size and y axis is number of true positives.
Rank Predicted Probability
Actual Class
1 0.95 Yes
2 0.93 Yes
3 0.93 No
4 0.88 Yes
….. …. ….
Gains Chart
Example 01: Direct Marketing• A company wants to do a mail marketing campaign• It costs the company $1 for each item mailed• They have information on 100,000 customers• Create a cumulative gains and lift charts from the following
data• Overall Response Rate: If we assume we have no model
other than the prediction of the overall response rate, then we can predict the number of positive responses as a fraction of the total customers contacted
• Suppose the response rate is 20%• If all 100,000 customers are contacted we will receive
around 20,000 positive responses
Cost($) Total Customers Contacted Positive Responses
100000 100000 20000
• Prediction of Response Model: A Response model predicts who will respond to a marketting campaign
• If we have a response model, we can make more detailed predictions
• For example, we use the response model to assign a score to all 100,000 customers and predict the results of contacting only the top 10,000 customers, the top 20,000 customers ,etc
Cost($) Total Customers Contacted
Positive Responses
10,000 10,000 6,000
20,000 20,000 10,000
30,000 30,000 13,000
40,000 40,000 15,800
50,000 50,000 17,000
60,000 60,000 18,000
70,000 70,000 18,800
80,000 80,000 19,400
90,000 90,000 19,800
100,000 100,000 20,000
Cumulative Gains Chart
• The y-axis shows the percentage of positive responses. This is a percentage of the total possible positive responses (20,000 as the overall response rate shows)
• The x-axis shows the percentage of customers contacted, which is a fraction of the 100,000 total customers
• Baseline(Overall response rate): If we contact X% of customers then we will receive X% if the total positive responses
• Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent customers contacted and map these points to create the lift curve
Cost($) Total Customers Contacted
Positive Responses
10,000 10,000 6,00020,000 20,000 10,00030,000 30,000 13,00040,000 40,000 15,80050,000 50,000 17,00060,000 60,000 18,00070,000 70,000 18,80080,000 80,000 19,40090,000 90,000 19,800100,000 100,000 20,000
Lift Chart
• Shows the actual lift.• To plot the chart: Calculate the points on the lift
curve by determining the ratio between the result predicted by our model and the result using no model.
• Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3
Lift Chart
Cumulative gains and lift charts are a graphical representation of the advantage of using a predictive model to choose which customers to contact
Example 2:
• Using the response model P(x)=100-AGE(x) for customer x and the data table shown below, construct the cumulative gains and lift charts.
Calculate P(x) for each person x1. Calculate P(x) for each person x
2. Order the people according to rank P(x)
3. Calculate the percentage of total responses for each cutoff point
Response Rate = Number of Responses / Total Number of Responses
Total CustomerContacted
#ofResponses
Response Rate
2
4
6
8
10
12
14
16
18
20
Calculate P(x) for each person x1. Calculate P(x) for each person x
2. Order the people according to rank P(x)
3. Calculate the percentage of total responses for each cutoff point
Response Rate = Number of Responses / Total Number of Responses
Cumulative Gains vs Lift ChartThe lift curve and the baseline have the same values for 10%-20% and 90%-100%.
ROC Curves
• ROC curves are similar to lift charts– Stands for “receiver operating characteristic” – Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channel • Differences from gains chart:
– x axis shows percentage of false positives in sample, rather than sample size
ROC Curve
81
Non-diseasedcases
Diseasedcases
Threshold
ROC Curves and Analysis
TruePredictedpos neg
pos 60 40neg 20 80
TruePredicted
pos neg
pos 70 30neg 50 50
TruePredicted
pos neg
pos 40 60neg 30 70
Classifier 1TPr = 0.4FPr = 0.3
Classifier 2TPr = 0.7FPr = 0.5
Classifier 3TPr = 0.6FPr = 0.2
ROC analysis
• True Positive Rate– TPR = TP / (TP+FN)– also called sensitivity– true abnormals called abnormal by the observer
• False Positive Rate– FPR = FP / (FP+TN)
• Specificity (TNR)= TN / (TN+FP)– True normals called normal by the observer
• FPR = 1 - specificity
Evaluating classifiers (via their ROC curves)
Classifier A can’t distinguish between normal and abnormal.
B is better but makes some mistakes.
C makes very few mistakes.
“Perfect” means no false positives and no false negatives.
Quiz 4:
1) How many images of Gerhard Schroeder in the data set?2) How many predictions of G Schroeder are there?3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?4) Your learning algorithm predicted/classified as Hugo Chavez.
What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) =6)Precision(“Hugo Chavez”)=7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)=9)Recall(“George W Bush”)=10) Precision(“George W Bush”)=