Date post: | 17-Jan-2018 |
Category: |
Documents |
Upload: | neil-hicks |
View: | 217 times |
Download: | 0 times |
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Lecture 19 of 42
Wednesday, 05 March 2008
William H. HsuDepartment of Computing and Information Sciences, KSU
KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ihCourse web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732
Instructor home page: http://www.cis.ksu.edu/~bhsu
Reading:Today: Section 6.15, Han & Kamber 2e
Friday: Sections 7.1 – 7.3, Han & Kamber 2e
Model Selection and Performance Measurement:Confidence and ROC Curves
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Eric-Jan Wagenmakers
Practical Methods for Model Selection:
Cross-Validation, Bootstrap, Prequential
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Problem
How to compare rival models that differ in complexity (e.g., the number and functional form of parameters)?
Model X?
Model Y?
Model Z?
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Solution?
Use model selection methods such as:Minimum Description Length (MDL)Bayesian Model Selection (BMS)These methods have a sound theoretical basis
I like MDL!I like BMS!
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
New Problems!
What if it is impossible to calculate what MDL/BMS require?
What if your favorite model does not have a likelihood function?
I like MDL!I like BMS!
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Aim of this Presentation
Stress importance of predictionPresent practical methods for model selection. These
are Data-driven Replace theoretical analyses with computing power Relatively easy to implement
Highlight pros and consAdvocate prequential approach
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Why Prediction?
The ideal model captures all of the replicable structure in the data, and none of the noise. Models capturing too little structure (underfitting): poor
prediction. Models capturing too much noise (overfitting): poor
prediction.MDL/BMS have a predictive interpretation.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Data-Driven Methods
Split-sample or hold-out method Cross-validation:
Split-half Leave-one-out K-fold Delete-d
Bootstrap model selectionPrequential
These methods can give contradictory results, even asymptotically.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Split-Sample Method [1]
Data
Training setCalibration set
Test setValidation set
Main idea: predictive virtues can only be assessed for unseen data
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Split-Sample Method [2]
Training set Test set
Often used in neural networks.The training set and the test set do not change roles: there is no “crossing”.Only one part of the data is ever used for fitting.High variance.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Cross-Validation: Split-half
Data
Training set Test setTest set Training set
Prediction error is average performance on the two training sets.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Cross-Validation: Split-half
All the data are used for fitting (but not at the same time, of course).
Variance may be reduced by repeatedly splitting the data in different halves and averaging the results.
Prediction is based on a relatively small data set; This gives relatively large prediction errors.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Cross-Validation: Leave-one-out [1]
Data (n observations)
Test set = a single observationTraining set = all the rest
Prediction error is average performance on the n training sets.
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
All the data are used for fitting (again, not at the same time).
Prediction is based on a large data set; This gives small prediction errors.
Problem: as n grows large, the method overfits (i.e., it does not converge on the correct model, in the case that there is one – that is, the method is not consistent).
Sometimes, the method can have high variance.
Cross-Validation: Leave-one-out [2]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Data
Test set
Successively setting apart a block of data.(instead of a single observation)
Test set Test set Test set
Cross-Validation: k-fold
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Same as K-fold, except that the test blocks consist of every subset of d observations from the data.
How should we choose d?
Suppose you want the method to be consistent. Then, as n grows large, d/n should approach 1 (Shao, 1997).
Cross-Validation: Delete-d (or Leave-d-Out)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Data: {5, 6, 7, 8}
Bootstrap sample 1: {5, 5, 6, 6}Bootstrap sample 2: {6, 7, 7, 7}Bootstrap sample 3: {5, 7, 8, 8}
Bootstrap: sampling with replacement from the original data (Efron).
Bootstrap Model Selection [1]:Example
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Take a bootstrap sample.
For all samples that do not contain observation i, calculate the prediction error for i.
Do this separately for all observations (i = 1,…,n).
Average the prediction errors.
Bootstrap Model Selection [2]:Example
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Previous recipe is close to split-half CV. Why? Bootstrap samples are supported on about .632*n of the original data points.A correction is known as the .632 estimator.A better correction is known as the .632+ estimator.
Main idea of bootstrap model selection is to smooth the leave-one-out cross-validation estimate of prediction error.
Bootstrap Complications
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Cross-validation is intuitively attractive, but it leaves open a crucial decision: How big should the training set be? Relatively large training sets lead to overfitting, relatively small training sets lead to underfitting.
Cross-validation lacks a guiding principle. What works in one situation is not guaranteed to work in a different situation.
Interim Conclusion
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
In BMS, one calculates the probability of the observed data under the model of interest.
The model that assigns the highest probability to the observed data is best supported by those data.
It does not matter whether the data arrive one-by-one or all at once!
Back to Bayes [1]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
1 2 2 1 1, |P x x P x x P x
It does not matter whether the data arrive one-by-one or all at once!
1 2 3 3 1 2 2 1 1, , | , |P x x x P x x x P x x P x
Sequential probability forecasts = BMS.
Back to Bayes [2]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Sequential probability forecasts = MDL.
Rissanen showed that MDL can be given a predictive interpretation (PMDL) – through sequential forecasts.
Back to Minimum Description Length (MDL)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Sequential forecasts should form the core of statistical inference.Prequential principle, prequential probability.Somehow never really caught on.
The idea is very simple: Assess one-step-ahead accumulative prediction error (APE).APE has a firm theoretical foundation (BMS, MDL, Prequential).
Prequential Approach (Dawid, 1984)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Assess prediction error to unseen data by pretending a subset of the data is unseen.
As in cross-validation, a model is fitted only to a subset of the data. Predictive performance is evaluated for the data the model did not fit.
Unlike cross-validation, predictive performance only concerns the very next data point. Also, the data used for fitting the model gradually increase (i.e., sequential prediction).
Accumulative One-Step-AheadPrediction Error (APE)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Now let’s calculate the accumulative prediction error for a given weather forecasting method M.
APE Example [1]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Given Monday, what will the weather on Tuesday be like, as predicted by forecasting method M?
APE Example [2]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Given Monday and Tuesday, what will the weather on Wednesday be like, as predicted by M?
prediction
4
APE Example [3]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Given Monday-Wednesday, what will the weather on Thursday be like, as predicted by M?
prediction
4
6
APE Example [4]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Given Monday-Thursday, what will the weather on Friday be like, as predicted by M?
prediction
64
3
APE Example [5]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
prediction
6 3
24
APE Example [6]
Given Monday-Friday, what will the weather on Saturday be like, as predicted by M?
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
prediction
Given Monday-Saturday, what will the weather on Sunday be like, as predicted by M?
6
42
1
3
APE Example [7]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Hence, APE = 4 + 6 + 3 + 2 + 1 + 0 = 16(assuming point prediction and absolute loss)
prediction
6
43
2
1 0
APE Example [8]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Based on the first i–1 observations, with i–1 always large enough to make the model identifiable, calculate a prediction for the next observation i.
Calculate the prediction error for observation i, for instance, take the squared difference between the predicted value and the observed value.
Increase i by 1 and repeat the above two steps until i = n. Sum all of the one-step-ahead prediction errors.
APE Recipe
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
APE is a data-driven method that quite directly assesses the quantity of interest, that is, prediction error to unseen data. It does not require an explicit penalty term. The penalty is in the prediction!
APE is inspired by Predictive Minimum Descriptive Length, Bayesian Model Selection, and Dawid’s Prequential Principle.
Accumulative Prediction Error:Advantages [1]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
APE quantifies predictive performance for a particular data set, without relying on asymptotics, and without requiring that the data-generating model is among the candidate models.
Should I go APE?
Accumulative Prediction Error:Advantages [2]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
APE is sensitive to the order of the observations (not asymptotically though).
For large data sets and many candidate models the computational effort can be a burden.
ML point estimation may give extreme predictions with very few data (i.e., early in the prediction sequence).
For many loss functions, it may be difficult to relate ΔAPE to the relative confidence in the models.
Accumulative Prediction Error:Disadvantages [2]
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Apply APE to hierarchical designs.
Apply the different methods to problems of realistic complexity.
Challenges
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221-264.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36, 111-147.
Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64, 29-35.
References [1]:Cross Validation
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association, 92, 548-560.
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37, 36-48.
Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Association, 91, 655-665.
References [2]:Bootstrap
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Dawid, A. P. (1984). The prequential approach. Journal of the Royal Statistical Society, Series A, 147, 278-292.
Wagenmakers, E.-J., Grünwald, P., & Steyvers, M. (2006). Accumulative prediction error and the selection of time series models. Journal of Mathematical Psychology, 50, 149-166.[Contains many other references]
References [3]:Prequential
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Actual
PredictedTrue
False
True False
Confusion Matrix
true positives = a, true negatives = dfalse positives = b, false negatives = c
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Actual
Predicted
Sensitivity vs. Specificity:Natural Science Terminology
Sensitivity aka Recall Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)
Specificity Actual negatives that were predicted out of total actual negatives True negatives / (false positives + true negatives)
Sensitivity = a / (a+c)Specificity = d / (b+d)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Actual
PredictedSearch hitreturned
Search hit not returned
Desired Not Desired
Precision vs. Recall:Information Science Terminology
Precision Actual positives that were predicted out of total predicted negatives True positives / (true positives + false positives)
Recall aka Sensitivity Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)
Precision = a / (a+b)Recall = a / (a+c)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Receiver Operating Characteristic (ROC)
aka Sensitivity vs. Specificity Curve Line plot
Ideal: (0, 1), upper left – false pos rate (FPR) = 0, true pos rate (TPR) = 1 Closer to ideal = better (2-D dominance)
Note: really plot of TPR (sensitivity) vs. FPR (1 – specificity) Plots generated by thresholding predictions of trained inducer
Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc.png.
Controllable parameter is predicted positiverate (therefore, every point represents oneconfusion matrix)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Area Under Curve (ROC AUC):Sensitivity vs. Specificity/Recall
Scalar measure of “performance” Simultaneous measure of sensitivity and specificity/recall Range: 0.0 (worst-case) to 1.0 (corresponding to L shape)
ROC AUC = P( score(actual positive) > score (actual negative) ) Also called discrimination of inducer
Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc space.png.
Examples picked uniformly at random(according to actual distribution) –relative rate, not absolute rate
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Summary End of Chapter 6 (Classification)! What Have We Learned So Far?
Supervised: so far Unsupervised and semi-supervised: next Reinforcement learning: later (and perhaps in other courses)
Supervised Learning and Inductive Bias Representation-based vs. search-based (“H-based vs. L-based”) Bias : generalization :: heuristic : search Reference: Mitchell, “Generalization as Search”
Topics Covered: “Inducers” H: conjunctive concepts, decision trees, frequency tables, rule sets, linear
threshold gates, multi-layer perceptrons, example sets, associations L: candidate elimination, ID3/C4.5/J48, Naïve Bayes, CN2/OneR & GABIL,
Perceptron/Winnow & SVM, backprop, nearest-neighbor & RBF, a priori Meta-L: genetic algorithms and evolutionary computation, committee
machines (WM, bagging, boosting, stacking, mixture models/HME)
Computing & Information SciencesKansas State UniversityWednesday, 05 Mar
2008CIS 732 / 830: Machine Learning / Advanced Topics in AI
Terminology Performance Element: Part of System that Applies Result of Learning Flavors of Learning
Supervised: with “teacher” (often, classification from labeled examples) Unsupervised: from data, using similarity measure (unlabeled instances) Reinforcement: “by doing”, with reward/penalty signal
Supervised Learning: Target Functions Target function – function c or f to be learned Target – desired value y to be predicted (sometimes “target function”) Example / labeled instance – tuples of the form <x, f(x)> Classification function, classifier – nominal-valued f (enumerated return
type) Clustering: Application of Unsupervised Learning Concepts and Hypotheses
Concept – function c from observations to TRUE or FALSE (membership) Class label – output of classification function Hypothesis – proposed function h believed to be similar to c (or f)