Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | kory-salter |
View: | 216 times |
Download: | 2 times |
TAMING THE LEARNING ZOO
2
SUPERVISED LEARNING ZOO
Bayesian learning (find parameters of a probabilistic model) Maximum likelihood Maximum a posteriori
Classification Decision trees (discrete attributes, few relevant) Support vector machines (continuous attributes)
Regression Least squares (known structure, easy to interpret) Neural nets (unknown structure, hard to interpret)
Nonparametric approaches k-Nearest-Neighbors Locally-weighted averaging / regression
VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS
Task Attributes N scalability D scalability
Capacity
Bayes nets C D Good Good Good
Naïve Bayes C D Excellent Excellent Low
Decision trees C D,C Excellent Excellent Fair
Linear least squares
R C Excellent Excellent Low
Nonlinear LS R C Poor Poor Good
Neural nets R C Poor Good Good
SVMs C C Good Good Good
Nearest neighbors
C D,C L:E, E:P Poor Excellent*
Locally-weighted averaging
R C L:E, E:P Poor Excellent*
Boosting C D,C ? ? Excellent*
VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS
Task Attributes N scalability D scalability
Capacity
Bayes nets C D Good Good Good
Naïve Bayes C D Excellent Excellent Low
Decision trees C D,C Excellent Excellent Fair
Linear least squares
R C Excellent Excellent Low
Nonlinear LS R C Poor Poor Good
Neural nets R C Poor Good Good
SVMs C C Good Good Good
Nearest neighbors
C D,C L:E, E:P Poor Excellent*
Locally-weighted averaging
R C Good Poor Excellent*
Boosting C D,C ? ? Excellent*
Note: we have looked at a limited subset of existing techniques in this class (typically, the “classical” versions).
Most techniques extend to:• Both C/R tasks (e.g., support vector regression)• Both continuous and discrete attributes• Better scalability for certain types of problem
With “sufficiently large” data sets
With “sufficiently diverse” weak leaners
AGENDA
Quantifying learner performance Cross validation Error vs. loss Precision & recall
Model selection
CROSS-VALIDATION
ASSESSING PERFORMANCE OF A LEARNING ALGORITHM Samples from X are typically unavailable Take out some of the training set
Train on the remaining training set Test on the excluded instances Cross-validation
CROSS-VALIDATION
Split original set of examples, train
+
+
+
+
++
+
-
-
-
--
-
+
+
+
+
+
-
-
-
-
-
-Hypothesis space H
Train
Examples D
CROSS-VALIDATION
Evaluate hypothesis on testing set
+
+
+
+
++
+
-
-
-
--
-
Hypothesis space H
Testing set
CROSS-VALIDATION
Evaluate hypothesis on testing set
Hypothesis space H
Testing set
++
++
+
--
-
-
-
-
++
Test
CROSS-VALIDATION
Compare true concept against prediction
+
+
+
+
++
+
-
-
-
--
-
Hypothesis space H
Testing set
++
++
+
--
-
-
-
-
++
9/13 correct
COMMON SPLITTING STRATEGIES
k-fold cross-validation
Train Test
Dataset
COMMON SPLITTING STRATEGIES
k-fold cross-validation
Leave-one-out (n-fold cross validation)
Train Test
Dataset
COMPUTATIONAL COMPLEXITY
k-fold cross validation requires k training steps on n(k-1)/k datapoints k testing steps on n/k datapoints (There are efficient ways of computing L.O.O.
estimates for some nonparametric techniques, e.g. Nearest Neighbors)
Average results reported
BOOTSTRAPPING
Similar technique for estimating the confidence in the model parameters
Procedure:1. Draw k hypothetical datasets from original
data. Either via cross validation or sampling with replacement.
2. Fit the model for each dataset to compute parameters k
3. Return the standard deviation of 1,…,k (or a confidence interval)
Can also estimate confidence in a prediction y=f(x)
SIMPLE EXAMPLE: AVERAGE OF N NUMBERS Data D={x(1),…,x(N)}, model is constant Learning: minimize E() = i(x(i)-)2 => compute
average Repeat for j=1,…,k :
Randomly sample subset x(1)’,…,x(N)’ from D Learn j = 1/N i x(i)’
Return histogram of 1,…,j
10 100 1000 100000.44
0.46
0.48
0.5
0.52
0.54
0.56
AverageLower rangeUpper range
|Data set|
17
BEYOND ERROR RATES
BEYOND ERROR RATE Predicting security risk
Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them)
Searching for images Returning irrelevant images is
worse than omitting relevant ones
18
BIASED SAMPLE SETS
Often there are orders of magnitude more negative examples than positive
E.g., all images of Kris on Facebook If I classify all images as “not Kris” I’ll have
>99.99% accuracy
Examples of Kris should count much more than non-Kris!
FALSE POSITIVES
20x1
x2
True concept Learned concept
FALSE POSITIVES
21x1
x2
True concept Learned concept
New query
An example incorrectly predicted
to be positive
FALSE NEGATIVES
22x1
x2
True concept Learned concept
New query
An example incorrectly predicted
to be negative
PRECISION VS. RECALL
Precision # of relevant documents retrieved / # of total
documents retrieved Recall
# of relevant documents retrieved / # of total relevant documents
Numbers between 0 and 1
23
PRECISION VS. RECALL
Precision # of true positives / (# true positives + # false
positives) Recall
# of true positives / (# true positives + # false negatives)
A precise classifier is selective A classifier with high recall is inclusive
24
REDUCING FALSE POSITIVE RATE
25x1
x2
True concept Learned concept
REDUCING FALSE NEGATIVE RATE
26x1
x2
True concept Learned concept
PRECISION-RECALL CURVES
27
Precision
Recall
Measure Precision vs Recall as the classification boundary is tuned
Perfect classifier
Actual performance
PRECISION-RECALL CURVES
28
Precision
Recall
Measure Precision vs Recall as the classification boundary is tuned
Penalize false negatives
Penalize false positives
Equal weight
PRECISION-RECALL CURVES
29
Precision
Recall
Measure Precision vs Recall as the classification boundary is tuned
PRECISION-RECALL CURVES
30
Precision
Recall
Measure Precision vs Recall as the classification boundary is tuned
Better learningperformance
OPTION 1: CLASSIFICATION THRESHOLDS Many learning algorithms (e.g., linear
models, NNets, BNs, SVM) give real-valued output v(x) that needs thresholding for classification
v(x) > t => positive label given to xv(x) < t => negative label given to x
May want to tune threshold to get fewer false positives or false negatives
31
OPTION 2: LOSS FUNCTIONS & WEIGHTED DATASETS
General learning problem: “Given data D and loss function L, find the hypothesis from hypothesis class H that minimizes L”
Loss functions: L may contain weights to favor accuracy on positive or negative examples E.g., L = 10 E+
+ 1 E-
Weighted datasets: attach a weight w to each example to indicate how important it is Or construct a resampled dataset D’ where each
example is duplicated proportionally to its w
MODEL SELECTION
COMPLEXITY VS. GOODNESS OF FIT
More complex models can fit the data better, but can overfit
Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off
Regularization: explicitly define a metric of complexity and penalize it in addition to loss
MODEL SELECTION WITH K-FOLD CROSS-VALIDATION
Parameterize learner by a complexity level C Model selection pseudocode:
For increasing levels of complexity C: errT[C],errV[C] = Cross-Validate(Learner,C,examples) If errT has converged,
Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)
REGULARIZATION
Minimize: Cost(h) = Loss(h) + Complexity(h)
Example with linear models y = Tx: L2 error: Loss() = i (y(i)-Tx(i))2
Lq regularization: Complexity(): j |j|q
L2 and L1 are most popular in linear regularization
L2 regularization leads to simple computation of optimal
L1 is more complex to optimize, but produces sparse models in which many coefficients are 0!
DATA DREDGING
As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases
In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit E.g., linear classifiers
Many opportunities for charlatans in the big data age!
38
OTHER TOPICS IN MACHINE LEARNING
Unsupervised learning Dimensionality reduction Clustering
Reinforcement learning Agent that acts and learns how to act in an
environment by observing rewards Learning from demonstration
Agent that learns how to act in an environment by observing demonstrations from an expert
ISSUES IN PRACTICE
The distinctions between learning algorithms diminish when you have a lot of data
The web has made it much easier to gather large scale datasets than in early days of ML
Understanding data with many more attributes than examples is still a major challenge! Do humans just have really great priors?
NEXT LECTURES
Temporal sequence models (R&N 15) Decision-theoretic planning Reinforcement learning Applications of AI