Date post: | 14-Jan-2016 |
Category: |
Documents |
Upload: | sibyl-fowler |
View: | 221 times |
Download: | 2 times |
START OF DAY 5Reading: Chap. 8
Support Vector Machine
Revisiting Linear Classification
• Recall:– Perceptron can only solve linearly-separable tasks– Non-linear dimensions (e.g., x2, xy) can be added
that may make the new task linearly separable• Two questions here:– Is there an optimal way to do linear classification?– Is there a systematic way to leverage it in higher
dimensions?
Maximal-Margin Classification (I)
• Consider a 2-class problem in Rd
• As needed (and without loss of generality), relabel the classes to -1 and +1
• Suppose we have a separating hyperplane– Its equation is: w.x + b = 0• w is normal to the hyperplane• |b|/||w|| is the perpendicular distance from the
hyperplane to the origin• ||w|| is the Euclidean norm of w
Maximal-Margin Classification (II)
• We can certainly choose w and b in such a way that:– w.xi + b > 0 when yi = +1
– w.xi + b < 0 when yi = -1
• Rescaling w and b so that the closest points to the hyperplane satisfy |w.xi + b| = 1 , we can rewrite the above to– w.xi + b ≥ +1 when yi = +1 (1)
– w.xi + b ≤ -1 when yi = -1 (2)
Maximal-Margin Classification (III)
• Consider the case when (1) is an equality– w.xi + b = +1 (H+)
• Normal w• Distance from origin |1-b|/||w||
• Similarly for (2)– w.xi + b = -1 (H-)
• Normal w• Distance from origin |-1-b|/||w||
• We now have two hyperplanes (// to original)
Maximal-Margin Classification (IV)
Maximal-Margin Classification (V)
• Note that the points on H- and H+ are sufficient to define H- and H+ and therefore are sufficient to build a linear classifier
• Define the margin as the distance between H- and H+
• What would be a good choice for w and b?– Maximize the margin
Support Vectors
Maximal-Margin Classification (VI)
• From the equations of H- and H+, we have– Margin = |1-b|/||w|| - |-1-b|/||w||
= 2/||w||• So, we can maximize the margin by:– Minimizing ||w||2
– Subject to: yi(w.xi + b) - 1 ≥ 0 (see (1)
and (2) above)
Minimizing ||w||2
• Use Lagrange multipliers for each constraint (1 per training instance)– For constraints of the form ci ≥ 0 (see above)• The constraint equations are multiplied by positive
Lagrange multipliers, and • Subtracted from the objective function
• Hence, we have the Lagrangian
Maximizing LD
• It turns out, after some transformations beyond the scope of our discussion that minimizing LP is equivalent to maximizing the following dual Lagrangian:
– Where <xi,xj> denotes the dot product
subject to:
Support vectors are those instances for which α≠0
SVM Learning (I)
• We could stop here and we would have a nice linear classification algorithm.
• SVM goes one step further:– It assumes that non-linearly
separable problems in low dimensions may become linearly separable in higher dimensions (e.g., XOR)
SVM Learning (II)
• SVM thus:– Creates a non-linear mapping from the low
dimensional space to a higher dimensional space– Uses MM learning in the new space
• Computation is efficient when “good” transformations are selected– The kernel trick
Choosing a Transformation (I)
• Recall the formula for LD
• Note that it involves a dot product– Expensive to compute in high dimensions– Gets worse if we transform to more dimensions
• What if we did not have to?
Choosing a Transformation (II)
• It turns out that it is possible to design transformations φ such that:– <φ(x), φ(y)> can be expressed in terms of <x,y>
• Hence, one needs only compute in the original lower dimensional space
• Example:– φ: R2R3 where φ(x)=(x1
2, √2x1x2, x22)
Choosing a Kernel
• Can start from a desired feature space and try to construct kernel
• More often one starts from a reasonable kernel and may not analyze the feature space
• Some kernels are better fit for certain problems, domain knowledge can be helpful
• Common kernels: – Polynomial– Gaussian – Sigmoidal– Application specific
SVM Notes
• Excellent empirical and theoretical potential• Multi-class problems not handled naturally• How to choose kernel – main learning parameter
– Also includes other parameters to be defined (degree of polynomials, variance of Gaussians, etc.)
• Speed and size: both training and testing, how to handle very large training sets not yet solved
• MM can lead to overfit due to noise, or problem may not be linearly separable within a reasonable feature space– Soft Margin is a common solution, allows slack variables– αi constrained to be >= 0 and less than C. The C allows outliers.
How to pick C?
Chunking• Start with a reasonably sized subset of the data set (one that
fits in memory and does not take too long during training)• Train on this subset and just keep the support vectors or the
m patterns with the highest αi values• Grab another subset, add the current support vectors to it
and continue training• Note that this training may allow previous support vectors to
be dropped as better ones are discovered• Repeat until all data is used and no new support vectors are
added or some other stopping criteria is fulfilled
Comparing Classifiers
Statistical Significance• How do we know that some measurement is statistically significant vs being
just a random perturbation– How good a predictor of generalization accuracy is the sample accuracy on a test
set?– Is a particular hypothesis really better than another one because its accuracy is
higher on a validation set?– When can we say that one learning algorithm is better than another for a
particular task or set of tasks?
• For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior in general for that task?
• Question becomes: What is the likely difference between the sample error (estimator of the parameter) and the true error (true parameter value)?
• Key point – What is the probability that the differences observed in our results are just due to chance, and thus not significant?
Sample Error
• Error of hypothesis h with respect to function f and sample S
otherwise
xfxhifxfxhwhere
xfxhS
herrorSx
S
0
)()(1))(),((
))(),((1
)(
True Error
• Error of hypothesis h with respect to function f and distribution D
)()(Pr)( xfxhherrorDx
D
The Question
• We wish to know errorD(h)
• We can only measure errorS(h)
How good an estimate of errorD(h) is provided by errorS(h)?
Confidence Interval
• If h is a discrete-valued hypothesis, |S|=n30, and examples are drawn independently of h and one another, then with N% probability:
n
herrorherrorzherrorherror
n
herrorherrorzherror SS
NSDSS
NS))(1)((
)()())(1)((
)(
Confidence level N% 50% 68% 80% 90% 95% 98% 99%
Constant zN0.67 1.00 1.28 1.64 1.96 2.33 2.5
8
A Few Useful Facts (I)
• The expected value of a random variable X, also know as the mean, is defined by:
i
ii xXxXE )Pr(
rnrpp
rnrn
r
)1()!(!
!)Pr(
The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads in a single toss is p
A Few Useful Facts (II)
• The Normal distribution is the well-known bell-shaped distribution, arising often in nature:
2
2
1
22
1)Pr(
x
ex
NormalBinomial XEnpXE
Expected values:
A Few Useful Facts (III)
• Estimating p from a random sample of coin tosses is equivalent to estimating errorD(h) from testing h on a random sample from D:– Single coin toss Single instance drawing– Probability p that a single coin toss is head probability that single
instance is misclassified– Number r of heads observed over a sample of n coin tosses number of
misclassifications observed over n randomly drawn instances
• Hence, we have:– p = errorD(h)
– r/n = errorS(h)
A Few Useful Facts (IV)
• An random variable can be viewed as the name of an experiment whose outcome is probabilistic
• An estimator is a random variable X used to estimate some parameter p (e.g., the mean) of an underlying population
• The bias of an estimator X is defined by:
pXE
pXE
An estimator X is unbiased if and only if:
A Few Useful Facts (V)
• Since, we have:– p = errorD(h)
– r/n = errorS(h)
– E[XBinomial] = np
• It follows that:– E[errorS(h)] = E[r/n] = E[r]/n = np/n = p =
errorD(h)
• Hence, errorS(h) is an unbiased estimator of errorD(h)
Comparing Hypotheses
• We wish to estimate errorD(h1) - errorD(h2)
• We measure errorS1(h1) – errorS2(h2), which turns out to be an unbiased estimator
• In practice, it is OK to measure on the same test set errorS(h1) – errorS(h2), which has lower variance
Comparing Classifiers (I)
• We wish to estimate ))](())(([ SLerrorSLerrorE BDAD
DS
))(())(( 0000SLerrorSLerror BTAT
In practice, we have a sample D0: Split into train/test sets and measure:
Comparing Classifiers (II)
• Problems:• We use errorT0(h) to estimate errorD(h)
• We measure the difference on S0 alone (not the expected value over all samples)
• Improvement:• Repeatedly partition dataset Di into N disjoint train (Si)/test
(Ti) sets (e.g., using n-fold Xval)• Compute:
Comparing Classifiers (III)
• Derive:
• t is known as the student-t statistic. – Choose a significance level q– With N-1 degrees of freedom, if tc exceeds the
value in the table then the difference is statistically significant at that level
END OF DAY 5Homework: SVM, Neural Networks