START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Post on 14-Jan-2016

221 views 2 download

transcript

START OF DAY 5Reading: Chap. 8

Support Vector Machine

Revisiting Linear Classification

• Recall:– Perceptron can only solve linearly-separable tasks– Non-linear dimensions (e.g., x2, xy) can be added

that may make the new task linearly separable• Two questions here:– Is there an optimal way to do linear classification?– Is there a systematic way to leverage it in higher

dimensions?

Maximal-Margin Classification (I)

• Consider a 2-class problem in Rd

• As needed (and without loss of generality), relabel the classes to -1 and +1

• Suppose we have a separating hyperplane– Its equation is: w.x + b = 0• w is normal to the hyperplane• |b|/||w|| is the perpendicular distance from the

hyperplane to the origin• ||w|| is the Euclidean norm of w

Maximal-Margin Classification (II)

• We can certainly choose w and b in such a way that:– w.xi + b > 0 when yi = +1

– w.xi + b < 0 when yi = -1

• Rescaling w and b so that the closest points to the hyperplane satisfy |w.xi + b| = 1 , we can rewrite the above to– w.xi + b ≥ +1 when yi = +1 (1)

– w.xi + b ≤ -1 when yi = -1 (2)

Maximal-Margin Classification (III)

• Consider the case when (1) is an equality– w.xi + b = +1 (H+)

• Normal w• Distance from origin |1-b|/||w||

• Similarly for (2)– w.xi + b = -1 (H-)

• Normal w• Distance from origin |-1-b|/||w||

• We now have two hyperplanes (// to original)

Maximal-Margin Classification (IV)

Maximal-Margin Classification (V)

• Note that the points on H- and H+ are sufficient to define H- and H+ and therefore are sufficient to build a linear classifier

• Define the margin as the distance between H- and H+

• What would be a good choice for w and b?– Maximize the margin

Support Vectors

Maximal-Margin Classification (VI)

• From the equations of H- and H+, we have– Margin = |1-b|/||w|| - |-1-b|/||w||

= 2/||w||• So, we can maximize the margin by:– Minimizing ||w||2

– Subject to: yi(w.xi + b) - 1 ≥ 0 (see (1)

and (2) above)

Minimizing ||w||2

• Use Lagrange multipliers for each constraint (1 per training instance)– For constraints of the form ci ≥ 0 (see above)• The constraint equations are multiplied by positive

Lagrange multipliers, and • Subtracted from the objective function

• Hence, we have the Lagrangian

Maximizing LD

• It turns out, after some transformations beyond the scope of our discussion that minimizing LP is equivalent to maximizing the following dual Lagrangian:

– Where <xi,xj> denotes the dot product

subject to:

Support vectors are those instances for which α≠0

SVM Learning (I)

• We could stop here and we would have a nice linear classification algorithm.

• SVM goes one step further:– It assumes that non-linearly

separable problems in low dimensions may become linearly separable in higher dimensions (e.g., XOR)

SVM Learning (II)

• SVM thus:– Creates a non-linear mapping from the low

dimensional space to a higher dimensional space– Uses MM learning in the new space

• Computation is efficient when “good” transformations are selected– The kernel trick

Choosing a Transformation (I)

• Recall the formula for LD

• Note that it involves a dot product– Expensive to compute in high dimensions– Gets worse if we transform to more dimensions

• What if we did not have to?

Choosing a Transformation (II)

• It turns out that it is possible to design transformations φ such that:– <φ(x), φ(y)> can be expressed in terms of <x,y>

• Hence, one needs only compute in the original lower dimensional space

• Example:– φ: R2R3 where φ(x)=(x1

2, √2x1x2, x22)

Choosing a Kernel

• Can start from a desired feature space and try to construct kernel

• More often one starts from a reasonable kernel and may not analyze the feature space

• Some kernels are better fit for certain problems, domain knowledge can be helpful

• Common kernels: – Polynomial– Gaussian – Sigmoidal– Application specific

SVM Notes

• Excellent empirical and theoretical potential• Multi-class problems not handled naturally• How to choose kernel – main learning parameter

– Also includes other parameters to be defined (degree of polynomials, variance of Gaussians, etc.)

• Speed and size: both training and testing, how to handle very large training sets not yet solved

• MM can lead to overfit due to noise, or problem may not be linearly separable within a reasonable feature space– Soft Margin is a common solution, allows slack variables– αi constrained to be >= 0 and less than C. The C allows outliers.

How to pick C?

Chunking• Start with a reasonably sized subset of the data set (one that

fits in memory and does not take too long during training)• Train on this subset and just keep the support vectors or the

m patterns with the highest αi values• Grab another subset, add the current support vectors to it

and continue training• Note that this training may allow previous support vectors to

be dropped as better ones are discovered• Repeat until all data is used and no new support vectors are

added or some other stopping criteria is fulfilled

Comparing Classifiers

Statistical Significance• How do we know that some measurement is statistically significant vs being

just a random perturbation– How good a predictor of generalization accuracy is the sample accuracy on a test

set?– Is a particular hypothesis really better than another one because its accuracy is

higher on a validation set?– When can we say that one learning algorithm is better than another for a

particular task or set of tasks?

• For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior in general for that task?

• Question becomes: What is the likely difference between the sample error (estimator of the parameter) and the true error (true parameter value)?

• Key point – What is the probability that the differences observed in our results are just due to chance, and thus not significant?

Sample Error

• Error of hypothesis h with respect to function f and sample S

otherwise

xfxhifxfxhwhere

xfxhS

herrorSx

S

0

)()(1))(),((

))(),((1

)(

True Error

• Error of hypothesis h with respect to function f and distribution D

)()(Pr)( xfxhherrorDx

D

The Question

• We wish to know errorD(h)

• We can only measure errorS(h)

How good an estimate of errorD(h) is provided by errorS(h)?

Confidence Interval

• If h is a discrete-valued hypothesis, |S|=n30, and examples are drawn independently of h and one another, then with N% probability:

n

herrorherrorzherrorherror

n

herrorherrorzherror SS

NSDSS

NS))(1)((

)()())(1)((

)(

Confidence level N% 50% 68% 80% 90% 95% 98% 99%

Constant zN0.67 1.00 1.28 1.64 1.96 2.33 2.5

8

A Few Useful Facts (I)

• The expected value of a random variable X, also know as the mean, is defined by:

i

ii xXxXE )Pr(

rnrpp

rnrn

r

)1()!(!

!)Pr(

The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads in a single toss is p

A Few Useful Facts (II)

• The Normal distribution is the well-known bell-shaped distribution, arising often in nature:

2

2

1

22

1)Pr(

x

ex

NormalBinomial XEnpXE

Expected values:

A Few Useful Facts (III)

• Estimating p from a random sample of coin tosses is equivalent to estimating errorD(h) from testing h on a random sample from D:– Single coin toss Single instance drawing– Probability p that a single coin toss is head probability that single

instance is misclassified– Number r of heads observed over a sample of n coin tosses number of

misclassifications observed over n randomly drawn instances

• Hence, we have:– p = errorD(h)

– r/n = errorS(h)

A Few Useful Facts (IV)

• An random variable can be viewed as the name of an experiment whose outcome is probabilistic

• An estimator is a random variable X used to estimate some parameter p (e.g., the mean) of an underlying population

• The bias of an estimator X is defined by:

pXE

pXE

An estimator X is unbiased if and only if:

A Few Useful Facts (V)

• Since, we have:– p = errorD(h)

– r/n = errorS(h)

– E[XBinomial] = np

• It follows that:– E[errorS(h)] = E[r/n] = E[r]/n = np/n = p =

errorD(h)

• Hence, errorS(h) is an unbiased estimator of errorD(h)

Comparing Hypotheses

• We wish to estimate errorD(h1) - errorD(h2)

• We measure errorS1(h1) – errorS2(h2), which turns out to be an unbiased estimator

• In practice, it is OK to measure on the same test set errorS(h1) – errorS(h2), which has lower variance

Comparing Classifiers (I)

• We wish to estimate ))](())(([ SLerrorSLerrorE BDAD

DS

))(())(( 0000SLerrorSLerror BTAT

In practice, we have a sample D0: Split into train/test sets and measure:

Comparing Classifiers (II)

• Problems:• We use errorT0(h) to estimate errorD(h)

• We measure the difference on S0 alone (not the expected value over all samples)

• Improvement:• Repeatedly partition dataset Di into N disjoint train (Si)/test

(Ti) sets (e.g., using n-fold Xval)• Compute:

Comparing Classifiers (III)

• Derive:

• t is known as the student-t statistic. – Choose a significance level q– With N-1 degrees of freedom, if tc exceeds the

value in the table then the difference is statistically significant at that level

END OF DAY 5Homework: SVM, Neural Networks