+ All Categories
Home > Documents > START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Date post: 14-Jan-2016
Category:
Upload: sibyl-fowler
View: 221 times
Download: 2 times
Share this document with a friend
34
START OF DAY 5 Reading: Chap. 8
Transcript
Page 1: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

START OF DAY 5Reading: Chap. 8

Page 2: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Support Vector Machine

Page 3: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Revisiting Linear Classification

• Recall:– Perceptron can only solve linearly-separable tasks– Non-linear dimensions (e.g., x2, xy) can be added

that may make the new task linearly separable• Two questions here:– Is there an optimal way to do linear classification?– Is there a systematic way to leverage it in higher

dimensions?

Page 4: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (I)

• Consider a 2-class problem in Rd

• As needed (and without loss of generality), relabel the classes to -1 and +1

• Suppose we have a separating hyperplane– Its equation is: w.x + b = 0• w is normal to the hyperplane• |b|/||w|| is the perpendicular distance from the

hyperplane to the origin• ||w|| is the Euclidean norm of w

Page 5: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (II)

• We can certainly choose w and b in such a way that:– w.xi + b > 0 when yi = +1

– w.xi + b < 0 when yi = -1

• Rescaling w and b so that the closest points to the hyperplane satisfy |w.xi + b| = 1 , we can rewrite the above to– w.xi + b ≥ +1 when yi = +1 (1)

– w.xi + b ≤ -1 when yi = -1 (2)

Page 6: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (III)

• Consider the case when (1) is an equality– w.xi + b = +1 (H+)

• Normal w• Distance from origin |1-b|/||w||

• Similarly for (2)– w.xi + b = -1 (H-)

• Normal w• Distance from origin |-1-b|/||w||

• We now have two hyperplanes (// to original)

Page 7: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (IV)

Page 8: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (V)

• Note that the points on H- and H+ are sufficient to define H- and H+ and therefore are sufficient to build a linear classifier

• Define the margin as the distance between H- and H+

• What would be a good choice for w and b?– Maximize the margin

Support Vectors

Page 9: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximal-Margin Classification (VI)

• From the equations of H- and H+, we have– Margin = |1-b|/||w|| - |-1-b|/||w||

= 2/||w||• So, we can maximize the margin by:– Minimizing ||w||2

– Subject to: yi(w.xi + b) - 1 ≥ 0 (see (1)

and (2) above)

Page 10: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Minimizing ||w||2

• Use Lagrange multipliers for each constraint (1 per training instance)– For constraints of the form ci ≥ 0 (see above)• The constraint equations are multiplied by positive

Lagrange multipliers, and • Subtracted from the objective function

• Hence, we have the Lagrangian

Page 11: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Maximizing LD

• It turns out, after some transformations beyond the scope of our discussion that minimizing LP is equivalent to maximizing the following dual Lagrangian:

– Where <xi,xj> denotes the dot product

subject to:

Support vectors are those instances for which α≠0

Page 12: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

SVM Learning (I)

• We could stop here and we would have a nice linear classification algorithm.

• SVM goes one step further:– It assumes that non-linearly

separable problems in low dimensions may become linearly separable in higher dimensions (e.g., XOR)

Page 13: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

SVM Learning (II)

• SVM thus:– Creates a non-linear mapping from the low

dimensional space to a higher dimensional space– Uses MM learning in the new space

• Computation is efficient when “good” transformations are selected– The kernel trick

Page 14: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Choosing a Transformation (I)

• Recall the formula for LD

• Note that it involves a dot product– Expensive to compute in high dimensions– Gets worse if we transform to more dimensions

• What if we did not have to?

Page 15: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Choosing a Transformation (II)

• It turns out that it is possible to design transformations φ such that:– <φ(x), φ(y)> can be expressed in terms of <x,y>

• Hence, one needs only compute in the original lower dimensional space

• Example:– φ: R2R3 where φ(x)=(x1

2, √2x1x2, x22)

Page 16: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Choosing a Kernel

• Can start from a desired feature space and try to construct kernel

• More often one starts from a reasonable kernel and may not analyze the feature space

• Some kernels are better fit for certain problems, domain knowledge can be helpful

• Common kernels: – Polynomial– Gaussian – Sigmoidal– Application specific

Page 17: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

SVM Notes

• Excellent empirical and theoretical potential• Multi-class problems not handled naturally• How to choose kernel – main learning parameter

– Also includes other parameters to be defined (degree of polynomials, variance of Gaussians, etc.)

• Speed and size: both training and testing, how to handle very large training sets not yet solved

• MM can lead to overfit due to noise, or problem may not be linearly separable within a reasonable feature space– Soft Margin is a common solution, allows slack variables– αi constrained to be >= 0 and less than C. The C allows outliers.

How to pick C?

Page 18: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Chunking• Start with a reasonably sized subset of the data set (one that

fits in memory and does not take too long during training)• Train on this subset and just keep the support vectors or the

m patterns with the highest αi values• Grab another subset, add the current support vectors to it

and continue training• Note that this training may allow previous support vectors to

be dropped as better ones are discovered• Repeat until all data is used and no new support vectors are

added or some other stopping criteria is fulfilled

Page 19: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Comparing Classifiers

Page 20: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Statistical Significance• How do we know that some measurement is statistically significant vs being

just a random perturbation– How good a predictor of generalization accuracy is the sample accuracy on a test

set?– Is a particular hypothesis really better than another one because its accuracy is

higher on a validation set?– When can we say that one learning algorithm is better than another for a

particular task or set of tasks?

• For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior in general for that task?

• Question becomes: What is the likely difference between the sample error (estimator of the parameter) and the true error (true parameter value)?

• Key point – What is the probability that the differences observed in our results are just due to chance, and thus not significant?

Page 21: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Sample Error

• Error of hypothesis h with respect to function f and sample S

otherwise

xfxhifxfxhwhere

xfxhS

herrorSx

S

0

)()(1))(),((

))(),((1

)(

Page 22: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

True Error

• Error of hypothesis h with respect to function f and distribution D

)()(Pr)( xfxhherrorDx

D

Page 23: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

The Question

• We wish to know errorD(h)

• We can only measure errorS(h)

How good an estimate of errorD(h) is provided by errorS(h)?

Page 24: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Confidence Interval

• If h is a discrete-valued hypothesis, |S|=n30, and examples are drawn independently of h and one another, then with N% probability:

n

herrorherrorzherrorherror

n

herrorherrorzherror SS

NSDSS

NS))(1)((

)()())(1)((

)(

Confidence level N% 50% 68% 80% 90% 95% 98% 99%

Constant zN0.67 1.00 1.28 1.64 1.96 2.33 2.5

8

Page 25: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

A Few Useful Facts (I)

• The expected value of a random variable X, also know as the mean, is defined by:

i

ii xXxXE )Pr(

rnrpp

rnrn

r

)1()!(!

!)Pr(

The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads in a single toss is p

Page 26: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

A Few Useful Facts (II)

• The Normal distribution is the well-known bell-shaped distribution, arising often in nature:

2

2

1

22

1)Pr(

x

ex

NormalBinomial XEnpXE

Expected values:

Page 27: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

A Few Useful Facts (III)

• Estimating p from a random sample of coin tosses is equivalent to estimating errorD(h) from testing h on a random sample from D:– Single coin toss Single instance drawing– Probability p that a single coin toss is head probability that single

instance is misclassified– Number r of heads observed over a sample of n coin tosses number of

misclassifications observed over n randomly drawn instances

• Hence, we have:– p = errorD(h)

– r/n = errorS(h)

Page 28: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

A Few Useful Facts (IV)

• An random variable can be viewed as the name of an experiment whose outcome is probabilistic

• An estimator is a random variable X used to estimate some parameter p (e.g., the mean) of an underlying population

• The bias of an estimator X is defined by:

pXE

pXE

An estimator X is unbiased if and only if:

Page 29: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

A Few Useful Facts (V)

• Since, we have:– p = errorD(h)

– r/n = errorS(h)

– E[XBinomial] = np

• It follows that:– E[errorS(h)] = E[r/n] = E[r]/n = np/n = p =

errorD(h)

• Hence, errorS(h) is an unbiased estimator of errorD(h)

Page 30: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Comparing Hypotheses

• We wish to estimate errorD(h1) - errorD(h2)

• We measure errorS1(h1) – errorS2(h2), which turns out to be an unbiased estimator

• In practice, it is OK to measure on the same test set errorS(h1) – errorS(h2), which has lower variance

Page 31: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Comparing Classifiers (I)

• We wish to estimate ))](())(([ SLerrorSLerrorE BDAD

DS

))(())(( 0000SLerrorSLerror BTAT

In practice, we have a sample D0: Split into train/test sets and measure:

Page 32: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Comparing Classifiers (II)

• Problems:• We use errorT0(h) to estimate errorD(h)

• We measure the difference on S0 alone (not the expected value over all samples)

• Improvement:• Repeatedly partition dataset Di into N disjoint train (Si)/test

(Ti) sets (e.g., using n-fold Xval)• Compute:

Page 33: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Comparing Classifiers (III)

• Derive:

• t is known as the student-t statistic. – Choose a significance level q– With N-1 degrees of freedom, if tc exceeds the

value in the table then the difference is statistically significant at that level

Page 34: START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

END OF DAY 5Homework: SVM, Neural Networks


Recommended