Intro Svm PDF

7/21/2019 Intro Svm PDF

1/47

SVMCAn introduction to Support Vector Machines Classification

Lorenzo Rosasco([email protected])

Department of Brain and Cognitive ScienceMIT

6.783, Biomedical Decision Support

Friday, October 30, 2009
mailto:[email protected]:[email protected]


2/47

A typical problem

We have a cohort of patients from twogroups- say A and B.

We wish to devise a classification rule todistinguish patients of one group frompatients of the other group.



3/47

Learning and

Generalization

Goal: classify correctly newpatients

3



4/47

Plan

1. Linear SVM

2. Non Linear SVM: Kernels

3. Tuning SVM

4. Beyond SVM: Regularization Networks



5/47

Learning from Data

To make predictions we need informationsabout the patients

patient 1:

patient 2 :

....

patient : x= (x1, . . . , xn)

x= (x1, . . . , xn)

x= (x1, . . . , xn)



6/47

Linear modelPatients of class A are labeled y=1

Patients of class B are labeled y=-1

Linear model

classification rule sign(w x)

w x =

n

j=1

wjxj



7/47

1D Case

y=1

y=-1

Y

X

w x > 0

wx < 0

w x = 0



8/47

How do we find a good solution?

2D Classification Problem

x= (x1, x2)

y=1

y=-1



9/47


w

x = 0

w x > 0w x < 0



10/47




11/47




12/47


?



13/47


M



14/47

Maximum Margin Hyperplane

....with little effort ... one can show that

maximizing the margin M is equivalent to:

maximizing

w



15/47

SVM

Linear and Separable SVM

minwRn

||w||2

subject to: yi(wx) 1 i= 1, . . . ,

Text

Typically an off-set term is added to the solution

f(x) =sign(w

x+b).



16/47

A more general

AlgorithmThere are two things we would like toimprove:

Allow for errors

Non Linear Models



17/47

Measuring errors



18/47

Measuring errors (cont)

i

i

i

Slack Variablesi



19/47

Linear SVM

minwRn,Rn,bR

C

i=1i+ 12 ||w||2

subject to: yi(w x+b) 1 i i=1, . . . ,

i 0 i=1, . . . ,



20/47

Optimization

How do we solve this minimization problem?

(...and why do we call it SVM anyway?)



21/47

Some facts

Representer Theorem

Dual Formulation

Box Constraints and Support Vectors



22/47

Representer Theorem

The solution to the minimization problemcan be written as

w x

=

i=1

c

i(x x

i)



23/47

Dual Problem

The coefficients can be found solving:

maxR

i=1i 1

2

T

Q

subject to:

i=1yii= 0

0 i C i= 1, . . . ,

TextText

Here Q=yiyj(xi xj)

i =

ci/yiFriday, October 30, 2009


24/47

Optimality conditions

with little effort ... one can show that

If then

The solution is sparse: some training pointsdo not contribute to the solution.

yif(xi) 1i> 0



25/47

Sparse SolutionNote that:

The solution depends only on the training

set points. (no dependence on the number offeatures!)

i i i



26/47

Feature Map

f(x) =w (x)



27/47

A Key Observation

The solution depends only on

maxR

i=

1

i 1

2

TQ

subject to:

i=1yii= 0

0 i C i= 1, . . . ,

Text

Idea: use Q=yiyj((xi) (xj))

Q=yiyj(xi xj)



28/47

Kernels and Feature

Ma sThe crucial quantity is the inner product

called Kernel.

K(x, t) =

(x)

(t)

A function is called Kernel if it is:

symmetricpositive definite



29/47

Examples of Kernels

Linear kernel

K(x, x) = x x

Gaussian kernel

K(x, x) = exx2

2 , >0

Polynomial kernel

K(x, x) = (x x + 1)d, d N

For specific applications, designing an effective kernel is a

challenging problem.



30/47

Non Linear SVM

Summing up:

Define Feature Map either explicitly or via akernel

Find linear solution in the Feature space

Use same solver as in the linear case

Representer theorem now gives:

w (x) =

i=1

ci((x) (xi)) =

i=1

ciK(x, xi)



31/47

Example in 1D

y=1

y=-1

Y

X



32/47

Software

SVM Light: http://svmlight.joachims.org

SVM Torch: http://www.torch.ch

libSVM:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/



33/47

Model Selection

We have to fix the Regularization parameter C

We have to choose the kernel (and its

parameter)

Using default values isusually a BAD BAD idea



34/47

Regularization Parameter

Large C: we try to minimize errors ignoringthe complexity of the solution

Small C we ignore the errors to obtain asimple solution

minwRn,Rn,bR

C

i=1i+ 12||w||2



35/47

Which Kernel?

For very high dimensional data linear kernel is oftenthe default choice

allows computational speed up less prone to overfitting

Gaussian Kernel with proper tuning is anothercommon choice

Whenever possible use prior knowledgeto build problem specific features or



36/47

2D demo

demo

(a) (b)

http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlhttp://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml


37/47

Practical Rules

We can choose C (and the kernelparameter) via cross validation

Holdout set

K-fold cross validation

K=# of examples is called Leave One OutFriday, October 30, 2009


38/47

K-Fold CV

We have to compute several solutions...



39/47

A Rule of Thumb

This is how the CV error typically looks like

Fix a reasonable kernel, then fine tune C

!!"&

!

!"'

!!"(

!

!"(

!"'

!"&

!"6

)

!(! !)$ !)! !$!'

!*

!(

!)

!

)

(

*

'

$

&

+,-(.+/012/3

+,-(.45-0/3



40/47

Which values do we start from?

For the Gaussian kernel, pick sigma of theorder of the average distance...

Take min (and max) C as the value for whichthe training set error does not increase(decrease) anymore.

k(Xi, Xj) =exp

||Xi Xj||2

2



41/47

Computational Considerations

the training time depends on theparameters: the more we fit, the slower the

algorithm. typically the computational burden is in the

selection of the regularization parameter(solvers for regularization path).



42/47

Regularization Networks

SVM are an example of a family of algorithmsof the form:

Vis called loss function

C

i=1

V(yi, w (xi)) +w2



43/47

Hinge Loss

V(yw (x))

0-1 loss

hinge loss

yw

(x)



44/47

Loss functions



45/47

Representer Theorem

For a LARGE class of loss functions:

w(x) =

n

i=1

i((x)(xi)) =

n

i=1

iK(x, xi)

The way we compute the coefficients depends on theconsidered loss function.



46/47

Regularized LS

The simplest, yet powerful, algorithm isprobably RLS

V(y,w (x)) = (y w (x))2Square loss

Algorithm (Q +1

C

I)=y, Qi,j =K(xi, xj)

Leave one out can be computed at the priceof one (!!!) solution



47/47

Summary

Separable, Linear SVM

Non Separable, Linear SVM

Non Separable, Non Linear SVM

How to use SVM

Date post:	05-Mar-2016
Category:	Documents
Upload:	cyobosaurus
View:	224 times
Download:	0 times

Intro Svm PDF

Documents