+ All Categories
Home > Data & Analytics > Support vector machines

Support vector machines

Date post: 07-Dec-2014
Category:
Upload: ujjawal
View: 186 times
Download: 4 times
Share this document with a friend
Description:
Support vector machines
Popular Tags:
19
SUPPORT VECTOR MACHINES
Transcript
Page 1: Support vector machines

SUPPORT VECTOR MACHINES

Page 2: Support vector machines

LINEAR CLASSIFIER: ANOTHER INTUITION If you have to place a fat separator between classes, you have less choice

and so the capacity of model has been decreased

Page 3: Support vector machines

SUPPORT VECTOR MACHINES SVMs maximize the margin around the separating hyperplane

The decision function is fully specified by a subset of training samples, the support vectors

Solving SVMs is a quadratic programming problem

Support vectors

MaximizesmarginNarrower

margin

Page 4: Support vector machines

MAXIMUM MARGIN - FORMALIZATION W: decision hyperplane normal vector

Xi: data point I

Yi: class of data point I (+1 or -1)

Classifier is: f(xi)=sign(wTxi + b)

Functional margin of xi is: yi(wTxi + b) We can increase this margin by scaling w, b

Functional margin of dataset is twice the minimum functional margin of any point The factor of two comes from measuring the whole width of the margin

Page 5: Support vector machines

GEOMETRIC MARGIN Distance form the example to the separator is:

Examples closest to hyperplane are supported vector

Margin of the separator is the width of separation between support vector of classes

rx′

w

ρx

Page 6: Support vector machines

LINEAR SVM MATHEMATICALLY Assume that all the data is at least 1 from the hyperplane, then the

following two constrain follow a training set {(xi, yi)}

For support vectors, the inequality becomes an equality

Since each examples distance from hyperplane is

The margin is

wTxi + b ≥ 1 if yi = 1

wTxi + b ≤ −1 if yi = −1

Page 7: Support vector machines

LINEAR SUPPORT VECTOR MACHINE Hyperplane

wT + b = 0

Extra scale constraint:mini=1,…,n |wTxi + b| = 1

This implies:wT(xa–xb) = 2

ρ = ||xa–xb||2 = 2/||w||2

wT x + b = 0

wTxa + b = 1

wTxb + b = -1

ρ

Page 8: Support vector machines

LINEAR SVM MATHEMATICALLY (CONT.) Then we can formulate the quadratic optimization problem:

A better formulation (min ||w|| = max 1/ ||w|| ):

Find w and b such that

is maximized; and for all {(xi , yi)}

wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

Find w and b such that

Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Page 9: Support vector machines

SOLVING THE OPTIMIZATION PROBLEM This is optimizing a quadratic function subject to linear constrain

The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constrain in primary problem

The solution has the form

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) αi ≥ 0 for all αi

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

Page 10: Support vector machines

SOLVING THE OPTIMIZATION PROBLEM Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi We will return to this later.

Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

f(x) = ΣαiyixiTx + b

Page 11: Support vector machines

SOFT-MARGIN CLASSIFICATION If the training data is not linearly separable, slack variables can be added

to allow misclassification of difficult or noisy examples

Allow some errors – some points can be moved to where they belong at a cost

Try to minimize error

Place hyperplane far from each class

ξj

ξi

Page 12: Support vector machines

SOFT MARGIN CLASSIFICATION MATHEMATICALLY Formulation involving slack variables

Parameter c can be viewed as a way to control overfitting – a regularization term

Find w and b such that

Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Page 13: Support vector machines

SOFT MARGIN CLASSIFICATION - SOLUTION Dual problem for soft-margin classification

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

Page 14: Support vector machines

CLASSIFICATION WITH SVM Given a new point x, we can score its projection onto the hyperplane

normal: i.e. compute score wTx + b = Σαiyixi

Tx + b Decide class based on whether > or < 0

Can set confidence threshold t Score > t – yes Score < t – no Else – don’t know

-10

1

Page 15: Support vector machines

LINEAR SVM - SUMMARY The classifier is a separating hyperplane

Most important training points are support vectors. They define the hyperplane

Quadratic optimization algorithms can identify which training points xi are support vectors with non- zero Lagrangian multipliers αi.

Both in dual formulation of the problem and in the solution, training points appear only inside inner products

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

Page 16: Support vector machines

NON-LINEAR SVM Data sets that are linearly separable work out great

But what if the data set is too hard?

How about mapping data to a higher dimensional space

0 x

x

0

x2

x

Page 17: Support vector machines

NON-LINEAR SVM – FEATURE SPACES The original feature space can always be mapped to some higher

dimensional feature space where the training set is separable

Φ: x → φ(x)

Page 18: Support vector machines

THE KERNEL TRICK The linear classifier relies on an inner product between vectors K(xi,xj)=xi

Txj

If every data point is mapped to some higher dimensional space via some transformation Φ: x → φ(x), the inner product becomes

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded space

Example:2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj) where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Page 19: Support vector machines

KERNELS Why use kernels

Make non-separable problems separable Map data into better representational space

Common kernels Linear Polynomial K(x,z) = (1+xTz)d

Radial bias function

Haven’t been very useful in text classification


Recommended