Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 ·...

transcript

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Margin optimization

min(w,w0)

12‖w‖2

Margin optimization

min(w,w0)

12‖w‖2

Margin optimization

min(w,w0)

12‖w‖2

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

Margin optimization

min(w,w0)

12‖w‖2

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

Margin optimization

min(w,w0)

12‖w‖2

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?

Then we must have αi = 0.

Margin optimization

min(w,w0)

12‖w‖2

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

Margin optimization

min(w,w0)

12‖w‖2

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

]}︸︷︷︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

]}︸︷︷︸

L(w,w0;α)

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

]}︸︷︷︸

L(w,w0;α)

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

Dual optimization problem

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

Dual optimization problem

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

(∑αi>0

αiyixi

= sign

∑αi>0

αiyixTi x

The classifier is based on the expansion in terms of dot products ofx with support vectors.

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

(∑αi>0

αiyixi

= sign

∑αi>0

αiyixTi x

The classifier is based on the expansion in terms of dot products ofx with support vectors.

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

Non-separable case

min(w,w0)

12‖w‖2 + C

n∑i=1

)− 1+ξi ≥ 0.

Penalty function

Non-separable case

min(w,w0)

12‖w‖2 + C

n∑i=1

)− 1+ξi ≥ 0.

Penalty function

Non-separable case

min(w,w0)

12‖w‖2 + C

n∑i=1

)− 1+ξi ≥ 0.

Penalty function

)COMP 875 Machine learning techniques and image analysis

Review: Hinge loss

Connection between SVMs and logistic regression

Support vector machines:

Hinge loss: max(0, 1− yi(w0 + wT xi)

)Logistic regression:

P (yi|xi;w, w0) =1

1 + e−yi(w0+wT xi)

Log loss: log(1 + e−yi(w0+wT xi)

Review: Non-separable case Source: G. Shakhnarovich

Dual problem:

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, 0 ≤ αi ≤ C, for all i = 1, . . . , N.

αi = 0: not support vector.

0 < αi < C: SV on the margin, ξi = 0.

αi = C: over the margin, either misclassified (ξi > 1) or not(0 < ξi ≤ 1).

Nonlinear SVM

General idea: try to map the original input space into ahigh-dimensional feature space where the data is separable.

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

Separable in 2D:

What is φ(x)?

φ(x) = (x, x2).

Separable in 2D:

Example of nonlinear mapping Source: G. Shakhnarovich

Consider the mapping:φ : [x1, x2]T → [1,

√2x1,

√2x2, x

22,√

2x1x2]T .

The (linear) SVM classifier in the feature space:

y = sign

∑αi>0

αiyiφ(xi)Tφ(x)

The dot product in the feature space:

φ(x)Tφ(z) = 1 + 2x1z1 + 2x2z2 + x21z

21 + x2

2z22 + 2x1x2z1z2

=(1 + xT z

Dot products and feature space Source: G. Shakhnarovich

We defined a non-linear mapping into feature space

φ : [x1, x2]T → [1,√

2x1,√

2x2, x21, x

22,√

2x1x2]T

and saw that φ(x)Tφ(z) = K(x, z) using the kernel

K(x, z) =(1 + xT z

I.e., we can calculate dot products in the feature spaceimplicitly, without ever writing the feature expansion!

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

∑αi>0

αiyiK(xi,x)

Need to compute K(xi,x) for all SVs xi.

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

∑αi>0

αiyiK(xi,x)

Need to compute K(xi,x) for all SVs xi.

Mercer’s kernels Source: G. Shakhnarovich

What kind of function K is a valid kernel, i.e. such that thereexists a feature space Φ(x) in which K(x, z) = φ(x)Tφ(z)?

Theorem due to Mercer (1930s)

K must be

continuous;

symmetric: K(x, z) = K(z,x);

positive definite: for any x1, . . . ,xN , the kernel matrix

K(x1,x1) K(x1,x2) K(x1,xN ). . . . . . . . . . . . . . . . .K(xN ,x1) K(xN ,x2) K(xN ,xN )

must be positive definite.

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

Example: SVM with polynomial kernel Source: G. Shakhnarovich

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

K(x, z;σ) = exp(− 1σ2‖x− z‖2

Consider σ → 0.

Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

K(x, z;σ) = exp(− 1σ2‖x− z‖2

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.

The SVM simply “memorizes” the training data (overfitting,lack of generalization).

K(x, z;σ) = exp(− 1σ2‖x− z‖2

What about σ →∞?

Then K(x, z)→ 1 for all x, z. TheSVM underfits.

K(x, z;σ) = exp(− 1σ2‖x− z‖2

What about σ →∞? Then K(x, z)→ 1 for all x, z.

TheSVM underfits.

K(x, z;σ) = exp(− 1σ2‖x− z‖2

SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich

Note: some SV here not close to the boundary

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

Multi-class SVMs

One vs. others:

One vs. one:

Multi-class SVMs

One vs. others:

One vs. one:

Multi-class SVMs

One vs. others:

One vs. one:

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y(x) + ε

y(x)− ε

ε-insensitive loss:

−1 −eps 0 eps 10

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y(x) + ε

y(x)− ε

ε-insensitive loss:

−1 −eps 0 eps 10

Support Vector Regression

Optimization: introduce constraints and slack variables for goingabove or below the tube.

minw0,w

12‖w‖2 + C

n∑i=1

(ξi + ξi)

subject to (w0 + wTxi)− yi ≤ ε+ ξi ,

yi − (w0 + wTxi) ≤ ε+ ξi ,

ξi, ξi ≥ 0, i = 1, . . . , n.

Support Vector Regression: Dual problem

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.

We can let βi = αi − αi and simplify:

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

n∑i=1

(αi+αi)− 12

n∑i,j=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

βi = 0, i = 1, . . . , n.

i=1 β∗i x

n∑i=1

(αi+αi)− 12

n∑i,j=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

βi = 0, i = 1, . . . , n.

i=1 β∗i x

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

Main ideas:

Large margin classification

The kernel trick

Main ideas:

The constant C

Choice of kernel and kernel parametersNumber of support vectors

Main ideas:

The constant CChoice of kernel and kernel parameters

Number of support vectors

Main ideas:

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

Advantages:

Disadvantages:

Computational and storage complexity of training:

quadraticin the size of the training set

Advantages:

Disadvantages:

Advantages:

Disadvantages:

Advantages:

Disadvantages:

Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 ·...

Documents