Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 ·...

Post on 14-Jun-2020

0 views 0 download

transcript

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]

What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?

Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

]}

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

]}

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

.

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

Dual optimization problem

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

COMP 875 Machine learning techniques and image analysis

Review: SVM optimization (separable case)

Dual optimization problem

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

COMP 875 Machine learning techniques and image analysis

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

w0 +

(∑αi>0

αiyixi

)Tx

= sign

(w0 +

∑αi>0

αiyixTi x

)

The classifier is based on the expansion in terms of dot products ofx with support vectors.

COMP 875 Machine learning techniques and image analysis

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

w0 +

(∑αi>0

αiyixi

)Tx

= sign

(w0 +

∑αi>0

αiyixTi x

)

The classifier is based on the expansion in terms of dot products ofx with support vectors.

COMP 875 Machine learning techniques and image analysis

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

COMP 875 Machine learning techniques and image analysis

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

COMP 875 Machine learning techniques and image analysis

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)COMP 875 Machine learning techniques and image analysis

Review: Hinge loss

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Connection between SVMs and logistic regression

Support vector machines:

Hinge loss: max(0, 1− yi(w0 + wT xi)

)Logistic regression:

P (yi|xi;w, w0) =1

1 + e−yi(w0+wT xi)

Log loss: log(1 + e−yi(w0+wT xi)

)

COMP 875 Machine learning techniques and image analysis

Review: Non-separable case Source: G. Shakhnarovich

Dual problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, 0 ≤ αi ≤ C, for all i = 1, . . . , N.

αi = 0: not support vector.

0 < αi < C: SV on the margin, ξi = 0.

αi = C: over the margin, either misclassified (ξi > 1) or not(0 < ξi ≤ 1).

COMP 875 Machine learning techniques and image analysis

Nonlinear SVM

General idea: try to map the original input space into ahigh-dimensional feature space where the data is separable.

COMP 875 Machine learning techniques and image analysis

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)?

φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Example of nonlinear mapping Source: G. Shakhnarovich

Consider the mapping:φ : [x1, x2]T → [1,

√2x1,

√2x2, x

21, x

22,√

2x1x2]T .

The (linear) SVM classifier in the feature space:

y = sign

(w0 +

∑αi>0

αiyiφ(xi)Tφ(x)

)

The dot product in the feature space:

φ(x)Tφ(z) = 1 + 2x1z1 + 2x2z2 + x21z

21 + x2

2z22 + 2x1x2z1z2

=(1 + xT z

)2.

COMP 875 Machine learning techniques and image analysis

Dot products and feature space Source: G. Shakhnarovich

We defined a non-linear mapping into feature space

φ : [x1, x2]T → [1,√

2x1,√

2x2, x21, x

22,√

2x1x2]T

and saw that φ(x)Tφ(z) = K(x, z) using the kernel

K(x, z) =(1 + xT z

)2.

I.e., we can calculate dot products in the feature spaceimplicitly, without ever writing the feature expansion!

COMP 875 Machine learning techniques and image analysis

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

(w0 +

∑αi>0

αiyiK(xi,x)

)

Need to compute K(xi,x) for all SVs xi.

COMP 875 Machine learning techniques and image analysis

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

(w0 +

∑αi>0

αiyiK(xi,x)

)

Need to compute K(xi,x) for all SVs xi.

COMP 875 Machine learning techniques and image analysis

Mercer’s kernels Source: G. Shakhnarovich

What kind of function K is a valid kernel, i.e. such that thereexists a feature space Φ(x) in which K(x, z) = φ(x)Tφ(z)?

Theorem due to Mercer (1930s)

K must be

continuous;

symmetric: K(x, z) = K(z,x);

positive definite: for any x1, . . . ,xN , the kernel matrix

K =

K(x1,x1) K(x1,x2) K(x1,xN ). . . . . . . . . . . . . . . . .K(xN ,x1) K(xN ,x2) K(xN ,xN )

must be positive definite.

COMP 875 Machine learning techniques and image analysis

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

COMP 875 Machine learning techniques and image analysis

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

COMP 875 Machine learning techniques and image analysis

Example: SVM with polynomial kernel Source: G. Shakhnarovich

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.

Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.

The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞?

Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z.

TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich

Note: some SV here not close to the boundary

COMP 875 Machine learning techniques and image analysis

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

COMP 875 Machine learning techniques and image analysis

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

COMP 875 Machine learning techniques and image analysis

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y

y(x)

y(x) + ε

y(x)− ε

x

ε-insensitive loss:

−1 −eps 0 eps 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

COMP 875 Machine learning techniques and image analysis

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y

y(x)

y(x) + ε

y(x)− ε

x

ε-insensitive loss:

−1 −eps 0 eps 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

COMP 875 Machine learning techniques and image analysis

Support Vector Regression

Optimization: introduce constraints and slack variables for goingabove or below the tube.

minw0,w

12‖w‖2 + C

n∑i=1

(ξi + ξi)

subject to (w0 + wTxi)− yi ≤ ε+ ξi ,

yi − (w0 + wTxi) ≤ ε+ ξi ,

ξi, ξi ≥ 0, i = 1, . . . , n.

COMP 875 Machine learning techniques and image analysis

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.

We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classification

The kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant C

Choice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parameters

Number of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training:

quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis