+ All Categories
Home > Documents > Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 ·...

Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 ·...

Date post: 14-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
73
Review: Support vector machines COMP 875 Machine learning techniques and image analysis
Transcript
Page 1: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Page 2: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Page 3: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Page 4: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

What are support vectors?

COMP 875 Machine learning techniques and image analysis

Page 5: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Page 6: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]

What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Page 7: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?

Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Page 8: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Page 9: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Support vector machines (separable case)

Margin optimization

min(w,w0)

12‖w‖2

subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.

Add constraints as terms to the objective function:

min(w,w0)

12‖w‖2 +

n∑i=1

maxαi≥0

αi[1− yi(w0 + wTxi)

]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.

COMP 875 Machine learning techniques and image analysis

Page 10: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Page 11: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Page 12: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

max{αi≥0}

min(w,w0)

{12‖w‖2 +

n∑i=1

αi[1− yi(w0 + wTxi)

]}︸ ︷︷ ︸

L(w,w0;α)

First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:

∂wL(w, w0;α) = w −

n∑i=1

αiyyxi = 0,

∂w0L(w, w0;α) = −

n∑i=1

αiyi = 0.

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

COMP 875 Machine learning techniques and image analysis

Page 13: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

]}

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

.

COMP 875 Machine learning techniques and image analysis

Page 14: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

w(α) =n∑i=1

αiyixi,n∑i=1

αiyi = 0.

Now we can substitute this solution into

max{αi≥0,

∑i αiyi=0}

{12‖w(α)‖2 +

n∑i=1

αi[1− yi(w0(α) + w(α)Txi)

]}

= max{αi≥0,

∑i αiyi=0}

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

.

COMP 875 Machine learning techniques and image analysis

Page 15: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

Dual optimization problem

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

COMP 875 Machine learning techniques and image analysis

Page 16: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM optimization (separable case)

Dual optimization problem

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.

Solving this quadratic program yields the optimal α. We substituteit back to get w:

w = w(α) =n∑i=1

αiyixi

What is the structure of the solution?

COMP 875 Machine learning techniques and image analysis

Page 17: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

w0 +

(∑αi>0

αiyixi

)Tx

= sign

(w0 +

∑αi>0

αiyixTi x

)

The classifier is based on the expansion in terms of dot products ofx with support vectors.

COMP 875 Machine learning techniques and image analysis

Page 18: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: SVM classification

w =∑αi>0

αiyixi.

Given a test example x, how is it classified?

y = sign(w0 + wTx

)= sign

w0 +

(∑αi>0

αiyixi

)Tx

= sign

(w0 +

∑αi>0

αiyixTi x

)

The classifier is based on the expansion in terms of dot products ofx with support vectors.

COMP 875 Machine learning techniques and image analysis

Page 19: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

COMP 875 Machine learning techniques and image analysis

Page 20: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Non-separable case

What if the training data are not linearly separable?

Basic idea: minimize

12‖w‖2 + C(penalty for violating margin constraints).

COMP 875 Machine learning techniques and image analysis

Page 21: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Page 22: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Page 23: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Page 24: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Non-separable case

Rewrite the constraints with slack variables ξi ≥ 0:

min(w,w0)

12‖w‖2 + C

n∑i=1

ξi

subject to yi(w0 + wTxi

)− 1+ξi ≥ 0.

Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.

Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).

Penalty function

max(0, 1− yi(w0 + wTxi)

)COMP 875 Machine learning techniques and image analysis

Page 25: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Hinge loss

max(0, 1− yi(w0 + wTxi)

)

COMP 875 Machine learning techniques and image analysis

Page 26: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Connection between SVMs and logistic regression

Support vector machines:

Hinge loss: max(0, 1− yi(w0 + wT xi)

)Logistic regression:

P (yi|xi;w, w0) =1

1 + e−yi(w0+wT xi)

Log loss: log(1 + e−yi(w0+wT xi)

)

COMP 875 Machine learning techniques and image analysis

Page 27: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Review: Non-separable case Source: G. Shakhnarovich

Dual problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjxTi xj

subject to

n∑i=1

αiyi = 0, 0 ≤ αi ≤ C, for all i = 1, . . . , N.

αi = 0: not support vector.

0 < αi < C: SV on the margin, ξi = 0.

αi = C: over the margin, either misclassified (ξi > 1) or not(0 < ξi ≤ 1).

COMP 875 Machine learning techniques and image analysis

Page 28: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Nonlinear SVM

General idea: try to map the original input space into ahigh-dimensional feature space where the data is separable.

COMP 875 Machine learning techniques and image analysis

Page 29: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Page 30: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Page 31: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)?

φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Page 32: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example of nonlinear mapping

Not separable in 1D:

Separable in 2D:

What is φ(x)? φ(x) = (x, x2).

COMP 875 Machine learning techniques and image analysis

Page 33: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example of nonlinear mapping Source: G. Shakhnarovich

Consider the mapping:φ : [x1, x2]T → [1,

√2x1,

√2x2, x

21, x

22,√

2x1x2]T .

The (linear) SVM classifier in the feature space:

y = sign

(w0 +

∑αi>0

αiyiφ(xi)Tφ(x)

)

The dot product in the feature space:

φ(x)Tφ(z) = 1 + 2x1z1 + 2x2z2 + x21z

21 + x2

2z22 + 2x1x2z1z2

=(1 + xT z

)2.

COMP 875 Machine learning techniques and image analysis

Page 34: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Dot products and feature space Source: G. Shakhnarovich

We defined a non-linear mapping into feature space

φ : [x1, x2]T → [1,√

2x1,√

2x2, x21, x

22,√

2x1x2]T

and saw that φ(x)Tφ(z) = K(x, z) using the kernel

K(x, z) =(1 + xT z

)2.

I.e., we can calculate dot products in the feature spaceimplicitly, without ever writing the feature expansion!

COMP 875 Machine learning techniques and image analysis

Page 35: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

(w0 +

∑αi>0

αiyiK(xi,x)

)

Need to compute K(xi,x) for all SVs xi.

COMP 875 Machine learning techniques and image analysis

Page 36: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

The kernel trick Source: G. Shakhnarovich

Replace dot products in the SVM formulation with kernelvalues.

The optimization problem:

max

n∑i=1

αi − 12

n∑i,j=1

αiαjyiyjK(xi,xj)

Need to compute pairwise kernel values for training data.

The classifier now defines a nonlinear decision boundary in theoriginal space:

y = sign

(w0 +

∑αi>0

αiyiK(xi,x)

)

Need to compute K(xi,x) for all SVs xi.

COMP 875 Machine learning techniques and image analysis

Page 37: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Mercer’s kernels Source: G. Shakhnarovich

What kind of function K is a valid kernel, i.e. such that thereexists a feature space Φ(x) in which K(x, z) = φ(x)Tφ(z)?

Theorem due to Mercer (1930s)

K must be

continuous;

symmetric: K(x, z) = K(z,x);

positive definite: for any x1, . . . ,xN , the kernel matrix

K =

K(x1,x1) K(x1,x2) K(x1,xN ). . . . . . . . . . . . . . . . .K(xN ,x1) K(xN ,x2) K(xN ,xN )

must be positive definite.

COMP 875 Machine learning techniques and image analysis

Page 38: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

COMP 875 Machine learning techniques and image analysis

Page 39: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Some popular kernels Source: G. Shakhnarovich

The linear kernel:K(x, z) = xT z.

This leads to the original, linear SVM.

The polynomial kernel:

K(x, z; c, d) = (c+ xT z)d.

We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.

COMP 875 Machine learning techniques and image analysis

Page 40: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Example: SVM with polynomial kernel Source: G. Shakhnarovich

COMP 875 Machine learning techniques and image analysis

Page 41: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 42: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.

Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 43: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.

The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 44: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 45: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞?

Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 46: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z.

TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 47: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Radial basis function kernel Source: G. Shakhnarovich

K(x, z;σ) = exp(− 1σ2‖x− z‖2

).

The RBF kernel is a measure of similarity between twoexamples.

The mapping φ(x) is infinite-dimensional!

What is the role of parameter σ?

Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).

What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.

COMP 875 Machine learning techniques and image analysis

Page 48: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich

Note: some SV here not close to the boundary

COMP 875 Machine learning techniques and image analysis

Page 49: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

COMP 875 Machine learning techniques and image analysis

Page 50: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Making more Mercer kernels

Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:

K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)

K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)

Multiple kernel learning: learn kernel combinations as part of SVMoptimization.

COMP 875 Machine learning techniques and image analysis

Page 51: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Page 52: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Page 53: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Page 54: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Multi-class SVMs

Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.

One vs. others:

Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value

One vs. one:

Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example

Error-correcting codes, decision trees/DAGs..

COMP 875 Machine learning techniques and image analysis

Page 55: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y

y(x)

y(x) + ε

y(x)− ε

x

ε-insensitive loss:

−1 −eps 0 eps 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

COMP 875 Machine learning techniques and image analysis

Page 56: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression Source: G. Shakhnarovich

The model: f(x) = w0 + wTx

Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.

y

y(x)

y(x) + ε

y(x)− ε

x

ε-insensitive loss:

−1 −eps 0 eps 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

COMP 875 Machine learning techniques and image analysis

Page 57: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression

Optimization: introduce constraints and slack variables for goingabove or below the tube.

minw0,w

12‖w‖2 + C

n∑i=1

(ξi + ξi)

subject to (w0 + wTxi)− yi ≤ ε+ ξi ,

yi − (w0 + wTxi) ≤ ε+ ξi ,

ξi, ξi ≥ 0, i = 1, . . . , n.

COMP 875 Machine learning techniques and image analysis

Page 58: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.

We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

Page 59: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

Page 60: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Support Vector Regression: Dual problem

maxα

n∑i=1

(αi−αi)yi−εn∑i=1

(αi+αi)− 12

n∑i,j=1

(αi−αi)(αj−αj)xTi xj ,

subject to 0 ≤ αi, αi ≤ C,n∑i=1

(αi − αi) = 0, i = 1, . . . , n.

Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:

maxβ

n∑i=1

yiβi − εn∑i=1

|βi| − 12

n∑i,j=1

βiβjxTi xj ,

subject to − C ≤ βi ≤ C,n∑i=1

βi = 0, i = 1, . . . , n.

Then f(x) = w∗0 +∑n

i=1 β∗i x

Ti xj , where w∗0 is chosen so that

f(xi)− yi = −ε for any i with 0 < β∗i < C.

COMP 875 Machine learning techniques and image analysis

Page 61: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 62: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classification

The kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 63: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 64: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 65: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant C

Choice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 66: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parameters

Number of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 67: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 68: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

SVM: summary Source: G. Shakhnarovich

Main ideas:

Large margin classificationThe kernel trick

What does the complexity/generalization ability of SVMsdepend on?

The constant CChoice of kernel and kernel parametersNumber of support vectors

A crucial component: good QP solver.

Tons of off-the-shelf packages.

COMP 875 Machine learning techniques and image analysis

Page 69: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Page 70: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training:

quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Page 71: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Page 72: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis

Page 73: Review: Support vector machines - Computer Sciencelazebnik/fall09/lec06_svm.pdf · 2009-09-14 · Review: Support vector machines Margin optimization min (w;w 0) 1 2 kwk2 subject

Advantages and disadvantages of SVMs

Advantages:

One of the most successful ML techniques!

Good generalization ability for small training sets

Kernel trick is powerful and flexible

Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)

Disadvantages:

Computational and storage complexity of training: quadraticin the size of the training set

In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train

No direct multi-class formulation

COMP 875 Machine learning techniques and image analysis


Recommended