Review: Support vector machines
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
What are support vectors?
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
What are support vectors?
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
What are support vectors?
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
What are support vectors?
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines (separable case)
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
Add constraints as terms to the objective function:
min(w,w0)
12‖w‖2 +
n∑i=1
maxαi≥0
αi[1− yi(w0 + wTxi)
]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines (separable case)
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
Add constraints as terms to the objective function:
min(w,w0)
12‖w‖2 +
n∑i=1
maxαi≥0
αi[1− yi(w0 + wTxi)
]
What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines (separable case)
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
Add constraints as terms to the objective function:
min(w,w0)
12‖w‖2 +
n∑i=1
maxαi≥0
αi[1− yi(w0 + wTxi)
]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?
Then we must have αi = 0.
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines (separable case)
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
Add constraints as terms to the objective function:
min(w,w0)
12‖w‖2 +
n∑i=1
maxαi≥0
αi[1− yi(w0 + wTxi)
]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.
COMP 875 Machine learning techniques and image analysis
Review: Support vector machines (separable case)
Margin optimization
min(w,w0)
12‖w‖2
subject to yi(w0 + wTxi)− 1 ≥ 0, i = 1, . . . , n.
Add constraints as terms to the objective function:
min(w,w0)
12‖w‖2 +
n∑i=1
maxαi≥0
αi[1− yi(w0 + wTxi)
]What if yi(w0 + wTxi) > 1 (not a support vector) at the solution?Then we must have αi = 0.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
max{αi≥0}
min(w,w0)
{12‖w‖2 +
n∑i=1
αi[1− yi(w0 + wTxi)
]}︸ ︷︷ ︸
L(w,w0;α)
First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:
∂
∂wL(w, w0;α) = w −
n∑i=1
αiyyxi = 0,
∂
∂w0L(w, w0;α) = −
n∑i=1
αiyi = 0.
w(α) =n∑i=1
αiyixi,n∑i=1
αiyi = 0.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
max{αi≥0}
min(w,w0)
{12‖w‖2 +
n∑i=1
αi[1− yi(w0 + wTxi)
]}︸ ︷︷ ︸
L(w,w0;α)
First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:
∂
∂wL(w, w0;α) = w −
n∑i=1
αiyyxi = 0,
∂
∂w0L(w, w0;α) = −
n∑i=1
αiyi = 0.
w(α) =n∑i=1
αiyixi,n∑i=1
αiyi = 0.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
max{αi≥0}
min(w,w0)
{12‖w‖2 +
n∑i=1
αi[1− yi(w0 + wTxi)
]}︸ ︷︷ ︸
L(w,w0;α)
First, we fix α and minimize L(w, w0;α) w.r.t. w, w0:
∂
∂wL(w, w0;α) = w −
n∑i=1
αiyyxi = 0,
∂
∂w0L(w, w0;α) = −
n∑i=1
αiyi = 0.
w(α) =n∑i=1
αiyixi,n∑i=1
αiyi = 0.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
w(α) =n∑i=1
αiyixi,n∑i=1
αiyi = 0.
Now we can substitute this solution into
max{αi≥0,
∑i αiyi=0}
{12‖w(α)‖2 +
n∑i=1
αi[1− yi(w0(α) + w(α)Txi)
]}
= max{αi≥0,
∑i αiyi=0}
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjxTi xj
.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
w(α) =n∑i=1
αiyixi,n∑i=1
αiyi = 0.
Now we can substitute this solution into
max{αi≥0,
∑i αiyi=0}
{12‖w(α)‖2 +
n∑i=1
αi[1− yi(w0(α) + w(α)Txi)
]}
= max{αi≥0,
∑i αiyi=0}
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjxTi xj
.
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
Dual optimization problem
max
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjxTi xj
subject to
n∑i=1
αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.
Solving this quadratic program yields the optimal α. We substituteit back to get w:
w = w(α) =n∑i=1
αiyixi
What is the structure of the solution?
COMP 875 Machine learning techniques and image analysis
Review: SVM optimization (separable case)
Dual optimization problem
max
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjxTi xj
subject to
n∑i=1
αiyi = 0, αi ≥ 0 for all i = 1, . . . , n.
Solving this quadratic program yields the optimal α. We substituteit back to get w:
w = w(α) =n∑i=1
αiyixi
What is the structure of the solution?
COMP 875 Machine learning techniques and image analysis
Review: SVM classification
w =∑αi>0
αiyixi.
Given a test example x, how is it classified?
y = sign(w0 + wTx
)= sign
w0 +
(∑αi>0
αiyixi
)Tx
= sign
(w0 +
∑αi>0
αiyixTi x
)
The classifier is based on the expansion in terms of dot products ofx with support vectors.
COMP 875 Machine learning techniques and image analysis
Review: SVM classification
w =∑αi>0
αiyixi.
Given a test example x, how is it classified?
y = sign(w0 + wTx
)= sign
w0 +
(∑αi>0
αiyixi
)Tx
= sign
(w0 +
∑αi>0
αiyixTi x
)
The classifier is based on the expansion in terms of dot products ofx with support vectors.
COMP 875 Machine learning techniques and image analysis
Review: Non-separable case
What if the training data are not linearly separable?
Basic idea: minimize
12‖w‖2 + C(penalty for violating margin constraints).
COMP 875 Machine learning techniques and image analysis
Review: Non-separable case
What if the training data are not linearly separable?
Basic idea: minimize
12‖w‖2 + C(penalty for violating margin constraints).
COMP 875 Machine learning techniques and image analysis
Non-separable case
Rewrite the constraints with slack variables ξi ≥ 0:
min(w,w0)
12‖w‖2 + C
n∑i=1
ξi
subject to yi(w0 + wTxi
)− 1+ξi ≥ 0.
Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.
Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).
Penalty function
max(0, 1− yi(w0 + wTxi)
)
COMP 875 Machine learning techniques and image analysis
Non-separable case
Rewrite the constraints with slack variables ξi ≥ 0:
min(w,w0)
12‖w‖2 + C
n∑i=1
ξi
subject to yi(w0 + wTxi
)− 1+ξi ≥ 0.
Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.
Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).
Penalty function
max(0, 1− yi(w0 + wTxi)
)
COMP 875 Machine learning techniques and image analysis
Non-separable case
Rewrite the constraints with slack variables ξi ≥ 0:
min(w,w0)
12‖w‖2 + C
n∑i=1
ξi
subject to yi(w0 + wTxi
)− 1+ξi ≥ 0.
Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.
Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).
Penalty function
max(0, 1− yi(w0 + wTxi)
)
COMP 875 Machine learning techniques and image analysis
Non-separable case
Rewrite the constraints with slack variables ξi ≥ 0:
min(w,w0)
12‖w‖2 + C
n∑i=1
ξi
subject to yi(w0 + wTxi
)− 1+ξi ≥ 0.
Whenever margin is ≥ 1 (original constraint is satisfied),ξi = 0.
Whenever margin is < 1 (constraint violated), pay linearpenalty: ξi = 1− yi(w0 + wTxi).
Penalty function
max(0, 1− yi(w0 + wTxi)
)COMP 875 Machine learning techniques and image analysis
Review: Hinge loss
max(0, 1− yi(w0 + wTxi)
)
COMP 875 Machine learning techniques and image analysis
Connection between SVMs and logistic regression
Support vector machines:
Hinge loss: max(0, 1− yi(w0 + wT xi)
)Logistic regression:
P (yi|xi;w, w0) =1
1 + e−yi(w0+wT xi)
Log loss: log(1 + e−yi(w0+wT xi)
)
COMP 875 Machine learning techniques and image analysis
Review: Non-separable case Source: G. Shakhnarovich
Dual problem:
max
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjxTi xj
subject to
n∑i=1
αiyi = 0, 0 ≤ αi ≤ C, for all i = 1, . . . , N.
αi = 0: not support vector.
0 < αi < C: SV on the margin, ξi = 0.
αi = C: over the margin, either misclassified (ξi > 1) or not(0 < ξi ≤ 1).
COMP 875 Machine learning techniques and image analysis
Nonlinear SVM
General idea: try to map the original input space into ahigh-dimensional feature space where the data is separable.
COMP 875 Machine learning techniques and image analysis
Example of nonlinear mapping
Not separable in 1D:
Separable in 2D:
What is φ(x)? φ(x) = (x, x2).
COMP 875 Machine learning techniques and image analysis
Example of nonlinear mapping
Not separable in 1D:
Separable in 2D:
What is φ(x)? φ(x) = (x, x2).
COMP 875 Machine learning techniques and image analysis
Example of nonlinear mapping
Not separable in 1D:
Separable in 2D:
What is φ(x)?
φ(x) = (x, x2).
COMP 875 Machine learning techniques and image analysis
Example of nonlinear mapping
Not separable in 1D:
Separable in 2D:
What is φ(x)? φ(x) = (x, x2).
COMP 875 Machine learning techniques and image analysis
Example of nonlinear mapping Source: G. Shakhnarovich
Consider the mapping:φ : [x1, x2]T → [1,
√2x1,
√2x2, x
21, x
22,√
2x1x2]T .
The (linear) SVM classifier in the feature space:
y = sign
(w0 +
∑αi>0
αiyiφ(xi)Tφ(x)
)
The dot product in the feature space:
φ(x)Tφ(z) = 1 + 2x1z1 + 2x2z2 + x21z
21 + x2
2z22 + 2x1x2z1z2
=(1 + xT z
)2.
COMP 875 Machine learning techniques and image analysis
Dot products and feature space Source: G. Shakhnarovich
We defined a non-linear mapping into feature space
φ : [x1, x2]T → [1,√
2x1,√
2x2, x21, x
22,√
2x1x2]T
and saw that φ(x)Tφ(z) = K(x, z) using the kernel
K(x, z) =(1 + xT z
)2.
I.e., we can calculate dot products in the feature spaceimplicitly, without ever writing the feature expansion!
COMP 875 Machine learning techniques and image analysis
The kernel trick Source: G. Shakhnarovich
Replace dot products in the SVM formulation with kernelvalues.
The optimization problem:
max
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjK(xi,xj)
Need to compute pairwise kernel values for training data.
The classifier now defines a nonlinear decision boundary in theoriginal space:
y = sign
(w0 +
∑αi>0
αiyiK(xi,x)
)
Need to compute K(xi,x) for all SVs xi.
COMP 875 Machine learning techniques and image analysis
The kernel trick Source: G. Shakhnarovich
Replace dot products in the SVM formulation with kernelvalues.
The optimization problem:
max
n∑i=1
αi − 12
n∑i,j=1
αiαjyiyjK(xi,xj)
Need to compute pairwise kernel values for training data.
The classifier now defines a nonlinear decision boundary in theoriginal space:
y = sign
(w0 +
∑αi>0
αiyiK(xi,x)
)
Need to compute K(xi,x) for all SVs xi.
COMP 875 Machine learning techniques and image analysis
Mercer’s kernels Source: G. Shakhnarovich
What kind of function K is a valid kernel, i.e. such that thereexists a feature space Φ(x) in which K(x, z) = φ(x)Tφ(z)?
Theorem due to Mercer (1930s)
K must be
continuous;
symmetric: K(x, z) = K(z,x);
positive definite: for any x1, . . . ,xN , the kernel matrix
K =
K(x1,x1) K(x1,x2) K(x1,xN ). . . . . . . . . . . . . . . . .K(xN ,x1) K(xN ,x2) K(xN ,xN )
must be positive definite.
COMP 875 Machine learning techniques and image analysis
Some popular kernels Source: G. Shakhnarovich
The linear kernel:K(x, z) = xT z.
This leads to the original, linear SVM.
The polynomial kernel:
K(x, z; c, d) = (c+ xT z)d.
We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.
COMP 875 Machine learning techniques and image analysis
Some popular kernels Source: G. Shakhnarovich
The linear kernel:K(x, z) = xT z.
This leads to the original, linear SVM.
The polynomial kernel:
K(x, z; c, d) = (c+ xT z)d.
We can write the expansion explicitly, by concatenatingpowers up to d and multiplying by appropriate weights.
COMP 875 Machine learning techniques and image analysis
Example: SVM with polynomial kernel Source: G. Shakhnarovich
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.
Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.
The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞?
Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z.
TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
Radial basis function kernel Source: G. Shakhnarovich
K(x, z;σ) = exp(− 1σ2‖x− z‖2
).
The RBF kernel is a measure of similarity between twoexamples.
The mapping φ(x) is infinite-dimensional!
What is the role of parameter σ?
Consider σ → 0.Then K(xi,x;σ) → 1 if x = z or 0 if x 6= z.The SVM simply “memorizes” the training data (overfitting,lack of generalization).
What about σ →∞? Then K(x, z)→ 1 for all x, z. TheSVM underfits.
COMP 875 Machine learning techniques and image analysis
SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich
Note: some SV here not close to the boundary
COMP 875 Machine learning techniques and image analysis
Making more Mercer kernels
Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:
K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)
K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)
Multiple kernel learning: learn kernel combinations as part of SVMoptimization.
COMP 875 Machine learning techniques and image analysis
Making more Mercer kernels
Let K1 and K2 be Mercer kernels. Then the following functionsare also Mercer kernels:
K(x, z) = K1(x, z) +K2(x, z)K(x, z) = aK1(x, z) (a is a positive scalar)
K(x, z) = K1(x, z)K2(x, z)K(x, z) = xTBz (B is a symmetric positive semi-definitematrix)
Multiple kernel learning: learn kernel combinations as part of SVMoptimization.
COMP 875 Machine learning techniques and image analysis
Multi-class SVMs
Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.
One vs. others:
Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value
One vs. one:
Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example
Error-correcting codes, decision trees/DAGs..
COMP 875 Machine learning techniques and image analysis
Multi-class SVMs
Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.
One vs. others:
Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value
One vs. one:
Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example
Error-correcting codes, decision trees/DAGs..
COMP 875 Machine learning techniques and image analysis
Multi-class SVMs
Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.
One vs. others:
Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value
One vs. one:
Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example
Error-correcting codes, decision trees/DAGs..
COMP 875 Machine learning techniques and image analysis
Multi-class SVMs
Various “direct” formulations exist, but they are not widelyused in practice. It is more common to obtain multi-classclassifiers by combining two-class SVMs in various ways.
One vs. others:
Traning: learn an SVM for each class vs. the othersTesting: apply each SVM to test example and assign to it theclass of the SVM that returns the highest decision value
One vs. one:
Training: learn an SVM for each pair of classesTesting: each learned SVM “votes” for a class to assign to thetest example
Error-correcting codes, decision trees/DAGs..
COMP 875 Machine learning techniques and image analysis
Support Vector Regression Source: G. Shakhnarovich
The model: f(x) = w0 + wTx
Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.
y
y(x)
y(x) + ε
y(x)− ε
x
ε-insensitive loss:
−1 −eps 0 eps 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
COMP 875 Machine learning techniques and image analysis
Support Vector Regression Source: G. Shakhnarovich
The model: f(x) = w0 + wTx
Instead of the margin around the predicted decision boundary, wehave ε-tube around the predicted function.
y
y(x)
y(x) + ε
y(x)− ε
x
ε-insensitive loss:
−1 −eps 0 eps 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
COMP 875 Machine learning techniques and image analysis
Support Vector Regression
Optimization: introduce constraints and slack variables for goingabove or below the tube.
minw0,w
12‖w‖2 + C
n∑i=1
(ξi + ξi)
subject to (w0 + wTxi)− yi ≤ ε+ ξi ,
yi − (w0 + wTxi) ≤ ε+ ξi ,
ξi, ξi ≥ 0, i = 1, . . . , n.
COMP 875 Machine learning techniques and image analysis
Support Vector Regression: Dual problem
maxα
n∑i=1
(αi−αi)yi−εn∑i=1
(αi+αi)− 12
n∑i,j=1
(αi−αi)(αj−αj)xTi xj ,
subject to 0 ≤ αi, αi ≤ C,n∑i=1
(αi − αi) = 0, i = 1, . . . , n.
Note that at the solution, we must have ξiξi = 0, αiαi = 0.
We can let βi = αi − αi and simplify:
maxβ
n∑i=1
yiβi − εn∑i=1
|βi| − 12
n∑i,j=1
βiβjxTi xj ,
subject to − C ≤ βi ≤ C,n∑i=1
βi = 0, i = 1, . . . , n.
Then f(x) = w∗0 +∑n
i=1 β∗i x
Ti xj , where w∗0 is chosen so that
f(xi)− yi = −ε for any i with 0 < β∗i < C.
COMP 875 Machine learning techniques and image analysis
Support Vector Regression: Dual problem
maxα
n∑i=1
(αi−αi)yi−εn∑i=1
(αi+αi)− 12
n∑i,j=1
(αi−αi)(αj−αj)xTi xj ,
subject to 0 ≤ αi, αi ≤ C,n∑i=1
(αi − αi) = 0, i = 1, . . . , n.
Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:
maxβ
n∑i=1
yiβi − εn∑i=1
|βi| − 12
n∑i,j=1
βiβjxTi xj ,
subject to − C ≤ βi ≤ C,n∑i=1
βi = 0, i = 1, . . . , n.
Then f(x) = w∗0 +∑n
i=1 β∗i x
Ti xj , where w∗0 is chosen so that
f(xi)− yi = −ε for any i with 0 < β∗i < C.
COMP 875 Machine learning techniques and image analysis
Support Vector Regression: Dual problem
maxα
n∑i=1
(αi−αi)yi−εn∑i=1
(αi+αi)− 12
n∑i,j=1
(αi−αi)(αj−αj)xTi xj ,
subject to 0 ≤ αi, αi ≤ C,n∑i=1
(αi − αi) = 0, i = 1, . . . , n.
Note that at the solution, we must have ξiξi = 0, αiαi = 0.We can let βi = αi − αi and simplify:
maxβ
n∑i=1
yiβi − εn∑i=1
|βi| − 12
n∑i,j=1
βiβjxTi xj ,
subject to − C ≤ βi ≤ C,n∑i=1
βi = 0, i = 1, . . . , n.
Then f(x) = w∗0 +∑n
i=1 β∗i x
Ti xj , where w∗0 is chosen so that
f(xi)− yi = −ε for any i with 0 < β∗i < C.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classification
The kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant C
Choice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parameters
Number of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
SVM: summary Source: G. Shakhnarovich
Main ideas:
Large margin classificationThe kernel trick
What does the complexity/generalization ability of SVMsdepend on?
The constant CChoice of kernel and kernel parametersNumber of support vectors
A crucial component: good QP solver.
Tons of off-the-shelf packages.
COMP 875 Machine learning techniques and image analysis
Advantages and disadvantages of SVMs
Advantages:
One of the most successful ML techniques!
Good generalization ability for small training sets
Kernel trick is powerful and flexible
Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)
Disadvantages:
Computational and storage complexity of training: quadraticin the size of the training set
In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train
No direct multi-class formulation
COMP 875 Machine learning techniques and image analysis
Advantages and disadvantages of SVMs
Advantages:
One of the most successful ML techniques!
Good generalization ability for small training sets
Kernel trick is powerful and flexible
Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)
Disadvantages:
Computational and storage complexity of training:
quadraticin the size of the training set
In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train
No direct multi-class formulation
COMP 875 Machine learning techniques and image analysis
Advantages and disadvantages of SVMs
Advantages:
One of the most successful ML techniques!
Good generalization ability for small training sets
Kernel trick is powerful and flexible
Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)
Disadvantages:
Computational and storage complexity of training: quadraticin the size of the training set
In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train
No direct multi-class formulation
COMP 875 Machine learning techniques and image analysis
Advantages and disadvantages of SVMs
Advantages:
One of the most successful ML techniques!
Good generalization ability for small training sets
Kernel trick is powerful and flexible
Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)
Disadvantages:
Computational and storage complexity of training: quadraticin the size of the training set
In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train
No direct multi-class formulation
COMP 875 Machine learning techniques and image analysis
Advantages and disadvantages of SVMs
Advantages:
One of the most successful ML techniques!
Good generalization ability for small training sets
Kernel trick is powerful and flexible
Margin-based formalism can be extended to a large class ofproblems (regression, structured prediction, etc.)
Disadvantages:
Computational and storage complexity of training: quadraticin the size of the training set
In the worst case, can degenerate to nearest neighbor (everytraining point a support vector), but is much slower to train
No direct multi-class formulation
COMP 875 Machine learning techniques and image analysis