Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 1 times |
Support Vector Machines
CMPUT 466/551Nilanjan Ray
Agenda
• Linear support vector classifier– Separable case– Non-separable case
• Non-linear support vector classifier• Kernels for classification• SVM as a penalized method• Support vector regression
Linear Support Vector Classifier: Separable Case
.0
,,0:tosubject
}2
1{min
,,1
iii
i
ii
i kk
Tikiki
y
i
xxyyN
Dual problem(simpler optimization)
Dual problemin matrix vector form:
.0
,0:tosubject
,1)]()([2
1
T
TYT
y
ydiagXXydiagCompare theimplementationsimple_svm.m
ixy iT
i ,1)(:tosubject
2
1min
0
2
,0
Primal problem
Linear SVC (AKA Optimal Hyperplane)…
After solving the dual problem we obtain i ‘s; how do construct the hyperplane from here?
To obtain use the equation:
How do we obtain 0 ?We need the complementary slackness criteria, which are the results of Karush-Kuhn-Tucker (KKT) conditions for the primal optimization problem.
Complementary slackness means:
Training points corresponding to non-negative i ‘s are support vectors.
0 is computed from for which i ‘s are non-negative.
i
iii xy
.01)(
,1)(0
0
0
iTii
Tiii
xy
xy
1)( 0 Tii xy
Optimal Hyperplane/Support Vector Classifier
In interesting interpretationfrom the equality constraintin the dual problem is as follows.
i are forces on both sides of thehyperplane, and the net force iszero on the hyperplane.
0i
ii y
Linear Support Vector Classifier: Non-separable Case
From Separable to Non-separable
.,0
,1)(:tosubject
2
1min
0
2
,0
i
ixy
i
iiT
i
ii
In the non-separable case the margin width is: , and if in addition , then the margin width is 1. This is the reason that in the primal problemwe have the following inequality constraints:
1
1
.,1)( 0 ixy iT
i
These inequality constraints ensure that there is no point in the margin area. Forthe non-separable case, such constraints must be violated, and it is modified to:
.,0,1)( 0 ixy iiiT
i
So, the primary optimization problem becomes:
The positive parameter controls the extentto which points areallowed to violate (1)
(1)
Non-separable Case: Finding Dual Function
• Lagrangian function minimization:
• Solve:
• Substitute (1), (2) and (3) in L to form the dual function:
.,0,0,0:
)]1()([2
1
110
1
2
itosubject
xyL
iii
N
iii
N
ii
Tiii
N
ii
0
iiii xy
L
00
iii y
L
.0
,0:tosubject
2
1
1
i
N
iii
i kk
Tikiki
ii
y
xxyyq
(1)
(2)
., iii (3)
Dual optimization: dual variables to primal variables
After solving the dual problem we obtain i ‘s; how do we construct the hyperplane from here?
To obtain use the equation:
How do we obtain 0 ?complementary slackness conditions for the primal optimization problem:
Training points corresponding to non-negative i ‘s are support vectors.
0 is computed from for which:(Average is taken from such points)
is chosen by cross-validation. should be typically greater than 1/N.
i
iii xy
.00
,00
,01)(
,1)(0
0
0
iii
iii
iiTii
iTiii
xy
xy
1)( 0 Tii xy .0 i
Example: Non-separable Case
Non-linear support vector classifier
Let’s take a look at the solution of optimal separating hyperplane in terms of dual variables:
N
ijj
Tiiij
N
i
Tiii
xxyy
xxyxf
10
01
.0somefor,
,)(
Let’s take a look at dual cost function for the optimal separating hyperplane:
.0
,0:tosubject
2
1
1
i
N
iii
i kk
Tikiki
ii
y
xxyyq
An invaluable observation: all these equations involve “feature points” in “inner products”
Non-linear support vector classifier…
An invaluable observation: all these equations involve “feature points” in “inner products”
This feature is particularly very convenient when the input feature space has a large dimension
As for example, consider that we want a classifier which is additive in the feature component,not linear. Such a classifier is expected to perform better on problems with non-linearclassification boundary.
01
)()(
M
ppp xhxf
hi are non-linear functions of the input feature. Ex. input space: x=(x1, x2), and h’s aresecond order polynomials:
,2),(,),(,),(
,2),(,2),(,1),(
2121622215
21214
22131212211
xxxxhxxxhxxxh
xxxhxxxhxxh
So that the classifier is now non-linear:
.222),( 216225
21423121021 xxxxxxxxf
Because of the inner product feature, this non-linear classifier can still be computed by the methods for finding linear optimal hyperplane.
Non-linear support vector classifier…
Denote: TM xhxhxh )]()([)( 1
The non-linear classifier: 001
)()()(
xhxhxf TM
ppp
.0
,0:tosubject
)()(2
1
1
i
N
iii
i kk
Tikiki
ii
y
xhxhyyq
The dual cost function:
N
ijj
Tiiij
N
i
Tiii
xhxhyy
xhxhyxf
10
01
.0somefor),()(
,)()()(
The non-linear classifierin dual variables:
Thus, in the dual variable space the non-linear classifer is expressed just with inner products!
Non-linear support vector classifier…
,2),(,),(,),(
,2),(,2),(,1),(
2121622215
21214
22131212211
xxxxhxxxhxxxh
xxxhxxxhxxh
With the previous non-linear feature vector,
The inner product takes a particularly interesting form:
),(),(
)1(
)1(
2221
),(),(
21212121
2
2
121
22211
22
22
21
21
22
22
21
212211
2121
aabbKbbaaK
b
baa
baba
babababababa
bbhaah T
Kernel functionComputational savings:instead of 6 products, wecompute 3 products
Kernel Functions
),()()( jijT
i xxKxhxh So, if the inner product can be expressed in terms of a function symmetric function K:
then we can apply the SV tool.
Well not quite! We need another property of K called positive (semi) definiteness.Why? The dual function has an answer to this question.
)()(),(where
,))(())((2
11
),(2
1
)()(2
1
kT
ikikiik
TT
i kkikiki
ii
i kk
Tikiki
ii
xhxhxxKKK
ydiagKydiag
xxKyy
xhxhyyq
The maximization of the dual is convex when the matrix K is positive semi-definite
Thus the kernel function K must satisfy two properties: symmetry and p.d.
Kernel Functions…
Thus we need such h(x)’s that define kernel function.
In practice we don’t even need to define h(x)! All we need is the kernel function!
Example kernel functions:dxxxxK )',1()',(
)/||'||exp()',( 2 cxxxxK
)',tanh()',( 21 kxxkxxK
dth degree polynomial
Radial kernel
Neural network
The real question is now designing a kernel function
Example
SVM as a Penalty Method
With the following optimization,)()( 0 Txhxf
2
1, 2
1)](1[min
0
N
ii xfy
is equivalent to:
.,0
,1)(:tosubject
2
1min
0
2
,0
i
ixy
i
iiT
i
ii
SVM is a penalized optimization method for binary classification
Negative Binomial Log-likelihood (LR Loss Function) Example
This is essentiallynon-linear logisticregression
SVM for Regression
The penalty view of SVM leads to regression
With the following optimization,)()( 0 Txhxf
,2
))((min2
1, 0
N
ii xfyV
where, V(.) is a regression loss function.
SV Regression: Loss Functions