Support Vector Machines
Javier Bejar cbea
LSI - FIB
Term 2012/2013
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 1 / 44
Outline
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 2 / 44
Support Vector Machines - Introduction
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 3 / 44
Support Vector Machines - Introduction
Linear separators
As we saw with the perceptron, one way to perform classification is tofind a linear separator (for binary classification)
The idea is to find the equation of a line wT x + b = 0 that dividesthe set of examples in the target classes so:
wT x + b > 0⇒ Positive class
wT x + b < 0⇒ Negative class
This is just the problem that the single layer perceptron solves (b isthe bias weight)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 4 / 44
Support Vector Machines - Introduction
Optimal linear separator
Actually any line that divides the examples can be the answer for theproblem
The question that we can ask is, what line is the best one?
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 5 / 44
Support Vector Machines - Introduction
Optimal linear separator
Some boundaries seem a bad choice due to poor generalization
We have to maximize the probability of classifying correctly unseeninstances
We want to minimize the expected generalization loss (instead of theexpected empirical loss)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 6 / 44
Maximum Margin Classification
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 7 / 44
Maximum Margin Classification
Maximum Margin Classification
Maximizing the distance of theseparator to the examples seemsthe right choice (actuallysupported by PAC learningtheory)
This means that only thenearest instances to theseparator matter (the rest canbe ignored)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 8 / 44
Maximum Margin Classification
Classification Margin
We can compute the distancefrom an example xi to theseparator as:
r =wT xi + b
||w ||The examples closest to theseparator are support vectors
The margin (ρ) of the separatoris the distance between supportvectors
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 9 / 44
Maximum Margin Classification
Large Margin Decision Boundary
The separator has to be as far as possible from the examples of bothclasses
This means that we have to maximize the margin
We can normalize the equation of the separator so the distance in thesupports are 1 or −1, by
r =wT x + b
||w ||
So the length of the optimal margin is m = 2||w ||
This means that maximizing the margin is the same that minimizingthe norm of the weights
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 10 / 44
Maximum Margin Classification
Computing the decision boundary
Given a set of examples {x1, x2, . . . , xn} with class labelsyi ∈ {+1,−1}The decision boundary that classify the examples correctly holds
yi (wT xi + b) ≥ 1, ∀i
This allows to define the problem of learning the weights as anoptimization problem
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 11 / 44
Maximum Margin Classification
Computing the decision boundary
The primal formulation of the optimization problem is:
Minimize 12 ||w ||
2
subject to yi (wT xi + b) ≥ 1, ∀i
This problem is difficult to solve in this formulation, but we canrewrite the problem in a more convenient form
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 12 / 44
Maximum Margin Classification
Constrained optimization (a little digression)
Suppose we want to minimize f (x) subject to g(x) = 0
A necessary condition for a point x0 to be a solution is{ ∂∂x (f (x) + αg(x))
∣∣x=x0
= 0
g(x) = 0
Being α the Lagrange multiplier
If we have multiple constraints, we just need one multiplier perconstraint { ∂
∂x (f (x) +∑n
i=1 αigi (x))∣∣x=x0
= 0
gi (x) = 0 ∀i ∈ 1 . . . n
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 13 / 44
Maximum Margin Classification
Constrained optimization (a little digression)
For inequality constraints gi (x) ≤ 0 we have to add the constraintthat the Lagrange multipliers have to be positive αi ≥ 0
If x0 is a solution to the constrained optimization problem
minx f (x) subject to gi (x) ≤ 0 ∀i ∈ 1 . . . n
There must exist αi ≥ 0 such that x0 satisfies{ ∂∂x (f (x) +
∑ni=1 αigi (x))
∣∣x=x0
= 0
gi (x) = 0 ∀i ∈ 1 . . . n
The function f (x) +∑n
i=1 αigi (x) is called the Lagrangian
The solution is the point of gradient 0
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 14 / 44
Maximum Margin Classification
Computing the decision boundary (we are back)
The primal formulation of the optimization problem is:
Minimize 12 ||w ||
2
subject to 1− yi (wT xi + b) ≤ 0, ∀i
The Lagrangian is1:
L =1
2wTw +
n∑i=1
αi (1− yi (wT xi + b))
1||w || = wTwJavier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 15 / 44
Maximum Margin Classification
Computing the decision boundary (we are back)
If we compute the derivative of L with respect to w and b and we setthem to zero (we are computing the gradient):
w +n∑
i=1
αi (−yi )xi = 0⇒ w =n∑
i=1
αiyixi
and
n∑i=1
αiyi = 0
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 16 / 44
Maximum Margin Classification
The dual formulation
If we substitute w in the Lagrangian L
L =1
2
n∑i=1
αiyixTi
n∑j=1
αjyjxj +n∑
i=1
αi
1− yi (n∑
j=1
αjyjxTj xi + b)
=
1
2
n∑i=1
n∑j=1
αiαjyiyjxTi xj +
n∑i=1
αi −n∑
i=1
αiyi
n∑j=1
αjyjxTj xi
−bn∑
i=1
αiyi (and given thatn∑
i=1
αiyi = 0)
= −1
2
n∑i=1
n∑j=1
αiαjyiyjxTi xj +
n∑i=1
αi (rearranging terms)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 17 / 44
Maximum Margin Classification
The dual formulation
This problem is known as the dual problem (the original is the primal)
This formulation only depends on the αi
Both problems are linked, given the w we can compute the αi andviceversa
This means that we can solve one instead of the other
Now this problem is a maximization problem (αi ≥ 0)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 18 / 44
Maximum Margin Classification
The dual problem
The formulation of the dual problem is
Maximize∑n
i=1 αi − 12
∑ni=1
∑nj=1 αiαjyiyjx
Ti xj
subject to∑n
i=1 αiyi = 0, αi ≥ 0 ∀i
This a Quadratic Programming problem (QP)
This means that the parameters form a parabolloidal surface, and anoptimal can be found
We can obtain w by
w =n∑
i=1
αiyixi
b can be obtained from a positive support vector (xpsv ) knowing thatwT xpsv + b = 1
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 19 / 44
Maximum Margin Classification
Geometric Interpretation
W
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 20 / 44
Maximum Margin Classification
Characteristics of the solution
Many of the αi are zero
w is a linear combination of a small number of examplesWe obtain a sparse representation of the data (data compression)
The examples xi with non zero αi are the support vectors (SV)
The vector of parameters can be expressed as:
w =∑
∀i∈SVαiyixi
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 21 / 44
Maximum Margin Classification
Classifying new instances
In order to classify a new instance z we just have to compute
sign(wT z + b) = sign(∑
∀i∈SVαiyix
Ti z + b)
This means that w does not need to be explicitly computed
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 22 / 44
Maximum Margin Classification
Solving the QP problem
Quadratic programming is a well known optimization problem
Many algorithms have been proposed
One of the most popular is Sequential Minimal Optimization (SMO)
Picks a pair of variables and solves a QP problem for two variables(trivial)Repeats the procedure until convergence
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 23 / 44
Soft Margin Classification
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 24 / 44
Soft Margin Classification
Non separable problems
This algorithm works well forseparable problems
Sometimes data has errors andwe want to ignore them toobtain a better solution
Sometimes data is just nonlinearly separable
We can obtain better linearseparators being less strict
Too Close!
Better!
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 25 / 44
Soft Margin Classification
Soft Margin Classification
We want to be permissive for certain examples, allowing that theirclassification by the separator diverge from the real class
This can be obtained by adding to the problem what is called slackvariables (ξi )
This variables represent the deviation of the examples from the margin
Doing this we are relaxing the margin, we are using a soft margin
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 26 / 44
Soft Margin Classification
Soft Margin Classification
We are allowing an error inclassification based on theseparator wT x + b
The values of ξi approximate thenumber of missclassifications
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 27 / 44
Soft Margin Classification
Soft Margin Hyperplane
In order to minimize the error, we can minimize∑
i ξi introducing theslack variables to the constraints of the problem:
wT xi + b ≥ 1− ξi yi = 1wT xi + b ≥ −1 + ξi yi = −1ξi ≥ 0
ξi = 0 if there are no errors (linearly separable problem)
The number of resulting supports and slack variables give an upperbound of the leave one out error
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 28 / 44
Soft Margin Classification
The primal optimization problem
We need to introduce this slack variables on the original problem, wehave now:
Minimize 12 ||w ||
2 + C∑n
i=1 ξi
subject to yi (wT xi + b) ≥ 1− ξi , ∀i , ξi ≥ 0
Now we have an additional parameter C that is the tradeoff betweenthe error and the margin
We will need to adjust this parameter
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 29 / 44
Soft Margin Classification
The dual optimization problem
Performing the transformation to the dual problem we obtain thefollowing:
Maximize∑n
i=1 αi − 12
∑ni=1
∑nj=1 αiαjyiyjx
Ti xj
subject to∑n
i=1 αiyi = 0, C ≥ αi ≥ 0 ∀i
We can recover the solution as w =∑
∀i∈SV αi , yixi
This problem is very similar to the linearly separable case, except thatthere is a upper bound C on the values of αi
This is also a QP problem
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 30 / 44
Non linear large margin classification
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 31 / 44
Non linear large margin classification
Non linear large margin classification
So far we have only considered large margin classifiers that use alinear boundary
In order to have better performance we have to be able to obtainnon-linear boundaries
The idea is to transform the data from the input space (the originalattributes of the examples) to a higher dimensional space using afunction φ(x)
This new space is called the feature space
The advantage of the transformation is that linear operations in thefeature space are equivalent to non-linear operations in the inputspace
Remember the RBFs networks?
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 32 / 44
Non linear large margin classification
Feature Space transformation
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 33 / 44
Non linear large margin classification
The XOR problem
The XOR problem is not linearly separable in its original definition,but we can make it linearly separable if we add a new feature x1 · x2
x1 x2 x1 · x2 x1 XOR x2
0 0 0 10 1 0 01 0 0 01 1 1 1
The linear function h(x) = 2x1x2 − x1 − x2 + 1 classifies correctly allthe examples
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 34 / 44
Non linear large margin classification
Transforming the data
Working directly in the feature space can be costly
We have to explicitly create the feature space and operate in it
We may need infinite features to make a problem linearly separable
We can use what is called the Kernel trick
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 35 / 44
Non linear large margin classification
The Kernel Trick
In the problem that define a SVM only the inner product of theexamples is needed
This means that if we can define how to compute this product in thefeature space, then it is not necessary to explicitly build it
There are many geometric operations that can be expressed as innerproducts, like angles or distances
We can define the kernel function as:
K (xi , xj) = φ(xi )Tφ(xj)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 36 / 44
Non linear large margin classification
Kernels - Example
We can show how this kernel trick works in an example
Lets assume a feature space defined as:
φ
([x1
x2
])= (1,
√2x1,√
2x2, x21x
22 ,√
2x1x2)
A inner product in this feature space is:⟨φ
([x1
x2
]), φ
([y1
y2
])⟩= (1 + x1y1 + x2y2)2
So, we can define a kernel function to compute inner products in thisspace as:
K (x , y) = (1 + x1y1 + x2y2)2
and we are using only the features from the input space
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 37 / 44
Non linear large margin classification
Kernel functions - Properties
Given a set of examples X = {x1, x2, . . . xn}, and a symmetricfunction k(x , z) we can define the kernel matrix K as:
K = φTφ =
k(x1, x1) · · · k(x1, xn)· · · · · · · · ·
k(xn, x1) · · · k(xn, xn)
If K is a symmetric and positive definite matrix, then k(x , z) is akernel function
The justification is that if K is a sdp matrix then can be decomposedin:
K = VΛV T
V can be interpreted as a projection in the feature space (rememberfeature projection?)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 38 / 44
Non linear large margin classification
Kernel functions
Examples of functions that hold this condition are:
Polynomial kernel of degree d : K (x , y) = (xT y + 1)d
Gaussian function with width σ: K (x , y) = exp(−||x − y ||2/2σ2)Sigmoid hiperbolical tangent with parameters k and θ:K (x , y) = tanh(kxT y + θ) (only for certain values of the parameters)
Kernel functions can also be interpreted as similarity function
A similarity function can be transformed in a kernel given that holdthe sdp matrix condition
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 39 / 44
Non linear large margin classification
The dual problem
Due to the introduction of the kernel function, the optimizationproblem has to be modified:
Maximize∑n
i=1 αi − 12
∑ni=1
∑nj=1 αiαjyiyjK (xi , xj)
subject to∑n
i=1 αiyi = 0, C ≥ αi ≥ 0 ∀i
For classifying new examples:
h(z) = sign
( ∑∀i∈SV
αi , yik(xi , z) + b
)
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 40 / 44
Non linear large margin classification
The advantage of using kernels
Since the training of the SVM only needs the value of K (xi , xj) thereis no constrains about how the examples are represented
We only have to define the kernel as a similarity among examples
We can define similarity functions for different representations
Strings, sequencesGraphs/TreesDocuments...
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 41 / 44
Application
1 Support Vector Machines - Introduction
2 Maximum Margin Classification
3 Soft Margin Classification
4 Non linear large margin classification
5 Application
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 42 / 44
Application
Splice-Junction gene sequences
Identification of gene sequence type(Bioinformatics)
60 Attributes (Discrete)
Attributes: DNA base at position 1-60
3190 instances
3 classes
Methods: Support vector machines
Validation: 10 fold cross validation
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 43 / 44
Application
Splice-Junction gene sequences: Models
SVM (linear): accuracy 90.9%, 1331 Support Vectors
SVM (linear, C=1): accuracy 91.8%, 608 Support Vectors
SVM (linear, C=5): accuracy 91.5%, 579 Support Vectors
SVM (quadratic): accuracy 91.4%, 1305 Support Vectors
SVM (quadratic, C=5): accuracy 91.6%, 1054 Support Vectors
SVM (cubic, C=5): accuracy 91.9%, 1206 Support Vectors
Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 44 / 44