Support Vector Machines using Kernels - imag · Support Vector Machines with Kernel Functions 4-4...

Intelligent Systems: Reasoning and Recognition

James L. Crowley ENSIMAG 2 and MoSIG M1 Winter Semester 2017 Lesson 4 10 February 2017

Support Vector Machines using Kernels

Contents

Kernel Functions ...............................................................2 Definition ..................................................................................... 3 Radial Basis Function (RBF) ........................................................ 4 Kernel Functions for Symbolic Data............................................. 6

Support Vector Machines with Kernels.............................7

Soft Margin SVM's - Non-separable training data...........12 Sources: "Neural Networks for Pattern Recognition", C. M. Bishop, Oxford Univ. Press, 1995. "A Computational Biology Example using Support Vector Machines", Suzy Fei, 2009 (available on line).

Support Vector Machines with Kernel Functions

4-2

Kernel Functions Linear discriminant functions can provide very efficient 2-class and multi-class classifiers, provided that the class features can be separated by a linear decision surface. For many domains, it is possible to find a “kernel” function, that transforms the data into a space where the two classes are separate.

Instead of a decision surface:

!

g(! X ) =

! W T! X + b

We will use a decision surface of the form:

!

g(! X ) =

! W T f (

! X )+ b

where

!

! W = f (

! Z ) = am ym f (

! X m )

m=1

M

"

is learned from the transformed training data. and am is a binary variable learned from the training data that is an ≥ 0 for support vectors and 0 for all others. The function

!

! f (! X ) provides an implicit non-linear decision surfaces for the original

data.


4-3

Definition Formally, a Kernel function is any function

!

K(! Z ,! X )

!

K :! X "! X #R

that satisfies “Mercer's condition”. Essentially, Mercer’s condition tells whether a function is a vector product in some space. (The definition of this condition is beyond the scope of this class. See wikipedia for a discussion of Mercer’s condition if you are curious). Mercer’s condition tells whether, for

!

K(! Z ,! X ) there exists a function

!

! f (! Z ) such that

!

K(! Z ,! X ) =

! f (! Z ),! f (! X )

Obviously, Mercer's condition is satisfied by inner products (dot products)

!

K! Z ,! X ( ) =

! Z T! X =

! Z ,! X = zd

d=1

D

" xd

Thus

!

K(!

W ,! X ) =

! W T! X is a valid (but trivial) kernel function.

(Known as the linear kernel). We can learn the discriminant in an inner product space

!

K(! Z ,! X ) = f (

! Z )T f (

! X )

where

!

! W will be learned from the training data.

This will give us

!

g(! X ) =

! W T f (

! X )+ b

Note that Mercer’s condition can be satisfied by many other functions. Popular kernel functions include:

• Polynomial Kernels • Radial Basis Functions • Fisher Kernels • Text intersection • Bayesian Kernels

Kernel functions provide an implicit feature space. We will see that we can learn in the kernel space, and then recognize without explicitly computing the position in this implicit space!


4-4

Radial Basis Function (RBF) Radial functions of the form

!

f! X "! X n( ) are popular for use a kernel function.

In this case, each support vector acts as a dimension in the new feature space. Radial basis function (RBF) are popular for use a kernel function. In this case, each support vector acts as a dimension in the new feature space. The RBF function is a function

!

f ( ) :RN " R . Typically, the function is a used with the Euclidean Norm

!

" between a set of points to provide an approximation/interpolation of the form:

!

s(! X ) = Wn

n=1

N

" f! X #! X n( )

where

!

Wn and

!

! X n are learned from training data.

This can be used to learn a discriminant function:

!

g(! X ) = Wn

n=1

N

" f! X #! X n( )

Where the points N samples

!

! X n can be derived from the training data.

The term

!

! X "! X n is the Euclidean distance from the set of points

!

! X n{ }.

The distance can be normalized by dividing by a value σ

!

g(! X ) = Wn

n=1

N

" f! X #! X n

2$

%

& ' '

(

) * * or

!

g(! X ) = Wn

n=1

N

" f! X #! X n

2$ 2

2%

&

' '

(

)

* *

The vectors

!

! X n act as center points for defining bases. The sigma parameter acts as a

smoothing parameter that determines the influence of each of the basis vectors,

!

! X n .

The zero-crossings in the distances define the decision surface.

The Gaussian function

!

f ( ! x " ! c ) = e"! x "! c 2

2# 2 is a popular Radial Basis Function, and is often used as a kernel for support vector machines.


4-5

We can use a subset of the training data to define a discriminant function, where the support vectors are drawn from the M training samples. This gives a discriminant function

!

g(! X ) = am ym f (

! X "! X m )

m=1

M

# + b ,

The training samples for which am ! 0 are the support vectors. The distance can be normalized by dividing by σ

!

g(! X ) = am ym f (

! X "! X m

#)

m=1

M

$ + b

Depending on σ, this can provide a good fit or an over fit to the data. If σ is large compared to the distance between the classes, this can give an overly flat discriminant surface. If σ is small compared to the distance between classes, this will over-fit the samples. A good choice for σ will be comparable to the distance between the closest members of the two classes.

(images from "A Computational Biology Example using Support Vector Machines", Suzy Fei, 2009) Each Radial Basis Function is a dimension in a high dimensional basis space.


4-6

Kernel Functions for Symbolic Data Kernel functions can be defined over graphs, sets, strings and text! Consider for example, a non-vector space composed of a set of words {W}. We can select a subset of discriminant words {S} ⊂ {W} Now given a set of words (a probe), {A} ⊂ {W} We can define a kernel function of A and S using the intersection operation.

!

k(A,S) = 2 A"S where | . | denotes the cardinality (the number of elements) of a set.


4-7

Support Vector Machines with Kernels Let us assume that a training data composed of M training samples

!

! X m{ } and their

indicator variable,

!

ym{ }, where , ym is -1 or +1. We will seek a linear decision surface

!

g(! X ) =

! W T f (

! X )+ b such that the training data

fall into two separable classes. That is

!

"m : ym (!

W T f (! X )+ b) > 0

If we assume that the data is separable, then for all training samples:

!

ymg(! X m ) > 0

For any training sample

!

! X m the perpendicular distance to the decision surface is:

!

dm =ymg(

! X m )!

W =

ym (!

W T f (! X m )+ b)!

W

The margin is the smallest distance from the decision surface:

!

" =min{ym (!

W T f (! X m )+ b)}

For a decision surface, (

!

! W , b), the support vectors are the subset

!

! X s{ } of the training

sample,

!

! X s{ }"

! X m{ } that on the margin, γ.

Our problem is to choose the

!

! X s{ }"

! X m{ } support vectors that maximizes the

margin.


4-8

We will seek to maximize the margin by finding the

!

! X s{ } training samples that

maximize:

!

(!

W ,b) = argmax! W ,b

1!

W min

mym (!

W T f (! X m )+ b){ }

" # $

% $

& ' $

( $

The factor

!

1!

W can be removed from the optimization because

!

! W does not depend

on m. Direct solution can be difficult because we do not always know how many support vectors will be required, notably with radial basis functions as kernels. Fortunately the problem can be converted to an equivalent problem. Note that rescaling the problem changes nothing. Thus we will scale the equation such for the sample that is closest to the decision surface (smallest margin):

!

ym (!

W T f (! X m )+ b) =1 that is:

!

ymg(! X m ) =1

For all other sample points:

!

ym (!

W T f (! X m )+ b) >1

This is known as the Canonical Representation for the decision hyperplane. The training sample where

!

ym (! w T f (

! X m )+ b) =1 are said to be the "active"

constraint. All other training samples are "inactive". By definition there is always at least one active constraint. Thus the optimization problem is to maximize

!

argmin! W ,b

12!

W 2"

# $

% & ' subject to the active

constraints. The factor of ½ is a convenience for later analysis.


4-9

To solve this problem, we will use Lagrange Multipliers, an ≥ 0, with one multiplier for each constraint. Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints, for instance to maximize f(x, y) subject to g(x, y) = 0. The function f and g must have continuous first partial derivatives. The technique introduces a new variable (λ) called a Lagrange multiplier and sets up a loss function L(), and finds a solution by setting the derivatives to zero. See wikipedia for an accessible discussion of Lagrange multipliers. For our problem, we have a Lagrangian function:

!

L(!

W ,b, ! a ) =12!

W 2" am ym (

! W T f (

! X m )+ b)"1{ }

m=1

M

#

Setting the derivatives to zero, we obtain:

!

"L"!

W = 0#

!

! W = am ym f (

! X m )

m=1

M

"

!

"L"b

= 0#

!

amymm=1

M

" = 0

Eliminating

!

! w ,b from

!

L( ! w ,b, ! a ) we obtain:

!

L(! a ) = amm=1

M

" #12

anam ynymn=1

M

" f (! X n )

T f (! X m )

m=1

M

"

!

= amm=1

M

" #12

anam ynymn=1

M

" K(! X n,! X m )

m=1

M

"

with constraints:

am ≥ 0 for m=1, ..., M and

!

amymm=1

M

" = 0


4-10

The solution takes the form of a quadratic programming problem in Dk variables (the dimension of the Kernel space). This would normally take O(Dk

3) computations. In going to the dual formulation, we have converted this to a dual problem over M data points, requiring O(M3) computations. This can appear to be a problem, but the solution only depends on a small number of points Ms << M. To classify a new observed point, we evaluate:

!

g(! X ) = am ym

m=1

M

" f (! X m )

T f (! X )+ b = am ym

m=1

M

" K(! X m ,! X )+ b

The solution to optimization problems of this form satisfy the "Karush-Kuhn-Tucker" condition, requiring: am ≥ 0

!

ymg(! X m )"1# 0

!

am ymg(! X m )"1{ } # 0

For every observation in the training set,

!

! X m{ } , either

am = 0 or

!

ymg(! X m ) =1

Any point for which am = 0 does not contribute to

!

g(! X ) = am ym

m=1

M

" f (! X m )

T f (! X )+ b = am ym

m=1

M

" K(! X m ,! X )+ b

and thus is not used! (is not active) . The remaining Ms samples for which am ≠ 0 are the Support vectors. These points lie on the margin at

!

ymg(! X m ) =1 of the maximum margin hyperplane.

Once the model is trained, all other points can be discarded! Let us define the support vectors as the set

!

! X s{ }.

Now that we have solved for

!

! X s{ } and a, we can solve for b:

we note that for any active training sample m in

!

! X s{ }


4-11

!

ym anynn"S# K (

! X n ,! X m )+ b

$

% &

'

( ) =1

averaging over all support vectors in

!

! X s{ } gives:

!

b =1

M S

ym " anynK(! X n,! X m )

n#S$

%

& '

(

) *

m#S$

From Bishop p 331.


4-12

Soft Margin SVM's - Non-separable training data. So far we have assumed that the data are linearly separable in

!

f (! X ).

For many problems some training data may overlap. The problem is that the error function goes to ∞ for any point on the wrong side of the decision surface. This is called a "hard margin" SVM. We will relax this by adding a "slack" variable, zn for each training sample. zm ≥ 1 We will define zm = 0 for training samples on the correct side of the margin, and

!

zm = ym " g(! X m ) for other training samples.

For a sample inside the margin, but on the correct side of the decision surface: 0 < zm ≤ 1 For a sample on the decision surface: zm= 1 For a sample on the wrong side of the decision surface: zm > 1

Soft margin SVM: Bishop p 332 (note use of ξn in place zn)


4-13

This is called a soft margin SVM. To softly penalize points on the wrong side, we minimize :

!

C zm +12m=1

M

" ! w 2

where C > 0 controls the tradeoff between slack variables and the margin. because any misclassified point zm > 1, the upper bound on the number of

misclassified points is

!

zmm=1

M

" .

C is an inverse factor. (note C=∞ is the SVM with hard margins) To solve for the SVM we write the Lagrangian:

!

L(!

W ,b, z, ! a ,µ) =12!

W 2

+ C zmm=1

M

" # am ymg(! X m )#1+ zm{ }

m=1

M

" # µmzmm=1

M

"

where {am ≥ 0} and { µm ≥ 0} are the Lagrange multipliers. The KKT conditions are am ≥ 0

!

ymg(! X m )"1+ zm # 0

!

am ymg(! X m )"1+ zm{ } # 0

!

µm " 0 zm ≥ 1 µnzm = 0 We optimize for

!

! W , b, and {zm}, using

!

g(! X ) =

! W T f (

! X )+ b

Solving the derivatives of

!

L(!

W ,b, ! a ) for zero gives


4-14

!

"L"w

= 0#!

W = am ym f (! X m )

m=1

M

$

!

"L"b

= 0# amymm=1

M

$ = 0

!

"L"zn

= 0# am =C $µn

using these to eliminate w, b and {Sm} from L(w, b, a) we obtain

!

L(! a ) = amn=1

N

" #12

amanym ynn=1

M

" f (! X m )

T f (! X n )

m=1

M

"

This appears to be the same as before, except that the constraints are different.

0 ≤ am ≤ C and

!

amm=1

M

" ym = 0

(referred to as a "box" constraint). The solution is a quadratic programming problem, with complexity O(M3). However, as before, a large subset of training samples have am = 0, and thus do not contribute to the optimization. For the remaining points

!

ymg(! X m ) =1" Sm

For samples ON the margin am < C hence µm > 0 requiring that Sm = 0 For samples INSIDE the margin: am = C and Sm ≤ 1 if correctly classified and Sm >1 if misclassified. as before to solve for b we note that :

!

ym anynn"S# f (

! X n )

T f (! X m )+ b

$

% &

'

( ) =1

Averaging over all support vectors in S gives:

!

b =1

M N

yn " anyn f (! X n )

T f (! X m )

n#S$

%

& '

(

) *

m#N$

where N denotes the set of support vectors such that 0 < an < C.

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	23 times
Download:	0 times

Support Vector Machines using Kernels - imag · Support Vector Machines with Kernel Functions 4-4...

Documents