Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes...

Kernels

Usman RoshanCS 675 Machine Learning

Feature space representation

• Consider two classes shown below• Data cannot be separated by a hyperplane

Feature space representation

• Suppose we square each coordinate• In other words (x1 , x2 ) => (x1

2 , x22 )

• Now the data are well separated

Feature spaces/Kernel trick

• Using a linear classifier (nearest means or SVM) we solve a non-linear problem simply by working in a different feature space.

• With kernels – we don’t have to make the new feature space

explicit.– we can implicitly work in a different space and

efficiently compute dot products there.

Support vector machine

• Consider the hard margin SVM optimization

• Solve by applying KKT. Think of KKT as a tool for constrained convex optimization.

• Form Lagrangian2

0

1- ( ( ) 1)

2

where are Lagrange multipliers

Tp i i i

i

i

L w y w x w


• KKT says the optimal w and w0 are given by the saddle point solution

• And KKT conditions imply that and

0

2

, 0

1min max - ( ( ) 1)

2

subject to 0

i

Tw w p i i i

i

i

L w y w x w


• After applying the Lagrange multipliers we obtain the dual by substituting w into the primal (dual is maximized)

SVM and kernels

• We can rewrite the dual in a compact form:

Optimization

• The SVM is thus a quadratic program that can be solved by any quadratic program solver.

• Platt’s Sequential Minimization Optimization (SMO) algorithm offers a simple specific solution to the SVM dual

• Idea is to perform coordinate ascent by selecting two variables at a time to optimize

• Let’s look at some kernels.

Example kernels

• Polynomial kernels of degree d give a feature space with higher order non-linear terms

• Radial basis kernel gives infinite dimensional space (Taylor series)

( , ) ( 1)T di j i jK x x x x

2

2( , )i jx x

si jK x x e

Example kernels

• Empirical kernel map– Define a set of reference vectors for– Define a score between xi and mj

– Then– And

Example kernels

• Bag of words– Given two documents D1 and D2 the we define the

kernel K(D1,D2) as the number of words in common

– To prove this is a kernel first create a large set of words Wi. Define the mapping Φ(D1) as a high dimensional vector where Φ(D1)[i] is 1 if the word Wi is present in the document.

SVM and kernels

• What if we make the kernel matrix K a variable and optimize the dual

• But now there is no way to tie the kernel matrix to the training data points.

SVM and kernels

• To tie the kernel matrix to training data we assume that the kernel to be determined is a linear combination of some existing base kernels.

• Now we have a problem that is not a quadratic program anymore.

• Instead we have a semi-definite program (Lanckriet et. al. 2002)

Theoretical foundation

• Recall the margin error theorem (7.3 from Learning with kernels)

Theoretical foundation

• The kernel analogue of Theorem 7.3 from Lackriet et. al. 2002:

How does MKL work in practice?

• Gonnen and Alpaydin, JMLR, 2011• Datasets:– Digit recognition, – Internet advertisements– Protein folding

• Form kernels with different sets of features• Apply SVM with various kernel learning

algorithms.


From Gonnen and Alpaydin, JMLR, 2011






• MKL better than single kernel• Mean kernel hard to beat• Non-linear MKL looks promising

Date post:	05-Jan-2016
Category:	Documents
Upload:	louise-montgomery
View:	388 times
Download:	0 times

Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes...

Documents