PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 7: SPARSE KERNEL MACHINES.

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 7: SPARSE KERNEL MACHINES

Outline

• The problem: finding a sparse decision (and regression) machine that uses kernels

• The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs)

• The core ideas behind the solutions

• The mathematical details

The problem (1)

Methods introduced in chapters 3 and 4• Take into account all data points in the training

set -> cumbersome• Do not take advantage of kernel methods

-> basis functions have to be explicit

Example: Least squares and logistic regression

The problem (2)

Kernel methods require evaluation of the kernel function for all pairs of • -> cumbersome

The solution (1)

Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points

The solution (2)

Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points

The solution (3)

SVMs as well as RVMs can also be used for regression

SVM RVMeven sparser!

SVM: The core idea (1)

That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:


In input space:


For regression:

RVM: The core idea (1)

Exclude basis vectors whose presence reduces the probability of the observed data

RVM: The core idea (2)

For classification and regression:

Classification Regression

SVM: The details (1)

Equation of the decision surface:

Distance of a point from the decision surface:



Maximum margin solution:



We therefore may rescale , such that

for the point closest to the surface.


Therefore, we can reduce

to

under the constraint


To solve this, we introduce Lagrange multipliers and minimize

Equivalently, we can maximize the dual representation

where the kernel function can be chosen without specifying explicitly.


Because of the constraint

only those survive for which is on the margin,

i.e.

This leads to sparsity.


Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of


In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points.

To achieve this, we introduce slack variableswith


Graphically:


The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:


The soft-margin approach can be formulated as minimizing the regularized error function

This formulation can be extended to use SVMs for regression:

where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.


Graphically:


Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:

SVM: Limitations

• Output is a decision, not a posterior probability

• Extension of classification to more than two classes is problematic

• The parameters C and ϵ have to be found by methods such as cross validation

• Kernel functions are required to be positive definite

Date post:	15-Dec-2015
Category:	Documents
Upload:	jada-whicker
View:	233 times
Download:	1 times

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 7: SPARSE KERNEL MACHINES.

Documents