Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | jada-whicker |
View: | 233 times |
Download: | 1 times |
PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 7: SPARSE KERNEL MACHINES
Outline
• The problem: finding a sparse decision (and regression) machine that uses kernels
• The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs)
• The core ideas behind the solutions
• The mathematical details
The problem (1)
Methods introduced in chapters 3 and 4• Take into account all data points in the training
set -> cumbersome• Do not take advantage of kernel methods
-> basis functions have to be explicit
Example: Least squares and logistic regression
The problem (2)
Kernel methods require evaluation of the kernel function for all pairs of • -> cumbersome
The solution (1)
Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points
The solution (2)
Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points
The solution (3)
SVMs as well as RVMs can also be used for regression
SVM RVMeven sparser!
SVM: The core idea (1)
That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:
SVM: The core idea (2)
In input space:
SVM: The core idea (3)
For regression:
RVM: The core idea (1)
Exclude basis vectors whose presence reduces the probability of the observed data
RVM: The core idea (2)
For classification and regression:
Classification Regression
SVM: The details (1)
Equation of the decision surface:
Distance of a point from the decision surface:
SVM: The details (2)
Distance of a point from the decision surface:
Maximum margin solution:
SVM: The details (3)
Distance of a point from the decision surface:
We therefore may rescale , such that
for the point closest to the surface.
SVM: The details (4)
Therefore, we can reduce
to
under the constraint
SVM: The details (5)
To solve this, we introduce Lagrange multipliers and minimize
Equivalently, we can maximize the dual representation
where the kernel function can be chosen without specifying explicitly.
SVM: The details (6)
Because of the constraint
only those survive for which is on the margin,
i.e.
This leads to sparsity.
SVM: The details (7)
Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of
SVM: The details (8)
In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points.
To achieve this, we introduce slack variableswith
SVM: The details (9)
Graphically:
SVM: The details (10)
The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:
SVM: The details (11)
The soft-margin approach can be formulated as minimizing the regularized error function
This formulation can be extended to use SVMs for regression:
where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.
SVM: The details (12)
Graphically:
SVM: The details (13)
Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:
SVM: Limitations
• Output is a decision, not a posterior probability
• Extension of classification to more than two classes is problematic
• The parameters C and ϵ have to be found by methods such as cross validation
• Kernel functions are required to be positive definite