Date post: | 22-Feb-2018 |
Category: |
Documents |
Upload: | vishal-mishra |
View: | 223 times |
Download: | 0 times |
of 21
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
1/21
A (short) Introduction to Support VectorMachines and Kernelbased Learning
Johan Suykens
K.U. Leuven, ESAT-SCD-SISTAKasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]
http://www.esat.kuleuven.ac.be/sista/members/suykens.html
ESANN 2003, Bruges April 2003
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
2/21
Overview
Disadvantages of classical neural nets
SVM properties and standard SVM classifier
Related kernelbased learning methods
Use of the kernel trick (Mercer Theorem)
LS-SVMs: extending the SVM framework
Towards a next generation of universally applicable models?
The problem of learning and generalization
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 1
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
3/21
Classical MLPsx1x2x3xn
1
w1
w2
w3wn
b
h()
h()
y
Multilayer Perceptron (MLP) properties:
Universal approximationof continuous nonlinear functions
Learning frominput-output patterns; either off-line or on-line learning Parallelnetwork architecture, multiple inputs and outputs
Use in feedforward and recurrent networks
Use in supervised and unsupervised learning applications
Problems: Existence of many local minima!How many neurons needed for a given task?
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 2
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
4/21
Support Vector Machines (SVM)
cost function cost function
weights weights
MLP SVM
Nonlinear classification and function estimation byconvex optimizationwith a unique solution andprimal-dualinterpretations.
Number of neuronsautomatically follows from a convex program.
Learning and generalization in huge dimensional input spaces (able toavoid the curse of dimensionality!).
Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ).Application-specifickernels possible (e.g. textmining, bioinformatics)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 3
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
5/21
SVM: support vectors
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x1
x2
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3
x1
x2
Decision boundary can be expressed in terms of a limited number ofsupport vectors(subset of given training data); sparseness property
Classifier follows from the solution to a convexQP problem.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 4
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
6/21
SVMs: living in two worlds ...
+ +
+
x
x
x
+
xx
x
+
+
+
+ +
+
x
x
x
+
x
x
xx
x
x
x
+
+
+
Input space
Feature space
(x)
Primal space: ( large data sets)
Dual space: ( high dimensional inputs)
Parametric: estimate w Rnh
Non-parametric: estimate RN
y(x) = sign[wT(x)+ b]
y(x) = sign[P#sv
i=1 iyiK(x, xi) +b]
K(xi, xj)=(xi)T(xj) (Kernel trick)
y(x)
y(x)
w1
wnh
1
#sv
1(x)
nh(x)
K(x, x1)
K(x, x#sv)
x
x
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 5
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
7/21
Standard SVM classifier (1)
Training set {xi, yi}Ni=1: inputs xi R
n; class labels yi {1, +1}
Classifier: y(x) = sign[wT(x) +b]
with () : Rn Rnh a mapping to a high dimensional feature space(which can be infinite dimensional!)
Forseparable data, assume
wT(xi) +b+1, if yi= +1wT(xi) +b 1, if yi=1
yi[wT(xi) +b] 1,i
Optimization problem (non-separable case):
minw,b,
J(w, ) =1
2wTw+c
Ni=1
i s.t.
yi[w
T(xi) +b]1 ii0, i= 1,...,N
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 6
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
8/21
Standard SVM classifier (2)
Lagrangian:
L(w,b, ; , ) =J(w, ) N
i=1
i{yi[wT(xi) +b] 1 +i}
N
i=1
ii
Find saddle point: max,
minw,b,
L(w,b, ; , )
One obtains
Lw
= 0 w=N
i=1
iyi(xi)
Lb
= 0 Ni=1
iyi= 0
Li
= 0 0ic, i= 1,...,N
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 7
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
9/21
Standard SVM classifier (3)
Dual problem: QP problem
max
Q() =1
2
N
i,j=1
yiyjK(xi, xj) ij+N
j=1
j s.t.
Ni=1
iyi= 0
0 ic, i
withkernel trick(Mercer Theorem): K(xi, xj) =(xi)T (xj)
Obtained classifier: y(x) = sign[N
i=1 i yi K(x, xi) +b]
Some possible kernels K(, ):
K(x, xi) =xTi x (linear SVM)
K(x, xi) = (xTix+)
d (polynomial SVM of degree d)K(x, xi) = exp{x xi
22/
2} (RBF kernel)K(x, xi) = tanh( x
Ti x+) (MLP kernel)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 8
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
10/21
Kernelbased learning: many related methods and fields
=
SVMs
LS-SVMs
Regularization networks
Gaussian processes
Kriging
Kernel ridge regression
RKHS
?
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige1970: Parzen
1971: Kimeldorf & Wahba
SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 9
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
11/21
Wider use of the kernel trick
Angle between vectors:Input space:
cos xz = xTz
x2z2Feature space:
cos (x),(z)= (x)T(z)
(x)2(z)2=
K(x, z)K(x, x)
K(z, z)
Distance between vectors:Input space:
x z22= (x z)T(x z) =xTx+zTz 2xTz
Feature space:
(x) (z)22=K(x, x) +K(z, z) 2K(x, z)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 10
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
12/21
LS-SVM models: extending the SVM framework
Linear and nonlinear classification and function estimation, applicable in highdimensionalinput spaces;primal-dual optimization formulations.
Solvinglinear systems; link with Gaussian processes, regularization networks and kernel
versions of Fisher discriminant analysis.
Sparseapproximation androbustregression (robust statistics).
Bayesian inference (probabilistic interpretations, inference of hyperparameters, modelselection, automatic relevance determination for input selection).
Extensions to unsupervised learning: kernel PCA(and related methods ofkernel PLS,CCA), density estimation (clustering).
Fixed-size LS-SVMs: large scaleproblems;adaptive learning machines; transductive.
Extensions torecurrent networksandcontrol.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 11
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
13/21
Towards a next generation of universal models?
FDA
PCA
PLS
CCA
Classifiers
Regression
Clustering
Recurrent
Linear
RobustLinear
Kernel
RobustKerne
l
LS-SVM
SVM
Research issues:
Large scale methodsAdaptive processingRobustness issuesStatistical aspectsApplication-specific kernels
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 12
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
14/21
Fixed-size LS-SVM (1)
Primal space Dual space
Nystrom method
Kernel PCA
Density estimate
Entropy criteria
Eigenfunctions
SV selectionRegression
Modelling in view of primal-dual representationsLink Nystrom approximation (GP) - kernel PCA - density estimation
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 13
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
15/21
Fixed-size LS-SVM (2)
high dimensional inputs, large data sets, adaptive learning machines (using LS-SVMlab)
Sinc function (20.000 data, 10 SV)
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
x
y
Santa Fe laser data
2
1
0
1
2
3
4
5
0 100 200 300 400 500 600 700 800 900
discrete time k
y
k
10 20 30 40 50 60 70 80 902
1
0
1
2
3
4
discrete time k
yk,
yk
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 14
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
16/21
Fixed-size LS-SVM (3)
2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.52.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 15
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
17/21
The problem of learning and generalization (1)
Different mathematical settings exist, e.g.
Vapnik et al.:
Predictive learning problem (inductive inference)Estimating values of functions at given points (transductive inference)Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.
Poggio et al., Smale:Estimate true functionfwith analysis of approximation error and sampleerror (e.g. in RKHS space, Sobolev space)Cucker F., Smale S. (2002) On the mathematical foundations of learning theory, Bulletin of the AMS,
39, 149.
Goal: Deriving bounds on the generalization error (this can be used todetermine regularization parameters and other tuning constants). Importantfor practical applications is trying to get sharp bounds.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 16
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
18/21
The problem of learning and generalization (2)
(see Pontil, ESANN 2003)
Random variables xX, yY RDraw i.i.d. samples from (unknown) probability distribution (x, y)
Generalization error:
E[f] =
X,Y
L(y, f(x))(x, y)dxdy
Loss function L(y, f(x));empirical error EN[f] = 1
N
Ni=1
L(yi, f(xi))
f:= arg minf
E[f] (true function); fN:= arg minf
EN[f]
IfL(y, f) = (f y)2 thenf=Y
y(y|x)dy (regression function)
Consider hypothesis space H with fH:= arg minfH
E[f]
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 17
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
19/21
The problem of learning and generalization (3)
generalization error= sample error + approximation errorE[fN] E[f] = (E[fN] E[fH]) + (E[fH] E[f])
approximation errordepends only on H (not on sampled examples)sample error:
E(fN) E(fH)(N, 1/h, 1/) (w.p. 1 )
is a non-decreasing functionh measures the size of hypothesis space H
Overfittingwhen h large & Nsmall (large sample error)Goal: obtain a good trade-off between sample error and approximation error
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 18
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
20/21
Interdisciplinary challenges
NATO-ASI onLearning Theory and Practice, Leuven July 2002http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html
SVM & kernel methods
linear algebra
mathematics
statistics
systems and control theory
signal processingoptimization
machine learning
pattern recognition
data mining
neural networks
J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), Advances in Learning Theory:
Methods, Models and Applications, NATO-ASI Series Computer and Systems Sciences, IOS Press, 2003.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 19
7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)
21/21
Books, software, papers ...
www.kernel-machines.org & www.esat.kuleuven.ac.be/sista/lssvmlab/
Introductory papers:
C.J.C. Burges (1998)A tutorial on support vector machines for pattern recognition,Knowledge Discovery
and Data Mining, 2(2), 121-167.
A.J. Smola, B. Scholkopf (1998)A tutorial on support vector regression, NeuroCOLT Technical Report
NC-TR-98-030, Royal Holloway College, University of London, UK.
T. Evgeniou, M. Pontil, T. Poggio (2000)Regularization networks and support vector machines, Advances
in Computational Mathematics, 13(1), 150.
K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf (2001)An introduction to kernel-based learning
algorithms, IEEE Transactions on Neural Networks, 12(2), 181-201.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 20