[Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)

7/24/2019 [Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)

1/21

A (short) Introduction to Support VectorMachines and Kernelbased Learning

Johan Suykens

K.U. Leuven, ESAT-SCD-SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium

Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

http://www.esat.kuleuven.ac.be/sista/members/suykens.html

ESANN 2003, Bruges April 2003


2/21

Overview

Disadvantages of classical neural nets

SVM properties and standard SVM classifier

Related kernelbased learning methods

Use of the kernel trick (Mercer Theorem)

LS-SVMs: extending the SVM framework

Towards a next generation of universally applicable models?

The problem of learning and generalization

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003 1


3/21

Classical MLPsx1x2x3xn

1

w1

w2

w3wn

b

h()

h()

y

Multilayer Perceptron (MLP) properties:

Universal approximationof continuous nonlinear functions

Learning frominput-output patterns; either off-line or on-line learning Parallelnetwork architecture, multiple inputs and outputs

Use in feedforward and recurrent networks

Use in supervised and unsupervised learning applications

Problems: Existence of many local minima!How many neurons needed for a given task?



4/21

Support Vector Machines (SVM)

cost function cost function

weights weights

MLP SVM

Nonlinear classification and function estimation byconvex optimizationwith a unique solution andprimal-dualinterpretations.

Number of neuronsautomatically follows from a convex program.

Learning and generalization in huge dimensional input spaces (able toavoid the curse of dimensionality!).

Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ).Application-specifickernels possible (e.g. textmining, bioinformatics)



5/21

SVM: support vectors

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x1

x2

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

3

x1

x2

Decision boundary can be expressed in terms of a limited number ofsupport vectors(subset of given training data); sparseness property

Classifier follows from the solution to a convexQP problem.



6/21

SVMs: living in two worlds ...

+ +

+

x

x

x

+

xx

x

+

+

+

+ +

+

x

x

x

+

x

x

xx

x

x

x

+

+

+

Input space

Feature space

(x)

Primal space: ( large data sets)

Dual space: ( high dimensional inputs)

Parametric: estimate w Rnh

Non-parametric: estimate RN

y(x) = sign[wT(x)+ b]

y(x) = sign[P#sv

i=1 iyiK(x, xi) +b]

K(xi, xj)=(xi)T(xj) (Kernel trick)

y(x)

y(x)

w1

wnh

1

#sv

1(x)

nh(x)

K(x, x1)

K(x, x#sv)

x

x



7/21

Standard SVM classifier (1)

Training set {xi, yi}Ni=1: inputs xi R

n; class labels yi {1, +1}

Classifier: y(x) = sign[wT(x) +b]

with () : Rn Rnh a mapping to a high dimensional feature space(which can be infinite dimensional!)

Forseparable data, assume

wT(xi) +b+1, if yi= +1wT(xi) +b 1, if yi=1

yi[wT(xi) +b] 1,i

Optimization problem (non-separable case):

minw,b,

J(w, ) =1

2wTw+c

Ni=1

i s.t.

yi[w

T(xi) +b]1 ii0, i= 1,...,N



8/21


Lagrangian:

L(w,b, ; , ) =J(w, ) N

i=1

i{yi[wT(xi) +b] 1 +i}

N

i=1

ii

Find saddle point: max,

minw,b,

L(w,b, ; , )

One obtains

Lw

= 0 w=N

i=1

iyi(xi)

Lb

= 0 Ni=1

iyi= 0

Li

= 0 0ic, i= 1,...,N



9/21


Dual problem: QP problem

max

Q() =1

2

N

i,j=1

yiyjK(xi, xj) ij+N

j=1

j s.t.

Ni=1

iyi= 0

0 ic, i

withkernel trick(Mercer Theorem): K(xi, xj) =(xi)T (xj)

Obtained classifier: y(x) = sign[N

i=1 i yi K(x, xi) +b]

Some possible kernels K(, ):

K(x, xi) =xTi x (linear SVM)

K(x, xi) = (xTix+)

d (polynomial SVM of degree d)K(x, xi) = exp{x xi

22/

2} (RBF kernel)K(x, xi) = tanh( x

Ti x+) (MLP kernel)



10/21

Kernelbased learning: many related methods and fields

=

SVMs

LS-SVMs

Regularization networks

Gaussian processes

Kriging

Kernel ridge regression

RKHS

?

Some early history on RKHS:

1910-1920: Moore

1940: Aronszajn

1951: Krige1970: Parzen

1971: Kimeldorf & Wahba

SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces



11/21

Wider use of the kernel trick

Angle between vectors:Input space:

cos xz = xTz

x2z2Feature space:

cos (x),(z)= (x)T(z)

(x)2(z)2=

K(x, z)K(x, x)

K(z, z)

Distance between vectors:Input space:

x z22= (x z)T(x z) =xTx+zTz 2xTz

Feature space:

(x) (z)22=K(x, x) +K(z, z) 2K(x, z)



12/21

LS-SVM models: extending the SVM framework

Linear and nonlinear classification and function estimation, applicable in highdimensionalinput spaces;primal-dual optimization formulations.

Solvinglinear systems; link with Gaussian processes, regularization networks and kernel

versions of Fisher discriminant analysis.

Sparseapproximation androbustregression (robust statistics).

Bayesian inference (probabilistic interpretations, inference of hyperparameters, modelselection, automatic relevance determination for input selection).

Extensions to unsupervised learning: kernel PCA(and related methods ofkernel PLS,CCA), density estimation (clustering).

Fixed-size LS-SVMs: large scaleproblems;adaptive learning machines; transductive.

Extensions torecurrent networksandcontrol.



13/21

Towards a next generation of universal models?

FDA

PCA

PLS

CCA

Classifiers

Regression

Clustering

Recurrent

Linear

RobustLinear

Kernel

RobustKerne

l

LS-SVM

SVM

Research issues:

Large scale methodsAdaptive processingRobustness issuesStatistical aspectsApplication-specific kernels



14/21

Fixed-size LS-SVM (1)

Primal space Dual space

Nystrom method

Kernel PCA

Density estimate

Entropy criteria

Eigenfunctions

SV selectionRegression

Modelling in view of primal-dual representationsLink Nystrom approximation (GP) - kernel PCA - density estimation



15/21


high dimensional inputs, large data sets, adaptive learning machines (using LS-SVMlab)

Sinc function (20.000 data, 10 SV)

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

Santa Fe laser data

2

1

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900

discrete time k

y

k

10 20 30 40 50 60 70 80 902

1

0

1

2

3

4

discrete time k

yk,

yk



16/21


2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.52.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5



17/21

The problem of learning and generalization (1)

Different mathematical settings exist, e.g.

Vapnik et al.:

Predictive learning problem (inductive inference)Estimating values of functions at given points (transductive inference)Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.

Poggio et al., Smale:Estimate true functionfwith analysis of approximation error and sampleerror (e.g. in RKHS space, Sobolev space)Cucker F., Smale S. (2002) On the mathematical foundations of learning theory, Bulletin of the AMS,

39, 149.

Goal: Deriving bounds on the generalization error (this can be used todetermine regularization parameters and other tuning constants). Importantfor practical applications is trying to get sharp bounds.



18/21


(see Pontil, ESANN 2003)

Random variables xX, yY RDraw i.i.d. samples from (unknown) probability distribution (x, y)

Generalization error:

E[f] =

X,Y

L(y, f(x))(x, y)dxdy

Loss function L(y, f(x));empirical error EN[f] = 1

N

Ni=1

L(yi, f(xi))

f:= arg minf

E[f] (true function); fN:= arg minf

EN[f]

IfL(y, f) = (f y)2 thenf=Y

y(y|x)dy (regression function)

Consider hypothesis space H with fH:= arg minfH

E[f]



19/21


generalization error= sample error + approximation errorE[fN] E[f] = (E[fN] E[fH]) + (E[fH] E[f])

approximation errordepends only on H (not on sampled examples)sample error:

E(fN) E(fH)(N, 1/h, 1/) (w.p. 1 )

is a non-decreasing functionh measures the size of hypothesis space H

Overfittingwhen h large & Nsmall (large sample error)Goal: obtain a good trade-off between sample error and approximation error



20/21

Interdisciplinary challenges

NATO-ASI onLearning Theory and Practice, Leuven July 2002http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html

SVM & kernel methods

linear algebra

mathematics

statistics

systems and control theory

signal processingoptimization

machine learning

pattern recognition

data mining

neural networks

J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), Advances in Learning Theory:

Methods, Models and Applications, NATO-ASI Series Computer and Systems Sciences, IOS Press, 2003.



21/21

Books, software, papers ...

www.kernel-machines.org & www.esat.kuleuven.ac.be/sista/lssvmlab/

Introductory papers:

C.J.C. Burges (1998)A tutorial on support vector machines for pattern recognition,Knowledge Discovery

and Data Mining, 2(2), 121-167.

A.J. Smola, B. Scholkopf (1998)A tutorial on support vector regression, NeuroCOLT Technical Report

NC-TR-98-030, Royal Holloway College, University of London, UK.

T. Evgeniou, M. Pontil, T. Poggio (2000)Regularization networks and support vector machines, Advances

in Computational Mathematics, 13(1), 150.

K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf (2001)An introduction to kernel-based learning

algorithms, IEEE Transactions on Neural Networks, 12(2), 181-201.


Date post:	22-Feb-2018
Category:	Documents
Upload:	vishal-mishra
View:	223 times
Download:	0 times

[Suykens J.] a Short Introduction to Support Vecto(BookZZ.org)

Documents