Date post: | 06-May-2015 |
Category: |
Technology |
Upload: | nextlib |
View: | 8,452 times |
Download: | 2 times |
23/4/11 Chap8 SVM Zhongzhi Shi 1
Advanced Computing Seminar Data Mining and Its Industrial
Applications
— Chapter 8 —
Support Vector Machines
Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr Knowledge and Software Engineering Lab
Advanced Computing Research CentreSchool of Computer and Information Science
University of South Australia
23/4/11 Chap8 SVM Zhongzhi Shi 2
Outline
Introduction
Support Vector Machine
Non-linear Classification
SVM and PAC
Applications
Summary
23/4/11 Chap8 SVM Zhongzhi Shi 3
History SVM is a classifier derived from
statistical learning theory by Vapnik and Chervonenkis
SVMs introduced by Boser, Guyon, Vapnik in COLT-92
Initially popularized in the NIPS community, now an important and active field of all Machine Learning research.
Special issues of Machine Learning Journal, and Journal of Machine Learning Research.
23/4/11 Chap8 SVM Zhongzhi Shi 4
What is SVM? SVMs are learning systems that
use a hypothesis space of linear functions
in a high dimensional feature space — Kernel function
trained with a learning algorithm from optimization theory — Lagrange
Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis
23/4/11 Chap8 SVM Zhongzhi Shi 5
Linear Classifiers
yest
denotes +1
denotes -1
f x
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 6
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 7
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 8
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 9
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 10
Maximum Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 11
Model of Linear Classification
Binary classification is frequently performed by
using a real-valued hypothesis function:
bxw
bxwxFn
iii
1
)(
The input x is assigned to the positive class, if
Otherwise to the negative class.
0)( xf
23/4/11 Chap8 SVM Zhongzhi Shi 12
The concept of Hyperplane
For a binary linear separable
training set, we can find at least
a hyperplane (w,b) which
divides the space into two half
spaces.
The definition of hyperplane
0)( xf
0)( xf
0)( xf
0)( xf
23/4/11 Chap8 SVM Zhongzhi Shi 13
Tuning the Hyperplane (w,b)
The Perceptron Algorithm Proposed by Frank Rosenblatt in 1956
Preliminary definition The functional margin of an example (xi,yi)
implies correct classification of (xi,yi)
)( bxwy iii
0i
23/4/11 Chap8 SVM Zhongzhi Shi 14
The Perceptron Algorithm
The number of mistakes is at most 22
R
23/4/11 Chap8 SVM Zhongzhi Shi 15
The Geometric margin →
The Euclidean
distance of an
example (xi,yi)
from the decision
boundary
w
bx
w
wii
23/4/11 Chap8 SVM Zhongzhi Shi 16
The Geometric margin
The margin of a training set Si
li
1min
Maximal Margin Hyperplane A hyperplane realising the
maximun geometric margin
The optimal linear classifier If it can form the Maximal Margin Hyperplane.
23/4/11 Chap8 SVM Zhongzhi Shi 17
How to Find the optimal solution?
The drawback of the perceptron algorithm The algorithm may give a different
solution depending on the order in which the examples are processed.
The superiority of SVM The kind of learning machines tune the
solution based on the optimization theory.
23/4/11 Chap8 SVM Zhongzhi Shi 18
The Maximal Margin Classifier
The simplest model of SVM Finds the maximal margin hyperplane
in an chosen kernel-induced feature space.
A convex optimization problem Minimizing a quadratic function under
linear inequality constrains
23/4/11 Chap8 SVM Zhongzhi Shi 19
Support Vector Classifiers Support vector machines
Cortes and Vapnik (1995) well suited for high-dimensional data binary classification
Training set D = {(xi,yi), i=1,…,n}, xi Rm and yi {-1,1}
Linear discriminant classifier Separating hyperplane
{ x : g(x) = wTx + w0 = 0 } model parameters: w Rm and w0 R
23/4/11 Chap8 SVM Zhongzhi Shi 20
Formalizi the geometric margin
Assumes that Sxx 1,,1,
1,1 bxwbxw
The geometric margin
wx
w
wx
w
w 1
2
1
In order to find the maximum ,we must find the minimum
w
23/4/11 Chap8 SVM Zhongzhi Shi 21
Minimizing the norm →
Because
We can re-formalize the optimization problem
w
www 2
23/4/11 Chap8 SVM Zhongzhi Shi 22
Minimizing the norm →
Uses the Lagrangian function
Obtained
Resubstituting into the primal to obtain
l
jijijiji
l
ii xxyybwL
1,21
1
,,,
w
23/4/11 Chap8 SVM Zhongzhi Shi 23
Minimizing the norm
Finds the minimum is equivalent to find the
maximum
The strategies for minimizing differentiable function Decomposition Sequential Minimal Optimization (SMO)
w
ww
)(W
23/4/11 Chap8 SVM Zhongzhi Shi 24
The Support Vector
The condition of the optimization problem states that
This implies that only for input xi for which the functional margin is one
This implies that it lies closest to the hyperplane
The corresponding
01*** bxwy iii
0* i
23/4/11 Chap8 SVM Zhongzhi Shi 25
The optimal hypothesis (w,b)
The two parameters can be obtained from
The hypothesis is
l
iiii xyw
1
**
2
minmax *1
*1* iyiy xwxw
b ii
l
SViiii bxxybxf *** ),,(
23/4/11 Chap8 SVM Zhongzhi Shi 26
Soft Margin Optimization
The main problem with the maximal margin classifier is that
it always products perfectly a consistent hypothesis a hypothesis with no training error
Relax the boundary
23/4/11 Chap8 SVM Zhongzhi Shi 27
Non-linear Classification The problem
The maximal margin classifier is an important concept, but it cannot be used in many real-world problems
There will in general be no linear separation in the feature space.
The solution Maps the data into another space that can
be separated linearly.
23/4/11 Chap8 SVM Zhongzhi Shi 28
A learning machine
A learning machine f takes an input x and transforms it, somehow using weights , into a predicted output yest = +/- 1
f x
yest
is some vector of adjustable parameters
23/4/11 Chap8 SVM Zhongzhi Shi 29
Some definitions
Given some machine f And under the assumption that all training points (xk,yk) were drawn i.i.d
from some distribution. And under the assumption that future test points will be drawn from the
same distribution Define
icationMisclassif
ofy Probabilit),(
2
1)(TESTERR)(
xfyER
Official terminology
23/4/11 Chap8 SVM Zhongzhi Shi 30
Some definitions Given some machine f And under the assumption that all training points (xk,yk) were drawn i.i.d from
some distribution. And under the assumption that future test points will be drawn from the same
distribution Define
icationMisclassif
ofy Probabilit),(
2
1)(TESTERR)(
xfyER
Official terminology
iedmisclassifSet
TrainingFraction ),(
2
11)(TRAINERR)(
1
R
kkk
emp xfyR
R
R = #training set data points
23/4/11 Chap8 SVM Zhongzhi Shi 31
Vapnik-Chervonenkis Dimension
Given some machine f, let h be its VC dimension. h is a measure of f’s power (h does not depend on the choice of training set) Vapnik showed that with probability 1-
),(2
1)(TESTERR xfyE
R
kkk xfy
R 1
),(2
11)(TRAINERR
R
hRh )4/log()1)/2(log()(TRAINERR)(TESTERR
This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
23/4/11 Chap8 SVM Zhongzhi Shi 32
Structural Risk Minimization Let (f) = the set of functions representable by f. Suppose Then We’re trying to decide which machine to use. We train each machine and make a table…
i fiTRAINERR
VC-Conf Probable upper bound on TESTERR
Choice
1 f1
2 f2
3 f3
4 f4
5 f5
6 f6
R
hRh )4/log()1)/2(log()(TRAINERR)(TESTERR
)()()( 21 nfφfφfφ )()()( 21 nfhfhfh
23/4/11 Chap8 SVM Zhongzhi Shi 33
Kernel-Induced Feature Space
Mapping the data of space X into space F
)(),.....,(,....., 11 xxxxxx Nn
23/4/11 Chap8 SVM Zhongzhi Shi 34
Implicit Mapping into Feature Space
For the non-linear separable data set, we can modify the
hypothesis to map implicitly the data to another feature
space
l
iii bxwxf
1
)(
l
SViiiiii bxxyxf *)(
23/4/11 Chap8 SVM Zhongzhi Shi 35
Kernel Function
A Kernel is a function K, such that for all
The benefits Solve the computational problem of
working with many dimensions
Xzx ,
zxzxK ),(
23/4/11 Chap8 SVM Zhongzhi Shi 36
kernels sigmoid ))(tanh(),(
functions basis radial ))2(exp(),(
1),(
22
jiji
jiji
d
jiji
xxxxk
xxxxk
polynomialxxxxk
Kernel function
23/4/11 Chap8 SVM Zhongzhi Shi 37
The Polynomial Kernel
The kind of kernel represents the inner product of two vector(point) in a feature space of dimension.
For example
dyxyxK ),(
d
dn 1
23/4/11 Chap8 SVM Zhongzhi Shi 38
23/4/11 Chap8 SVM Zhongzhi Shi 39
23/4/11 Chap8 SVM Zhongzhi Shi 40
Text Categorization
Inductive learning Inpute :Output : f(x) = confidence(class)
In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
),...,( 21 nxxxx
23/4/11 Chap8 SVM Zhongzhi Shi 41
PROPERTIES OF TEXT-CLASSIFICATION TASKS
High-Dimensional Feature Space. Sparse Document Vectors. High Level of Redundancy.
23/4/11 Chap8 SVM Zhongzhi Shi 42
Text representation and feature selection
Binary feature term frequency Inverse document frequency
n is the total number of documentsDF(w) is the number of documents the word
occurs in
23/4/11 Chap8 SVM Zhongzhi Shi 43
23/4/11 Chap8 SVM Zhongzhi Shi 44
Learning SVMS
To learn the vector of feature weights Linear SVMS Polynomial classifiers Radial basis functions
w
23/4/11 Chap8 SVM Zhongzhi Shi 45
Processing
Text files are processed to produce a vector of words
Select 300 words with highest mutual information with each category(remove stopwords)
A separate classifier is learned for each category.
23/4/11 Chap8 SVM Zhongzhi Shi 46
An example - Reuters (trends & controversies)
Category : interest
Weight vector
large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46)
large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71)
w
23/4/11 Chap8 SVM Zhongzhi Shi 47
23/4/11 Chap8 SVM Zhongzhi Shi 48
Text Categorization Results
Dumais et al. (1998)
23/4/11 Chap8 SVM Zhongzhi Shi 49
Apply to the Linear Classifier
Substitutes to the hypothesis
Substitutes to the margin optimization
l
SViiiiii bxxyxf *)(
23/4/11 Chap8 SVM Zhongzhi Shi 50
SVMs and PAC Learning
Theorems connect PAC theory to the size of the margin
Basically, the larger the margin, the better the expected accuracy
See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002
23/4/11 Chap8 SVM Zhongzhi Shi 51
PAC and the Number of Support Vectors
The fewer the support vectors, the better the generalization will be
Recall, non-support vectors are Correctly classified Don’t change the learned model if left out of
the training set So
examples training#
ctorssupport ve # rateerror out oneleave
23/4/11 Chap8 SVM Zhongzhi Shi 52
VC-dimension of an SVM Very loosely speaking there is some theory which under some
different assumptions puts an upper bound on the VC dimension as
where Diameter is the diameter of the smallest sphere that can
enclose all the high-dimensional term-vectors derived from the training set.
Margin is the smallest margin we’ll let the SVM use This can be used in SRM (Structural Risk Minimization) for
choosing the polynomial degree, RBF , etc. But most people just use Cross-Validation
Margin
Diameter
Copyright © 2001, 2003, Andrew W. Moore
23/4/11 Chap8 SVM Zhongzhi Shi 53
Finding Non-Linear Separating Surfaces
Map inputs into new spaceExample: features x1 x2
5 4
Example: features x1 x2 x12 x2
2 x1*x2
5 4 25 16 20
Solve SVM program in this new space Computationally complex if many features But a clever trick exists
23/4/11 Chap8 SVM Zhongzhi Shi 54
Summary
Maximize the margin between positive and negative examples (connects to PAC theory)
Non-linear Classification The support vectors contribute to
the solution Kernels map examples into a new,
usually non-linear space
23/4/11 Chap8 SVM Zhongzhi Shi 55
References
Vladimir Vapnik. The Nature of Statistical Learning Theory, Springer, 1995
Andrew W. Moore. cmsc726: SVMs. http://www.cs.cmu.edu/~awm/tutorials
C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html
Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998
Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines
23/4/11 Chap8 SVM Zhongzhi Shi 56
www.intsci.ac.cn/shizz/
Questions?!Questions?!