1
Image Classification and Support Vector Machine
Shao-Chuan WangCITI, Academia Sinica
2
Outline (1/2)Quick Review of SVM
Intuition Functional margin and geometric margin
Optimal margin classifier Generalized Lagrangian multiplier methods Lagrangian duality
Kernel and feature mappingSoft Margin ( l1 regularization)
3
Outline (2/2)Some basis about Learning theory
Bias/variance tradeoff (underfitting vs overfitting)Chernoff bound and VC dimension
Model selectionCross validation
Dimension ReductionMulticlass SVM
One against oneOne against all
Image Classification by SVMProcessResults
4
Intuition: MarginsFunctional Margin
Geometric Margin
}1,1{,)},{( )(0
)()( im
iii yyS x
)(ˆ )()()( by iTii xwWe feel more confident when functional margin is larger
0)()(
biiT
w
wxw
wx
w
w bi
T
i
)()(
0bTxw
1x
2x
)(i
w ),( )()( ii yx
Note that scaling on w, b won’t change the plane.
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
5
Maximize marginsOptimization problem: maximize minimal
geometric margin under constraints.
Introduce scaling factor such that
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
6
Lagrange dualityPrimal optimization problem:
Generalized Lagrangian
Primal optimization problem (equivalent form)
Dual optimization problem:
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
7
Dual Problem
The necessary conditions that equality holds:f, gi are convex, and hi are affine.KKT conditions.
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
8
Optimal margin classifiers
Its Lagrangian
Its dual problem
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
9
Kernel and feature mappingKernel:
Positive semi-definiteSymmetricFor example:
Loose Intuition“similarity” between features
)()(),( zxzxK T
2),( zxzxK T
33
23
13
32
22
12
31
21
11
)(
xx
xx
xx
xx
xx
xx
xx
xx
xx
x
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
10
Soft Margin (L1 regularization)
C = ∞ leads to hard margin SVM, Rychetsky (2001)
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
11
Why doesn’t my model fit well on test data ?
12
Some basis about Learning theoryBias/variance tradeoff underfitting (high bias) (high variance)
overfitting
Training Error =
Generalization Error =
m
i
ii yxhm
h1
)()( })({11
)(̂
))(()( ~),( yxhPh Dyx Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
13
Bias/variance tradeoff
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.
14
Is training error a good estimator of generalization error?
15
Chernoff bound (|H|=finite)Lemma: Assume Z1, Z2, …, Zm are drawn iid
from Bernoulli(φ), and
and let γ > 0 be fixed. Then,
based on this lemma, one can find, with probability 1-δ
(k = # of hypotheses)
m
iiZm
1
)/1(̂
)2exp(2)|ˆ(| 2mP
k
mhh
2log
2
1)()(ˆ
Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
16
Chernoff bound (|H|=infinite)VC Dimension d : The size of largest set that
H can shatter.e.g. H = linear classifiersin 2-DVC(H) = 3
With probability at least 1-δ,
1
log1
log)()(ˆmd
m
m
dhh
Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
17
Model SelectionCross Validation: Estimator of
generalization errorK-fold: train on k-1 pieces, test on the remaining
(here we will get one test error estimation).
Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner.
Leave-one-out cross validation (m-fold, m = training sample size)
train trainvalidat
etrain train train
18
Model SelectionLoop possible parameters:
Pick one set of parameter, e.g. C = 2.0Do cross validation, get a error estimationPick the Cbest (with minimal error estimation) as
the parameter
19
Dimensionality ReductionWhich features are more “important”?Wrapper model feature selection
Forward/backward search: add/remove a feature at a time, then evaluate the model with the new feature set.
Filter feature selectionCompute score S(i) that measures how
informative xi is about the class label yS(i) can be correlation Corr(x_i, y), or mutual
information MI(x_i, y), etc.Principal Component Analysis (PCA)Vector Quantization (VQ)
20
Multiclass SVMOne against one
There are binary SVMs. (1v2, 1v3, …)To predict, each SVM can vote between 2
classes.
One against allThere are k binary SVMs. (1 v rest, 2 v rest, …)To predict, evaluate , pick the
largest.Multiclass SVM by solving ONE optimization
problem
2
k
1 3 5 3 2 1
1 2 3 4 5 6K =
poll K = 3
bT xw
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.
21
Image Classification by SVMProcess
Raw imag
es
Formatted
vectors
Training Data
K-fold Cros
s validation
SVM (wit
h best C)
Test Data
Accuracy
1 0:49 1:25 …1 0:49 1:25 … : :2 0:49 1:25 … :
1/4
3/4
K = 6
22
Image Classification by SVMResults
Run Multi-class SVM 100 times for both (linear/Gaussian).
Accuracy Histogram
23
Image Classification by SVMIf we throw object data that the machine
never saw before.
24
~ Thank You ~Shao-Chuan Wang
CITI, Academia Sinica