Download - Image Classification And Support Vector Machine

1

Image Classification and Support Vector Machine

Shao-Chuan WangCITI, Academia Sinica

2

Outline (1/2)Quick Review of SVM

Intuition Functional margin and geometric margin

Optimal margin classifier Generalized Lagrangian multiplier methods Lagrangian duality

Kernel and feature mappingSoft Margin ( l1 regularization)

3

Outline (2/2)Some basis about Learning theory

Bias/variance tradeoff (underfitting vs overfitting)Chernoff bound and VC dimension

Model selectionCross validation

Dimension ReductionMulticlass SVM

One against oneOne against all

Image Classification by SVMProcessResults

4

Intuition: MarginsFunctional Margin

Geometric Margin

}1,1{,)},{( )(0

)()( im

iii yyS x

)(ˆ )()()( by iTii xwWe feel more confident when functional margin is larger

0)()(

biiT

w

wxw

wx

w

w bi

T

i

)()(

0bTxw

1x

2x

)(i

w ),( )()( ii yx

Note that scaling on w, b won’t change the plane.

Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).

5

Maximize marginsOptimization problem: maximize minimal

geometric margin under constraints.

Introduce scaling factor such that


6

Lagrange dualityPrimal optimization problem:

Generalized Lagrangian

Primal optimization problem (equivalent form)

Dual optimization problem:


7

Dual Problem

The necessary conditions that equality holds:f, gi are convex, and hi are affine.KKT conditions.


8

Optimal margin classifiers

Its Lagrangian

Its dual problem


9

Kernel and feature mappingKernel:

Positive semi-definiteSymmetricFor example:

Loose Intuition“similarity” between features

)()(),( zxzxK T

2),( zxzxK T

33

23

13

32

22

12

31

21

11

)(

xx

xx

xx

xx

xx

xx

xx

xx

xx

x


10

Soft Margin (L1 regularization)

C = ∞ leads to hard margin SVM, Rychetsky (2001)


11

Why doesn’t my model fit well on test data ?

12

Some basis about Learning theoryBias/variance tradeoff underfitting (high bias) (high variance)

overfitting

Training Error =

Generalization Error =

m

i

ii yxhm

h1

)()( })({11

)(̂

))(()( ~),( yxhPh Dyx Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).

13

Bias/variance tradeoff

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.

14

Is training error a good estimator of generalization error?

15

Chernoff bound (|H|=finite)Lemma: Assume Z1, Z2, …, Zm are drawn iid

from Bernoulli(φ), and

and let γ > 0 be fixed. Then,

based on this lemma, one can find, with probability 1-δ

(k = # of hypotheses)

m

iiZm

1

)/1(̂

)2exp(2)|ˆ(| 2mP

k

mhh

2log

2

1)()(ˆ

Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).

16

Chernoff bound (|H|=infinite)VC Dimension d : The size of largest set that

H can shatter.e.g. H = linear classifiersin 2-DVC(H) = 3

With probability at least 1-δ,

1

log1

log)()(ˆmd

m

m

dhh

Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).

17

Model SelectionCross Validation: Estimator of

generalization errorK-fold: train on k-1 pieces, test on the remaining

(here we will get one test error estimation).

Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner.

Leave-one-out cross validation (m-fold, m = training sample size)

train trainvalidat

etrain train train

18

Model SelectionLoop possible parameters:

Pick one set of parameter, e.g. C = 2.0Do cross validation, get a error estimationPick the Cbest (with minimal error estimation) as

the parameter

19

Dimensionality ReductionWhich features are more “important”?Wrapper model feature selection

Forward/backward search: add/remove a feature at a time, then evaluate the model with the new feature set.

Filter feature selectionCompute score S(i) that measures how

informative xi is about the class label yS(i) can be correlation Corr(x_i, y), or mutual

information MI(x_i, y), etc.Principal Component Analysis (PCA)Vector Quantization (VQ)

20

Multiclass SVMOne against one

There are binary SVMs. (1v2, 1v3, …)To predict, each SVM can vote between 2

classes.

One against allThere are k binary SVMs. (1 v rest, 2 v rest, …)To predict, evaluate , pick the

largest.Multiclass SVM by solving ONE optimization

problem

2

k

1 3 5 3 2 1

1 2 3 4 5 6K =

poll K = 3

bT xw

Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.

21

Image Classification by SVMProcess

Raw imag

es

Formatted

vectors

Training Data

K-fold Cros

s validation

SVM (wit

h best C)

Test Data

Accuracy

1 0:49 1:25 …1 0:49 1:25 … ：：2 0:49 1:25 … ：

1/4

3/4

K = 6

22

Image Classification by SVMResults

Run Multi-class SVM 100 times for both (linear/Gaussian).

Accuracy Histogram

23

Image Classification by SVMIf we throw object data that the machine

never saw before.

24

~ Thank You ~Shao-Chuan Wang

CITI, Academia Sinica