Support vector machines - cs.nju.edu.cn · are useful in shaping the SVM primal form. We will not...

Support vector machines

Jianxin Wu

LAMDA Group

National Key Lab for Novel Software Technology

Nanjing University, China

[email protected]

February 11, 2020

Contents

1 The key SVM idea 21.1 Simplify it, simplify it, simplify it! . . . . . . . . . . . . . . . . . 31.2 Finding a max (or large) margin classifier . . . . . . . . . . . . . 3

2 Visualizing and calculating the margin 62.1 Visualizing the geometry . . . . . . . . . . . . . . . . . . . . . . . 72.2 Calculating the margin as an optimization . . . . . . . . . . . . . 7

3 Maximizing the margin 83.1 The formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 The simplifications . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 The optimization and the solution 114.1 The Lagrangian and the KKT conditions . . . . . . . . . . . . . 114.2 The dual SVM formulation . . . . . . . . . . . . . . . . . . . . . 124.3 The optimal b value and support vectors . . . . . . . . . . . . . . 144.4 Looking at the primal and dual simultaneously . . . . . . . . . . 15

5 Nonlinear and multiclass extensions 165.1 Linear classifiers for non-separable problems . . . . . . . . . . . . 165.2 Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Kernel SVMs 206.1 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 Mercer’s condition and feature mapping . . . . . . . . . . . . . . 226.3 Popular kernels and the hyper-parameters . . . . . . . . . . . . . 236.4 SVM complexity, tradeoff, and more . . . . . . . . . . . . . . . . 25

1

http://cs.nju.edu.cn/wujx

http://lamda.nju.edu.cn

Exercises 26

Support Vector Machines (SVMs) are a family of widely applied classifi-cation methods that have exhibited excellent accuracy in various classificationproblems. Complex mathematics are involved in various aspects of SVMs—e.g.,proof of their generalization bounds, their optimization, designing and provingthe validity of various non-linear kernels, etc. We will, however, not focus on themathematics. The main purpose of this chapter is to introduce how the ideas inSVMs are formulated, why are they reasonable, and how various simplificationsare useful in shaping the SVM primal form.

We will not touch on any generalization bounds of SVMs, although this is anarea that has attracted intensive research efforts. We will not talk about howthe optimization problem in SVMs can be solved or approximately solved (forefficiency and scalability). These choices enable us to focus on the key ideasthat lead to SVMs and the strategies in SVMs that may be helpful in otherdomains. We encourage readers to also pay attention to these aspects of SVMswhile reading this chapter.

1 The key SVM idea

How shall we accomplish a classification task? In Chapter 4, we discussed theBayes decision theory, which suggests that we can estimate the probability dis-tribution Pr(x, y) or the density function p(x, y) using the training data. Withthese distributions, the Bayes decision theory will guide us on how to classify anew example. These types of probabilistic methods are called generative meth-ods. An alternative type of method is the discriminative method, which directlyestimates the probability Pr(y|x) or p(y|x).

In SVMs, we do not model either the generative or the discriminative distri-butions based on the training data. Instead, the SVM seeks to directly find thebest classification boundary that divides the domain of x into different regions.Examples fall in one region all belong to one category, and different regionscorresponding to different categories or classes.

The rationale behind this choice is: estimation of probability distributionsor densities is a difficult task, which might be even more difficult than theclassification task itself, especially when there are only a few training examples.1

In other words, classification based on density estimation might be taking adetour. Then, why not directly estimate the classification boundaries? Asillustrated by Figure 2, it might be a much easier task to estimate the boundaryin some scenarios.

Then, the natural question is: which classification boundary is consideredgood or even the best?

1We will discuss density estimation in the next chapter.

2

1.1 Simplify it, simplify it, simplify it!

One key to answering this question is actually to simplify the problem, in orderto find out in which scenarios it will be easy to determine a good classificationboundary. The heading says “simplify it” three times—this is not only becauseit is important (so must be repeated three times), but also we are making threesimplifying assumptions.

Given datasets (or classification problems) such as those appearing in Fig-ure 1, it is difficult to figure out which classifier is good. In Figure 1a, the twoclasses (whose examples are denoted by black squares and red circles, respec-tively) cannot be classified by any linear boundary. Using a complex curve (i.e.,a non-linear boundary), we can separate these two classes with zero trainingerror. However, it is not obvious either how to compare two complex non-linearclassification boundaries. Hence, we make the first two assumptions:

• linear boundary (or, linear classifier); and,

• separable (i.e., linear classifiers can achieve 100% accuracy on the trainingset).

These two are usually combined to require a linearly separable problem.The problem in Figure 1b is in fact linearly separable: the four classes can be

divided into non-overlapping regions by linear boundaries—e.g., the three linesthat enclose the red class examples. However, it is not easy to judge whetherone set of triple-lines is better than another triplet either. And, there are manyother possible linear boundaries beyond triplets enclosing red examples. Hence,we make the third assumption:

• The problem is binary.

Putting all assumptions together, we start by only considering linearly separablebinary classification problems.

We want to emphasize that these simplifications are reasonable—i.e., theywill not change the essence of the classification problem. As we will introducein this chapter, all these assumptions will be relaxed and taken care of in SVM.Hence, reasonable simplifying assumptions help us find ideas and good solutionsto a complex problem, and we have the option to re-consider these assumptionslater such that our method has the potential to solve complex tasks, too. How-ever, if an assumption changes some basic characteristics of our problem, it isbetter avoided.

1.2 Finding a max (or large) margin classifier

Figure 2a shows a linearly separable binary problem. It is clear that any linebetween the cluster of red circles and the cluster of black squares can separatethem perfectly. In Figure 2a, three example boundaries are illustrated: the bluesolid line, the red dashed line, and the black dotted line.

Furthermore, it becomes easier to determine which classifier is better (orthe best) in this linearly separable binary example. Most people will agree that

3

(a) Inseparable data are messy

(b) Multiclass data are complex

Figure 1: Illustration of complex classification problems ().

the classification boundary determined by the blue solid line is better than theother two. The other two boundaries are too close to the examples, leavingvery small (almost no) room for variations or noise. The examples in Figure 2aand 2b are generated using the same underlying distribution and contain thesame number of training examples. However, because of the randomness, twored circles appear on the right side of the black dotted line, and one black square

4

(a) Large vs. small margin boundaries

(b) A small margin can cause problems

Figure 2: Illustration of the large margin idea. ()

appears below the red dashed line in Figure 2b. These examples are errors forthe two boundaries, respectively.

The blue solid line, however, correctly classifies all examples in Figure 2b.The reason for this robustness is: this boundary is far away from all training

5

Figure 3: Illustration of projection, margin, and normal vector. ()

examples; hence, when variations or noise are in effect, this classifier has themargin to accommodate these changes since the scale of such changes is usuallysmall. As shown in Figure 2, the margin of a classifier is the distance from it tothe training examples that are the closest to it.

Hence, the margin of one example (with respect to a classification boundary)is the distance from that point to the boundary. Naturally, the margin of adataset is the minimum of the margins of all its samples.

Because a large margin is beneficial for classification, some classifiers directlymaximize the margin; these are called max-margin classifiers. Some classifiersseek a compromise between a large margin and other properties, and are calledlarge margin classifiers. The SVM is a max-margin classifier.

And, of course, the next task after having had the max-margin idea is: howshall we translate the max-margin idea into operational procedures? We needto first formalize the max-margin idea.

2 Visualizing and calculating the margin

Given a linear classification boundary and a point x, how do we calculate themargin of x? We know in a d-dimensional space, a linear classification boundaryis a hyperplane. An illustration such as Figure 3 is helpful.

6

2.1 Visualizing the geometry

We have two ways to specify a vector x in an illustration. One way is to drawan arrow from the origin (O) to the coordinates specified by the elements of x.We use this way to draw x, x⊥ and x0 in Figure 3 (denoted by dashed arrows,but the arrow for x0 is omitted to preserve the illustration’s clarity). Anotherway is to use two end points to specify a vector, such that the vector is thedifference between the two points. For example, z = x − x⊥ (denoted by theblack solid arrow) and two vectors involving x0 (denoted by dotted arrows). wand −w are also specified in this way.

All points in a hyperplane in the d-dimensional space are specified by anequation

f(x) = wTx+ b = 0 ,

where w determines the direction of the hyperplane. The direction of w (i.e.,w‖w‖ ) is called the normal vector of the hyperplane. A hyperplane is perpendic-

ular to its normal vector. In other words, if x1 and x2 are two different pointson the hyperplane (i.e., satisfying f(x1) = f(x2) = 0 and x1 6= x2), we alwayshave

wT (x1 − x2) = 0 .

If x is on the hyperplane, then f(x) = 0; if x is not on the hyperplane, but inthe same direction of w with respect to the hyperplane, then f(x) > 0; and,f(x) < 0 holds for points in the opposite direction (cf. the red solid arrows).

In Figure 3, x⊥ is the projection of x onto the hyperplane. That is, x isdecomposed as

x = x⊥ + z .

Given any point x0( 6= x⊥) on the hyperplane, we always have

z ⊥ (x⊥ − x0) .

However, x⊥ is not perpendicular to z if b 6= 0. This geometry tells us that ‖z‖is the distance we are looking for. Hence, a visualization with proper notationshelps us translate the description “distance” into precise mathematics.

2.2 Calculating the margin as an optimization

What is even better, the visualization also hints at how to calculate the distance.Because z ⊥ (x⊥−x0), we know ‖z‖ ≤ ‖x−x0‖ for any x0 on the hyperplane,because of the Pythagorean theorem, which also holds in a space Rd whosedimensionality is higher than 2 (d > 2). That is, x⊥ is the solution of thefollowing optimization problem:

arg miny

‖x− y‖2 (1)

s.t. f(y) = 0 . (2)

7

This is an optimization problem with one equality constraint. The La-grangian is

L(y, λ) = ‖x− y‖2 − λ(wTy + b) .

Setting ∂L∂y = 0 gives 2(y − x) = λw. That is,

y = x+λ

2w .

Plugging it into f(y) = 0, we get λ2w

Tw + f(x) = 0, hence

λ = −2f(x)

wTw.

Then, the projection is

x⊥ = x+λ

2w = x− f(x)

wTww .

Now, it is easy to obtain

z = x− x⊥ =f(x)

wTww ,

hence, the distance (and the margin of x) is∥∥∥∥ f(x)

wTww

∥∥∥∥ =

∣∣∣∣ f(x)

‖w‖2

∣∣∣∣ ‖w‖ =|f(x)|‖w‖

. (3)

There are simpler ways to find the margin than what is presented here.However, in this section we familiarized ourselves with the geometry useful forSVM, found the actual value of x⊥, exercised the process of formalization andoptimization, and applied the method of Lagrange multipliers. These are allfruitful outcomes.

3 Maximizing the margin

The SVM tries to maximize the margin of the dataset—i.e., the smallest marginof all training points.

3.1 The formalization

Let us denote the training set as {(xi, yi)}ni=1, where xi ∈ Rd is an example,and y ∈ Y. For a binary problem, Y = {1, 2}. To maximize the margin of thisdataset with respect to a linear boundary f(x) = wTx + b means to solve thefollowing optimization problem:

maxw,b

min1≤i≤n

|f(xi)|‖w‖

, (4)

8

with the additional constraint that f(x) can classify all training examples cor-rectly. Hence, we also need to mathematically describe “correctly classified.”

This is quite simple if we change the definition of Y slightly. If we setY = {+1,−1}, the meaning of it will not change—the two numbers refer to thelabels of the two classes, and the specific values of them are not relevant (e.g.,2 vs. −1), so long as the two values are different. However, with this slightchange, the following statement holds:

f(x) correctly classifies all training examples

iff yif(xi) > 0 for all 1 ≤ i ≤ n,

in which “iff” means if and only if. When yi = 1 (i.e., a positive example),yif(xi) > 0 means f(xi) > 0, thus the prediction is also positive; When yi = −1,f(xi) < 0 and it predicts xi as negative.

The SVM then optimizes the following problem for a binary linearly sepa-rable dataset:

maxw,b

min1≤i≤n

|f(xi)|‖w‖

(5)

s.t. yif(xi) > 0, 1 ≤ i ≤ n . (6)

3.2 The simplifications

The constraints can be handled by a Lagrangian. However, the objective con-tains a fraction, an absolute value, a vector norm and the maximization of aminimum, all of which are hostile to optimization. Fortunately, by looking atthis objective more carefully, there exist ways to avoid all these difficulties.

Our assumption ensures that yif(xi) > 0 for all 1 ≤ i ≤ n—that is, wealways have

yif(xi) = |f(xi)|

because |yi| = 1. Hence, the objective can be rewritten as

min1≤i≤n

|f(xi)|‖w‖

= min1≤i≤n

yif(xi)

‖w‖(7)

=1

‖w‖min

1≤i≤n

(yi(w

Txi + b)). (8)

This type of objective (as a ratio) has appeared many times (e.g., in PCA).If (w?, b?) is a maximizer of the above objective, so will (cw?, cb?) be for anynonzero constant c ∈ R. However, since this is a constrained optimizationproblem, c ≤ 0 will make all the constraints invalid. When c > 0, (cw?, cb?)will not change the objective, and all constraints remain satisfied. That is, wehave the freedom to choose any c > 0. In the past, we have chosen a c such that‖w‖ = 1. If we use this same assumption, the optimization becomes

maxw,b

min1≤i≤n

yif(xi) (9)

9

s.t. yif(xi) > 0, 1 ≤ i ≤ n (10)

wTw = 1 . (11)

The objective is still a maximization of a minimum. And, this optimizationproblem is still difficult to solve.

However, note that yif(xi) appears in both the objective and the constraints;there is a clever trick to further simplify it. If (w?, b?) is a maximizer of the

original objective min1≤i≤n|f(xi)|‖w‖ , we choose

c = min1≤i≤n

yi((w?)Tx+ b?) .

Obviously c > 0, hence 1c (w?, b?) is also a maximizer of the original problem.

Let us consider this particular choice of c, which leads to

min1≤i≤n

yi

((1

cw?

)Tx+

1

cb?

)= 1 > 0 .

Hence, we can add the following constraint to our optimization problem withoutchanging the optimal objective value:

min1≤i≤n

yi(wTxi + b) = 1 , (12)

or, equivalently,yi(w

Txi + b) ≥ 1(> 0) , (13)

for all 1 ≤ i ≤ n—which means all constraints in the original problem areautomatically satisfied!

To be precise, min1≤i≤n yi(wTxi+b) = 1 is not equivalent to yi(w

Txi+b) ≥1 for all i. When yi(w

Txi+b) ≥ 1, it is possible that min1≤i≤n yi(wTxi+b) > 1,

for example, it is possible that min1≤i≤n yi(wTxi + b) = 2. However, since in

an SVM the objective is to minimize 12w

Tw, the constraints yi(wTxi + b) ≥ 1

altogether imply min1≤i≤n yi(wTxi + b) = 1 in the optimal solution. The proof

of this fact is pretty simple, and is left to the reader.In other words, we can convert the original problem into the following equiv-

alent one:

maxw,b

min1≤i≤n yi(wTx+ b)

‖w‖=

1

‖w‖(14)

s.t. yif(xi) ≥ 1, 1 ≤ i ≤ n . (15)

A final step is to change the maximization of 1‖w‖ to the minimization of

‖w‖ and furthermore (equivalently) to the minimization of wTw and 12w

Tw,which leads to

minw,b

1

2wTw (16)

10

s.t. yif(xi) ≥ 1, 1 ≤ i ≤ n . (17)

The additional term 12 will make later derivations easier.

In this series of transformations and simplifications, we get a series of equiv-alent problems and hence will get the same optimal objective value as that ofthe original problem. The new constraints are very similar to the original ones(and are handled by the method of Lagrange multipliers in the same manner).However, the objective is now a quadratic form ( 1

2wTw), and very amenable to

optimization.We will work (a little bit) on the optimization of Equations 16–17, and they

are called the primal form of the SVM formulation, or simply a primal SVM.

4 The optimization and the solution

We use the method of Lagrange multipliers to handle the n inequality constraintsand state the necessary and sufficient conditions for the minimizers of the primalSVM. The proof is beyond the scope of this book and is thus omitted. Withthese conditions, the primal form turns into an equivalent dual form.

4.1 The Lagrangian and the KKT conditions

We still define one Lagrange multiplier αi for each constraint, and the con-straints yif(xi) ≥ 1 are rewritten as yif(xi)− 1 ≥ 0. The Lagrangian is

L(w, b,α) =1

2wTw −

n∑i=1

αi(yi(w

Txi + b)− 1)

(18)

s.t. αi ≥ 0, 1 ≤ i ≤ n , (19)

in whichα = (α1, α2, . . . , αn)T

is the vector of Lagrange multipliers.For inequality constraints, however, the multipliers are not free any more—

they are required to be non-negative. If one constraint is violated—i.e., ifyif(xi) − 1 < 0 for some i—we have −αi

(yi(w

Txi + b)− 1)> 0, meaning

a punishment term is added to L(w, b,α). Hence, setting αi ≥ 0 for all i issomehow enforcing the i-th constraint. Also, if

αi(yi(w

Txi + b)− 1)

= 0 (20)

for all 1 ≤ i ≤ n, then L(w, b,α) = 12w

Tw matches the primal objective. Later,we will show the optimality conditions indeed specifies αi

(yi(w

Txi + b)− 1)

=0 for all 1 ≤ i ≤ n.

As usual, we compute the gradients and set them to 0 or 0, respectively:

∂L

∂w= 0 =⇒ w =

n∑i=1

αiyixi , (21)

11

∂L

∂b= 0 =⇒

n∑i=1

αiyi = 0 . (22)

The above three equality conditions, the original inequality constraints, andthe constraints on the Lagrange multipliers form the Karush–Kuhn–Tucker(KKT) conditions for the primal SVM optimization problem:

w =

n∑i=1

αiyixi (23)

n∑i=1

αiyi = 0 (24)

αi(yi(w

Txi + b)− 1)

= 0 i = 1, 2, . . . , n (25)

αi ≥ 0 i = 1, 2, . . . , n (26)

yi(wTxi + b) ≥ 1 i = 1, 2, . . . , n . (27)

These KKT conditions are not necessary and sufficient in general. However,for the primal SVM problem, these conditions are both necessary and sufficientto specify optimal solutions.

Equation 23 states that the optimal solution w can be represented by aweighted combination of the training examples; the signs of the weights are de-termined by the labels of the examples, and the magnitudes of these weights arethe Lagrange multipliers. In more general large margin learning, the representertheorem ensures this weighted average representation is still valid in many othersituations–e.g., in kernel SVMs.

Equation 24 states that these weights are balanced in the two classes. If wesum the Lagrange multipliers for all positive training examples, this must equalthe sum of weights for all negative training examples. Because these Lagrangemultipliers can be considered as weights for positive and negative examples, oneway to interpret Equation 24 is: an SVM has an internal mechanism to balanceweights between the positive and negative examples, which might be useful inhandling imbalanced datasets so long as the level of imbalance is not large.

Equation 25 is called the complementary slackness property, which is impor-tant in SVMs, and we will soon discuss this property in more depth. The othertwo conditions are the non-negative constraints for the Lagrange multipliers andthe original constraints.

4.2 The dual SVM formulation

Equations 23 and 24 allow us to remove the original parameters (w, b) from theLagrangian. Note that

1

2wTw =

1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj , (28)

12

n∑i=1

αiyiwTxi = wTw =

n∑i=1

n∑j=1

αiαjyiyjxTi xj , (29)

n∑i=1

αiyib =

(n∑i=1

αiyi

)b = 0 , (30)

when the KKT conditions hold. Hence, the Lagrangian becomes

− 1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj +

n∑i=1

αi , (31)

which does not involve w or b anymore. Putting the constraints on αi in, weget

maxα

n∑i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj (32)

s.t. αi ≥ 0, i = 1, 2, . . . , n , (33)n∑i=1

αiyi = 0 . (34)

which is called the dual SVM formulation, or simply the dual.In more detail, the above choices for w and b lead to

g(α) = inf(w,b)

L(w, b,α) ,

in which inf means the infimum (i.e., greatest lower bound). g is called theLagrange dual function, and it is always concave. The maximization of g withrespect to α is always smaller than or equal to the minimization of the original(primal) problem, and the difference between them is called the duality gap.When the duality gap is 0, we can maximize the dual instead of solving theoriginal minimization problem.

One notable difference between the dual and the primal SVM is: in the dual,the training data never appear alone (as in the primal), they always appear ina pair as a dot-product xTi xj .

In general, the optimal objective value of the primal and dual form of anoptimization problem are not equal (i.e., there exists a duality gap). In SVMs,the duality gap is 0—the dual and the primal forms will lead to the same optimalvalue. We will not talk about the optimization techniques in this chapter, whichare beyond the scope of this book. However, many high quality optimizationtoolboxes (including those specifically designed for SVMs) are available to solveeither the primal or the dual SVM optimization.

One can solve the SVM optimization using the primal form and obtain(w?, b?) directly; one can also solve the dual form and obtain the optimal La-grange multipliers α?, and Equation 23 will give us the optimal w?.

13

4.3 The optimal b value and support vectors

To obtain the optimal value for b is a little bit more tricky in the dual form,which hinges on the complementary slackness property:

αi(yi(w

Txi + b)− 1)

= 0, i = 1, 2, . . . , n . (35)

Based on complementary slackness, if we find one constraint whose correspond-ing Lagrange multiplier is positive (i.e., αi > 0), then we must have yi(w

Txi +b)− 1 = 0. Hence,

b? = yi − (w?)Txi . (36)

for this particular i. This solution is easy to derive. Because

yib = 1− yiwTxi (37)

when αi > 0, and note that y2i = 1, we can multiply both sides of Equation 37by yi and obtain Equation 36.

The optimal b value obtained in this manner may not be very reliable becausesome numerical errors can reduce its accuracy. Hence, we can also find all thoseexamples whose corresponding αi is non-zero, compute b from them individually,and then use their average as b?.

Those examples that are useful for computing b? are special in the trainingset, whose corresponding Lagrange multipliers are positive. They are called thesupport vectors, and consequently the classification method is called the supportvector machine.

When αi > 0, we know yi(wTxi+b)−1 = 0, which means the i-th constraint

has to be active—i.e., the margin of the training example xi must be 1, as shownin Figure 4. In Figure 4, examples in two classes are denoted by blue circlesand orange squares, respectively; the black line is the classification boundaryf(x) = 0, while the red lines correspond to f(x) = ±1—i.e., all examples whosemargin yif(xi) is 1 (remember in the SVM formulation we assume yi = ±1 andthe smallest margin of all training examples is 1). The three examples (twofilled circles and one filled square) are the support vectors, whose correspondingαi > 0 and yif(xi) = 1.

The classification boundary f(x) = wTx+b is a line in linear SVMs. Hence,we can explicitly compute w =

∑ni=1 αiyixi ∈ Rd, and consequently it is very

efficient to compute the decision value f(x), which only requires one dot-productcomputation. The linear SVM is hence both accurate and efficient. However,the derivation of decision values for more complex SVM classifiers—e.g., non-linear SVMs—requires a lot of computations.

The prediction (after the decision value f(x) is computed) is a simple task.As illustrated in Figure 4, the prediction is simply sign(f(x)), in which sign isthe sign function:

sign(x) =

+1 if x > 0

−1 if x < 0

0 if x = 0

. (38)

14

Figure 4: Illustration of support vectors. ()

When a test example x happens to have f(x) = 0, the above sign functionreturns 0, which is not suitable for our classification problem. We can output aprediction of either +1 or −1 in an ad-hoc manner; we can also randomly assigna label for these special (and rare) testing examples.

4.4 Looking at the primal and dual simultaneously

Since the primal and dual give the same answer in SVMs, we can look at themsimultaneously. After the optimal primal and dual variables are obtained, theyguarantee the margin of all training examples ≥ 1, that is, the distance betweenany positive and negative training example is ≥ 2. This observation provides alower bound of the distance between examples from different classes.

The objective wTw is equal to∑ni=1

∑nj=1 αiαjyiyjx

Ti xj . The primal and

dual parameters in this section refer to the optimal ones. However, we omit the? superscript to make the notations simpler. Let us define some more notations:

α+ =∑yi=+1

αi , (39)

α− =∑yi=−1

αi , (40)

s+ =∑yi=+1

αixi , (41)

s− =∑yi=−1

αixi , (42)

are the sums of Lagrange multipliers (weights) for all positive examples and neg-

15

ative examples and the weighted sums of all positive examples and all negativeexamples, respectively.

Equation 24 leads toα+ = α− ,

and Equation 23 showsw = s+ − s− .

Hence, the primal objective is to minimize ‖s+− s−‖2, which is also equivalentto ∥∥∥∥ s+α+

− s−α−

∥∥∥∥ . (43)

s+α+

( s−α−) is the weighted average of all positive (negative) examples. Hence,

the distance between them must be ≥ 2, and the equality can only happen ifαi = 0 for all training examples whose margin is > 1.

If any training example xi has a margin yif(xi) > 1 and αi > 0, then s+α+

or s−α−

will not reside on the two lines whose margin is 1 (cf. the red lines in

Figure 4) and the distance between them will be > 2.Hence, in the linearly separable case, all support vectors reside on these two

lines and are sparse (i.e., the number of support vectors is only a small fractionof the number of all training examples, as shown in Figure 4).

5 Extensions for linearly inseparable and multi-class problems

We have made three assumptions in the above derivation: linear classifier, sep-arable, and binary problems. These assumptions (and restrictions) have to berelaxed now that we already have a solution for the simplified problem. In thissection, we deal with the non-separable case (still with linear classifiers), andthe extension to multiclass problems. Nonlinear classifiers will be discussed inthe next section.

5.1 Linear classifiers for non-separable problems

In some binary problems, the training set is not linearly separable—i.e., there isno single line that can perfectly separate examples in the two classes. However,as in the example of Figure 1a, a linear classifier still seems the best model wecan find using our eyes and brains. In other words, the positive and negativeexamples are roughly linearly separable.

In fact, the two classes in Figure 1a are generated by two Gaussians with thesame covariance matrix and equal prior, and the optimal classification boundary(in the minimum cost sense) is indeed a linear one. This statement will becomeobvious after we discuss probabilistic methods in the next chapter.

The linear SVM classifier can be extended to handle such roughly linearlyseparable problems for real-world problems, using a trick called slack variables.

16

The current constraint, yif(xi) ≥ 1, means that not only the i-th exampleis correctly classified; its distance to the linear boundary is also relatively far(> 1), leaving room to accommodate possible variations in test examples.

However, must we strictly maintain this constraint? Maybe we can achievea much larger margin of the dataset2 at the cost of only slightly violating thissingle constraint (e.g., yif(xi) = 0.9 < 1.) So long as yif(xi) > 0 (e.g., 0.9),this particular training example xi is still correctly classified, but larger marginsare made possible for other examples.

At another extreme, there may be one outlier xi in the training set, e.g.,yi = +1, but it is in a cluster of negative examples. If we want to strictlymaintain the constraint associated with xi, we must pay the price that theproblem is not solvable or the margin of other examples are greatly reduced.We need a mechanism to allow this outlier to be wrongly classified—i.e., to allowyif(xi) ≤ 0 in a few cases.

These exceptions (either 0 < yif(xi) < 1 or yif(xi) ≤ 0), however, cannothappen too frequently. We also need a mechanism to specify that the total price(cost) we pay for such relatively rare or extreme cases is small. In SVMs, a slackvariable ξi is introduced as the price we pay for xi, and its associated constraintis changed to two constraints:

yif(xi) ≥ 1− ξi , (44)

ξi ≥ 0 . (45)

There can be three possible cases for ξi:

• When ξi = 0 for an xi, the original constraint is maintained, and no priceis paid;

• When 0 < ξi ≤ 1, this example is still correctly classified, but the marginis 1− ξi < 1, hence the price is ξi.

• When ξi > 1, this example is wrongly classified, and the price is still ξi.

Hence, the total price isn∑i=1

ξi .

To require the total price to be small, we just add this term to the (minimization)objective as a regularizer , and the primal linear SVM formulation becomes:

minw,b

1

2wTw + C

n∑i=1

ξi (46)

s.t. yif(xi) ≥ 1− ξi, 1 ≤ i ≤ n , (47)

ξi ≥ 0, 1 ≤ i ≤ n . (48)

2In the SVM formulation, this is equivalent to reducing wTw by a large amount whilefixing the margin to 1.

17

In the above SVM formulation, a new symbol C is introduced. C > 0 is ascalar that determines the relative importance between large margin (minimiz-ing wTw) and paying small total price (

∑ni=1 ξi). When C is large, the small

total price part is more important; and when C →∞, the margin requirement iscompletely ignored. When C < 1, the large margin requirement is emphasized;and when C → 0, no price is paid at all. Hence, in problems showing differentproperties, C values have to be tuned to seek the best accuracy.

Using almost the same procedure as in the linearly separable case, we obtainthe new dual form

maxα

n∑i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj (49)

s.t. C ≥ αi ≥ 0, i = 1, 2, . . . , n , (50)n∑i=1

αiyi = 0 . (51)

Note that the only difference between this formulation and the one for linearlyseparable SVMs is in the constraints on dual variables: αi ≥ 0 is changed to0 ≤ αi ≤ C, which hardly increases the difficulty of solving the dual problem,but the capacity of our SVM formulation has been greatly extended.

We can also view the new objective from a different perspective. The totalcost

∑ni=1 ξi can be seen as minimizing the cost over the training set—i.e., the

empirical cost (or empirical loss). The term wTw can alternatively been viewedas a regularization term, which encourages linear classification boundaries thatare less complex because the large components in w allow larger variations init and are hence more complex.

Since ξi ≥ 0 and ξi ≥ 1 − yif(xi) (based on Equations 44 and 45), we cancombine them as

ξi = (1− yif(xi))+ , (52)

in which x+ is called the hinge loss for x; x+ = x if x ≥ 0, and x+ = 0 if x < 0.It is obvious that

x+ = max(0, x) .

The hinge loss incurs no penalty if yif(x)i ≥ 1—i.e., when the margin forxi is larger than 1. Only when the margin is small (0 < yif(xi) < 1) or xi iswrongly classified (yif(xi) < 0) is the loss 1−yif(xi) incurred. Using the hingeloss, we can rewrite the primal SVM as

minw,b

1

2wTw + C

n∑i=1

ξi , or (53)

minw,b

1

2wTw + C

n∑i=1

(1− yif(xi))+ . (54)

In this formulation, the constraints are implicitly encoded in the hinge loss anddo not appear as constraints any more.

18

One practical question is: how to find the optimal C value for this tradeoffparameter? We will leave this to later sections.

5.2 Multiclass SVMs

There are two main strategies to handle multiclass problems in the SVM liter-ature: one is to combine multiple binary SVMs to solve the multiclass one; theother is to extend the max-margin idea and its mathematics into multiclass usingsimilar techniques, then solve the multiclass problem using a single optimiza-tion. We will introduce the first strategy. In fact, this strategy is not only validfor SVM classifiers, but also effective for extending many binary classificationmethods to the multiclass case.

The One-vs.-Rest (or one-versus-all, or OvA) method trains m binary SVMclassifiers for an m-class problem. In training the i-th binary SVM, examplesbelonging to the i-th class are used as the positive class examples in the binaryproblem, and all other training examples are used as the negative class examples.Let us denote this binary SVM classifier as fi(x). In the binary case, when anexample is far away from the classification boundary (i.e., with a large margin),there is good reason to believe it will be more tolerant of variations in thedistribution of examples. In other words, we are more confident about ourprediction if it has a large margin.

In the multiclass case, if fi(x) is large (and positive), we can also say withhigh confidence that x belongs to the i-th class. Hence, the m binary classifiersfi(x) (1 ≤ i ≤ m) give us m confidence scores, and our prediction for x issimply the class having the highest confidence. That is, the predicted class fora test example x is

arg max1≤i≤m

fi(x) (55)

in the OvA method.The One-vs.-One (or one-versus-one, or OvO) method trains

(m2

)= m(m−1)

2binary SVM classifiers. For any pair (i, j) satisfying 1 ≤ i < j ≤ m, we canuse the examples in the i-th class as positive ones, and those in the j-th classas negative ones; all other training examples are ignored for this pair. Thetrained classifier fi,j(x) (after the sign function) determines whether x shouldbe assigned to the i- or j-th class, although it is possible that x is neither inthe i- nor the j-th class; and we say the i- or j-th class receives one vote.

For any test example x, it will receive(m2

)votes, distributed among the m

classes. If the groundtruth class for x is k (1 ≤ k ≤ m), we expect the k-thclass will receive the largest amount of votes. Hence, in the OvO method, theprediction for any example x is the class that receives the largest number ofvotes.

There have been empirical comparisons between the OvA and OvO methods.In terms of classification accuracy, there is not an obvious winner. Hence, bothmethods are applicable when m is not large.

However, for problems with a large number of classes—e.g., when m =1000—the number of binary SVM classifiers to be trained in OvO is

(m2

)(499,500

19

if m = 1000). To train a large number of classifiers requires an excessivelylong time, and to store and apply all these trained models is also prohibitivelyexpensive. Hence, the OvA method is more popular when m is large.

6 Kernel SVMs

The final major missing block is a nonlinear SVM in which the classificationboundary is made up not of lines but of complex curves such as those in Fig-ure 5. Figure 5a shows a problem with 200 training examples while Figure 5bhas 2000. The blue hyperbolic curves are the groundtruth decision boundarythat separates the two classes, and the green curves are decision boundariesconstructed by an SVM using the nonlinear RBF kernel, which we will explainin this section. As shown in these figures, the kernel (or nonlinear) SVM can ap-proximate the groundtruth decision boundary well, especially in regions wheremany training examples are available.

Kernel methods are a class of methods that are popular in nonlinear learning,and SVMs are a typical example of kernel methods.

6.1 The kernel trick

First, let us consider and reexamine the second term in the SVM dual objective,

n∑i=1

n∑j=1

αiαjyiyjxTi xj .

Inside the summations is an inner product between αiyixi and αjyjxj—i.e.,inner product between two training examples whose sign is determined by thelabels and the weights by the dual variables.

It is well known that the dot-product between two vectors can be treatedas a measure of their similarity.3 If both vectors are unit ones, the dot-productequals the cosine of the angle between these two vectors. A small dot-productvalue means the angle is close to 90◦ and the two vectors are far away from eachother; a large positive dot-product result means the angle is close to 0 and thetwo vectors are close to each other; and, a large negative dot-product means theangle is close to 180◦ and the two vectors are almost the farthest apart possibleon a unit hypersphere.

Hence, it is natural to guess: if we replace the dot-product with some othernonlinear similarity metrics, can we get a nonlinear boundary better than aline? This guess seems viable because in the dual SVM formulation, as wehave observed, the training examples always appear as dot-products (or looselyspeaking, always as similarity comparisons!)

Some theories (unfortunately beyond the scope of this book) endorse theabove guess, but the nonlinear similarity measure must satisfy a certain con-dition (called Mercer’s condition). Let κ be a nonlinear function satisfying

3We will discuss similarity and dissimilarity metrics in more detail in Chapter 9.

20

(a) 200 training examples

(b) 2000 training examples

Figure 5: Illustration of non-linear classifiers. ()

Mercer’s condition; then the nonlinear dual SVM formulation is

maxα

n∑i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjκ(xi,xj) (56)

s.t. C ≥ αi ≥ 0, i = 1, 2, . . . , n , (57)

21

n∑i=1

αiyi = 0 . (58)

The optimization is the same as in the linear case, just replacing the (constant)values xTi xj by κ(xi,xj).

6.2 Mercer’s condition and feature mapping

To introduce Mercer’s condition, we need to define the quadratically integrable(or square integrable) function concept. A function g : Rd 7→ R is squareintegrable if ∫ ∞

−∞g2(x) dx <∞ . (59)

A function κ(·, ·) : Rd×Rd 7→ R satisfies Mercer’s condition if for any squareintegrable function g(x), the following inequality is always true:∫∫

κ(x,y)g(x)g(y) dx dy ≥ 0 , (60)

and it is symmetric, i.e.,κ(x,y) = κ(y,x) .

In the context of SVMs, Mercer’s condition translates to another way tocheck whether κ is a valid kernel (i.e., meets Mercer’s condition or not). For asymmetric function κ(·, ·) and a set of examples x1,x2, . . . ,xn ∈ Rd, one candefine a matrix K ∈ Rn × Rn, with

[K]ij = κ(xi,xj)

for 1 ≤ i, j ≤ n. If the matrix K is positive semi-definite for an arbitrary integern > 0 and an arbitrary set of examples x1,x2, . . . ,xn, then κ is a valid kernelfunction. When κ is a valid kernel, K is called the kernel matrix (induced byκ).

When κ is a valid kernel, then there exists a mapping φ : Rd 7→ X , whichmaps the input x from the input space to a feature space X and satisfies

κ(x,y) = φ(x)Tφ(y) (61)

for any x,y ∈ Rd. Although in many cases we do not know how to computethis mapping associated with κ, its existence has been rigorously proved. Whenwe cannot explicitly spell out φ, we call it an implicit mapping.

The feature space X is usually high dimensional. Its dimensionality is oftenmuch larger than d, and for many kernels is infinite! Although we skip the ad-vanced concepts (e.g., reproducing kernel Hilbert space), we want to assure thereader that so long as an infinite dimensional space satisfies certain conditions(such as the mappings in kernel SVMs), the inner product between any twovectors in it is well-defined.

22

Conceptually, the kernel method has turned a nonlinear classification prob-lem (in the input space) into an equivalent linear classification problem in the(usually much higher dimensional) feature space, as shown by the dual formula-tion for kernel SVMs. In the feature space, the linear boundary enjoys the samemax-margin benefit. Fortunately, because of the kernel trick, we do not need tocompute the dot-product explicitly: κ(x,y) replaces φ(x)Tφ(y).

The representer theorem once again gives us

w =

n∑i=1

αiyiφ(xi) .

However, w can be infinite dimensional, too. Hence, the prediction is performedthrough the kernel trick, as

f(x) = wTφ(x) + b (62)

=

n∑i=1

αiyiφ(xi)Tφ(x) + b (63)

=

n∑i=1

αiyiκ(xi,x) + b . (64)

This computation, however, is much more expensive than that in linear SVMs.Assuming the complexity of κ(xi,x) is O(d), predicting one example maytake O(nd) steps. When the number of training examples is large (e.g., n =1, 000, 000), kernel SVM prediction is very slow, and we need to store all trainingexamples in the SVM model, which incurs very high storage costs.

The actual SVM prediction cost, however, is lower than O(nd). If a trainingexample xi is not a support vector, then its Lagrange multiplier αi is 0, andis useless in the above summation. Hence, only support vectors are used inprediction and stored in the SVM model.

The Lagrange multipliers are sparse—i.e., many αi are zero. Let n′ denotethe number of support vectors, usually n′ � n, and the kernel SVM predictioncomplexity O(n′d) is much lower than nd.

6.3 Popular kernels and the hyper-parameters

Commonly used kernel functions include

Linear: κ(x,y) = xTy , (65)

RBF: κ(x,y) = exp(−γ‖x− y‖2) , (66)

Polynomial: κ(x,y) = (γxTy + c)D . (67)

The RBF (radial basic function) kernel is sometimes called the Gaussian kernel.Some of these kernels are parameterized—e.g., the RBF kernel has a parameterγ > 0, and the polynomial kernel has three parameters γ, c, and D (D is apositive integer, called the degree of the polynomial).

23

We can examine a special case of the polynomial kernel when γ = 1, c = 1,and D = 2. With these parameters, the polynomial kernel is

κ(x,y) = (1 + xTy)2 .

If we assume the examples are in the two dimensional space—i.e., x = (x1, x2)T ,y = (y1, y2)T—then

κ(x,y) = (1 + x1y1 + x2y2)2 (68)

= 1 + 2x1y1 + 2x2y2 + x21y21 + x22y

22 + 2x1x2y1y2 (69)

=

1√2x1√2x2x21x22√

2x1x2

T

1√2y1√2y2y21y22√

2y1y2

(70)

= φ(x)Tφ(y) . (71)

Hence, for the degree 2 polynomial kernel, we can explicitly write down itsmapping from the input space to the feature space, as

φ(x) = (1,√

2x1,√

2x2, x21, x

22,√

2x1x2)T .

The feature space is six-dimensional in this particular example, which is threetimes that of the input space.

The tradeoff parameter C in the SVM formulation and the kernel parametersγ, c, and D are hyperparameters. Note that different hyperparameters lead todifferent kernels and different SVM solutions, even if the same training set isused!

However, there is very little theoretical work to guide us on how to choosehyperparameters. In practice, the most commonly used method is to use cross-validation to choose them. For example, we can try a few candidate values forC in the set

{2−10, 2−8, 2−6, 2−4, 2−2, 20, 22, 24, 26, 28, 210} .

Using the training set, we can obtain the cross-validation error rate on thetraining set using each of these values. The C value that achieves the smallestcross-validation error rate is chosen.

This strategy, however, is very time consuming. Assuming an m class prob-lem and a K-fold cross validation; K ×

(m2

)binary SVM classifiers are required

to obtain a single cross-validation error rate if the OvO method is used. If thereare two hyperparameters (e.g., C and γ if the RBF kernel is used), and K1

and K2 candidate values exist for them, respectively, then the total number ofbinary SVMs to train is KK1K2

(m2

), which is very large.

24

6.4 SVM complexity, tradeoff, and more

In the early days, the popular SVM solver was the SMO (sequential minimaloptimization) method, whose complexity is rather difficult to precisely estimate.Empirical evaluations usually show complexity higher than O(n2.5d), where nand d are the number and dimensionality of training examples, respectively.This complexity is too high for large scale problems.

For linear SVMs, however, many fast methods have been proposed, whichcan complete the learning inO(nd) steps, using techniques such as column gener-ation, stochastic gradient descent, and coordinate descent. These new methodsenable the training of linear SVMs for large problems—e.g., with millions ofexamples and/or millions of dimensions.

Although there are also acceleration techniques for nonlinear SVM learning,its training complexity is still much higher than that of linear SVMs. The samealso applies to the testing complexity. Linear SVM testing only requires a singledot-product in a binary problem, while nonlinear SVM testing is much slower.

Nonlinear SVMs, if they can be trained in viable time for a problem, canusually achieve higher (and significantly higher in some problems) accuracy thanlinear ones. Hence, there is a tradeoff between the accuracy and complexityof SVM classifiers. Nonlinear SVMs are a good choice when the dataset issmall or medium sized, and linear SVMs are a good option for large problems,particularly when there are many feature dimensions.

For some special kernels, such as the power mean kernels, algorithms specif-ically designed for them can obtain both fast speed and highly accurate models.

High quality SVM solvers for both general nonlinear and fast linear kernelsare available. Examples include the LIBSVM package (for general nonlinearSVMs) and LIBLINEAR (for fast linear SVMs). We will discuss more detailsabout these topics in the exercise problems.

SVMs can also be extended to handle regression problems, called supportvector regression (SVR), which is also implemented in many software packages,including both LIBSVM and LIBLINEAR.

25

Exercises

1. LIBSVM is a widely used software package for learning kernel SVM classi-fiers and regressors. In this problem, we will experiment with this software.LIBSVM also has an accompanying page that collects datasets.

(a) Download the LIBSVM software package from http://www.csie.ntu.

edu.tw/~cjlin/libsvm/. Read the instructions and compile the softwarefrom the source code.

(b) Download the svmguide1 dataset from the LIBSVM datasets page(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).

For each of the following setups, train SVM model(s) using the trainingset, and compare the accuracy on the test set under different setups.

i. Use the default parameters (i.e., C = 1 and using the RBF kernel);

ii. Use the svm-scale utility to normalize the features. Make sure yourscaling operation is proper : use the parameters from the training set tonormalize the test set data. Read the instructions for this utility programcarefully;

iii. Use the linear kernel instead of the RBF kernel;

iv. Use C = 1000 and the RBF kernel;

v. Use the easy.py utility to determine hyperparameters C and γ in theRBF kernel.

What have you learned from these experiments?

(c) Play with the datasets in the LIBSVM datasets page. Can you find animbalanced dataset? The LIBSVM parameter -wi is useful in handlingimbalanced data. Try this parameter for the imbalanced dataset you havefound. Is it helpful?

2. (Additive kernels) A kernel is called additive if it has the following form:

κ(x,y) =

d∑i=1

κ(xi, yi)

for two d-dimensional vectors x and y. Note that we used the same symbolκ to denote the kernel for comparing two vectors and two scalars.

(a) The histogram intersection kernel is a widely used additive kernel,which is defined for vectors with only non-negative values, as

κHI(x,y) =

d∑i=1

min(xi, yi) .

26

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Show that if κHI(x, y) = min(x, y) for non-negative scalars x and y is avalid kernel, then κHI is a valid kernel for non-negative vectors x and y.

(b) Now we consider κHI for scalars. For any set of non-negative scalarvalues x1, x2, . . . , xn, the kernel matrix formed by these values is

[X]ij = min(xi, xj)

for 1 ≤ i, j ≤ n, in which n is an arbitrary positive integer. Let y1, y2, . . . , ynbe a permutation of x1, x2, . . . , xn such that y1 ≤ y2 ≤ · · · ≤ yn, and sim-ilarly define an n× n matrix Y with

yij = min(yi, yj) = ymin(i,j) .

Prove that X is positive (semi-)definite if and only if Y is positive (semi-)definite.

(c) For any set of values 0 ≤ x1 ≤ x2 ≤ · · · ≤ xn, prove that the kernelmatrix X with

xij = min(xi, xj) = xmin(i,j)

is positive semi-definite. (Hint: what is the LDL factorization of X?)

Note that combining the above results have proved that κHI is a validkernel for non-negative vectors.

(d) The χ2 kernel (Chi-square kernel) is another widely used additivekernel defined for positive data, which is defined as

κχ2(x,y) =

d∑i=1

2xiyixi + yi

.

For positive data, this kernel is positive definite (see the next problem fora proof). Prove that

κHI(x,y) ≤ κχ2(x,y) .

(e) Hellinger’s kernel is defined for non-negative data, as

κHE(x,y) =

d∑i=1

√xiyi.

Prove that this is a valid kernel and that

κHE(x,y) ≥ κχ2(x,y)

for positive data.

(f) Additive kernels are particularly effective when the feature vector is ahistogram. A histogram (which is not normalized) contains natural num-bers (i.e., zero or positive integers). Write out an explicit mapping for thehistogram intersection kernel when the features are all natural numbers.

27

3. (Power mean kernels) In this problem, we introduce the power mean ker-nels, which are a family of additive kernels and are closely related to thegeneralized mean in mathematics.

(a) Read the information in the page https://en.wikipedia.org/wiki/

Generalized_mean. When we only consider the power mean of two posi-tive numbers x and y, what are M0(x, y), M−1(x, y) and M−∞(x, y)?

(b) The power mean kernels are a family of additive kernels, each indexedby a negative or zero real number p. For two positive vectors x and ywith xi, yi > 0 (1 ≤ i ≤ d), the power mean kernel Mp is defined as

Mp(x,y) =

d∑i=1

Mp(xi, yi) =

d∑i=1

(xp + yp

2

)1/p

.

Power mean kernels were proposed by J. Wu in a paper titled PowerMean SVMs for Large Scale Visual Classification, CVPR 2012. Show thatthe three additive kernels discussed in the previous problem (histogramintersection, χ2 and Hellinger’s) are all special cases of the power meankernels.

(c) (Conditionally positive definite kernel) A function κ is called a condi-tionally positive definite kernel if for an arbitrary positive integer n andany c1, c2, . . . , cn ∈ R that satisfies

n∑i=1

ci = 0 ,

the inequalityn∑i=1

n∑j=1

cicjκ(xi, xj) ≥ 0

holds for arbitrary x1, x2, . . . , xn ∈ R. Prove that

−xp + yp

2

is conditionally positive definite for positive data.

(d) Use the following theorem to prove that power mean kernels are posi-tive definite kernels when −∞ ≤ p ≤ 0—i.e., they satisfy Mercer’s condi-tion.

If a kernel κ is conditionally positive definite and negative valued, then1

(−κ)δ is positive definite for all δ ≥ 0.

(e) When we use the power mean kernel M−∞ to replace the histogramintersection kernel in a programming environment, we have to use a pvalue that is far away from the zero value to replace −∞. For example,we can use p = −32—i.e., using M−32 to approximate M−∞ / κHI.

28

https://en.wikipedia.org/wiki/Generalized_mean

https://en.wikipedia.org/wiki/Generalized_mean

Write a simple Matlab or Octave program to evaluate the largest absoluteerror and relative error between M−32(x, y) and M−∞(x, y), where x andy are generated by the command x=0.01:0.01:1 and y=0.01:0.01:1,respectively. Is M−32 a good approximation for κHI? That is, compareκHI(x, y) and M−32(x, y) for the 10,000 pairs of x and y values.

(f) To illustrate the effectiveness of power mean kernels, visit https://

sites.google.com/site/wujx2001/home/power-mean-svm and read theinstructions there for installation and usage instructions. Use the resourceson this page to generate training and testing data for the Caltech 101dataset. When you generate the data, use K = 100 rather than K = 2000.Try different p values, and apply the PmSVM software to this dataset. Usethe LIBSVM software and the RBF kernel on the same dataset. Whichone has higher accuracy on this dataset (whose feature vectors are his-tograms)? Note that if you use the tool provided in LIBSVM to chooseoptimal C and gamma hyper-parameter values, it will take a very longtime to finish. Which software has higher training and testing speed?

4. (SVMs without bias) In the linear SVM formulation, we assume a classi-fication boundary of the form wTx+ b, which includes a bias term b ∈ R.However, it is also possible to learn a linear SVM classifier without thebias term—i.e., using wTx as the classification boundary.

(a) Without the bias term, what is the optimization problem in the primalspace? Use the notations in this chapter.

(b) Without the bias term, show that the dual form is as follows.

minα

f(α) =1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj −

n∑i=1

αi (72)

s.t. 0 ≤ αi ≤ C, i = 1, 2, . . . , n . (73)

If the solution for α is α?, how do you find the optimal decision boundaryw??

(c) When a bias is preferred, this formulation (without bias) can still beuseful. Given a training dataset (xi, yi) (1 ≤ i ≤ n), one can convertany x ∈ Rd into x̂ in Rd+1 by adding an extra dimension to x. Theadded dimension always has constant value 1. Suppose α? is the optimalsolution for the dual formulation, and suppose the classification boundaryis wTx+ b; what is the optimal b value?

5. (Dual coordinate descent) In this problem, we introduce the dual coordi-nate descent algorithm for solving a linear SVM without the bias term inthe dual space.

(a) Using the notations in the previous problem, find ∂f∂αi

. Find a way to

calculate ∂f∂αi

in O(d) steps. We use f ′(α) to denote the partial derivative

29

https://sites.google.com/site/wujx2001/home/power-mean-svm

https://sites.google.com/site/wujx2001/home/power-mean-svm

of f with respect to α—i.e., f ′(α)i = ∂f∂αi

.

(b) The dual coordinate descent (DCD) algorithm was proposed by Hsiehet al. in a paper titled A Dual Coordinate Descent Method for Large-scaleLinear SVMs, which was published in ICML 2008. Among the n Lagrangemultipliers αi (1 ≤ i ≤ n), DCD updates one multiplier at a time. Onetraining epoch updates α1, α2, . . . , αn sequentially one by one. The DCDalgorithm was proved to converge, and in practice it often converges in afew epochs.

Now, suppose we fix α1, . . . , αi−1 and αi+1, . . . , αn, and want to find abetter value for αi. Denote the set of Lagrange multipliers after thisupdate as α′, in which α′i = αi + d and α′j = αj for j 6= i. Show that

f(α′)− f(α) =1

2‖xi‖2d2 + f ′(α)d .

(c) Remember that αi and α′i should both be between 0 and C. Then,what is the optimal value for α′i?

(d) A practical implementation of the DCD algorithm needs to pay at-tention to a lot of implementation details—e.g., randomize the order ofupdating for different αi values in different epochs. Read the DCD paperand understand which details are important. This paper also providesproof for DCD’s convergence.

A C++ implementation is provided at https://www.csie.ntu.edu.tw/

~cjlin/liblinear/. Download this software and learn how to use it.This software has different solvers for linear classification and regressionproblems and several parameters. How do you specify parameters to per-form classification in equations 72–73?

(e) Download the rcv1.binary training set from the LIBSVM data page(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).

Use 5-fold cross validation and obtain the training time and the cross-validation accuracy. Compare LIBSVM and LIBLINEAR on this task:use the linear kernel in LIBSVM and use the problem specified by equa-tions 72–73 in LIBLINEAR. Set C = 4 in these experiments.

30

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Date post:	18-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Support vector machines - cs.nju.edu.cn · are useful in shaping the SVM primal form. We will not...

Documents