Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen...

transcript

Lecture Topic: Support Vector Machines

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Spam Email

Spam Email (Continued)

The obvious issue is that words common in spam (e.g., account, inheritance,enlargement), are also present in genuine emails.

In traning, one uses loss functions to measure how good is a particular classifierx ∈ Rn, which assigns weight to each word in the dictionary, on a particular rowAj :, 1 ≤ j ≤ m of the matrix, considering the classification y (j) of the row.

Using 12 max{0, 1− y (j)Aj :x}2, where Aj : denotes the jth row of A, is called the

hinge loss squared.

There are many more variants.

hinge loss squared.

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Key Concepts

SVM stands for support vector machine, which is one of the best known methodsfor classification, and producing a

linear classifier in particular, which is a functionf : Rn → {−1, 1}, f (x) := sign(wT x + b), defined by wT ∈ Rn, b ∈ R, where

sign gives 1 for positive values, -1 for negative values, and 0 for 0.

A hinge loss squared is 12 max{0, 1− y (j)Aj :x}2.

Key Concepts

A Picture to Keep in Mind

−1 0 1 2 3 4 5 6

Feature 1

Feature

Class 1Class 2

Max-margin hyperplane

Classification

Informally, classification considers a number of examples, where each example isrepresented as a point in an n-dimensional space and a value of -1 or 1.

The goal in training of support vector machines is to find a hyperplane thatseparates examples marked 1 from examples marked -1 as well as possible, insome sense.

In order to determine the classification of a new example, one maps it into thesame n-dimensional space, and looks which side of the hyperplane it is on.

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Classification

The equation wT x + b = 0 defines a hyperplane in Rn.

If there exists a hyperplane f such that all points x ∈ Rn representing exampleswith value -1 have f (x) < 0 and all points x ∈ Rn representing examples withvalue 1 have f (x) > 0, we call the instance linearly separable.

Then, one can scale the problem so as to obtain: wTAi : + b ≥ +1 for i , yi = +1and wTAi : + b ≤ −1 for i , yi = −1, which should be seen as two parallelhyperplanes.

Classification

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

αiαjyiyjATi : Ai : + b

αiyi −∑i

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

minw∈Rn,b∈R

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

αiyi −∑i

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

i αiyiAi : and

minw∈Rn,b∈R

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

αiyi −∑i

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

i αiyiAi : and

minw∈Rn,b∈R

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

αiyi −∑i

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

i αiyiAi : and

minw∈Rn,b∈R

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

αiyi −∑i

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

i αiyiAi : and

A Linearly Separable Example (Again)

−1 0 1 2 3 4 5 6

Feature 1

Feature

Class 1Class 2

Max-margin hyperplane

Beyond Linear Separability

The issue is that the data are generally not linearly separable.

When the data are generally not linearly separable, there exists no w ∈ Rn thatwould make the constrained problem feasible.

One remedy is to consider the Lagrangian relaxation:

minw∈Rn,λ∈Rm

2‖w‖2

2 +n∑

λiP(w ,Ai :, y(i)), (3.1)

where P(w ,Ai :, y(i)) measures the violation of the ith constraint.

It makes sense to focus on the loss function P(w ,Ai :, y(i)) in the Lagrangian

relaxation, rather than the max-margin term ‖w‖22.

minw∈Rn,λ∈Rm

2‖w‖2

2 +n∑

λiP(w ,Ai :, y(i)), (3.1)

minw∈Rn,λ∈Rm

2‖w‖2

2 +n∑

λiP(w ,Ai :, y(i)), (3.1)

minw∈Rn,λ∈Rm

2‖w‖2

2 +n∑

λiP(w ,Ai :, y(i)), (3.1)

Classification

Given a matrix A ∈ Rm×n and a compatible vector y ∈ Rm, the goal is hence tofind a vector x ∈ Rn, which solves the following optimization problem:

minx∈Rn

m∑j=1

P(x ,Aj :, y(j)), (3.2)

where P is a loss function.

Ultimately, the loss functions approximate the so-called 0/1-loss, which is 0 for 0and larger, and 1 elsewhere.

Classification

Given a matrix A ∈ Rm×n and a compatible vector y ∈ Rm, the goal is hence tofind a vector x ∈ Rn, which solves the following optimization problem:

minx∈Rn

m∑j=1

P(x ,Aj :, y(j)), (3.2)

where P is a loss function.

Ultimately, the loss functions approximate the so-called 0/1-loss, which is 0 for 0and larger, and 1 elsewhere.

The Loss Functions

−4 −3 −2 −1 0 1 2 3 4 5 60

Decision function f(x)

Lossfunctionvalue

Zero-one lossHinge lossLogistic loss

Hinge square loss

Classification

Common loss functions include:

PHL(x ,Aj :, y(j)) :=

2max{0, 1− y (j)Aj :x}, hinge loss, (HL)

PHSL(x ,Aj :, y(j)) :=

2max{0, 1− y (j)Aj :x}2, hinge loss squared, (HLS)

PLL(x ,Aj :, y(j)) := log(1 + e−y

(j)Aj :x), logistic loss, (LL)

Non-Smooth Losses

Notice that:

0/1-loss is non-convex and non-smooth

square loss, PSL(x ,Aj :, y(j)) := 1

2 (y (j) − Aj :x)2, is convex and smooth, butone obtains linear regression of Chapter 7

hinge loss and hinge loss squared are both convex, but the use of max inhinge loss and hinge loss squared makes them both not-smooth.

Subgradient Methods

One hence needs to develop subgradient methods.

For hinge loss squared, the dual has the form:

minx∈Rm

F (x) :=1

2λm2xTQx − 1

mxT1︸︷︷︸

+m∑i=1

Φ[0,1](x(i))︸︷︷︸

, (SVM-DUAL)

where Φ[0,1] is the characteristic (or “indicator”) function of the interval [0, 1] and

Q ∈ Rm×m is the Gram matrix of the data, i.e., Qi ,j = y (i)y (j)Ai :ATj : .

Subgradient Methods

One hence needs to develop subgradient methods.

For hinge loss squared, the dual has the form:

minx∈Rm

F (x) :=1

2λm2xTQx − 1

mxT1︸︷︷︸

+m∑i=1

Φ[0,1](x(i))︸︷︷︸

, (SVM-DUAL)

where Φ[0,1] is the characteristic (or “indicator”) function of the interval [0, 1] and

Q ∈ Rm×m is the Gram matrix of the data, i.e., Qi ,j = y (i)y (j)Ai :ATj : .

Subgradient Methods

If x∗ is an optimal solution of (SVM-DUAL) thenw∗ = w∗(x∗) = 1

∑mi=1 y

(i)(x∗)(i)ATi : is an optimal solution of the primal

problem

minw∈Rn

P(w) :=1

n∑i=1

P(w ,Ai :, y(i)) +

2‖w‖2, (4.1)

where P(w ,Ai :, y(i)) = max{0, 1− y (i)Ai :w}.

Subgradient Methods

It is not hard to see that one should like to apply a subgradient method. Let usdefine an auxiliary vector:

gk :=1

m∑i=1

x(i)k y (i)AT

i : . (4.2)

∇i f (x) =y (i)Ai :gk − 1

m, Li =

‖Ai :‖2

λm2. (4.3)

Subgradient Methods

As in the subgradient method for sparse least squares, one can use a coordinatedescent with a closed-form step. There exists α, β ∈ R The optimal step length isthe solution of a one-dimensional problem:

h(i)(xk) = arg mint∈R∇i f (α)t +

2 + Φ[0,1](α(i) + t) (4.4)

= clip[−α(i),1−α(i)]

(λm(1− y (i)Ai :gk)

β‖Ai :‖2

), (4.5)

where for a < b

clip[a,b](ζ) =

a, if ζ < a,

b, if ζ > b,

ζ, otherwise.

Subgradient Methods

The new value of the auxiliary vector gk+1 = g(xk+1) is given by

gk+1 = gk +C∑

∑i∈Z (c)

λmhi (xk)y iAT

︸︷︷︸δg (c)

. (4.6)

where the summation runs c over the parts of δg computed on C different

computers, each considering coordinates Z(c)k .

This makes it possible to produce efficient distributed implementations.

Subgradient Methods

The new value of the auxiliary vector gk+1 = g(xk+1) is given by

gk+1 = gk +C∑

∑i∈Z (c)

λmhi (xk)y iAT

︸︷︷︸δg (c)

. (4.6)

where the summation runs c over the parts of δg computed on C different

computers, each considering coordinates Z(c)k .

This makes it possible to produce efficient distributed implementations.

Primal-Dual MethodsAlternatively, one can consider the amount ζi of violation in the ith example:

minw∈Rn,b∈R,ζ∈Rm

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

for a fixed λ ∈ Rm.

There, one can formulate the dual problem and apply a primal-dual method.

For common instances with n� m, one can exploit block structure to form alinear system in O(n(m + 1)2) to invert a matrix in O((m + 1)3), at least on theBSS machine.

One also has quadratic convergence, under reasonable assumptions.Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 23 / 1

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

Primal-Dual Methods

Notice, however, that the problem would become non-convex, if λ ∈ Rm were notfixed.

Unlike in general optimisation, where a Lagrangian provides a relaxation to theconstrained problem, here the constrained problem with fixed λ is a restriction ofthe unconstrained problem.

Primal-Dual Methods

Notice, however, that the problem would become non-convex, if λ ∈ Rm were notfixed.

Unlike in general optimisation, where a Lagrangian provides a relaxation to theconstrained problem, here the constrained problem with fixed λ is a restriction ofthe unconstrained problem.

Multi-Class Classification

In all of the above, we have been considering classification into two classes, -1 and1.

When the dependent variable takes values out of a finite set V of values, e.g.,V = {−1, 0, 1}, the problem is known as multi-class classification and the valuesare said to correspond to “classes”.

Although there are principled methods for the problem, one reuses solvers forclassification into two classes in practice, most of the time.

Multi-Class Classification: One vs. Rest

One option is to consider |V | one-versus-rest classifiers. On the example withV = {−1, 0, 1}, one would train three classifiers: first for two classes {−1} and{0, 1}, second for {0} and {−1, 1}, and finally for {+1} and {0, 1}.Once given an example without a classification, one would run all |V | classifiersand pick the one, which seems best.

Multi-Class Classification: One vs. Rest

One option is to consider |V | one-versus-rest classifiers. On the example withV = {−1, 0, 1}, one would train three classifiers: first for two classes {−1} and{0, 1}, second for {0} and {−1, 1}, and finally for {+1} and {0, 1}.Once given an example without a classification, one would run all |V | classifiersand pick the one, which seems best.

Multi-Class Classification: One vs. One

The alternative is to consider |V |(|V | − 1)/2 one-versus-one classifiers.

This may be expensive and “reconciling” of such classifiers is tricky.

Often, the reconcilation mimicks voting, where each classifier “casts a vote” for aclass, and the class with the most votes wins.

Regularisations

Just as in the previous three chapters, one can consider regularisations, e.g., `1:

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸︷︷︸

+ γ‖x‖1︸︷︷︸Ψ(x)

whereby one obtains a problem with a smooth component f and a non-smoothcomponent Ψ, which is in some sense simple.

One can apply the same subgradient machinery as in Chapter 7.

There are many more extensions, but their impact is disputed.

It may hence make sense to “focus on the basics”.

Regularisations

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸︷︷︸

+ γ‖x‖1︸︷︷︸Ψ(x)

Regularisations

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸︷︷︸

+ γ‖x‖1︸︷︷︸Ψ(x)

Regularisations

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸︷︷︸

+ γ‖x‖1︸︷︷︸Ψ(x)

A Summary

Regression, classification, and multi-class classification are the best knownexamples of “supervised training”.

Instances, which fit within 100 MB, are often considered of “moderate” size andcan be solved well using most methods.

This may correspond to a dense 3600× 3600 matrix A or up to 100000 rows in asparse matrix with 40 entries per row.

In the machine learning literature, common sparse matrices of these dimensionsinclude CCAT variant of RCV1, Astro-ph, and COV.

A Summary

A SciKit Implementation

There, you can use module SciKit Learn module in Python to train the SVM:

1 from sklearn import svmA = [[0, 0], [1, 1]]y = [-1, 1]classifier = svm.SVC()classifier.fit(A, y)

which uses LIBMSVM .

Once you train the classifier, you can try:

print classifier.predict([[-1, -1], [2., 2.]])print classifier.support_vectors_

The closest to the subgradient method above is SGDClassifier:

from sklearn.linear_model import SGDClassifierA = [[0, 0], [1, 1]]

3 y = [-1, 1]classifier = SGDClassifier(loss="hinge", penalty="l2")classifier.fit(A, y)print classifier.predict([[2., 2.]])print classifier.coef_

8 print classifier.intercept_

A Summary by Andreas Mueller

Performance of Methods for Training SVMs

Many real-world datasets are much larger, though.

Consider a well-known instance WebSpam, which consists of 350,000 emails(rows) and 16,609,143 distinct words (columns). The size of the instance is 25GB.

It is not hard to imagine that Google’s GMail may have collected many orders ofmagnitude more emails to learn from.

There, the subgradient methods suggested above scale much better than others.

Performance of Subgradient MethodsThe Figure shows the execution time and duality gap on WebSpam, using C = 16processes, with each process using 8 threads.

Especially when there is some additional structure, one can scale much further.

0 20 40 60 80 100

10−4

10−3

10−2

10−1

Elapsed Time [min]

Dualit

τ = 48

τ = 96

τ = 224

τ = 384

τ = 768

Performance of Subgradient MethodsThe Figure shows the execution time and duality gap on WebSpam, using C = 16processes, with each process using 8 threads.

Especially when there is some additional structure, one can scale much further.

0 20 40 60 80 100

10−4

10−3

10−2

10−1

Elapsed Time [min]

Dualit

τ = 48

τ = 96

τ = 224

τ = 384

τ = 768

Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen...

Documents