Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu...

transcript

Support Vector Machine and Kernel Methods

Jiayu Zhou

1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA

February 26, 2017

Jiayu Zhou CSE 847 Machine Learning 1 / 50

Which Separator Do You Pick?

Robustness to Noisy Data

Being robust to noise (measurement error) is good (rememberregularization).

Thicker Cushion Means More Robustness

We call such hyperplanes fat

Two Crucial Questions

1 Can we efficiently find the fattest separating hyperplane?

2 Is a fatter hyperplane better than a thin one?

Pulling Out the Bias

Before

x ∈ {1} × Rd;w ∈ Rd+1

1x1...xd

w1...wd

signal = wTx

x ∈ Rd; b ∈ R,w ∈ Rd

x1...xd

w1...wd

bias b

signal = wTx+ b

Separating The Data

Hyperplane h = (b,w)h separates the data means:

yn(wTxn + b) > 0

By rescaling the weights andbias,

minn=1,...,N

yn(wTxn + b) = 1

Distance to the Hyperplane

w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0

Scalar projection:

aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)

let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)

dist(x, h) =1

∥w∥· |wTx−wTx⊥|

∥w∥· |wTx+ b|

Fatness of a Separating Hyperplane

dist(x, h) =1

∥w∥· |wTx+ b| = 1

∥w∥· |yn(wTx+ b)| = 1

∥w∥· yn(wTx+ b)

Fatness

= Distance to the closest point

Fatness = minn

dist(xn, h)

∥w∥minn

yn(wTx+ b)

∥w∥

Maximizing the Margin

Formal definition of margin:

margin: γ(h) =1

∥w∥

NOTE: Bias b does not appear in the margin.

Objective maximizing margin:

minb,w

subject to: minn=1,...,N

yn(wTxn + b) = 1

An equivalent objective:

minb,w

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Example - Our Toy Data Set

minb,w

Training Data:

0 02 22 03 0

−1−1+1+1

What is the margin?

minb,w

0 02 22 03 0

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

{(1) + (3) → w1 ≥ 1

(2) + (3) → w2 ≤ −1⇒ 1

2wTw =

1 + w22) ≥ 1

Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50

Given data X =

0 02 22 03 0

Optimal solution

w∗ =

[w1 = 1w2 = −1

], b∗ = −1

Optimal hyperplaneg(x) = sign(x1 − x2 − 1)

margin:1

∥w∥ = 1√2≈ 0.707

For data points (1), (2) and(3) yn(x

∗ + b∗) = 1Support Vectors

Solver: Quadratic Programming

minu∈Rq

2uTQu+ pTu

subject to: Au ≥ c

u∗ ← QP (Q,p, A, c)

(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html

Maximum Margin Hyperplane is QP

minb,w

subject to: yn(wTxn + b) ≥ 1, ∀n

minu∈Rq

2uTQu+ pTu

subject to: Au ≥ c

]∈ Rd+1 ⇒

2wTw = [b,wT ]

d0d Id

],p = 0d+1

yn(wTxn + b) ≥ 1 = [yn, ynx

Tn ]u ≥ 1 ⇒

y1 y1xT1

......

yN yNxTN

y1 y1xT1

......

yN yNxTN

Back To Our Example

Exercise:

0 02 22 03 0

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

Show the corresponding Q,p, A, c.

0 0 00 1 00 0 1

−1 0 0−1 −2 −21 2 01 3 0

Use your QP-solver to give

u∗ = [b∗, w∗1 , w

∗2 ]

T = [−1, 1,−1]

Primal QP algorithm for linear-SVM

1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where

y1 −y1xT1−

......

yN −yNxTN−

[0 0Td0d Id

2 Return

]= u∗ ← QP (Q,p, A, c).

3 The final hypothesis is g(x) = sign(xTw∗ + b∗).

Link to Regularization

Ein(w)

subject to: wTw ≤ C

optimal hyperplane regularization

minimize wTw Ein

subject to Ein = 0 wTw ≤ C

How to Handle Non-Separable Data?

(a) Tolerate noisy data points: soft-margin SVM.

(b) Inherent nonlinear boundary: non-linear transformation.

Non-Linear Transformation

Φ1(x) = (x1, x2)

Φ2(x) = (x1, x2, x21, x1x2, x

Φ3(x) = (x1, x2, x21, x1x2, x

21x2, x1x

Non-Linear Transformation

Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:

zn = Φ(xn)

Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):

minb̃,w̃

2w̃T w̃

subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n

Final hypothesis:

g(x) = sign(w̃∗TΦ(x) + b̃∗)

SVM and non-linear transformation

The margin is shaded in yellow, and the support vectors are boxed.

For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).

Support Vector Machine Summary

A very powerful, easy to use linear model which comes withautomatic regularization.

Fully exploit SVM: Kernel

potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick

SVM Dual: Formulation

Primal and dual in optimization.

The dual view of SVM enables us to exploit the kernel trick.

In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

which is also a QP problem.

SVM Dual: Prediction

We can obtain the primal solution:

w∗ =

N∑n=1

ynα∗nxn

where for support vectors αn > 0

The optimal hypothesis:

g(x) = sign(w∗Tx+ b∗)

= sign

ynα∗nx

Tnx+ b∗

= sign

∑α∗n>0

ynα∗nx

Tnx+ b∗

Dual SVM: Summary

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nxn

Common SVM Basis Functions

zk = polynomial terms of xk of degree 1 to q

zk = radial basis function of xk

zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)

zk = sigmoid functions of xk

Quadratic Basis Functions

Φ(x) =

1√2x1

...√2xd

x21...x2d√

...√2x1xd√2x2x3

...√2xd−1xd

Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms

The number of terms is approximatelyd2/2.

You may be wondering what those√2s are doing. Youll find out why

theyre there soon.

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.

Quadratic Dot Products

Φ(a)TΦ(b) =

1√2a1...√2ama21...

a2m√2a1a2...√

2a1ad√2a2a3...√

2ad−1ad

1√2b1...√2bdb21...b2d√2b1b2...√

2b1bd√2b2b3...√

2bd−1bd

Constant Term 1

Linear Terms

d∑i=1

Pure Quadratic Terms

d∑i=1

a2i b2i

Quadratic Cross-Terms

d∑i=1

d∑j=i+1

2aiajbibj

Quadratic Dot Product

Does Φ(a)TΦ(b) look familiar?

Φ(a)TΦ(b) = 1 + 2∑d

i=1aibi +

i=1a2i b

j=i+12aiajbibj

Try this: (aTb+ 1)2

(aT b+ 1)2 = (aT b)2 + 2aT b+ 1

i=1aibi

+ 2∑d

i=1aibi + 1

j=1aibiajbj + 2

i=1aibi + 1

i=1a2i b

2i + 2

j=i+1aiajbibj + 2

i=1aibi + 1

They’re the same! And this is only O(d) to compute!

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

w∗ =N∑

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.

Higher Order Polynomials

Φ(x) Cost 100dim

Quadratic d2/2 terms d2N2/4 2.5kN2

Cubic d3/6 terms d3N2/12 83kN2

Quartic d4/24 terms d4N2/48 1.96mN2

Φ(a)TΦ(b) Cost 100dim

Quadratic (aTb+ 1)2 dN2/2 50N2

Cubic (aTb+ 1)3 dN2/2 50N2

Quartic (aTb+ 1)4 dN2/2 50N2

Dual SVM with Quintic basis functions

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)︸︷︷︸

(xTnxm+1)5

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗n(x

Tnx+ 1)5 + b∗

Dual SVM with general kernel functions

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymK(xn,xm)

subject toN∑

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗nK(xn,xm) + b∗

Kernel Tricks

Replacing dot product with a kernel function

Not all functions are kernel functions!

Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?

Mercer’s condition

To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .

Kernel Design: expression kernel

mRNA expression data:

Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.

Similar or dissimilarSimilar

Dissimilar

Kernel

K(x, y) =

∑i xiyi√∑

i xixi√∑

i yiyi

Kernel Design: sequence kernel

Work with non-vectorial data

Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Commonly Used SVM Kernel Functions

K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.

Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function

Radial-basis style kernel (RBF)/Gaussian kernel function

K(a,b) = exp(−γ∥a− b∥2

)Sigmoid functions

2nd Order Polynomial Kernel

K(a,b) = (α · aTb+ β)2

Gaussian Kernels

K(a,b) = exp(−γ∥a− b∥2

When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.

Gaussian Kernels

For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.

From hard-margin to soft-margin

When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.

Remedy: a soft formulation that allows small violation of themargins or even some classification errors.

Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that

yn(wTxn + b) ≥ 1− εn

εn captures by how much (xn, yn) fails to be separated.

Soft-Margin SVM

We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:

minb,w,ε

2wTw + C

N∑n=1

subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N

εn ≥ 0, for n = 1, . . . , N

The meaning of C?

When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.

When C is small, on the other hand, we care less about violating themargin.

Soft Margin Example

Soft Margin and Hard Margin

minb,w,ε

Tw︸︷︷︸margin

+C∑N

n=1εn︸︷︷︸

error tolerance

subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N

The Hinge Loss

The trade-off sounds very similar, right?

We have εn ≥ 0, and thatyn(w

Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)

The SVM loss (aka. Hinge Loss) function

ESVM(b,w) =1

n=1max(1− yn(w

Txn + b), 0)

The soft-margin SVM can be re-written as the followingoptimization problem:

minb,w

ESVM(b,w) + λwTw

Dual Soft-Margin SVM

maxα∈RN

N∑n=1

αn −1

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

ynαn = 0, 0 ≤ αn≤ C, ∀n

w∗ =

N∑n=1

ynα∗nxn

Summary of Dual SVM

Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.

Deal with high- or infinite-dimensional transforms using thekernel trick.

Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.

Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.

Support Vector Machine

Robust Classifier: Maximum Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Design: Hard Margin

Design: Soft Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Kernel TrickAllow Training Error

Kernel Trick

Support Vector

Hinge Loss

Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu...

Documents