Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu...

Post on 21-Aug-2020

0 views 0 download

transcript

Support Vector Machine and Kernel Methods

Jiayu Zhou

1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA

February 26, 2017

Jiayu Zhou CSE 847 Machine Learning 1 / 50

Which Separator Do You Pick?

Jiayu Zhou CSE 847 Machine Learning 2 / 50

Robustness to Noisy Data

Being robust to noise (measurement error) is good (rememberregularization).

Jiayu Zhou CSE 847 Machine Learning 3 / 50

Thicker Cushion Means More Robustness

We call such hyperplanes fat

Jiayu Zhou CSE 847 Machine Learning 4 / 50

Two Crucial Questions

1 Can we efficiently find the fattest separating hyperplane?

2 Is a fatter hyperplane better than a thin one?

Jiayu Zhou CSE 847 Machine Learning 5 / 50

Pulling Out the Bias

Before

x ∈ {1} × Rd;w ∈ Rd+1

x =

1x1...xd

;w =

w0

w1...wd

signal = wTx

After

x ∈ Rd; b ∈ R,w ∈ Rd

x =

x1...xd

;w =

w1...wd

bias b

signal = wTx+ b

Jiayu Zhou CSE 847 Machine Learning 6 / 50

Separating The Data

Hyperplane h = (b,w)h separates the data means:

yn(wTxn + b) > 0

By rescaling the weights andbias,

minn=1,...,N

yn(wTxn + b) = 1

Jiayu Zhou CSE 847 Machine Learning 7 / 50

Distance to the Hyperplane

w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0

Scalar projection:

aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)

let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)

dist(x, h) =1

∥w∥· |wTx−wTx⊥|

=1

∥w∥· |wTx+ b|

Jiayu Zhou CSE 847 Machine Learning 8 / 50

Fatness of a Separating Hyperplane

dist(x, h) =1

∥w∥· |wTx+ b| = 1

∥w∥· |yn(wTx+ b)| = 1

∥w∥· yn(wTx+ b)

Fatness

= Distance to the closest point

Fatness = minn

dist(xn, h)

=1

∥w∥minn

yn(wTx+ b)

=1

∥w∥

Jiayu Zhou CSE 847 Machine Learning 9 / 50

Maximizing the Margin

Formal definition of margin:

margin: γ(h) =1

∥w∥

NOTE: Bias b does not appear in the margin.

Objective maximizing margin:

minb,w

1

2wTw

subject to: minn=1,...,N

yn(wTxn + b) = 1

An equivalent objective:

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Jiayu Zhou CSE 847 Machine Learning 10 / 50

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Training Data:

X =

0 02 22 03 0

,y =

−1−1+1+1

What is the margin?

Jiayu Zhou CSE 847 Machine Learning 11 / 50

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

{(1) + (3) → w1 ≥ 1

(2) + (3) → w2 ≤ −1⇒ 1

2wTw =

1

2(w2

1 + w22) ≥ 1

Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50

Example - Our Toy Data Set

Given data X =

0 02 22 03 0

Optimal solution

w∗ =

[w1 = 1w2 = −1

], b∗ = −1

Optimal hyperplaneg(x) = sign(x1 − x2 − 1)

margin:1

∥w∥ = 1√2≈ 0.707

For data points (1), (2) and(3) yn(x

Tnw

∗ + b∗) = 1Support Vectors

Jiayu Zhou CSE 847 Machine Learning 13 / 50

Solver: Quadratic Programming

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u∗ ← QP (Q,p, A, c)

(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html

Jiayu Zhou CSE 847 Machine Learning 14 / 50

Maximum Margin Hyperplane is QP

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1, ∀n

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u =

[bw

]∈ Rd+1 ⇒

1

2wTw = [b,wT ]

[0 0T

d0d Id

] [b

wT

]= uT

[0 0T

d0d Id

]u

Q =

[0 0T

d0d Id

],p = 0d+1

yn(wTxn + b) ≥ 1 = [yn, ynx

Tn ]u ≥ 1 ⇒

y1 y1xT1

......

yN yNxTN

u ≥

1...1

A =

y1 y1xT1

......

yN yNxTN

, c =

1...1

Jiayu Zhou CSE 847 Machine Learning 15 / 50

Back To Our Example

Exercise:

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

Show the corresponding Q,p, A, c.

Q =

0 0 00 1 00 0 1

,p =

000

, A =

−1 0 0−1 −2 −21 2 01 3 0

, c =

1111

Use your QP-solver to give

u∗ = [b∗, w∗1 , w

∗2 ]

T = [−1, 1,−1]

Jiayu Zhou CSE 847 Machine Learning 16 / 50

Primal QP algorithm for linear-SVM

1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where

A =

y1 −y1xT1−

......

yN −yNxTN−

, Q =

[0 0Td0d Id

]

2 Return

[b∗

w∗

]= u∗ ← QP (Q,p, A, c).

3 The final hypothesis is g(x) = sign(xTw∗ + b∗).

Jiayu Zhou CSE 847 Machine Learning 17 / 50

Link to Regularization

minw

Ein(w)

subject to: wTw ≤ C

optimal hyperplane regularization

minimize wTw Ein

subject to Ein = 0 wTw ≤ C

Jiayu Zhou CSE 847 Machine Learning 18 / 50

How to Handle Non-Separable Data?

(a) Tolerate noisy data points: soft-margin SVM.

(b) Inherent nonlinear boundary: non-linear transformation.

Jiayu Zhou CSE 847 Machine Learning 19 / 50

Non-Linear Transformation

Φ1(x) = (x1, x2)

Φ2(x) = (x1, x2, x21, x1x2, x

22)

Φ3(x) = (x1, x2, x21, x1x2, x

22, x

31, x

21x2, x1x

22, x

32)

Jiayu Zhou CSE 847 Machine Learning 20 / 50

Non-Linear Transformation

Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:

zn = Φ(xn)

Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):

minb̃,w̃

1

2w̃T w̃

subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n

Final hypothesis:

g(x) = sign(w̃∗TΦ(x) + b̃∗)

Jiayu Zhou CSE 847 Machine Learning 21 / 50

SVM and non-linear transformation

The margin is shaded in yellow, and the support vectors are boxed.

For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).

Jiayu Zhou CSE 847 Machine Learning 22 / 50

Support Vector Machine Summary

A very powerful, easy to use linear model which comes withautomatic regularization.

Fully exploit SVM: Kernel

potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick

Jiayu Zhou CSE 847 Machine Learning 23 / 50

SVM Dual: Formulation

Primal and dual in optimization.

The dual view of SVM enables us to exploit the kernel trick.

In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

which is also a QP problem.

Jiayu Zhou CSE 847 Machine Learning 24 / 50

SVM Dual: Prediction

We can obtain the primal solution:

w∗ =

N∑n=1

ynα∗nxn

where for support vectors αn > 0

The optimal hypothesis:

g(x) = sign(w∗Tx+ b∗)

= sign

(N∑

n=1

ynα∗nx

Tnx+ b∗

)

= sign

∑α∗n>0

ynα∗nx

Tnx+ b∗

Jiayu Zhou CSE 847 Machine Learning 25 / 50

Dual SVM: Summary

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 26 / 50

Common SVM Basis Functions

zk = polynomial terms of xk of degree 1 to q

zk = radial basis function of xk

zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)

zk = sigmoid functions of xk

Jiayu Zhou CSE 847 Machine Learning 27 / 50

Quadratic Basis Functions

Φ(x) =

1√2x1

...√2xd

x21...x2d√

2x1x2

...√2x1xd√2x2x3

...√2xd−1xd

Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms

The number of terms is approximatelyd2/2.

You may be wondering what those√2s are doing. Youll find out why

theyre there soon.

Jiayu Zhou CSE 847 Machine Learning 28 / 50

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.

Jiayu Zhou CSE 847 Machine Learning 29 / 50

Quadratic Dot Products

Φ(a)TΦ(b) =

1√2a1...√2ama21...

a2m√2a1a2...√

2a1ad√2a2a3...√

2ad−1ad

T

1√2b1...√2bdb21...b2d√2b1b2...√

2b1bd√2b2b3...√

2bd−1bd

Constant Term 1

Linear Terms

d∑i=1

2aibi

Pure Quadratic Terms

d∑i=1

a2i b2i

Quadratic Cross-Terms

d∑i=1

d∑j=i+1

2aiajbibj

Jiayu Zhou CSE 847 Machine Learning 30 / 50

Quadratic Dot Product

Does Φ(a)TΦ(b) look familiar?

Φ(a)TΦ(b) = 1 + 2∑d

i=1aibi +

∑d

i=1a2i b

2i +

∑d

i=1

∑d

j=i+12aiajbibj

Try this: (aTb+ 1)2

(aT b+ 1)2 = (aT b)2 + 2aT b+ 1

=

(∑d

i=1aibi

)2

+ 2∑d

i=1aibi + 1

=∑d

i=1

∑d

j=1aibiajbj + 2

∑d

i=1aibi + 1

=∑d

i=1a2i b

2i + 2

∑d

i=1

∑d

j=i+1aiajbibj + 2

∑d

i=1aibi + 1

They’re the same! And this is only O(d) to compute!

Jiayu Zhou CSE 847 Machine Learning 31 / 50

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =N∑

n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.

Jiayu Zhou CSE 847 Machine Learning 32 / 50

Higher Order Polynomials

Φ(x) Cost 100dim

Quadratic d2/2 terms d2N2/4 2.5kN2

Cubic d3/6 terms d3N2/12 83kN2

Quartic d4/24 terms d4N2/48 1.96mN2

Φ(a)TΦ(b) Cost 100dim

Quadratic (aTb+ 1)2 dN2/2 50N2

Cubic (aTb+ 1)3 dN2/2 50N2

Quartic (aTb+ 1)4 dN2/2 50N2

Jiayu Zhou CSE 847 Machine Learning 33 / 50

Dual SVM with Quintic basis functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)︸ ︷︷ ︸

(xTnxm+1)5

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗n(x

Tnx+ 1)5 + b∗

)

Jiayu Zhou CSE 847 Machine Learning 34 / 50

Dual SVM with general kernel functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymK(xn,xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗nK(xn,xm) + b∗

)

Jiayu Zhou CSE 847 Machine Learning 35 / 50

Kernel Tricks

Replacing dot product with a kernel function

Not all functions are kernel functions!

Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?

Mercer’s condition

To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .

Jiayu Zhou CSE 847 Machine Learning 36 / 50

Kernel Design: expression kernel

mRNA expression data:

Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.

Similar or dissimilarSimilar

Dissimilar

Kernel

K(x, y) =

∑i xiyi√∑

i xixi√∑

i yiyi

Jiayu Zhou CSE 847 Machine Learning 37 / 50

Kernel Design: sequence kernel

Work with non-vectorial data

Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Jiayu Zhou CSE 847 Machine Learning 38 / 50

Commonly Used SVM Kernel Functions

K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.

Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function

Radial-basis style kernel (RBF)/Gaussian kernel function

K(a,b) = exp(−γ∥a− b∥2

)Sigmoid functions

Jiayu Zhou CSE 847 Machine Learning 39 / 50

2nd Order Polynomial Kernel

K(a,b) = (α · aTb+ β)2

Jiayu Zhou CSE 847 Machine Learning 40 / 50

Gaussian Kernels

K(a,b) = exp(−γ∥a− b∥2

)

When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.

Jiayu Zhou CSE 847 Machine Learning 41 / 50

Gaussian Kernels

For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.

Jiayu Zhou CSE 847 Machine Learning 42 / 50

From hard-margin to soft-margin

When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.

Remedy: a soft formulation that allows small violation of themargins or even some classification errors.

Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that

yn(wTxn + b) ≥ 1− εn

εn captures by how much (xn, yn) fails to be separated.

Jiayu Zhou CSE 847 Machine Learning 43 / 50

Soft-Margin SVM

We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:

minb,w,ε

1

2wTw + C

N∑n=1

εn

subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N

εn ≥ 0, for n = 1, . . . , N

The meaning of C?

When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.

When C is small, on the other hand, we care less about violating themargin.

Jiayu Zhou CSE 847 Machine Learning 44 / 50

Soft Margin Example

Jiayu Zhou CSE 847 Machine Learning 45 / 50

Soft Margin and Hard Margin

minb,w,ε

12w

Tw︸ ︷︷ ︸margin

+C∑N

n=1εn︸ ︷︷ ︸

error tolerance

subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N

Jiayu Zhou CSE 847 Machine Learning 46 / 50

The Hinge Loss

The trade-off sounds very similar, right?

We have εn ≥ 0, and thatyn(w

Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)

The SVM loss (aka. Hinge Loss) function

ESVM(b,w) =1

N

∑N

n=1max(1− yn(w

Txn + b), 0)

The soft-margin SVM can be re-written as the followingoptimization problem:

minb,w

ESVM(b,w) + λwTw

Jiayu Zhou CSE 847 Machine Learning 47 / 50

Dual Soft-Margin SVM

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, 0 ≤ αn≤ C, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 48 / 50

Summary of Dual SVM

Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.

Deal with high- or infinite-dimensional transforms using thekernel trick.

Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.

Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.

Jiayu Zhou CSE 847 Machine Learning 49 / 50

Support Vector Machine

Robust Classifier: Maximum Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Design: Hard Margin

Design: Soft Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Kernel TrickAllow Training Error

Kernel Trick

Support Vector

Support Vector

Hinge Loss

Jiayu Zhou CSE 847 Machine Learning 50 / 50