+ All Categories
Home > Documents > Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector...

Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector...

Date post: 17-Jan-2016
Category:
Upload: stephanie-hawkins
View: 214 times
Download: 0 times
Share this document with a friend
21
Today’s Topics 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 1 Support Vector Machines (SVMs) Three Key Ideas – Max Margins – Allowing Misclassified Training Examples – Kernels (for non-linear models; in next lecture)
Transcript
Page 1: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

1

Today’s Topics

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Support Vector Machines (SVMs)

Three Key Ideas

– Max Margins

– Allowing Misclassified Training Examples

– Kernels (for non-linear models; in next lecture)

Page 2: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Three Key SVM Concepts

• Maximize the MarginDon’t choose just any separating plane

• Penalize Misclassified ExamplesUse soft constraints and ‘slack’ variables

• Use the ‘Kernel Trick’ to get Non-LinearityRoughly like ‘hardwiring’ the input HU

portion of ANNs (so only need a perceptron)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 2

Page 3: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Support Vector MachinesMaximizing the Margin between Bounding Planes

Support Vectors

?

Margin

2||w||2

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 3

SVMs define some inequalities we want satisfied. We then useadvanced optimization methods (eg, linear programming) to find the satisfying solutions, but in cs540 we’ll do a simpler approx

Page 4: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Margins and Learning Theory

Theorems exist that connect learning (‘PAC’) theory to the size of the margin

– Basically the larger the margin, the better the expected future accuracy

– See, for example, Chapter 4 of Support Vector Machines by N. Christianini & J. Shawe-Taylor, Cambridge Press, 2000(not an assigned reading)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 4

Page 5: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

y Support Vectors

‘Slack’ VariablesDealing with Data that is not Linearly Separable

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 5

For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane)

If we deleted any/all of the non support vectors we’d get the same answer!

Page 6: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

SVMs and Non-Linear Separating Surfaces

f1

f2 +

+

_

_

h(f1, f2)

g(f1, f2) +

+

_

_

Non-linearly map to new space

Linearly separate in new spaceResult is a non-linear

separator in original space

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 6

Page 7: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 7

Math Review: Dot Products

11/10/15

X Y X1 Y1 + X2 Y2 + … + Xn Yn

So if X = [4, 5, -3, 7] and Y = [9, 0, -8, 2]

Then X Y = 49 + 50 + (-3)(-8) + 72 = 74

(weighted sums in ANNs are dot products)

Page 8: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Some Equations

1 posxW

xW

+

++

++

-

-

--

--

-Separating Plane

For all positive examples

1 negxW

For all negative examples

weights input features threshold

These 1’s result from dividingthrough by a constant for convenience (it is the distance from the dashed lines to the green line)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 8

Page 9: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Idea #1: The Margin(derivation not on final)

)cos(|| || || || )( BABA xxWxxW

xi 2)( BA xxW

1 AxW

1 BxW

Subtracting (ii) from (i) gives

(i)

(ii)

(iii)

WxA

xB

mar

gin

= 1 sinceparallel lines(iv)

The green line is the set of all pts that satisfy this equation (ditto for red line)

Combining (iii) and (iv) we get||W ||

2 || || BA xx

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 9

xj

Page 10: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Our Initial ‘Mathematical Program’

min ||w|| (this is the ‘1-norm’ length of the weight vector, which is the sum of the absolute values of the weights;

some SVMs use quadratic programs, but 1-norms have some preferred properties)

such that

w · xpos ≥ + 1 // for ‘+’ ex’s

w · xneg ≤ – 1 // for ‘–’ ex’s

w,

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 10

1

Page 11: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

The ‘p’ Norm – Generalization of the Familiar Euclidean Distance (p=2)

11/10/15 11

Page 12: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Our Mathematical Program (cont.)

Note: w and are our adjustable parameters (we could, of course, use the ANN ‘trick’ and move to the left side of our inequalities and treat as another weight)

We can now use existing math programming optimization s/w to find a sol’n to our current program (covered in cs525)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 12

Page 13: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Idea #2: Dealing with Non-Separable Data

• We can add what is called a ‘slack’ variable to each example

• This variable can be viewed as = 0 if example correctly separatedelse = ‘distance’ we need to move ex to get it correct (ie, distance from decision boundary)

• Note: we are NOT counting #misclassified would be nice to do so, but that becomes [mixed] integer programming, which is much harder

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 13

Page 14: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

min ||w||1 + μ ||S||1

such that

w · xposi + Si ≥ + 1

w · xnegj – Sj ≤ – 1

Sk ≥ 0

The Math Program with Slack Vars (this is the linear-programming version; there also is a quadratic-prog version - in cs540 we won’t worry about the difference)

w, s,

Dim = # of input features

Dim = # of training

examples

scalar

Scaling constant (use tuning set to

select value)

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface

14

Notice we are solving the perceptron task with acomplexity penalty (sum of wgts) – Hinton’s wgt decay!

Page 15: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Slack’s and Separability

• If training data is separable, will all Si = 0 ?

• Not necessarily!

– Might get a larger margin by misclassifying a few examples (just like in d-tree pruning)

– This can also happen when using gradient-descent to minimize an ANN’s cost function

11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 15

Page 16: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Brief Intro to Linear Programs (LP’s) - not on final

• We need to convert our task into

A z ≥ bwhich is the basic form of an LP (A is a constant matrix, b is a constant vector, z is a vector of variables)

• Note Can convert inequalities containing ≤ into ones using ≥ by multiplying both sides by -1 eg, 5x ≤ 15 same as -5x ≥ -15

• Can also handle = (ie, equalities) could use ≥ and ≤ to get =, but more efficient methods exist

11/10/15 16

Page 17: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Brief Intro to Linear Programs (cont.) - not on final

In addition, we want to

min c z under the linear Az ≥ b constraints

Vector c says how to penalize settings for variables in vector zHighly optimized s/w for solving LP exists(eg, CPLEX, COINS [free])

11/10/15

Yellow region are those points that satisfy the constraints; dotted lines are iso-cost lines

Lecture #21, Slide 17

Page 18: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Review: Matrix Multiplication

11/10/15 18

From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2

A B = C

Matrix A is K by M

Matrix B is N by K

Matrix C is M by N

Page 19: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

f

e/2

e/2

1

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Aside: Our SVM as an LP (not on

final)

Apos 1 0 -1 0

-Aneg 0 1 1 0

0 1 0 0 0

0 0 1 0 0

-1 0 0 0 1

1 0 0 0 1The 1’s are identity matrices (often written as I)

| f | e/2 | e/2 | 1 | f |

e/2

e/2

e/2

e/2

f

f

e/2

e/2

e

f

f

W

Spos

Sneg

Z

1

1

0

0

0

Let Apos = our positive training examples

Aneg = our negative training examples (assume 50% pos and 50% neg for notational simplicity)

#features

#exa

mpl

es

11/10/15 19

f

Page 20: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Our C Vector (determines the cost we’re minimizing, also not on final)

min [ 0 μ 0 1 ] W

S

Z

= min μ ● S + 1 ● Z

Aside: could also penalize (but would need to add more

variables since can be negative)

CNote we min Z’s not

W’s since only Z’s ≥ 0

= min μ ||S||1 + ||W||1since all S are non-negative and

the Z’s ‘squeeze’ the W’s

11/10/1520

Note here: S = Spos concatenated with Sneg

Page 21: Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10

Where We are so Far

• We have an ‘objective’ function that we can optimize by Linear Programming– min ||w||1 + μ ||S||1 subject to some constraints

– Free LP solvers exist– CS 525 teaches Linear Programming

• We could also use gradient descent– Perceptron learning with ‘weight decay’

quite similar, though uses SQUARED wgts and SQUARED error (the S is this error)

11/10/15 21


Recommended