+ All Categories
Home > Documents > Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning...

Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning...

Date post: 08-May-2020
Category:
Upload: others
View: 16 times
Download: 1 times
Share this document with a friend
100
Learning Maximum-Margin Hyperplanes: Support Vector Machines Piyush Rai Machine Learning (CS771A) Aug 24, 2016 Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 1
Transcript
Page 1: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning Maximum-Margin Hyperplanes:Support Vector Machines

Piyush Rai

Machine Learning (CS771A)

Aug 24, 2016

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 1

Page 2: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Perceptron and (Lack of) Margins

Perceptron learns a hyperplane (of many possible) that separates the classes

Standard Perceptron doesn’t guarantee any “margin” around the hyperplane

Note: Possible to “artificially” introduce a margin in the Perceptron

Simply change the Perceptron mistake condition to

yn(wT xn + b) ≤ γ

where γ > 0 is a pre-specified margin. For standard Perceptron, γ = 0

Support Vector Machine (SVM) offers a more principled way of doing this by learning the maximummargin hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 2

Page 3: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Perceptron and (Lack of) Margins

Perceptron learns a hyperplane (of many possible) that separates the classes

Standard Perceptron doesn’t guarantee any “margin” around the hyperplane

Note: Possible to “artificially” introduce a margin in the Perceptron

Simply change the Perceptron mistake condition to

yn(wT xn + b) ≤ γ

where γ > 0 is a pre-specified margin. For standard Perceptron, γ = 0

Support Vector Machine (SVM) offers a more principled way of doing this by learning the maximummargin hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 2

Page 4: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Perceptron and (Lack of) Margins

Perceptron learns a hyperplane (of many possible) that separates the classes

Standard Perceptron doesn’t guarantee any “margin” around the hyperplane

Note: Possible to “artificially” introduce a margin in the Perceptron

Simply change the Perceptron mistake condition to

yn(wT xn + b) ≤ γ

where γ > 0 is a pre-specified margin. For standard Perceptron, γ = 0

Support Vector Machine (SVM) offers a more principled way of doing this by learning the maximummargin hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 2

Page 5: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Perceptron and (Lack of) Margins

Perceptron learns a hyperplane (of many possible) that separates the classes

Standard Perceptron doesn’t guarantee any “margin” around the hyperplane

Note: Possible to “artificially” introduce a margin in the Perceptron

Simply change the Perceptron mistake condition to

yn(wT xn + b) ≤ γ

where γ > 0 is a pre-specified margin. For standard Perceptron, γ = 0

Support Vector Machine (SVM) offers a more principled way of doing this by learning the maximummargin hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 2

Page 6: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Perceptron and (Lack of) Margins

Perceptron learns a hyperplane (of many possible) that separates the classes

Standard Perceptron doesn’t guarantee any “margin” around the hyperplane

Note: Possible to “artificially” introduce a margin in the Perceptron

Simply change the Perceptron mistake condition to

yn(wT xn + b) ≤ γ

where γ > 0 is a pre-specified margin. For standard Perceptron, γ = 0

Support Vector Machine (SVM) offers a more principled way of doing this by learning the maximummargin hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 2

Page 7: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”? SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”). Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 8: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”? SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”). Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 9: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”?

SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”). Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 10: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”? SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”). Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 11: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”? SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”).

Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 12: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vector Machine (SVM)

Learns a hyperplane such that the positive and negative class training examples are as far away aspossible from it (ensures good generalization)

SVMs can also learn nonlinear decision boundaries using kernels (though the idea of kernels is notspecific to SVMs and is more generally applicable)

Reason behind the name “Support Vector Machine”? SVM finds the most important examples(called “support vectors”) in the training data

These examples also “balance” the margin boundaries (hence called “support”). Also, even if wethrow away the remaining training data and re-learn the SVM classifier, we’ll get the same hyperplane

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3

Page 13: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 14: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 15: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 16: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 17: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 18: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Learning a Maximum Margin Hyperplane

Suppose there exists a hyperplane w>x + b = 0 such that

wT xn + b ≥ 1 for yn = +1

wT xn + b ≤ −1 for yn = −1

Equivalently, yn(wT xn + b) ≥ 1 ∀n (the margin condition)

Also note that min1≤n≤N |wT xn + b| = 1

Thus margin on each side: γ = min1≤n≤N|wT xn+b|||w|| = 1

||w||

Total margin = 2γ = 2||w||

Want the hyperplane (w , b) to have the largest possible margin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 4

Page 19: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 20: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 21: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 22: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 23: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 24: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 25: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 26: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Large Margin = Good Generalization

Large margins intuitively mean good generalization

We saw that margin γ ∝ 1||w ||

Large margin ⇒ small ||w || , i.e., small `2 norm of w

Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)

Recall our discussion of regularization..

Simple solutions ⇒ good generalization on test data

Want to see an even more formal justification? :-)

Wait until we cover Learning Theory!

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 5

Page 27: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM

Every training example has to fulfil the margin condition yn(wTxn + b) ≥ 1

Also want to maximize the margin γ ∝ 1||w ||

Equivalent to minimizing ||w ||2 or ||w||2

2

The objective for hard-margin SVM

minw,b

f (w , b) =||w ||2

2

subject to yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Thus the hard-margin SVM minimizes a convex objective function which is a Quadratic Program(QP) with N linear inequality constraints

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 6

Page 28: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM

Every training example has to fulfil the margin condition yn(wTxn + b) ≥ 1

Also want to maximize the margin γ ∝ 1||w ||

Equivalent to minimizing ||w ||2 or ||w||2

2

The objective for hard-margin SVM

minw,b

f (w , b) =||w ||2

2

subject to yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Thus the hard-margin SVM minimizes a convex objective function which is a Quadratic Program(QP) with N linear inequality constraints

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 6

Page 29: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM

Every training example has to fulfil the margin condition yn(wTxn + b) ≥ 1

Also want to maximize the margin γ ∝ 1||w ||

Equivalent to minimizing ||w ||2 or ||w||2

2

The objective for hard-margin SVM

minw,b

f (w , b) =||w ||2

2

subject to yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Thus the hard-margin SVM minimizes a convex objective function which is a Quadratic Program(QP) with N linear inequality constraints

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 6

Page 30: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM

Every training example has to fulfil the margin condition yn(wTxn + b) ≥ 1

Also want to maximize the margin γ ∝ 1||w ||

Equivalent to minimizing ||w ||2 or ||w||2

2

The objective for hard-margin SVM

minw,b

f (w , b) =||w ||2

2

subject to yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Thus the hard-margin SVM minimizes a convex objective function which is a Quadratic Program(QP) with N linear inequality constraints

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 6

Page 31: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM

Every training example has to fulfil the margin condition yn(wTxn + b) ≥ 1

Also want to maximize the margin γ ∝ 1||w ||

Equivalent to minimizing ||w ||2 or ||w||2

2

The objective for hard-margin SVM

minw,b

f (w , b) =||w ||2

2

subject to yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Thus the hard-margin SVM minimizes a convex objective function which is a Quadratic Program(QP) with N linear inequality constraints

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 6

Page 32: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Allow some training examples to fall within the margin region, or be even misclassified (i.e., fall onthe wrong side). Preferable if training data is noisy

Each training example (xn, yn) given a “slack” ξn ≥ 0 (distance by which it “violates” the margin).If ξn > 1 then xn is totally on the wrong side

Basically, we want a soft-margin condition: yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 7

Page 33: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Allow some training examples to fall within the margin region, or be even misclassified (i.e., fall onthe wrong side). Preferable if training data is noisy

Each training example (xn, yn) given a “slack” ξn ≥ 0 (distance by which it “violates” the margin).If ξn > 1 then xn is totally on the wrong side

Basically, we want a soft-margin condition: yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 7

Page 34: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Allow some training examples to fall within the margin region, or be even misclassified (i.e., fall onthe wrong side). Preferable if training data is noisy

Each training example (xn, yn) given a “slack” ξn ≥ 0 (distance by which it “violates” the margin).If ξn > 1 then xn is totally on the wrong side

Basically, we want a soft-margin condition: yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 7

Page 35: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Goal: Maximize the margin, while also minimizing the sum of slacks (don’t want too many trainingexamples violating the margin condition)

The primal objective for soft-margin SVM can thus be written as

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to constraints yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0 n = 1, . . . ,N

Thus the soft-margin SVM also minimizes a convex objective function which is a QuadraticProgram (QP) with 2N linear inequality constraints

Param. C controls the trade-off between large margin vs small training error

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 8

Page 36: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Goal: Maximize the margin, while also minimizing the sum of slacks (don’t want too many trainingexamples violating the margin condition)

The primal objective for soft-margin SVM can thus be written as

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to constraints yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0 n = 1, . . . ,N

Thus the soft-margin SVM also minimizes a convex objective function which is a QuadraticProgram (QP) with 2N linear inequality constraints

Param. C controls the trade-off between large margin vs small training error

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 8

Page 37: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Goal: Maximize the margin, while also minimizing the sum of slacks (don’t want too many trainingexamples violating the margin condition)

The primal objective for soft-margin SVM can thus be written as

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to constraints yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0 n = 1, . . . ,N

Thus the soft-margin SVM also minimizes a convex objective function which is a QuadraticProgram (QP) with 2N linear inequality constraints

Param. C controls the trade-off between large margin vs small training error

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 8

Page 38: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Soft-Margin SVM (More Commonly Used)

Goal: Maximize the margin, while also minimizing the sum of slacks (don’t want too many trainingexamples violating the margin condition)

The primal objective for soft-margin SVM can thus be written as

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to constraints yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0 n = 1, . . . ,N

Thus the soft-margin SVM also minimizes a convex objective function which is a QuadraticProgram (QP) with 2N linear inequality constraints

Param. C controls the trade-off between large margin vs small training error

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 8

Page 39: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Summary: Hard-Margin SVM vs Soft-Margin SVM

Objective for the hard-margin SVM (unknowns are w and b)

minw,b

f (w , b) =||w ||2

2

subject to constraints yn(wT xn + b) ≥ 1, n = 1, . . . ,N

Objective for the soft-margin SVM (unknowns are w , b, and {ξn}Nn=1)

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to yn(wT xn + b) ≥ 1−ξn, ξn ≥ 0 n = 1, . . . ,N

In either case, we have to solve constrained, convex optimization problem

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 9

Page 40: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Brief Detour: Solving Constrained OptimizationProblems

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 10

Page 41: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 42: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 43: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 44: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 45: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 46: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 47: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Constrained Optimization via Lagrangian

Consider optimizing the following objective, subject to some constraints

minw

f (w)

s.t gn(w) ≤ 0, n = 1, . . . ,N

hm(w) = 0, m = 1, . . . ,M

Introduce Lagrange multipliers α = {αn}Nn=1, αn ≥ 0, and β = {βm}Mm=1, one for each constraint,and construct the following Lagrangian

L(w ,α,β) = f (w) +N∑

n=1

αngn(w) +N∑

m=1

βnhn(w)

Consider LP(w) = maxα,β L(w ,α,β). Note that

LP(w) =∞ if w violates any of the constraints (g ’s or h’s)

LP(w) = f (w) if w satisfies all the constraints (g ’s and h’s)

Thus minw LP(w) = minw maxα≥0,βL(w ,α,β) solves the same problem as the original problemand will have the same solution. For convex f , g , h, the order of min and max is interchangeable.

Karush-Kuhn-Tucker (KKT) Conditions: At the optimal solution, αngn(w) = 0 (note the maxα)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 11

Page 48: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 12

Page 49: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The hard-margin SVM optimization problem is:

minw,b

f (w , b) =||w ||2

2

subject to 1− yn(wT xn + b) ≤ 0, n = 1, . . . ,N

A constrained optimization problem. Can solve using Lagrange’s method

Introduce Lagrange Multipliers αn (n = {1, . . . ,N}), one for each constraint, and solve thefollowing Lagrangian:

minw,b

maxα≥0

L(w , b,α) =||w ||2

2+

N∑n=1

αn{1− yn(wT xn + b)}

Note: α = [α1, . . . , αN ] is the vector of Lagrange multipliers

We will solve this Lagrangian by solving a dual problem (eliminate w and b and solve for the “dualvariables” α)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 13

Page 50: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The hard-margin SVM optimization problem is:

minw,b

f (w , b) =||w ||2

2

subject to 1− yn(wT xn + b) ≤ 0, n = 1, . . . ,N

A constrained optimization problem. Can solve using Lagrange’s method

Introduce Lagrange Multipliers αn (n = {1, . . . ,N}), one for each constraint, and solve thefollowing Lagrangian:

minw,b

maxα≥0

L(w , b,α) =||w ||2

2+

N∑n=1

αn{1− yn(wT xn + b)}

Note: α = [α1, . . . , αN ] is the vector of Lagrange multipliers

We will solve this Lagrangian by solving a dual problem (eliminate w and b and solve for the “dualvariables” α)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 13

Page 51: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The hard-margin SVM optimization problem is:

minw,b

f (w , b) =||w ||2

2

subject to 1− yn(wT xn + b) ≤ 0, n = 1, . . . ,N

A constrained optimization problem. Can solve using Lagrange’s method

Introduce Lagrange Multipliers αn (n = {1, . . . ,N}), one for each constraint, and solve thefollowing Lagrangian:

minw,b

maxα≥0

L(w , b,α) =||w ||2

2+

N∑n=1

αn{1− yn(wT xn + b)}

Note: α = [α1, . . . , αN ] is the vector of Lagrange multipliers

We will solve this Lagrangian by solving a dual problem (eliminate w and b and solve for the “dualvariables” α)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 13

Page 52: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The original Lagrangian is

minw,b

maxα≥0

L(w , b,α) =w>w

2+

N∑n=1

αn{1− yn(wT xn + b)}

Take (partial) derivatives of L w.r.t. w , b and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn∂L∂b

= 0⇒N∑

n=1

αnyn = 0

Important: Note the form of the solution w - it is simply a weighted sum of all the training inputsx1, . . . , xN (and αn is like the “importance” of xn)

Substituting w =∑N

n=1 αnynxn in Lagrangian and also using∑N

n=1 αnyn = 0

maxα≥0

LD (α) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 14

Page 53: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The original Lagrangian is

minw,b

maxα≥0

L(w , b,α) =w>w

2+

N∑n=1

αn{1− yn(wT xn + b)}

Take (partial) derivatives of L w.r.t. w , b and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn∂L∂b

= 0⇒N∑

n=1

αnyn = 0

Important: Note the form of the solution w - it is simply a weighted sum of all the training inputsx1, . . . , xN (and αn is like the “importance” of xn)

Substituting w =∑N

n=1 αnynxn in Lagrangian and also using∑N

n=1 αnyn = 0

maxα≥0

LD (α) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 14

Page 54: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

The original Lagrangian is

minw,b

maxα≥0

L(w , b,α) =w>w

2+

N∑n=1

αn{1− yn(wT xn + b)}

Take (partial) derivatives of L w.r.t. w , b and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn∂L∂b

= 0⇒N∑

n=1

αnyn = 0

Important: Note the form of the solution w - it is simply a weighted sum of all the training inputsx1, . . . , xN (and αn is like the “importance” of xn)

Substituting w =∑N

n=1 αnynxn in Lagrangian and also using∑N

n=1 αnyn = 0

maxα≥0

LD (α) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 14

Page 55: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 56: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 57: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 58: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 59: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 60: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Hard-Margin SVM

Can write the objective more compactly in vector/matrix form as

maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Good news: This is maximizing a concave function (or minimizing a convex function - verify thatthe Hessian is G, which is p.s.d.). Note that our original primal SVM objective was also convex

Important: Inputs x ’s only appear as inner products (helps to “kernelize”)

Can solve† the above objective function for α using various methods, e.g.,

Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such asquadprog (MATLAB), CVXOPT, CPLEX, etc.

Using (projected) gradient methods (projection needed because the α’s are constrained). Gradientmethods will usually be much faster than QP methods.

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 15

Page 61: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)

A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 62: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 63: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 64: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn

lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 65: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 66: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 67: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Hard-Margin SVM: The Solution

Once we have the αn’s, w and b can be computed as:

w =∑N

n=1 αnynxn

b = − 12

(minn:yn=+1 wTxn + maxn:yn=−1 wTxn

)A nice property: Most αn’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

For the optimal αn’s

αn{1− yn(wT xn + b)} = 0

αn is non-zero only if xn lies on one of the two margin boundaries,i.e., for which yn(wT xn + b) = 1

These examples are called support vectors

Recall the support vectors “support” the margin boundaries

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 16

Page 68: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 17

Page 69: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Recall the soft-margin SVM optimization problem:

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to 1 ≤ yn(wT xn + b)+ξn, −ξn ≤ 0 n = 1, . . . ,N

Note: ξ = [ξ1, . . . , ξN ] is the vector of slack variables

Introduce Lagrange Multipliers αn, βn (n = {1, . . . ,N}), for constraints, and solve the Lagrangian:

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =||w ||2

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Note: The terms in red above were not present in the hard-margin SVM

Two sets of dual variables α = [α1, . . . , αN ] and β = [β1, . . . , βN ]. We’ll eliminate the primalvariables w , b, ξ to get dual problem containing the dual variables (just like in the hard margin case)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 18

Page 70: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Recall the soft-margin SVM optimization problem:

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to 1 ≤ yn(wT xn + b)+ξn, −ξn ≤ 0 n = 1, . . . ,N

Note: ξ = [ξ1, . . . , ξN ] is the vector of slack variables

Introduce Lagrange Multipliers αn, βn (n = {1, . . . ,N}), for constraints, and solve the Lagrangian:

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =||w ||2

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Note: The terms in red above were not present in the hard-margin SVM

Two sets of dual variables α = [α1, . . . , αN ] and β = [β1, . . . , βN ]. We’ll eliminate the primalvariables w , b, ξ to get dual problem containing the dual variables (just like in the hard margin case)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 18

Page 71: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Recall the soft-margin SVM optimization problem:

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to 1 ≤ yn(wT xn + b)+ξn, −ξn ≤ 0 n = 1, . . . ,N

Note: ξ = [ξ1, . . . , ξN ] is the vector of slack variables

Introduce Lagrange Multipliers αn, βn (n = {1, . . . ,N}), for constraints, and solve the Lagrangian:

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =||w ||2

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Note: The terms in red above were not present in the hard-margin SVM

Two sets of dual variables α = [α1, . . . , αN ] and β = [β1, . . . , βN ]. We’ll eliminate the primalvariables w , b, ξ to get dual problem containing the dual variables (just like in the hard margin case)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 18

Page 72: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Recall the soft-margin SVM optimization problem:

minw,b,ξ

f (w , b, ξ) =||w ||2

2+C

N∑n=1

ξn

subject to 1 ≤ yn(wT xn + b)+ξn, −ξn ≤ 0 n = 1, . . . ,N

Note: ξ = [ξ1, . . . , ξN ] is the vector of slack variables

Introduce Lagrange Multipliers αn, βn (n = {1, . . . ,N}), for constraints, and solve the Lagrangian:

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =||w ||2

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Note: The terms in red above were not present in the hard-margin SVM

Two sets of dual variables α = [α1, . . . , αN ] and β = [β1, . . . , βN ]. We’ll eliminate the primalvariables w , b, ξ to get dual problem containing the dual variables (just like in the hard margin case)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 18

Page 73: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

The Lagrangian problem to solve

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =w>w

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn ,∂L∂b

= 0⇒N∑

n=1

αnyn = 0,∂L∂ξn

= 0⇒ C − αn − βn = 0

Note: Solution of w again has the same form as in the hard-margin case (weighted sum of allinputs with αn being the importance of input xn)

Note: Using C −αn − βn = 0 and βn ≥ 0⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)

Substituting these in the Lagrangian L gives the Dual problem

maxα≤C,β≥0

LD (α,β) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 19

Page 74: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

The Lagrangian problem to solve

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =w>w

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn ,∂L∂b

= 0⇒N∑

n=1

αnyn = 0,∂L∂ξn

= 0⇒ C − αn − βn = 0

Note: Solution of w again has the same form as in the hard-margin case (weighted sum of allinputs with αn being the importance of input xn)

Note: Using C −αn − βn = 0 and βn ≥ 0⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)

Substituting these in the Lagrangian L gives the Dual problem

maxα≤C,β≥0

LD (α,β) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 19

Page 75: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

The Lagrangian problem to solve

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =w>w

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn ,∂L∂b

= 0⇒N∑

n=1

αnyn = 0,∂L∂ξn

= 0⇒ C − αn − βn = 0

Note: Solution of w again has the same form as in the hard-margin case (weighted sum of allinputs with αn being the importance of input xn)

Note: Using C −αn − βn = 0 and βn ≥ 0⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)

Substituting these in the Lagrangian L gives the Dual problem

maxα≤C,β≥0

LD (α,β) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 19

Page 76: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

The Lagrangian problem to solve

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =w>w

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn ,∂L∂b

= 0⇒N∑

n=1

αnyn = 0,∂L∂ξn

= 0⇒ C − αn − βn = 0

Note: Solution of w again has the same form as in the hard-margin case (weighted sum of allinputs with αn being the importance of input xn)

Note: Using C −αn − βn = 0 and βn ≥ 0⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)

Substituting these in the Lagrangian L gives the Dual problem

maxα≤C,β≥0

LD (α,β) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 19

Page 77: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

The Lagrangian problem to solve

minw,b,ξ

maxα≥0,β≥0

L(w , b, ξ, α, β) =w>w

2+ +C

N∑n=1

ξn +N∑

n=1

αn{1− yn(wT xn + b)−ξn}−N∑

n=1

βnξn

Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

∂L∂w

= 0⇒ w =N∑

n=1

αnynxn ,∂L∂b

= 0⇒N∑

n=1

αnyn = 0,∂L∂ξn

= 0⇒ C − αn − βn = 0

Note: Solution of w again has the same form as in the hard-margin case (weighted sum of allinputs with αn being the importance of input xn)

Note: Using C −αn − βn = 0 and βn ≥ 0⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)

Substituting these in the Lagrangian L gives the Dual problem

maxα≤C,β≥0

LD (α,β) =N∑

n=1

αn −1

2

N∑m,n=1

αmαnymyn(xTmxn) s.t.

N∑n=1

αnyn = 0

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 19

Page 78: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 79: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 80: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 81: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 82: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 83: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Solving Soft-Margin SVM

Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as

maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

where G is an N × N matrix with Gmn = ymynx>mxn, and 1 is a vector of 1s

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

Can be solved† the same way as hard-margin SVM (except that α ≤ C )

Can solve for α using QP solvers or (projected) gradient methods

Given α, the solution for w , b has the same form as hard-margin case

Note: α is again sparse. Nonzero αn’s correspond to the support vectors

† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 20

Page 84: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vectors in Soft-Margin SVM

The hard-margin SVM solution had only one type of support vectors

.. ones that lie on the margin boundaries wT x + b = −1 and wT x + b = +1

The soft-margin SVM solution has three types of support vectors

1 Lying on the margin boundaries wT x + b = −1 and wT x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

3 Lying on the wrong side of the hyperplane (ξn ≥ 1)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 21

Page 85: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vectors in Soft-Margin SVM

The hard-margin SVM solution had only one type of support vectors

.. ones that lie on the margin boundaries wT x + b = −1 and wT x + b = +1

The soft-margin SVM solution has three types of support vectors

1 Lying on the margin boundaries wT x + b = −1 and wT x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

3 Lying on the wrong side of the hyperplane (ξn ≥ 1)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 21

Page 86: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vectors in Soft-Margin SVM

The hard-margin SVM solution had only one type of support vectors

.. ones that lie on the margin boundaries wT x + b = −1 and wT x + b = +1

The soft-margin SVM solution has three types of support vectors

1 Lying on the margin boundaries wT x + b = −1 and wT x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

3 Lying on the wrong side of the hyperplane (ξn ≥ 1)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 21

Page 87: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vectors in Soft-Margin SVM

The hard-margin SVM solution had only one type of support vectors

.. ones that lie on the margin boundaries wT x + b = −1 and wT x + b = +1

The soft-margin SVM solution has three types of support vectors

1 Lying on the margin boundaries wT x + b = −1 and wT x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

3 Lying on the wrong side of the hyperplane (ξn ≥ 1)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 21

Page 88: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Support Vectors in Soft-Margin SVM

The hard-margin SVM solution had only one type of support vectors

.. ones that lie on the margin boundaries wT x + b = −1 and wT x + b = +1

The soft-margin SVM solution has three types of support vectors

1 Lying on the margin boundaries wT x + b = −1 and wT x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

3 Lying on the wrong side of the hyperplane (ξn ≥ 1)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 21

Page 89: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVMs via Dual Formulation: Some Comments

Recall the final dual objectives for hard-margin and soft-margin SVM

Hard-Margin SVM: maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

Soft-Margin SVM: maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians). The dual problem hasonly one constraint that is non-trivial (

∑Nn=1 αnyn = 0). The original Primal formulation of SVM had

many more (depends on N).

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ymynx>mxn)by kernelized similarities (kernelized SVMs)

However, the dual formulation can be expensive if N is large. Have to solve for N variablesα = [α1, . . . , αN ], and also need to store an N × N matrix G

A lot of work† has gone into speeding up optimization in these settings

†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 22

Page 90: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVMs via Dual Formulation: Some Comments

Recall the final dual objectives for hard-margin and soft-margin SVM

Hard-Margin SVM: maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

Soft-Margin SVM: maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians). The dual problem hasonly one constraint that is non-trivial (

∑Nn=1 αnyn = 0). The original Primal formulation of SVM had

many more (depends on N).

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ymynx>mxn)by kernelized similarities (kernelized SVMs)

However, the dual formulation can be expensive if N is large. Have to solve for N variablesα = [α1, . . . , αN ], and also need to store an N × N matrix G

A lot of work† has gone into speeding up optimization in these settings

†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 22

Page 91: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVMs via Dual Formulation: Some Comments

Recall the final dual objectives for hard-margin and soft-margin SVM

Hard-Margin SVM: maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

Soft-Margin SVM: maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians). The dual problem hasonly one constraint that is non-trivial (

∑Nn=1 αnyn = 0). The original Primal formulation of SVM had

many more (depends on N).

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ymynx>mxn)by kernelized similarities (kernelized SVMs)

However, the dual formulation can be expensive if N is large. Have to solve for N variablesα = [α1, . . . , αN ], and also need to store an N × N matrix G

A lot of work† has gone into speeding up optimization in these settings

†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 22

Page 92: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVMs via Dual Formulation: Some Comments

Recall the final dual objectives for hard-margin and soft-margin SVM

Hard-Margin SVM: maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

Soft-Margin SVM: maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians). The dual problem hasonly one constraint that is non-trivial (

∑Nn=1 αnyn = 0). The original Primal formulation of SVM had

many more (depends on N).

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ymynx>mxn)by kernelized similarities (kernelized SVMs)

However, the dual formulation can be expensive if N is large. Have to solve for N variablesα = [α1, . . . , αN ], and also need to store an N × N matrix G

A lot of work† has gone into speeding up optimization in these settings

†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 22

Page 93: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVMs via Dual Formulation: Some Comments

Recall the final dual objectives for hard-margin and soft-margin SVM

Hard-Margin SVM: maxα≥0

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

Soft-Margin SVM: maxα≤C

LD (α) = α>1−

1

2α>Gα s.t.

N∑n=1

αnyn = 0

The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians). The dual problem hasonly one constraint that is non-trivial (

∑Nn=1 αnyn = 0). The original Primal formulation of SVM had

many more (depends on N).

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ymynx>mxn)by kernelized similarities (kernelized SVMs)

However, the dual formulation can be expensive if N is large. Have to solve for N variablesα = [α1, . . . , αN ], and also need to store an N × N matrix G

A lot of work† has gone into speeding up optimization in these settings†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 22

Page 94: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVM Dual Formulation: A Geometric View

Convex Hull Interpretation†: Solving the SVM dual is equivalent to finding the shortest lineconnecting the convex hulls of both classes (the SVM’s hyperplane will be the perpendicularbisector of this line)

†See: “Duality and Geometry in SVM Classifiers” by Bennett and Bredensteiner

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 23

Page 95: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Loss Function Minimization View of SVM

Recall, we want for each training example: yn(wTxn + b) ≥ 1− ξn

Can think of our loss as basically the sum of the slacks ξn ≥ 0, which is

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

ξn =N∑

n=1

max{0, 1− yn(wT xn + b)}

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochasticsub-gradient descent (can also add a regularizer on w , e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

max{0,−yn(wT xn + b)}

Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizingwhich is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 24

Page 96: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Loss Function Minimization View of SVM

Recall, we want for each training example: yn(wTxn + b) ≥ 1− ξnCan think of our loss as basically the sum of the slacks ξn ≥ 0, which is

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

ξn =N∑

n=1

max{0, 1− yn(wT xn + b)}

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochasticsub-gradient descent (can also add a regularizer on w , e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

max{0,−yn(wT xn + b)}

Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizingwhich is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 24

Page 97: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Loss Function Minimization View of SVM

Recall, we want for each training example: yn(wTxn + b) ≥ 1− ξnCan think of our loss as basically the sum of the slacks ξn ≥ 0, which is

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

ξn =N∑

n=1

max{0, 1− yn(wT xn + b)}

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochasticsub-gradient descent (can also add a regularizer on w , e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

max{0,−yn(wT xn + b)}

Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizingwhich is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 24

Page 98: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Loss Function Minimization View of SVM

Recall, we want for each training example: yn(wTxn + b) ≥ 1− ξnCan think of our loss as basically the sum of the slacks ξn ≥ 0, which is

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

ξn =N∑

n=1

max{0, 1− yn(wT xn + b)}

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochasticsub-gradient descent (can also add a regularizer on w , e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

max{0,−yn(wT xn + b)}

Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizingwhich is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 24

Page 99: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

Loss Function Minimization View of SVM

Recall, we want for each training example: yn(wTxn + b) ≥ 1− ξnCan think of our loss as basically the sum of the slacks ξn ≥ 0, which is

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

ξn =N∑

n=1

max{0, 1− yn(wT xn + b)}

This is called “Hinge Loss”. Can also learn SVMs by minimizing this loss via stochasticsub-gradient descent (can also add a regularizer on w , e.g., `2)

Recall that, Perceptron also minimizes a sort of similar loss function

`(w , b) =N∑

n=1

`n(w , b) =N∑

n=1

max{0,−yn(wT xn + b)}

Perceptron, SVM, Logistic Reg., all minimize convex approximations of the 0-1 loss (optimizingwhich is NP-hard; moreover it’s non-convex/non-smooth)

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 24

Page 100: Learning Maximum-Margin Hyperplanes: Support Vector Machines · Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 3 Support Vector Machine (SVM)

SVM: Some Notes

A hugely (perhaps the most!) popular classification algorithm

Reasonably mature, highly optimized SVM softwares freely available (perhaps the reason why it ismore popular than various other competing algorithms)

Some popular ones: libSVM, LIBLINEAR, SVMStruct, Vowpal Wabbit, etc.

Lots of work on scaling up SVMs† (both large N and large D)

Extensions beyond binary classification (e.g., multiclass, structured outputs)

Can even be used for regression problems (Support Vector Regression)

Nonlinear extensions possible via kernels

†See: “Support Vector Machine Solvers” by Bottou and Lin

Machine Learning (CS771A) Learning Maximum-Margin Hyperplanes: Support Vector Machines 25


Recommended