1 / 31
Support Vector Machines
Andrei Alexandrescu
June 19, 2007
Introduction
Introduction
Metaphor #1
Metaphor #2
Metaphor #3
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
2 / 31
Metaphor #1
Introduction
Metaphor #1
Metaphor #2
Metaphor #3
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
3 / 31
■ Imagine a single perceptron■ Linear separation■ Train it to draw the separation
hyperplane:
◆ Minimize d+ + d−, where◆ Distance to closest positive point d+
◆ Distance to closest negative point d−
You’ve got an SVM.
Metaphor #2
Introduction
Metaphor #1
Metaphor #2
Metaphor #3
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
4 / 31
■ Imagine creating a highway:■ Straight■ Trees on the left■ Rocks on the right■ Farthest from closest tree■ Farthest from closest rock
You need an SVM.
Metaphor #3
Introduction
Metaphor #1
Metaphor #2
Metaphor #3
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
5 / 31
■ Imagine you want to support a metalsheet with coil springs
■ Coil springs are attached to fixed points■ They push from two different sides■ For the metal sheet to be supported, the
momentum from all springs must cancelout
The sheet will be, well, supported along anSVM plane.
Background: StructuralRisk Minimization
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
6 / 31
Capacity and Generalization
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
7 / 31
■ Generalization: Figure out similaritiesbetween already-seen data and new data
◆ Too much: “Square piece of paper?That’s a $100 bill”
■ Capacity: Ability to allocate newcategories for data
◆ Too much: “#L26118670? It’s afake; all $100 bills I’ve seen had otherserial numbers”
■ They are competitive with one another■ How to strike the right balance?
Empirical Risk
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
8 / 31
■ We are given l observations 〈xi, yi〉
◆ xi ∈ Rn
◆ yi ∈ {−1, 1}
■ Learn y = f(x, α) by tuning α
■ Expected test error (risk) and empiricalrisk:
R(α) =1
2
∫
|y − f(x, α)|dP (x, y) (1)
Remp(α) =1
2l
∑
|yi − f(xi, α)| (2)
Risk Bound
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
9 / 31
■ For 0/1 loss and 0 < η < 1:
R(α) ≤ Remp(α)+
√
h(1 + log 2lh) − log η
4
l(3)
where h ∈ N is the Vapnik-Chervonenkis(VC) dimension
■ Second term: “VC confidence”
Importance of Risk Bound
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
10 / 31
1. Not dependent on P (x, y)2. lhs not computable3. rhs computable if we know h
■ For a given task, choose the machinethat minimizes the risk bound!
■ Even when bound not tight, we cancontrast “tightness” of various families ofmachines
The VC dimension
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
11 / 31
■ For a family of functions f(α):
◆ Choose a set of l points◆ Label them in any way◆ ∃α s.t. f(α) can recognize
(“shatter”) them
■ Then f(α) has VC at least l
Example: Hyperplanes in Rn
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
12 / 31
■ Choosing 4 planar points:
◆ they can’t be separated by one linefor all of their possible labelings (onelabeling will be inseparable)
■ Similarly, n + 1 points in Rn can’t be
separated for all labelings■ So the VC dimension of hyperplanes in
Rn is n + 1
Not a Hard and Fast Formula
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
13 / 31
■ Consider a nearest-neighbor (NN)classifier
■ VC dimension is infinite■ Remp = 0■ The bound is irrelevant, yet NN classifier
can perform well in many situations
Corollary
Introduction
Background:Structural RiskMinimizationCapacity andGeneralization
Empirical Risk
Risk BoundImportance of RiskBound
The VC dimensionExample:Hyperplanes in R
n
Not a Hard and FastFormula
Corollary
The SVMConnection
Nonlinear SVMs
Loose Ends.Conclusions
14 / 31
We’d like to find a machine able to zero theempirical risk (sufficient capacity) and
minimize the VC dimension (capacity notwastefully large)
The SVM Connection
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
15 / 31
Linear SVMs
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
16 / 31
■ Training data {xi, yi} i = 1, . . . , l,xi ∈ R
n, and yi ∈ {−1, 1}■ On a separating hyperplane: xw + b = 0,
where
◆ w normal to the hyperplane
◆|b|
‖w‖is the distance to origin
◆ ‖w‖ Euclidean norm of w
Linear SVMs (cont’d)
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
17 / 31
■ d+, d− shortest distances from labeledpoints to hyperplane
■ Define margin m = d+ + d−■ Task: find the separating hyperplane that
maximizes m
Key point: Maximizing the margin minimizesthe VC dimension
Computation
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
18 / 31
■ For the separating plane:
xiw + b ≥ +1, yi = +1 (4)
xiw + b ≤ −1, yi = −1 (5)
≡ (6)
yi(xiw + b) − 1 ≥ 0, ∀i (7)
■ For the closest points the equalities aresatisfied, so:
d+ + d− =|1 − b|
‖w‖+
| − 1 − b|
‖w‖=
2
‖w‖(8)
Switching to Lagrangian
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
19 / 31
■ One coefficient per train sample■ The constraints easier to handle■ Training data appears only in dot
products■ Great for applying the kernel trick later
on
Lagrangian Form
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
20 / 31
■ Minimize
LP =‖w‖2
2−
l∑
i=1
αiyi(xiw + b) +l
∑
i=1
αi
(9)
■ Convex quadratic programming problemwith the dual: maximize
LD =l
∑
i=1
αi −1
2
l∑
i,j=1
αiαjyiyj(xixj)
(10)
The Support Vectors
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
21 / 31
■ The points with αi > 0 are the supportvectors
■ Solution only depends on them■ All others have αi = 0 and can be moved
arbitrarily far from the decisionhyperplane, or removed
Testing
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
22 / 31
■ Once the hyperplane is found:
y = sgn(wx + b) (11)
Unseparable Data
Introduction
Background:Structural RiskMinimization
The SVMConnection
Linear SVMsLinear SVMs(cont’d)
Computation
Switching toLagrangian
Lagrangian Form
The Support Vectors
Testing
Unseparable Data
Nonlinear SVMs
Loose Ends.Conclusions
23 / 31
■ Add slack variables ξi
xiw + b ≥ +1 − ξi, yi = +1 (12)
xiw + b ≤ −1 + ξi, yi = −1 (13)
■ The Lagrangian formulation is onlyinfluenced by an upper bound C on αi
Nonlinear SVMs
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMsSpaceTransformation(RKHS)
Kernel Example
Using Kernels
Loose Ends.Conclusions
24 / 31
Space Transformation (RKHS)
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMsSpaceTransformation(RKHS)
Kernel Example
Using Kernels
Loose Ends.Conclusions
25 / 31
■ Take points from Rd to some space H:
Φ : Rd → H (14)
■ Choose kernel function K such that
K(xi, xj) = Φ(xi)Φ(xj) (15)
■ Since in the Lagrangian formulation weonly have xi in dot products(remember?), we don’t even need toknow Φ!
Kernel Example
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMsSpaceTransformation(RKHS)
Kernel Example
Using Kernels
Loose Ends.Conclusions
26 / 31
■ Gaussian Kernel
K(xi,xj) = e−‖xi−xj‖
2
2σ2 (16)
■ Polynomial
K(xi,xj) = (xixj)p (17)
■ Kinda sorta neural net!
K(xi,xj) = tanh(κxixj − δ)p (18)
Using Kernels
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMsSpaceTransformation(RKHS)
Kernel Example
Using Kernels
Loose Ends.Conclusions
27 / 31
■ Just replace xixj with K(xi,xj)everywhere and the magic is complete
■ Training is identical and takes similartime
■ Separation is still linear, but in a differentspace (infinite-dimensional!)
■ Same sleight of hand for testing:
sgn
Ns∑
i=1
αiyiK(si,x) + b (19)
Loose Ends. Conclusions
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.ConclusionsSVM For MultipleClasses
Soft Outputs
Conclusions
28 / 31
SVM For Multiple Classes
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.ConclusionsSVM For MultipleClasses
Soft Outputs
Conclusions
29 / 31
■ Build n “one-versus-all” classifiers■ Essentially costs n times the complexity
of one classifier
◆ During testing choose the mostconfident answer
■ Build n(n−1)2 “one-versus-one” classifiers
◆ Decide by voting where the databelongs
◆ Many classifiers, but little data foreach
■ DAGSVM (Platt): Same training time, n
times testing time
Soft Outputs
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.ConclusionsSVM For MultipleClasses
Soft Outputs
Conclusions
30 / 31
■ sgn is discontinuous; want to restoreconfidence
■ SVM outputs can be mapped to posteriorprobabilities (Platt)
Conclusions
Introduction
Background:Structural RiskMinimization
The SVMConnection
Nonlinear SVMs
Loose Ends.ConclusionsSVM For MultipleClasses
Soft Outputs
Conclusions
31 / 31
■ Powerful theoretical grounds■ Global, unique solution■ Performance depends on choice of kernel
and parametersStill a research topic
■ Training is memory-intensive; chunkingmust be used
■ Complexity dependent on the # ofsupport vectors