Greg Grudic Intro AI 1 Support Vector Machine (SVM) Classification Greg Grudic
Transcript
Slide 1
Greg GrudicIntro AI1 Support Vector Machine (SVM)
Classification Greg Grudic
Slide 2
Last Class Linear separating hyperplanes for binary
classification Rosenblatts Perceptron Algorithm Based on Gradient
Descent Convergence theoretically guaranteed if data is linearly
separable Infinite number of solutions For nonlinear data: Mapping
data into a nonlinear space where it is linearly separable (or
almost) However, convergence still not guaranteed Greg GrudicIntro
AI2
Slide 3
Questions? Greg GrudicIntro AI3
Slide 4
Why Classification? Greg GrudicIntroduction to AI4 world
SensingActionsComputationStateDecisions/Planning Agent Uncertainty
Signals Symbols (The Grounding Problem) Not typically addressed in
CS
Slide 5
7/12/2015Intro AI5 The Problem Domain for Project Test 1:
Identifying (and Navigating) Paths Non-pathPath Data Construct a
Classifier Path labeled Image Classifier
Slide 6
Greg GrudicIntro AI6 Todays Lecture Goals Support Vector
Machine (SVM) Classification Another algorithm for linear
separating hyperplanes A Good text on SVMs: Bernhard Schlkopf and
Alex Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002
Slide 7
Greg GrudicIntro AI7 Support Vector Machine (SVM)
Classification Classification as a problem of finding optimal
(canonical) linear hyperplanes. Optimal Linear Separating
Hyperplanes: In Input Space In Kernel Space Can be non-linear
Slide 8
Greg GrudicIntro AI8 Linear Separating Hyper-Planes How many
lines can separate these points? NO! Which line should we use?
Slide 9
Greg GrudicIntro AI9 Initial Assumption: Linearly Separable
Data
Slide 10
Greg GrudicIntro AI10 Linear Separating Hyper-Planes
Slide 11
Greg GrudicIntro AI11 Linear Separating Hyper-Planes Given
data: Finding a separating hyperplane can be posed as a constraint
satisfaction problem (CSP): Or, equivalently: If data is linearly
separable, there are an infinite number of hyperplanes that satisfy
this CSP
Slide 12
Greg GrudicIntro AI12 The Margin of a Classifier Take any
hyper-plane (P0) that separates the data Put a parallel hyper-plane
(P1) on a point in class 1 closest to P0 Put a second parallel
hyper-plane (P2) on a point in class -1 closest to P0 The margin
(M) is the perpendicular distance between P1 and P2
Slide 13
Greg GrudicIntro AI13 Calculating the Margin of a Classifier P0
P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing
through closest point in one class P2: Parallel to P0, passing
through point closest to the opposite class Margin (M): distance
measured along a line perpendicular to P1 and P2
Slide 14
Model parameters must be chosen such that, for on P1 and for on
P2: SVM Constraints on the Model Parameters Greg GrudicIntro AI14
For any P0, these constraints are always attainable. Given the
above, then the linear separating boundary lies half way between P1
and P2 and is given by: Resulting Classifier:
Slide 15
Remember: signed distance from a point to a hyperplane: Greg
GrudicIntro AI15 Hyperplane define by:
Slide 16
Calculating the Margin (1) Greg GrudicIntro AI16
Slide 17
Calculating the Margin (2) Greg GrudicIntro AI17 Take absolute
value to get the unsigned margin: Signed Distance
Slide 18
Greg GrudicIntro AI18 Different P0s have Different Margins P0
P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing
through closest point in one class P2: Parallel to P0, passing
through point closest to the opposite class Margin (M): distance
measured along a line perpendicular to P1 and P2
Slide 19
Greg GrudicIntro AI19 Different P0s have Different Margins P0
P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing
through closest point in one class P2: Parallel to P0, passing
through point closest to the opposite class Margin (M): distance
measured along a line perpendicular to P1 and P2
Slide 20
Greg GrudicIntro AI20 Different P0s have Different Margins P0
P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing
through closest point in one class P2: Parallel to P0, passing
through point closest to the opposite class Margin (M): distance
measured along a line perpendicular to P1 and P2
Slide 21
Greg GrudicIntro AI21 How Do SVMs Choose the Optimal Separating
Hyperplane (boundary)? P2 P1 Find the that maximizes the margin!
Margin (M): distance measured along a line perpendicular to P1 and
P2
Slide 22
Greg GrudicIntro AI22 SVM: Constraint Optimization Problem
Given data: Minimize subject to: The Lagrange Function Formulation
is used to solve this Minimization Problem
Slide 23
Greg Grudic Intro AI 23 The Lagrange Function Formulation For
every constraint we introduce a Lagrange Multiplier: The Lagrangian
is then defined by: Where - the primal variables are - the dual
variables are Goal: Minimize Lagrangian w.r.t. primal variables,
and Maximize w.r.t. dual variables
Slide 24
Greg GrudicIntro AI24 Derivation of the Dual Problem At the
saddle point (extremum w.r.t. primal) This give the conditions
Substitute into to get the dual problem
Slide 25
Greg GrudicIntro AI25 Using the Lagrange Function Formulation,
we get the Dual Problem Maximize Subject to
Slide 26
Properties of the Dual Problem Solving the Dual gives a
solution to the original constraint optimization problem For SVMs,
the Dual problem is a Quadratic Optimization Problem which has a
globally optimal solution Gives insights into the NON-Linear
formulation for SVMs Greg GrudicIntro AI26
Slide 27
Greg GrudicIntro AI27 Support Vector Expansion (1) OR is also
computed from the optimal dual variables
Slide 28
Greg GrudicIntro AI28 Support Vector Expansion (2) OR
Substitute
Slide 29
Greg GrudicIntro AI29 What are the Support Vectors? Maximized
Margin
Slide 30
Greg GrudicIntro AI30 Why do we want a model with only a few
SVs? Leaving out an example that does not become an SV gives the
same solution! Theorem (Vapnik and Chervonenkis, 1974): Let be the
number of SVs obtained by training on N examples randomly drawn
from P(X,Y), and E be an expectation. Then
Slide 31
Greg GrudicIntro AI31 What Happens When Data is Not Separable:
Soft Margin SVM Add a Slack Variable
Slide 32
Greg GrudicIntro AI32 Soft Margin SVM: Constraint Optimization
Problem Given data: Minimize subject to:
Slide 33
Greg GrudicIntro AI33 Dual Problem (Non-separable data)
Maximize Subject to
Slide 34
Greg GrudicIntro AI34 Same Decision Boundary
Slide 35
Greg GrudicIntro AI35 Mapping into Nonlinear Space Goal: Data
is linearly separable (or almost) in the nonlinear space.
Slide 36
Greg GrudicIntro AI36 Nonlinear SVMs KEY IDEA: Note that both
the decision boundary and dual optimization formulation use dot
products in input space only!
Slide 37
Greg GrudicIntro AI37 Kernel Trick Replace with Can use the
same algorithms in nonlinear kernel space! Inner Product
Greg GrudicIntro AI40 Gram (Kernel) Matrix Training Data:
Properties: Positive Definite Matrix Symmetric Positive on diagonal
N by N
Slide 41
Greg GrudicIntro AI41 Commonly Used Mercer Kernels Polynomial
Sigmoid Gaussian
Slide 42
Why these kernels? There are infinitely many kernels The best
kernel is data set dependent We can only know which kernels are
good by trying them and estimating error rates on future data
Definition: a universal approximator is a mapping that can
arbitrarily well model any surface (i.e. many to one mapping)
Motivation for the most commonly used kernels Polynomials (given
enough terms) are universal approximators However, Polynomial
Kernels are not universal approximators because they cannot
represent all polynomial interactions Sigmoid functions (given
enough training examples) are universal approximators Gaussian
Kernels (given enough training examples) are universal
approximators These kernels have shown to produce good models in
practice Greg GrudicIntro AI42
Slide 43
Greg GrudicIntro AI43 Picking a Model (A Kernel for SVMs)? How
do you pick the Kernels? Kernel parameters These are called
learning parameters or hyperparamters Two approaches choosing
learning paramters Bayesian Learning parameters must maximize
probability of correct classification on future data based on prior
biases Frequentist Use the training data to learn the model
parameters Use validation data to pick the best hyperparameters.
More on learning parameter selection later
Slide 44
Greg GrudicIntro AI44
Slide 45
Greg GrudicIntro AI45
Slide 46
Greg GrudicIntro AI46
Slide 47
Some SVM Software LIBSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light
http://svmlight.joachims.org/ TinySVM
http://chasen.org/~taku/software/TinySVM/ WEKA
http://www.cs.waikato.ac.nz/ml/weka/ Has many ML algorithm
implementations in JAVA Greg GrudicIntro AI47
Slide 48
Greg GrudicIntro AI48 MNIST: A SVM Success Story Handwritten
character benchmark 60,000 training and 10,0000 testing Dimension d
= 28 x 28
Slide 49
Greg GrudicIntro AI49 Results on Test Data SVM used a
polynomial kernel of degree 9.
Slide 50
Greg GrudicIntro AI50 SVM (Kernel) Model Structure