+ All Categories
Home > Documents > Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Date post: 22-Dec-2015
Category:
View: 226 times
Download: 0 times
Share this document with a friend
50
Greg Grudic Intro AI 1 Support Vector Machine (SVM) Classification Greg Grudic
Transcript
  • Slide 1
  • Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic
  • Slide 2
  • Last Class Linear separating hyperplanes for binary classification Rosenblatts Perceptron Algorithm Based on Gradient Descent Convergence theoretically guaranteed if data is linearly separable Infinite number of solutions For nonlinear data: Mapping data into a nonlinear space where it is linearly separable (or almost) However, convergence still not guaranteed Greg GrudicIntro AI2
  • Slide 3
  • Questions? Greg GrudicIntro AI3
  • Slide 4
  • Why Classification? Greg GrudicIntroduction to AI4 world SensingActionsComputationStateDecisions/Planning Agent Uncertainty Signals Symbols (The Grounding Problem) Not typically addressed in CS
  • Slide 5
  • 7/12/2015Intro AI5 The Problem Domain for Project Test 1: Identifying (and Navigating) Paths Non-pathPath Data Construct a Classifier Path labeled Image Classifier
  • Slide 6
  • Greg GrudicIntro AI6 Todays Lecture Goals Support Vector Machine (SVM) Classification Another algorithm for linear separating hyperplanes A Good text on SVMs: Bernhard Schlkopf and Alex Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002
  • Slide 7
  • Greg GrudicIntro AI7 Support Vector Machine (SVM) Classification Classification as a problem of finding optimal (canonical) linear hyperplanes. Optimal Linear Separating Hyperplanes: In Input Space In Kernel Space Can be non-linear
  • Slide 8
  • Greg GrudicIntro AI8 Linear Separating Hyper-Planes How many lines can separate these points? NO! Which line should we use?
  • Slide 9
  • Greg GrudicIntro AI9 Initial Assumption: Linearly Separable Data
  • Slide 10
  • Greg GrudicIntro AI10 Linear Separating Hyper-Planes
  • Slide 11
  • Greg GrudicIntro AI11 Linear Separating Hyper-Planes Given data: Finding a separating hyperplane can be posed as a constraint satisfaction problem (CSP): Or, equivalently: If data is linearly separable, there are an infinite number of hyperplanes that satisfy this CSP
  • Slide 12
  • Greg GrudicIntro AI12 The Margin of a Classifier Take any hyper-plane (P0) that separates the data Put a parallel hyper-plane (P1) on a point in class 1 closest to P0 Put a second parallel hyper-plane (P2) on a point in class -1 closest to P0 The margin (M) is the perpendicular distance between P1 and P2
  • Slide 13
  • Greg GrudicIntro AI13 Calculating the Margin of a Classifier P0 P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing through closest point in one class P2: Parallel to P0, passing through point closest to the opposite class Margin (M): distance measured along a line perpendicular to P1 and P2
  • Slide 14
  • Model parameters must be chosen such that, for on P1 and for on P2: SVM Constraints on the Model Parameters Greg GrudicIntro AI14 For any P0, these constraints are always attainable. Given the above, then the linear separating boundary lies half way between P1 and P2 and is given by: Resulting Classifier:
  • Slide 15
  • Remember: signed distance from a point to a hyperplane: Greg GrudicIntro AI15 Hyperplane define by:
  • Slide 16
  • Calculating the Margin (1) Greg GrudicIntro AI16
  • Slide 17
  • Calculating the Margin (2) Greg GrudicIntro AI17 Take absolute value to get the unsigned margin: Signed Distance
  • Slide 18
  • Greg GrudicIntro AI18 Different P0s have Different Margins P0 P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing through closest point in one class P2: Parallel to P0, passing through point closest to the opposite class Margin (M): distance measured along a line perpendicular to P1 and P2
  • Slide 19
  • Greg GrudicIntro AI19 Different P0s have Different Margins P0 P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing through closest point in one class P2: Parallel to P0, passing through point closest to the opposite class Margin (M): distance measured along a line perpendicular to P1 and P2
  • Slide 20
  • Greg GrudicIntro AI20 Different P0s have Different Margins P0 P2 P1 P0: Any separating hyperplane P1: Parallel to P0, passing through closest point in one class P2: Parallel to P0, passing through point closest to the opposite class Margin (M): distance measured along a line perpendicular to P1 and P2
  • Slide 21
  • Greg GrudicIntro AI21 How Do SVMs Choose the Optimal Separating Hyperplane (boundary)? P2 P1 Find the that maximizes the margin! Margin (M): distance measured along a line perpendicular to P1 and P2
  • Slide 22
  • Greg GrudicIntro AI22 SVM: Constraint Optimization Problem Given data: Minimize subject to: The Lagrange Function Formulation is used to solve this Minimization Problem
  • Slide 23
  • Greg Grudic Intro AI 23 The Lagrange Function Formulation For every constraint we introduce a Lagrange Multiplier: The Lagrangian is then defined by: Where - the primal variables are - the dual variables are Goal: Minimize Lagrangian w.r.t. primal variables, and Maximize w.r.t. dual variables
  • Slide 24
  • Greg GrudicIntro AI24 Derivation of the Dual Problem At the saddle point (extremum w.r.t. primal) This give the conditions Substitute into to get the dual problem
  • Slide 25
  • Greg GrudicIntro AI25 Using the Lagrange Function Formulation, we get the Dual Problem Maximize Subject to
  • Slide 26
  • Properties of the Dual Problem Solving the Dual gives a solution to the original constraint optimization problem For SVMs, the Dual problem is a Quadratic Optimization Problem which has a globally optimal solution Gives insights into the NON-Linear formulation for SVMs Greg GrudicIntro AI26
  • Slide 27
  • Greg GrudicIntro AI27 Support Vector Expansion (1) OR is also computed from the optimal dual variables
  • Slide 28
  • Greg GrudicIntro AI28 Support Vector Expansion (2) OR Substitute
  • Slide 29
  • Greg GrudicIntro AI29 What are the Support Vectors? Maximized Margin
  • Slide 30
  • Greg GrudicIntro AI30 Why do we want a model with only a few SVs? Leaving out an example that does not become an SV gives the same solution! Theorem (Vapnik and Chervonenkis, 1974): Let be the number of SVs obtained by training on N examples randomly drawn from P(X,Y), and E be an expectation. Then
  • Slide 31
  • Greg GrudicIntro AI31 What Happens When Data is Not Separable: Soft Margin SVM Add a Slack Variable
  • Slide 32
  • Greg GrudicIntro AI32 Soft Margin SVM: Constraint Optimization Problem Given data: Minimize subject to:
  • Slide 33
  • Greg GrudicIntro AI33 Dual Problem (Non-separable data) Maximize Subject to
  • Slide 34
  • Greg GrudicIntro AI34 Same Decision Boundary
  • Slide 35
  • Greg GrudicIntro AI35 Mapping into Nonlinear Space Goal: Data is linearly separable (or almost) in the nonlinear space.
  • Slide 36
  • Greg GrudicIntro AI36 Nonlinear SVMs KEY IDEA: Note that both the decision boundary and dual optimization formulation use dot products in input space only!
  • Slide 37
  • Greg GrudicIntro AI37 Kernel Trick Replace with Can use the same algorithms in nonlinear kernel space! Inner Product
  • Slide 38
  • Greg GrudicIntro AI38 Nonlinear SVMs Maximize: Boundary:
  • Slide 39
  • Greg GrudicIntro AI39 Need Mercer Kernels
  • Slide 40
  • Greg GrudicIntro AI40 Gram (Kernel) Matrix Training Data: Properties: Positive Definite Matrix Symmetric Positive on diagonal N by N
  • Slide 41
  • Greg GrudicIntro AI41 Commonly Used Mercer Kernels Polynomial Sigmoid Gaussian
  • Slide 42
  • Why these kernels? There are infinitely many kernels The best kernel is data set dependent We can only know which kernels are good by trying them and estimating error rates on future data Definition: a universal approximator is a mapping that can arbitrarily well model any surface (i.e. many to one mapping) Motivation for the most commonly used kernels Polynomials (given enough terms) are universal approximators However, Polynomial Kernels are not universal approximators because they cannot represent all polynomial interactions Sigmoid functions (given enough training examples) are universal approximators Gaussian Kernels (given enough training examples) are universal approximators These kernels have shown to produce good models in practice Greg GrudicIntro AI42
  • Slide 43
  • Greg GrudicIntro AI43 Picking a Model (A Kernel for SVMs)? How do you pick the Kernels? Kernel parameters These are called learning parameters or hyperparamters Two approaches choosing learning paramters Bayesian Learning parameters must maximize probability of correct classification on future data based on prior biases Frequentist Use the training data to learn the model parameters Use validation data to pick the best hyperparameters. More on learning parameter selection later
  • Slide 44
  • Greg GrudicIntro AI44
  • Slide 45
  • Greg GrudicIntro AI45
  • Slide 46
  • Greg GrudicIntro AI46
  • Slide 47
  • Some SVM Software LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light http://svmlight.joachims.org/ TinySVM http://chasen.org/~taku/software/TinySVM/ WEKA http://www.cs.waikato.ac.nz/ml/weka/ Has many ML algorithm implementations in JAVA Greg GrudicIntro AI47
  • Slide 48
  • Greg GrudicIntro AI48 MNIST: A SVM Success Story Handwritten character benchmark 60,000 training and 10,0000 testing Dimension d = 28 x 28
  • Slide 49
  • Greg GrudicIntro AI49 Results on Test Data SVM used a polynomial kernel of degree 9.
  • Slide 50
  • Greg GrudicIntro AI50 SVM (Kernel) Model Structure

Recommended