Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | dustin-reed |
View: | 213 times |
Download: | 0 times |
Mathematical Programming in Support Vector Machines
Olvi L. Mangasarian
University of Wisconsin - Madison
High Performance Computation for Engineering Systems Seminar
MIT October 4, 2000
What is a Support Vector Machine?
An optimally defined surfaceTypically nonlinear in the input spaceLinear in a higher dimensional spaceImplicitly defined by a kernel function
What are Support Vector Machines Used For?
ClassificationRegression & Data FittingSupervised & Unsupervised Learning
(Will concentrate on classification)
Example of Nonlinear Classifier:Checkerboard Classifier
Outline of Talk
Generalized support vector machines (SVMs)Completely general kernel allows complex classification
(No Mercer condition!) Smooth support vector machines
Smooth & solve SVM by a fast Newton method Lagrangian support vector machines
Very fast simple iterative scheme-One matrix inversion: No LP. No QP.
Reduced support vector machinesHandle large datasets with nonlinear kernels
Generalized Support Vector Machines2-Category Linearly Separable Case
A+
A-
wx0w = í + 1
x0w = í à 1
Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case
Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by:A i
An m-by-m diagonal matrix D with +1 & -1 entries
D(Awà eí )=e;
More succinctly:
where e is a vector of ones.
x0w = í æ1: Separate by two bounding planes,
A iw=í + 1; for D i i = + 1;A iw5í à 1; for D i i = à 1:
Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes
wx0w = í + 1
x0w = í à 1
A+
A-
jjwjj22
Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation
s.t. D(Awà eí ) + y = e
Solve the following mathematical program for some :
w;í ;ymin ÷e0y+ 2
kwk
y = 0:
÷> 0
The nonnegative slack variable is zero iff: Convex hulls of and do not intersect is sufficiently large
yA + A à
÷
D(Awà eí )=e
Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant
Another Application: Disputed Federalist PapersBosch & Smith 1998
56 Hamilton, 50 Madison, 12 Disputed
Generalized Support Vector Machine Motivation
(Nonlinear Kernel Without Mercer Condition)
Linear SVM: Linear separating surface: x0w = ímin ÷e0y+ k w k1
s.t. D(Awà eí ) + y=e; y=0 Set w = A0Du. Resulting linear surface:x0A0Du = í
min ÷e0y+ k u k1
s.t. D(AA0Du à eí ) + y=e; y=0Replace AA0by arbitrary nonlinear kernel K (A;A0) Resulting nonlinear surface: K (x0;A0)Du = í
min ÷e0y+ k u k1
s.t. D(K (A;A0)Du à eí ) + y=e;y=0
SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem)
Changing to 2-norm and measuring margin in( ) space:
Smoothing the Plus Function: Integrate the Sigmoid Function
SSVM: The Smooth Support Vector Machine Smoothing the Plus Function
Integrating the sigmoid approximation to the step function:
s(x;ë) = 1+"à ëx1 ;
gives a smooth, excellent approximation to the plus function:
p(x;ë) = x + ë1 log(1+ "à ëx); ë > 0:
Replacing the plus function in the nonsmooth SVMby the smooth approximation gives our SSVM:
min Ðë(w;í ) :=
min2÷k p(eà D(Awà eí );ë) k2
2 + 21 k w;í k2
2
Newton: Minimize a sequence of quadratic approximationsto the strongly convex objective function, i.e. solve a sequenceof linear equations in n+1 variables. (Small dimensional inputspace.)
Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!)
Global Quadratic Convergence: Starting from any point,the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)
SSVM with a Nonlinear Kernel Nonlinear Separating Surface in Input Space
Examples of Kernels Generate Nonlinear Separating Surfaces in Input Space
A 2 Rmâ n;a 2 Rm;ö 2 R;dintegerPolynomial Kernel
(AA0+ öaa0)d
Neural Network Kernel
(AA0+ öaa0)ã(á)ã : R ! f 0;1g
Gaussian (Radial Basis) Kernel
"à ökA ià A jk2; i;j=1;. . .;m:
LSVM: Lagrangian Support Vector MachineDual of SVM
Taking the dual of the SVM formulation:
,
gives the following simple dual problem:
min0ô u2R m 21u0(÷
I + D(AA0+ ee0)D)uà e0u
The variables (w;í ;y) of SSVM are related to u by:
w = A0Du; y = ÷u; í = à e0Du:
LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem
The optimality condition for this dual SVM is the LCP:
0 ô u ? Qu à eõ 0;
min 0ô u2Rm f (u) := 21u0Qu à e0u:
Reduces the dual SVM to:
Defining the two matrices:
H = D[A à e]; Q = ÷I + HH0
which, by Implicit Lagrangian Theory, is equivalent to:
Qu à e= ((Qu à e) à ëu)+:ë > 0;
LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion
ui+1 = Qà 1(e+ ((Qui à e) à ëui)+); i = 0;1; . . .Where:
0< ë < ÷2
Key Idea: Sherman-Morrison-Woodbury formula allows the inversioninversion of an extremely large m-by-m matrix Q by merely invertinga much smaller n-by-n matrix as follows:
(÷I + HH0)à 1 = ÷(I à H(÷
I + H0H)à 1H0):
LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code
function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;
function pl = pl(x); pl = (abs(x)+x)/2;
LSVM Algorithm – Linear KernelComputational Results
2 Million random points in 10 dimensional spaceClassified in 6.7 minutes in 6 iterations & e-5 accuracy250 MHz UltraSPARC II with 2 gigabyte memoryCPLEX ran out of memory
32562 points in 123-dimensional space (UCI Adult Dataset)Classified in141 seconds & 55 iterations to 85% correctness400 MHz Pentium II with 2 gigabyte memorySVM classified in 178 seconds & 4497 iterationslight
LSVM – Nonlinear KernelFormulation
K (A;B) : Rmâ n â Rnâ l ! Rmâ l;
For the nonlinear kernel:
the separating nonlinear surface is given by:
K ([x0 à 1]; à e0A0
h i)Du = 0
Where u is the solution of the dual problem:
05u2Rmmin f (u) := 2
1u0Qu à e0u;with Q redefined as:
G = [A à e]; Q = ÷I + DK (G;G0)D
LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy
Reduced Support Vector Machines (RSVM)
Large Nonlinear Kernel Classification Problems
is a small random sample ofK (A;Aö0);where Aö0 A0 Key idea: Use a rectangular kernel.
Typically Aö has 1% to 10% of the rows of A
Two important consequences:RSVM can solve very large problems
Aö Nonlinear separator depends on only
uö;í ;ymin
2÷y0y+ 2
1(uö0uö+ í 2)
s:t: D(K (A;Aö0)Döuöà eí ) + y=e;y=0
gives lousy resultsK (Aö;Aö0) Separating surface: K (x0;Aö0)Döuö = í
Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000
RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000
RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)
Conclusion
Mathematical Programming plays an essential role in SVMs
TheoryNew formulations
Generalized SVMsNew algorithm-generating concepts
Smoothing (SSVM)
Implicit Lagrangian (LSVM)Algorithms
Fast : SSVMMassive: LSVM, RSVM
Future Research
TheoryConcave minimization
Concurrent feature & data selection Multiple-instance problems
SVMs as complementarity problems
Algorithms
Multicategory classification algorithms
Kernel methods in nonlinear programming
Chunking for massive classification: 108
Talk & Papers Available on Web
www.cs.wisc.edu/~olvi