Optimization for Kernel Methods
S. Sathiya Keerthi
Yahoo! Research, Burbank, CA, USA
Kernel methods:
• Support Vector Machines (SVMs)
• Kernel Logistic Regression (KLR)
Aim: To introduce a variety of optimization problems that arisein the solution of classification problems by kernel methods, brieflyreview relevant optimization algorithms, and point out which opti-mization methods are suited for these problems.
The lectures in this topic will be divided into 6 parts:
1. Optimization problems arising in kernel methods
2. A review of optimization algorithms
3. Decomposition methods
4. Quadratic programming methods
5. Path tracking methods
6. Finite Newton method
The first two topics form an introduction, the next two topics coverdual methods and the last two topics cover primal methods.
1
Part I
Optimization Problems Arising in
Kernel Methods
References:
1. B. Scholkopf and A. Smola, Learning with Kernels, MIT Press,2002, Chapter 7, Pattern Recognition.
2
Kernel methods for Classification problems
• Although kernel methods are used for a range of problems suchas classification (binary/multiclass), regression, ordinal regres-sion, ranking and unsupervised learning, we will focus only onbinary classification problems.
• Training data: {xi, ti}ni=1
• xi ∈ Rm is the i-th input vector.
• ti ∈ {1,−1} is the target for xi, denoting the class to which thei-th example belongs; 1 denotes class 1 and −1 denotes class 2.
• Kernel methods transform x to a Reproducing Kernel HilbertSpace, H via φ : Rm → H and then develop a linear classifierin that space:
y(x) = w · φ(x) + b
y(x) > 0⇒ x ∈ Class 1; y(x) < 0⇒ x ∈ Class 2
• The dot product in H, i.e., k(xi, xj) = φ(xi) ·φ(xj) is called theKernel function. All computations are done using k only.
• Example: φ(x) is the vector of all monomials up to degree d onthe components of x. For this example, k(xi, xj) = (1+xi ·xj)
d.This is the polynomial kernel function. Larger the value of dis, more flexible and powerful is the classifer function.
• RBF kernel: k(xi, xj) = e−γ‖xi−xj‖2 is another popular kernelfunction. Larger the value of γ is, more flexible and powerfulis the classifer function.
• Training problem: (w, b), which define the classifier are ob-tained by solving the following optimization problem:
minw,b
E = R+ CL
• L is the Empirical error defined as
L =∑
i
l(y(xi), ti)
l is a loss function that describes the discrepancy between theclassifier output y(xi) and the target ti.
3
• Minimizing only L can lead to overfitting on the training data.The regularizer function R prefers simpler models and helpsprevent overfitting.
• The parameter C helps to establish a trade-off between R andL. C is a hyperparameter. (Other parameters such as d in thepolynomial kernel and γ in the RBF kernel are also hyperpa-rameters.) All hyperparameters need to be tuned at a higherlevel.
Some commonly used loss functions
• SVM (Hinge) loss: l(y, t) = 1− ty if ty < 1; 0 otherwise.
• KLR (Logistic) loss: l(y, t) = − log(1 + exp(−ty))
• L2-SVM loss: l(y, t) = (1− ty)2/2 if ty < 1; 0
• Modified Huber loss: l(y, t) is: 0 if ξ ≥ 0; ξ2/2 if 0 < ξ < 2;and 2(ξ − 1) if ξ ≥ 2, where ξ = 1− ty.
4
Margin based regularization
• The margin between the planes defined by y(x) = ±1 is 2/‖w‖.Making the margin big is equivalent to making the function
R = 12‖w‖2 small.
• This function is a very effective regularizing function. This isthe natural regularizer associated with the RKHS.
• Although there are other regularizers that have been consideredin the literature, in these lectures I will restrict attention toonly the optimization problems directly related to the abovementioned natural regularizer.
Primal problem:
min 12‖w‖2 + C
∑i l(y(xi), ti)
Sometimes the term 12b2 is also added in order to handle w and b
uniformly. (This is also equivalent to ignoring b and instead addinga constant to the kernel function.)
5
Solution via Wolfe dual
• w and y(·) have the representation:
w =∑
i
αitiφ(xi), y(x) =∑
i
αitik(x, xi)
• w could reside in an infinite dimensional space (e.g., in the caseof the RBF kernel) and so we have to necessarily handle thesolution via finite dimensional quantities such as the αi’s. Thisis effectively done via the Wolfe dual (details will be covered inlectures on kernel methods by other speakers).
SVM dual: (Convex quadratic program)
min E(α) = 12
∑i,j titjαiαjk(xi, xj)−
∑i αi s.t. 0 ≤ αi ≤ C,
∑i tiαi = 0
KLR dual: (Convex program)
min E(α) =1
2
∑i,j
titjαiαjk(xi, xj) + C∑
i
g(αi
C) s.t.
∑i
tiαi = 0
where g(δ) = δ log δ + (1− δ) log(1− δ).
L2-SVM dual: (Convex quadratic program)
min E(α) =1
2
∑i,j
titjαiαj k(xi, xj)−∑
i
αi s.t. αi ≥ 0,∑
i
tiαi = 0
where k(xi, xj) = k(xi, xj) + 1Cδij.
Modified Huber: Dual can be written down, but it is a bit morecomplex.
6
Ordinal regression
All the ideas for binary classification can be easily extended to or-dinal regression.
There are several ways of defining losses for ordinal regression. Oneway is to define a threshold for each successive class and include aloss term for each pair of classes.
7
An Alternative: Direct primal design
Primal problem:
min1
2‖w‖2 + C
∑i
l(y(xi), ti) (1)
Plug into (1), the representation:
w =∑
i
βitiφ(xi), y(x) =∑
i
βitik(x, xi)
to get the problem
min1
2
∑ij
titjβiβjk(xi, xj) + C∑
i
l(y(xi), ti) (2)
We can attempt to directly solve (2) to get the β vector.
Such an approach can be particularly attractive when the loss func-tion l is differentiable, such as in the cases of KLR, L2-SVM andModified Huber loss SVM, since the optimization problem is an un-constrained one.
Sparse formulations (minimizing the number of nonzero αi)
• Approach 1: Replace the regularizer in (2) by the “sparsity-inducing regularizer”
∑i |βi| to get the optimization problem:
min∑
i
|βi|+ C∑
i
l(y(xi), ti) (3)
• Approach 2: Include the sparsity regularizer,∑
i |βi| in agraded fashion:
min λ∑
i
|βi|+1
2
∑ij
titjβiβjk(xi, xj) + C∑
i
l(y(xi), ti) (4)
Large λ will force sparse solutions while small λ will get us backto the original kernel solution.
8
Semi-supervised Learning
• In many problems a set of unlabeled examples, {xk} is available
• E is an edge relation on that set with weights ρkl
• Then 12
∑kl∈E ρkl(y(xk) − y(xl))
2 can be included as an addi-tional regularizer. (Nearby input vectors should have near yvalues.)
Transductive design
• Solve the problem
min1
2‖w‖2 + C
∑i
l(y(xi), ti) + C∑
k
l(y(xk), tk)
where the tk ∈ {1,−1} are also variables.
• This is a combinatorial optimization problem. There exist goodspecial techniques for solving it. But we will not go into anydetails in these lectures.
9
Part II
A Review of Optimization Algorithms
References:
1. B. Scholkopf and A. Smola. Learning with Kernels, MIT Press,2002, Chapter 6, Optimization.
2. D. P. Bertsekas, Nonlinear Programming. Athena Scientific,1995.
10
Types of optimization problems
minθ∈Z
E(θ)
• E : Z → R is continuously differentiable, Z ⊂ Rn
• Z = Rn ⇒ Unconstrained
• E = linear, Z =polyhedral ⇒ Linear ProgrammingE = quadratic, Z =polyhedral ⇒ Quadratic Programming(example: SVM dual)Else, Nonlinear Programming
• These problems have been traditionally treated separately. Theirmethodologies have come closer in later years.
Unconstrained: Optimality conditions
At a minimum:
• Stationarity: ∇E = 0
• Non-negative curvature: ∇2E is positive semi-definite
E convex ⇒ local minimum is a global minimum.
11
Geometry of descent
∇E(θ)′d < 0
12
A sketch of a descent algorithm
13
Exact line search:
η? = minη
φ(η) = E(θ + ηd)
Inexact line search: Armijo condition
14
Global convergence theorem
• E is Lipschitz continuous
• Sufficient angle of descent condition:
−∇E(θk)′dk ≥ δ‖∇E(θk)‖‖dk‖, δ > 0
• Armijo line search condition: For some 0 < µ < 0.5
−(1− µ)η∇E(θk)′dk ≥ E(θk)− E(θk + ηdk) ≥ −µη∇E(θk)′dk
Then, either E → −∞ or θk converges to a stationary point θ?:∇E(θ?) = 0.
Rate of convergence
• εk = E(θk+1)− E(θk)
• |εk+1| = ρ|εk|r in limit as k →∞
• r = rate of convergence,a key factor for speed of convergence of optimization algorithms
• Linear convergence (r = 1) is quite a bit slower than
quadratic convergence (r = 2) .
• Many optimization algorithms have superlinear convergence (1 < r < 2)
which is pretty good.
15
Gradient descent method
• d = −∇E
• Linear convergence
• Very simple; locally good; but often very slow; rarely used inpractice
16
Newton method
d = −[∇2E]−1∇E, η = 1
Interpretations:
• θ + d minimizes second order approximation
E(θ + d) = E(θ) +∇E(θ)′d +1
2d′∇2E(θ)d
• θ + d solves linearized optimality condition
∇E(θ + d) ≈ ∇E(θ + d) = ∇E(θ) +∇2E(θ)d = 0
• Quadratic rate of convergence
Implementation:
• Compute H = ∇2E(θ), g = ∇E(θ) and solve Hd = −g
• Use Cholesky factorization H = LL′
• Newton method may not converge (or worse, if H is not posi-
tive definite, it may not even be properly defined ) when startedfar from a minimum
• Modify the method in 2 ways: (a) Change H to a “nearby”positive definite matrix whenever it is not; and (b) add linesearch.
Quasi-Newton methods:
• Instead of calculating Hessian and inverting it, QN methodsbuild an approximation to the inverse Hessian over many stepsusing gradient information.
• Several update methods for the inverse Hessian exist;the BFGS method is popularly used.
• Applied to a convex quadratic function with exact line searchthey find the minimizer within n steps .
• With inexact line search the QN methods can be used for mini-mizing nonlinear functions too. They work well even with looseline search.
17
Method of conjugate directions
• Originally developed for convex quadratic minimization : P ispd
min E(θ) =1
2θ′Pθ − q′θ
Equivalently, solvePθ = q
• Define a set of P -conjugate search directions :
{d0, d1, . . . , dn−1} such that d′iPdj = 0 ∀i 6= j
• Do exact line search along each direction
• Main result:The minimum of E will be reached in exactly n steps.
Conjugate gradient method
• Choose any initial point, θ0, set d0 = −∇E(θ0)
• θk+1 = θk + ηkdk where ηk = arg minη E(θk + ηdk)
• dk+1 = −∇E(θk+1) + βkdk
• Simply choosing βk so that d′k+1Adk = 0 is sufficient to ensureP -conjugacy of dk+1 with all previous directions.
βk =‖∇E(θk+1‖2
‖∇E(θk)‖2(Fletcher-Reeves formula)
βk =∇E ′(θk+1[∇E ′(θk+1 −∇E(θk)]
‖∇E(θk)‖2(Polak-Ribierre formula)
18
Nonlinear Conjugate gradient method
• Simply use the nonlinear gradient function for ∇E for gettingthe directions
• Replace exact line search by inexact line searchArmijo condition needs to be replaced by a more stringent con-dition called the Wolfe condition
• Convergence not possible in n steps
Practicalities:
• FR, PR usually behave very differently
• PR is usually better
• Methods are very sensitive to line search
19
Overall comparison of the methods
• Quasi-Newton methods are robust. But, they require O(n2)memory space to store the approximate Hessian inverse, andso they are not directly suited for large scale problems. Modifi-cations of these methods called Limited Memory Quasi-Newtonmethods use O(n) memory and they are suited for large scaleproblems.
• Conjugate gradient methods also work well and are well suitedfor large scale problems. However they need to be implementedcarefully, with a carefully set line search.
• In some situations block coordinate descent methods (optimiz-ing a selected subset of variables at a time) can be very muchbetter suited than the above methods. We will say more aboutthis later.
20
Linear Programming
min E(θ) = c′θ subject to Aθ ≤ b, θ ≥ 0
The solution occurs at a vertex of the fesible polyhedral region.
Simplex algorithm:
• Starts at a vertex and moves along descending edges until anoptimal vertex is found.
• In the worst case the algorithm takes a number of steps that isexponential in n; but, in practice it is very efficient.
• Interior point methods are alternative methods that are prov-ably polynomial time. They are also very efficient when imple-mented via certain predictor-corrector ideas.
Quadratic Programming
The (Wolfe) SVM dual is a good example. Many traditional QPmethods exist. For instance, the active set method which solves asequence of equality constrained problems, is a good traditional QPmethod. We will talk about this method in detail in part IV.
21
Part III
Decomposition methods
References:
1. B. Scholkopf and A. Smola, Learning with Kernels, MIT Press,2002, Chapter 10, Implementation.
2. T. Joachims, Making large-scale SVM learning practical, In B.Scholkopf, C. J. C. Burges and A. Smola, editors, Advancesin kernel methods - Support Vector Learning, pp. 169-184,MIT Press, 1999. http://www.cs.cornell.edu/People/tj/
publications/joachims_98c.pdf
3. J. C. Platt, Fast training of support vector machines using se-quential minimal optimization, In B. Scholkopf, C. J. C. Burgesand A. Smola, editors, Advances in kernel methods - SupportVector Learning, pp. 185-208, MIT Press, 1999. http://
research.microsoft.com/~jplatt/smo.html
4. S. S. Keerthi et al, Improvements to Platt’s SMO algorithm forSVM classifier design, Neural Computation, Vol. 13, March2001, pp. 637-649. http://guppy.mpe.nus.edu.sg/~mpessk/svm/svm.shtml
5. LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
6. SV M light: http://svmlight.joachims.org/
22
SVM Dual Problem
• We will take up the details only for SVM (hinge loss). Theideas are quite similar for optimization problems arising fromother loss functions such as L2-SVM, KLR, Modifier Huber etc.
• Recall the SVM dual convex quadratic program
min1
2
∑i,j
titjαiαjk(xi, xj)−∑
i
αi
subject to 0 ≤ αi ≤ C,∑
i
tiαi = 0
• In matrix-vector notations...
minα
1
2α′Qα− e′α
subject to 0 ≤ αi ≤ C, i = 1, . . . , n
t′α = 0,
where Qij = titjk(xi, xj) and e = [1, . . . , 1]′
Large Dense Quadratic Programming
• minα f(α) = 12α′Qα− e′α, subject to t′α = 0, 0 ≤ αi ≤ C
• Qij 6= 0, Q : an n by n fully dense matrix
• 30,000 training points: 30,000 variables:
(30, 0002 × 8/2) bytes = 3GB RAM to store Q: still difficult
• Traditional optimization methods:
Newton, Quasi Newton cannot be directly applied since theyinvolve O(n2) storage.
• Even methods such as CG which require O(n) storage are un-suitable since ∇f = Qα − e, which requires the entire kernelmatrix.
• Decomposition (Block coordinate descent) methods are bestsuited
23
Decomposition Methods
• Working on a few variables each time
• Similar to coordinate-wise minimization. Also referred to asChunking.
• Working set B; N = {1, . . . , n}\B fixed
Size of B usually <= 100
• Sub-problem in each iteration:
minαB
1
2
[α′
B (αkN)′] [QBB QBN
QNB QNN
] [αB
αkN
]−
[e′B (ek
N)′] [αB
αkN
]subject to 0 ≤ αl ≤ C, l ∈ B, t′BαB = −t′Nαk
N
Avoid Memory Problems
• The new objective function
1
2α′
BQBBαB + (−eB + QBNαkN)′αB + constant
• B columns of Q needed
• Calculated as and when needed
Decomposition Method: the Algorithm
1. Find initial feasible α1
Set k = 1.
2. If αk satisfies optimality conditions, stop. Otherwise, findworking set B.
Define N ≡ {1, . . . , n}\B
24
3. Solve sub-problem of αB:
minαB
1
2α′
BQBBαB + (−eB + QBNαkN)′αB
subject to 0 ≤ αl ≤ C, l ∈ B
t′BαB = −t′NαkN ,
4. αk+1N ≡ αk
N . Set k ← k + 1; go to Step 2.
Do these methods Really Work?
• Compared to Newton, Quasi-Newton
Slow convergence (lot more steps to come close to solution)
• However, no need to have very accurate α
decision function = sgn
(n∑
i=1
αiyiK(xi,x) + b
)
Prediction not affected much after a certain level of optimalityis reached
• In some situations, # support vectors � # training points
Initial α1 = 0, many elements are never used
• An example where machine learning knowledge affects opti-mization
• An example of solving 50,000 training instances by LIBSVM
$ ./svm-train -m 200 -c 16 -g 4 22features
optimization finished, #iter = 24981
Total nSV = 3370
real 3m32.828s
• On a Pentium M 1.7 GHz Laptop
• Calculating Q may have taken more than 3 minutes
A good case where many αi remain at zero all the time
25
Working Set Selection
• Very important
Better selection ⇒ fewer iterations
• But
Better selection ⇒ higher cost per iteration
• Two issues:
1. Size
|B| ↗, # iterations ↘|B| ↘, # iterations ↗
2. Selecting elements
Size of the Working Set
• Keeping all nonzero αi in the working set
If all SVs included ⇒ optimum
Few iterations (i.e., few sub-problems)
Size varies
May still have memory problems
• Existing software
Small and fixed size
Memory problems solved
Though sometimes slower
Sequential Minimal Optimization (SMO)
• Consider |B| = 2
|B| ≥ 2 is necessary because of the linear constraint
Extreme case of decomposition methods
26
• Sub-problem analytically solved; no need to use optimizationsoftware
minαi,αj
1
2
[αi αj
] [Qii Qij
Qij Qjj
] [αi
αj
]+ (QBNαk
N − eB)′[αi
αj
]s.t. 0 ≤ αi, αj ≤ C,
tiαi + tjαj = −t′NαkN ,
• B = {i, j}
• Optimization people may not think this a big advantage
Machine learning people do: they like simple code
• A minor advantage in optimization
No need to have inner and outer stopping conditions
• B = {i, j}
• Too slow convergence?
With other tricks, |B| = 2 fine in practice
Selection by KKT Violation
minα
f(α) =1
2α′Qα− e′α,
subject to t′α = 0, 0 ≤ αi ≤ C
• KKT optimality condition: α optimal if and only if
∇f(α) + bt = λ− µ,
λiαi = 0, µi(C − αi) = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , n
• ∇f(α) ≡ Qα− e
• Rewritten as
∇f(α)i + bti ≥ 0 if αi < C
∇f(α)i + bti ≤ 0 if αi > 0
27
• Note ti = ±1
• KKT further rewritten as
∇f(α)i + b ≥ 0 if αi < C, ti = 1
∇f(α)i − b ≥ 0 if αi < C, ti = −1
∇f(α)i + b ≤ 0 if αi > 0, ti = 1
∇f(α)i − b ≤ 0 if αi > 0, ti = −1
• A condition on the range of b:
max{−yl∇f(α)l | αl < C, tl = 1 or αl > 0, tl = −1}≤ b
≤ min{−yl∇f(α)l | αl < C, tl = −1 or αl > 0, tl = 1}
• Define
Iup(α) ≡ {l | αl < C, tl = 1 or αl > 0, tl = −1}, and
Ilow(α) ≡ {l | αl < C, tl = −1 or αl > 0, tl = 1}.
• α optimal if and only if feasible and
maxi∈Iup(α)
−ti∇f(α)i ≤ mini∈Ilow(α)
−ti∇f(α)i.
Violating Pair
• KKT equivalent to
-�
l ∈ Iup(α) l ∈ Ilow(α)
−tl∇f(α)l
• Violating pair
i ∈ Iup(α), j ∈ Ilow(α), and − ti∇f(α)i > −tj∇f(α)j
• f(αk) strictly decreases if and only if B has at least one vio-lating pair
However, simply choosing a violating pair not enough for con-vergence
28
Maximal Violating Pair
• If |B| = 2, natural to choose indices that most violate the KKT:
i ∈ arg maxl∈Iup(αl)
−tl∇f(αk)l,
j ∈ arg minl∈Ilow(αl)
−tl∇f(αk)l
• {i, j} called “maximal violating pair”
Obtained in O(n) operations
Calculating Gradient
• To find violating pairs, gradient is maintained throughout alliterations
• Memory problems occur as ∇f(α) = Qα− e involves Q
• Solved by using the following tricks
1. α1 = 0 implies ∇f(α1) = Q · 0− e = −e
Initial gradient easily obtained
2. Update ∇f(α) using only QBB and QBN :
∇f(αk+1) = ∇f(αk) + Q(αk+1 −αk)
= ∇f(αk) + Q:,B(αk+1 −αk)B
• Only |B| columns of Q needed per iteration
SV M light
• |B| = q; feasible direction vector dB obatined by solving
mindB
∇f(αk)′BdB
subject to t′BdB = 0,
dt ≥ 0, if αkt = 0, t ∈ B,
dt ≤ 0, if αkt = C, t ∈ B,
−1 ≤ dt ≤ 1, t ∈ B.
29
• A combinatorial problem:(
lq
)possibilities
• But optimum is the q/2 most violating pairs
• From Iup(αk): largest q/2 elements
−ti1∇f(αk)i1 ≥ −ti2∇f(αk)i2 ≥ · · · ≥ −tiq/2∇f(αk)iq/2
From Ilow(αk): smallest q/2 elements
−tj1∇f(αk)j1 ≤ · · · ≤ −tjq/2∇f(αk)jq/2
• An O(lq) procedure
• Used in popular SVM software: SV M light, LIBSVM (before Ver-sion 2.8), and others
Caching and Shrinking
• Speed up decomposition methods
• Caching
Store recently used Hessian columns in computer memory
• Example$ time ./libsvm-2.81/svm-train -m 0.01 a4a 11.463s
$ time ./libsvm-2.81/svm-train -m 40 a4a 7.817s
• Shrinking
Some bounded elements remain until the end
Heuristically resized to a smaller problem
• After certain iterations, most bounded elements identified andnot changed
Stopping Condition
• In optimization software such conditions are important
However, don’t be surprised if you see no stopping conditionsin an optimization code of ML software
30
Sometimes time/iteration limits more suitable
• From KKT condition
maxi∈Iup(α)
−ti∇f(α)i ≤ mini∈Ilow(α)
−ti∇f(α)i + ε (1)
a natural stopping condition
Better Stopping Condition
• In LIBSVMε = 10−3
Experience: ok but sometimes too strict
Many a times we get good results even with ε = 10−1
• Large C ⇒ large ∇f(α) components
Too strict ⇒ many iterations
Need a relative condition
• A very important issue not fully addressed yet
Example of Slow Convergence
• Using C = 1$./libsvm-2.81/svm-train -c 1 australian_scale
optimization finished, #iter = 508
obj = -201.642538, rho = 0.044312
• Using C = 5000$./libsvm-2.81/svm-train -c 5000 australian_scale
optimization finished, #iter = 35241
obj = -242509.157367, rho = -7.186733
• Optimization researchers may rush to solve difficult cases
• It turns out that large C less used than small C
Finite Termination
• Given ε, finite termination can be shown for both, SMO andSV M light
31
Effect of hyperparameters
• If we use C = 20, γ = 400$./svm-train -c 20 -g 400 train.1.scale
$./svm-predict train.1.scale train.1.scale.model o
Accuracy = 100% (3089/3089) (classification)
• 100% training accuracy but$./svm-predict test.1.scale train.1.scale.model o
Accuracy = 82.7% (3308/4000) (classification)
• Very bad test accuracy
• Overfitting happens
Overfitting and Underfitting
• When training and predicting a data,
we should
– Avoid underfitting: large training error
– Avoid overfitting: too small training error
• In theory
You can easily achieve 100% training accuracy
• But this is useless
Parameter Selection
• Sometimes you can get away with default choices
• Usually a good idea to tune them correctly
• Now parameters are
C and kernel parameters
32
• Examples:
γ of e−γ‖xi−xj‖2
a, b, d of (x′ixj/a + b)d
• How do we select them ?
Performance Evaluation
• Training errors not important; only test errors count
• l training data, xi ∈ Rn, yi ∈ {+1,−1}, i = 1, . . . , l, a learningmachine:
x→ f(x, α), f(x, α) = 1 or − 1.
Different α: different machines
• The expected test error (generalized error)
R(α) =
∫1
2|y − f(x, α)|dP (x, y)
y: class of x (i.e. 1 or -1)
• P (x, y) unknown, empirical risk (training error):
Remp(α) =1
2l
l∑i=1
|yi − f(xi, α)|
• 12|yi − f(xi, α)| : loss, choose 0 ≤ η ≤ 1, with probability at
least 1− η:
R(α) ≤ Remp(α) + another term
– A good pattern recognition method:
minimize the combined effect of both terms
– Remp(α)→ 0
another term → large
33
• In practice
Available data ⇒ training, validation, and (testing)
• Train + validation ⇒ model
• k-fold cross validation:
– Data randomly separated to k groups.
– Each time k − 1 as training and one as testing
– Select parameters with highest CV
– Another optimization problem
Trying RBF Kernel First
• Linear kernel: special case of RBF
Leave-one-out cross-validation accuracy of linear the same asRBF under certain parameters
Related to optimization as well
• Polynomial: numerical difficulties
(< 1)d → 0, (> 1)d →∞
More parameters than RBF
34
Contour of Parameter Selection
35
Part IV
Quadratic Programming Methods
References:
1. L. Kaufman, Solving the quadratic programming problem aris-ing in support vector machine classification, In B. Scholkopf, C.J. C. Burges and A. Smola, editors, Advances in kernel methods- Support Vector Learning, pp. 147-168, MIT Press, 1999.
2. S. V. N. Vishwanathan, A. Smola and M. N. Murty, SimpleSVM, ICML 2003. http://users.rsise.anu.edu.au/~vishy/papers/VisSmoMur03.pdf
36
Active Set Method
Force each αi into one of three groups:
• O: αi = 0
• C: αi = C
• A: only the αi in this (active) set is allowed to change
α = (αA, αC , αO), αC = CeC , αO = 0
• The optimization problem on only the αA variable set is:
minαB
1
2α′
AQAAαA + (−eA + CQCAeC)′αA
subject to 0 ≤ αl ≤ C, l ∈ A
t′AαA = −Ct′CeC ,
• The problem is as messy as the original problem, except for thefact that the working set is smaller in size. So, what do we doto simplify?
• Typically, the 0 < αi < C, i ∈ A.
• Think as if these constraints will be satisfied and solve thefollowing equality constrained quadratic problem
minαB
1
2α′
AQAAαA + (−eA + CQCAeC)′αA
subject to t′AαA = −Ct′CeC ,
• The solution of this system can be obtained by solving a lin-ear system Hγ = g where γ includes αA together with b, theLagrange multiplier corresponding to the equality constraint.A factorization of H is maintained (and incrementally updatedwhen H undergoes changes in the overall active set algorithm).
37
The basic iteration of the active set method consists of the followingsteps:
• Solve the above-mentioned equality constrained problem
• If the solution αA violates a constraint, move the first violatedi of A into C or O.
• If the solution αA satisfies the constraints then check if any iin C or O violates optimality (KKT) conditions; if so bring itinto A.
The algorithm can be initialized by choosing A to be a small set,say two points.
With appropriate conditions on the incoming new variable, the al-gorithm can be shown to have finite convergence . (Recall thatdecomposition methods such as SMO have only asymptotic conver-gence.)
The method of bringing in a new variable from C or O into A hasa big impact on the overall efficiency of the algorithm. The origi-nal active set algorithm chooses the point of C/O which maximallyviolates the optimality (KKT) condition. This usually ends up be-ing expensive, especially in large scale problems where large kernelarrays cannot be stored. In the Simple SVM algorithm of Vish-wanathan, Smola and Murty, the indices are randomly traversedand the first violating point is included.
38
Comparison with a decomposition method
39
Some Overall Comments
Pros: Finite convergence, and so independent of stopping toler-ances; speed usually unaffected by C; very good when the size offinal A is not more than a few thousand. Very well suited when gra-dient based methods (Chapelle et al) are used for model selection.
Cons: Storage and factorization can be expensive/impossible whenthe size of A goes beyond a few thousand. Seeding is not as cleanas in decomposition methods since factorization needs to be entirelycomputed again.
Simpler SVM: Vishwanathan et al have modified their SimpleSVM method in 2 ways:
• Replace factorization techniques by CG methods of solving thelinear systems that arise.
• Instead of choosing the in-coming points randomly, use a heuris-tically defined priority queue on the points so that those pointswhich are more likely to violate optimality conditions comefirst.
40
Part V
Path Tracking Methods
References:
1. S. Rosset and J. Zhu, Piecewise linear regularized solution paths,2004. http://www-stat.stanford.edu/%7Esaharon/papers/piecewise-revised.pdf
2. S. S. Keerthi, Generalized LARS as an effective feature selectiontool for text classification with SVMs, ICML 2005. http://
www.machinelearning.org/proceedings/icml2005/papers/
053_GeneralizedLARS_Keerthi.pdf
41
A general problem formulation
• Consider the optimization problem
minβ
f(β) = λJ(β) + L(β)
where J(β) =∑
j |βj| and L is a differentiable piecewise convexquadratic function. (Piecewise: This means that the β-spaceis divided into a finite number of zones, in each of which L is aconvex quadratic function and, at the boundary of the zones,the pieces of L merge properly to maintain differentiablity.)
• Our aim is to track the minimizer of f as a function of λ.
• Let g = ∇L.
• At any one λ let β(λ) be the minimizer, A = {j : βj(λ) 6= 0}and Ac be the complement set.
• Optimality conditions:
gj + λsgn(βj) = 0 ∀ j ∈ A (1)
|gj| ≤ λ ∀ j ∈ Ac (2)
• Within one ‘quadratic zone’ (1) defines a set of linear equationsin βj, j ∈ A. Let γ denote the direction in β space thus defined.
• At large λ (specifically, λ > maxj |gj(0)|), β = 0 is the solution.
42
Rosset-Zhu path tracking algorithm
• Initialize: β = 0, A = arg maxj |gj|, get γ
• While (max |gj| > 0)
– d1 = arg mind≥0 minj∈Ac |gj(β + dγ)| = λ + d
– d2 = arg mind≥0 minj∈A(β+dγ)j = 0 (an active componenthits 0)
– Find d3, the first d value at which a piecewise quadraticzone boundary is crossed.
– set d = min(d1, d2, d3)
– If d = d1 then add variable attaining equality at d to A.
– If d = d2 then remove variable attaining 0 at d from A.
– β ← β + dγ
– Update info and compute the new direction vector γ
Implementation: The second order matrix and its factorizationneeded to obtain γ can be efficiently done.
43
Feature selection in Linear classifiers
L(β) =1
2‖β‖2 +
∑i
l(y(xi), ti)
where y(x) = β′x and l is a differentiable, piecewise quadratic lossfunction. Examples: L2-SVM loss, Modified Huber loss. Even logis-tic loss can be nicely approximated by piecewise quadratic functionsquite well...
When the minimum of f = λJ + L is tracked with respect to λ,we get β = 0 at large λ values and we retrieve the minimizer of Las λ→ 0. Intermediate λ values give approximations where featureselection is achieved.
44
Selecting features in Text classification
The following figure shows the application of path tracking to adataset from the Reuters corpus. The plots show F -measure (larger
is the better) as a function of the number of features chosen (whichis much the same as λ→ 0). The black curve corresponds to keep-ing ‖β‖2/2, the blue curve corresponds to leaving out ‖β‖2/2, andthe red curve corresponds to feature selection using the informa-tion gain measure. SVM corresponds to the L2-SVM loss while RLScorresponds to regularized least squares.
45
Forming sparse nonlinear kernel classifiers
• Consider the nonlinear kernel primal problem
min1
2‖w‖2 + C
∑i
l(y(xi, ti)
where l is a differentiable, piecewise quadratic loss function. Asbefore, l can be one of L2-SVM loss, Modified Huber loss or apiecewise quadratic approximation of the logistic loss.
• Use the primal substitution w = βitiφ(xi) to get
min L(β) =1
2β′Qβ +
∑i
l(y(xi), ti)
where y(x) =∑
i βitik(x, xi)
• When the minimum of f = λJ +L is tracked with respect to λ,we get β = 0 at large λ values and we retrieve the minimizer ofL as λ→ 0. Intermediate λ values give approximations wheresparsity is achieved.
46
Performance on Two Datasets
47
Part VI
Finite Newton Method (FNM)
References:
1. O. L. Mangasarian, A finite Newton method for classification,Optimization Methods and Software, Vol. 17, pp. 913-929,2002. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-11.pdf
2. S. S. Keerthi and D. W. DeCoste, A Modified Finite NewtonMethod for Fast Solution of Large Scale Linear SVMs, Jour-nal of Machine Learning Research, Vol. 6, pp. 341-361, 2005.http://jmlr.csail.mit.edu/papers/volume6/keerthi05a/
keerthi05a.pdf
3. J. Zhu and T. Hastie, Kernel logistic regression and the im-port vector machine, Journal of Computational and Graphi-cal Statistics, Vol. 14, pp. 185-205. http://www.stat.lsa.
umich.edu/~jizhu/research/05klr.pdf
4. S. S. Keerthi, O. Chapelle and D. W. DeCoste, Building supportvector machines with reduced classifier complexity, submittedto JMLR.http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/primal/
48
Introduction
FNM is much more efficient and better suited than dual decom-position methods in certain situations.
Differnt dimensions:
• Linear, Nonlinear kernel machines
• Classification, Ordinal Regression, Regression
• Differentiable loss functions:
– Least squares
– L2-SVM
– Modified Huber
– KLR
49
FNM: A General Format
minβ
f(β) =1
2β′Rβ +
∑i
li(β)
R is positive definite.
li is the loss for the i-th example.
Assume: li is convex, differentiable and piecewise quadratic. (Piece-wise: Same as mentioned in path tracking.)
FNM Iteration
βk = starting vector at the k-th iteration
qi = local quadratic of li at βk.
β = arg minβ
1
2β′Rβ +
∑i
qi(β)
Note that β can be obtained by solving a linear system.
Define direction: dk = β − βk
New point by exact line search:
βk+1 = βk + δkdk, δk = arg minδ
f(βk + δdk)
50
Finite convergence of FNM
First, global convergence theorem (discussed in part II) ensures thatβk → β?, the minimizer of f .
Since ∇f is continuous, β? is also the minimizer (i.e., β) of everylocal quadratic approximation of f at β?.
Thus, there is an open neighborhood around β? such that, from anypoint there FNM will reach β? in exactly one iteration.
Convergence of βk → β? ensures that the above mentioned neigh-borhood will be reached in a finite number of steps.
Thus FNM has finite convergence.
Comments
# iterations usually very small (5-50)
Linear system in each iteration is of the form:
Akβ = bk, Ak = R +∑
i
γisis′i
Factorization of Ak can be done incrementally and is very efficient.
In many cases, the linear system is of the RLS (Regularized LeastSquares) form and so special methods can be called in.
51
Linear Kernel Machines: Small input dimension
R = I, li(β) = C · h(tiβ′xi)
d = dimension of β is small
Factorization of Ak is very efficient
O(nd2) complexity
52
Linear Kernel Machines: Large input dimension
Text classification: n ≈ 200, 000; d ≈ 250, 000
Data matrix, X (containing the xi’s) is sparse : 0.1% non-zeros
Linear System: Use quadratic conjugate-gradient methods
Theoretically CG will need d + 1 iterations for exact convergence.But exact solution is completely unnecessary. With inexact termi-nation, CG requires a very small number of iterations.
Example: # CG iterations in various FNM iterations...
522→ 314→ 117→ 60→ 35→ 19→ 9→ 5→ 2→ 1
Each CG iteration requires a couple of calls of the form Xy or X ′z.There are about 1000 such calls.
Compare: SMO does calculations equivalent to one Xβ calculationin each of its iterations involving the update of a pair of alphas,(αi, αj). SMO uses tens of thousands of such iterations!
Unlike SMO, where the number of iterations is very sensitive to C,the number of FNM iterations is not at all sensitive to C.
53
Heuristics
H1: Terminate first FNM iteration in 10 CG iterations. Most non-support vectors usually get well identified in this phase. These vec-tors will not get involved in the CG iterations of the future FNMiterations. This heuristic is particularly powerful when the numberof support vectors is a small fraction of n.
H2: First run FNM with a high tolerance; then do another run witha low tolerance.
Example: % SV = 6.8
• No H: 456.85 secs
• H1 only: 70.86 secs
• H2 only: 202.62 secs
• H1 & H2: 62.18 secs
β-seeding can be used when going from one C value to anothernearby C value. β-seeding is more effective than the α-seeding inSMO.
54
Comparison of FNM (L2-SVM), SV M lightand BSVM
Computing time versus training set size for Adult and Web datasets
55
Feature selection in Linear Kernel Classifiers
The ideas form a very good alternative to the L1 regularization pathtracking ideas discussed earlier.
Start with no feature; add features greedily
Let β = optimized vector with a current set of features
βj = a feature not yet chosen
Evaluation criterion:
fj = arg minβj
f(β, βj), β = fixed
where f is the training cost function.
Choose the βj with smallest fj
After choosing the best j, solve min f(β, βj) using (β, 0) as the seed.
Factorization updates needed for linear system solution can be up-dated very efficiently.
56
Sparse Nonlinear Kernel Machines
The ideas are parallel to what we discussed earlier on the same topic.
Use the primal substitution w = βitiφ(xi) to get
min L(β) =1
2β′Qβ +
∑i
l(y(xi), ti)
where y(x) =∑
i βitik(x, xi)
Note that, except for the regularizer being a more general quadraticfunction (β′Qβ/2), this problem is essentially in the linear classifierform.
New non-zero βj variables can be selected greedily as in the feature
selection process of the previous slide.
At each step of the greedy process it is usually sufficient to restrictthe evaluation to a small, randomly chosen number (say, 50) of βj’s.A similar choice doesn’t exist for the L1 regularization method.
The result is an effective algorithm with a clearly defined small com-plexity: O(d2n) algorithm for selecting d kernel basis functions.
On many datasets, a small d gives nearly the same accuracy as thefull kernel classifier using all basis functions.
57
Performance on some UCI Datasets
SpSVM SVMDataset TestErate #Basis TestErate #SVBanana 10.87 17.3 10.54 221.7Breast 29.22 12.1 28.18 185.8Diabetis 23.47 13.8 23.73 426.3Flare 33.90 8.4 33.98 629.4German 24.90 14.0 24.47 630.4Heart 15.50 4.3 15.80 166.6Ringnorm 1.97 12.9 1.68 334.9Thyroid 5.47 10.6 4.93 57.80Titanic 22.68 3.3 22.35 150.0Twonorm 2.96 8.7 2.42 330.30Waveform 10.66 14.4 10.04 246.9
58
An Example
59