+ All Categories
Home > Documents > Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of...

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of...

Date post: 30-Dec-2015
Category:
Upload: barnard-bridges
View: 217 times
Download: 0 times
Share this document with a friend
21
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data Mining Techniques with Applications IEEE International Conference on Data Mining Omaha, Nebraska, October 28, 2007
Transcript
Page 1: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Feature Selection in Nonlinear Kernel Classification

Olvi Mangasarian & Edward Wild

University of Wisconsin

Madison

Workshop on Optimization-Based Data Mining Techniques with Applications

IEEE International Conference on Data Mining

Omaha, Nebraska, October 28, 2007

Page 2: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Example

+ + + ++ + + +

+ + + ++ + + +

_ _ _ _

_ _ _ _

x1

x2

However, data is nonlinearly separable using only the feature x2

Best linear classifier that uses only 1 feature selects the feature x1

Feature selection in nonlinear classification is important

Data is nonlinearly separable: In general nonlinear kernels use both x1 and x2

Page 3: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Outline

Minimize the number of input space features selected by a nonlinear kernel classifier

Start with a standard 1-norm nonlinear support vector machine (SVM)

Add 0-1 diagonal matrix to suppress or keep features Leads to a nonlinear mixed-integer program

Introduce algorithm to obtain a good local solution to the resulting mixed-integer program

Evaluate algorithm on two public datasets from the UCI repository and synthetic NDCC data

Page 4: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

K(x0, A0)u = 1

Support Vector Machines

K(A+, A0)u ¸ e +e

K(A, A0)u· e e

+

__

_

___

_

++

+

+

+

+

+

+

__

_ _

_

_

__

__

_

++

++ +

+

+

++

_

_

_

_

_ K(x0, A0)u = K(x0, A0)u =

Slack variable y ¸ 0 allows points to be on the wrong side of the bounding surface

x 2 Rn

SVM defined by parameters u and threshold of the nonlinear surface

A contains all data points{+…+} ½ A+

{…} ½ A

e is a vector of ones

SVMs

Minimize e0s (||u||1 at solution) to reduce overfitting

Minimize e0y (hinge loss or plus function or max{•, 0}) to fit data

Linear kernel: (K(A, B))ij = (AB)ij = AiB¢j = K(Ai, B¢j)

Gaussian kernel, parameter (K(A, B))ij = exp(-||Ai0-B¢j||

2)

Page 5: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Reduced Feature SVM

If Eii is 0 the ith feature is removed

To suppress features, add the number of features present (e0Ee) to the objective with weight ¸ 0

As is increased, more features will be removed from the classifier

All features are present in the kernel matrix K(A, A0)

Replace A with AE, where E is a diagonal n £ n matrix with Eii 2 {1, 0}, i = 1, …, n

Start with Full SVM

Page 6: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Reduced Feature SVM (RFSVM)

1) Initialize diagonal matrix E randomly2)For fixed 0-1 values E, solve the SVM linear program

to obtain (u, , y, s)3)Fix (u, , s) and sweep through E repeatedly as follows:

1) For each component of E replace 1 by 0 and conversely provided the change decreases the overall objective function by more than tol

4)Go to (3) if a change was made in the last sweep, otherwise continue to (5)

5)Solve the SVM linear program with the new matrix E. If the objective decrease is less than tol, stop, otherwise go to (3)

Page 7: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

RFSVM Convergence (for tol = 0)

Objective function value convergesEach step decreases the objectiveObjective is bounded below by 0

Limit of the objective function value is attained at any accumulation point of the sequence of iterates

Accumulation point is a “local minimum solution”Continuous variables are optimal for the fixed integer

variablesChanging any single integer variable will not decrease

the objective

Page 8: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Experimental Results

Classification accuracy versus number of features usedCompare our RFSVM to Relief and RFE

(Recursive Feature Elimination)Results given on two public datasets from the UCI

repository Ability of RFSVM to handle problems with up to 1000

features tested on synthetic NDCC datasetsSet feature selection parameter = 1

Page 9: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Relief and RFE

Relief Kira and Rendell, 1992 Filter method: feature selection is a preprocessing procedure Features are selected as relevant if they tend to have different

feature values for points in different classes RFE (Recursive Feature Elimination)

Guyon, Weston, Barnhill, and Vapnik, 2002 Wrapper method: feature selection is based on classification Features are selected as relevant if removing them causes a large

change in the margin of an SVM

Page 10: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Ionosphere Dataset 351 Points in R34Nonlinear SVM with

no feature selection

Linear 1-norm SVM

Even for feature selection parameter = 0, some features may be removed when removing them decreases the hinge loss

Note that accuracy decreases slightly until about 10 features remain, and then decreases more sharply as they are removed

Number of features used

Cro

ss-v

alid

atio

n ac

cura

cy

If the appropriate value of is selected, RFSVM can obtain higher accuracy using fewer features than SVM1

Page 11: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Normally Distributed Clusters on Cubes Dataset (Thompson, 2006)

Points are generated from normal distributions centered at vertices of 1-norm cubes

Dataset is not linearly separable

Page 12: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 100 True Features and

1000 Irrelevant Features

RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 20 True Features and Varying

Numbers of Irrelevant Features

Each point is the average test set correctness over 10 datasets with 200 training, 200 tuning, and 1000 testing points When 480

irrelevant features are added, the accuracy of RFSVM is 45% higher than that of NKSVM1

0.53NKSVM1

0.70RFSVM

Average Accuracy on 1000 Test Points

Page 13: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Conclusion

New rigorous formulation with precise objective for feature selection in nonlinear SVM classifiersObtain a local solution to the resulting mixed-integer

program Alternate between a linear program to compute

continuous variables and successive sweeps to update the integer variables

Efficiently learns accurate nonlinear classifiers with reduced numbers of features

Handles problems with 1000 features, 900 of which are irrelevant

Page 14: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Questions?

Websites with links to papers and talkshttp://www.cs.wisc.edu/~olvihttp://www.cs.wisc.edu/~wildt

NDCC generatorhttp://www.cs.wisc.edu/dmi/svm/ndcc/

Page 15: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Running Time on the Ionosphere Dataset

Averages 5.7 sweeps through the integer variablesAverages 3.4 linear programs75% of the time consumed in objective function

evaluations15% of time consumed in solving linear programsComplete experiment (1960 runs) took 1 hour

3 GHz Pentium 4Written in MATLABCPLEX 9.0 used to solve the linear programsGaussian kernel written in C

Page 16: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Sonar Dataset208 Points in R60

Number of features used

Cro

ss-v

alid

atio

n ac

cura

cy

Page 17: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Related Work

Approaches that use specialized kernelsWeston, Mukherjee, Chapelle, Pontil, Poggio, and

Vapnik, 2000: structural risk minimizationGold, Holub, and Sollich, 2005: Bayesian interpretationZhang, 2006: smoothing spline ANOVA kernels

Margin-based approachFrölich and Zell, 2004: remove features if there is little

change to the margin if they are removedOther approaches which combine feature selection

with basis reductionBi, Bennett, Embrechts, Breneman, and Song, 2003Avidan, 2004

Page 18: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Future Work

Datasets with more featuresReduce the number of objective function

evaluationsLimit the number of integer cyclesOther ways to update the integer variables

Application to regression problemsAutomatic choice of

Page 19: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Algorithm

Global solution to nonlinear mixed-integer program cannot be found efficiently Requires solving 2n linear programs

For fixed values of the integer diagonal matrix, the problem is reduced to an ordinary SVM linear program

Solution strategy: alternate optimization of continuous and integer variables: For fixed values of E, solve a linear program for

(u, , y, s) For fixed values of (u, , s), sweep through the components of E

and make updates which decrease the objective function

Page 20: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Notation

Data points represented as rows of an m £ n matrix AData labels of +1 or -1 are given as elements of an

m £ m diagonal matrix DExample

XOR: 4 points in R2 Points (0, 1) , (1, 0) have label +1Points (0, 0) , (1, 1) have label 1

Kernel K(A, B) : Rm£n £ Rn£k ! Rm£k

Linear kernel: (K(A, B))ij = (AB)ij = AiB¢j = K(Ai, B¢j)

Gaussian kernel, parameter (K(A, B))ij = exp(-||Ai0

- B¢j||2)

Page 21: Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Methodology

UCI Datasets To reduce running time, 1/11 of each dataset was used as a

tuning set to select and the kernel parameter Remaining 10/11 used for 10-fold cross validation Procedure repeated 5 times for each dataset with different

random choice of tuning set each time NDCC

Generate multiple datasets with 200 training, 200 tuning, and 1000 testing points


Recommended