Linear Programming for Feature Selection via Regularization

Linear Programming for Feature Selection viaRegularization

Yoonkyung LeeDepartment of Statistics

The Ohio State University(Joint work with Yonggang Yao)

July 2008

Outline

◮ Methods of regularization◮ Solution paths◮ Main optimization problems for feature selection◮ Overview of linear programming◮ Simplex algorithm for generating solution paths◮ Implications◮ Numerical examples◮ Concluding remarks

Regularization

◮ Tikhonov regularization (1943):solving ill-posed integral equation numerically

◮ Process of modifying ill-posed problems by introducingadditional information about the solution

◮ Modification of the maximum likelihood principle orempirical risk minimization principle(Bickel & Li 2006)

◮ Smoothness, sparsity, small norm, large margin, ...◮ Bayesian connection

Methods of Regularization (Penalization)

Find f ∈ F minimizing

1n

n∑

i=1

L(yi , f (x i)) + λJ(f ).

◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): the complexity of a model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem

Examples of Regularization Methods

◮ Ridge regression (Hoerl and Kennard 1970)◮ LASSO (Tibshirani 1996)◮ Smoothing splines (Wahba 1990)◮ Support vector machines (Vapnik 1998)◮ Regularized neural network, boosting, logistic regression,

...

◮ Smoothing splines:Find f ∈ F = W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing

1n

n∑

i=1

(yi − f (xi))2 + λ

∫ 1

0(f ′′(x))2dx ,

where J(f ) =∫ 1

0 (f ′′(x))2dx .◮ Support vector machines:

Find f ∈ F = {f (x) = w⊤x + b | w ∈ Rp and b ∈ R}

minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖w‖2,

where J(f ) = J(w⊤x + b) = ‖w‖2.

LASSO

minβ

n∑

i=1

(yi−

p∑

j=1

βjxij)2+λ‖β‖1 ⇔ min

β

n∑

i=1

(yi−

p∑

j=1

βjxij)2 s.t. ‖β‖1 ≤ s

β

LASSO coefficient paths

** * * * * * * * * ** *

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

** * * ** *

* * * ** *

**

* * * * * * * * ** *

** **

* * * * * * ** *

** * * * * **

* *

**

*

** * * * * * * * *

***

** * ** * * * * *

***

** * * * * * ** * ** *

**

* * * * * * * ***

*

** * * * * * * * * ** *

LASSO

52

18

69

0 2 3 4 5 7 8 10 12

Solution Paths

◮ Each regularization method defines a continuum ofoptimization problems indexed by a tuning parameter.

◮ λ determines the trade-off between the prediction errorand the model complexity

◮ The entire set of solutions f or β as a function of λ

◮ Complete exploration of the model space andcomputational savings

◮ Examples◮ LARS (Efron et al. 2004)◮ SVM path (Hastie et al. 2004)◮ Multicategory SVM path (Lee and Cui 2006)◮ Piecewise linear paths (Rosset and Zhu 2007)◮ Generalized path seeking algorithm (Friedman 2008)

Main Problem

◮ Regularization for simultaneous fitting and feature selection◮ Convex piecewise linear loss functions◮ Penalties of ℓ1 nature for feature selection

◮ Parametric: LASSO-type◮ Nonparametric: COSSO-type

COmponent Selection and Smoothing Operator(Lin and Zhang 2003, Gunn and Kandola 2002)

◮ Non-differentiability of the loss and penalty◮ Linear programming (LP) problems indexed by a single

regularization parameter

◮ Examples◮ ℓ1-norm SVM

(Bradley and Mangasarian 1998, Zhu et al. 2004)◮ ℓ1-norm Quantile Regression (Li and Zhu 2005)◮ θ-step (kernel selection) for structured kernel methods

(Lee et al. 2006)◮ Dantzig selector (Candes and Tao 2005)◮ ǫ-insensitive loss in SVM regression◮ Sup norm, maxj=1,...,p |βj |

◮ Computational properties of the solutions to the problemscan be treated generally by tapping into the LP theory.

Linear Programming

◮ One of the cornerstones of the optimization theory◮ Applications in operation research, economics, business

management, and engineering◮ The simplex algorithm by Dantzig (1947)◮ ‘Parametric cost LP’ or ‘parametric right-hand-side LP’ in

the optimization theory◮ Exploit the connection to lay out general algorithms for the

solution paths of the feature selection problems.

Geometry of LP◮ Search the minimum of a linear function over a polyhedron

whose edges are defined by hyperplanes.◮ At least one of the intersection points of the hyperplanes

should attain the minimum if the minimum exists.

Linear Programming

◮ Standard form of LP

minz ∈ R

Nc′z

s.t. Az = bz ≥ 0,

where z is an N-vector of variables, c is a fixed N-vector, bis a fixed M-vector, and A is an M × N fixed matrix.

LP terminology

◮ A set B∗ := {B∗1, · · · , B∗

M} ⊂ N = {1, · · · , N} is called abasic index set, if AB∗ is invertible.

◮ z∗ ∈ RN is called the basic solution associated with B∗, if

z∗ satisfies{

z∗B∗ := (z∗

B∗

1, · · · , z∗

B∗

M)′ = A−1

B∗ bz∗

j = 0 for j ∈ N \ B∗.

◮ A basic index set B∗ is called a feasible basic index set ifA−1

B∗ b ≥ 0.◮ A feasible basic index set B∗ is also called an optimal basic

index set if[

c − A′(

A−1B∗

)′cB∗

]

≥ 0.

Optimality Condition for LP

TheoremLet z∗ be the basic solution associated with B∗, an optimalbasic index set. Then z∗ is an optimal basic solution.

◮ The standard LP problem can be solved by finding theoptimal basic index set.

Parametric Linear Programs

◮ Standard form of a parametric-cost LP:

minz ∈ R

N(c + λa)′z

s.t. Az = bz ≥ 0

◮ Standard form of a parametric right-hand-side LP:

minz ∈ R

Nc′z

s.t. Az = b + ωb∗

z ≥ 0

Example: ℓ1-norm SVM

minβ0 ∈ R, β ∈ R

p

n∑

i=1

{1 − yi (β0 + xiβ)}+ + λ‖β‖1,

◮ In other words,

{


p, ζ ∈ Rn

∑ni=1(ζi)+ + λ‖β‖1

s.t. yi(β0 + xiβ) + ζi = 1 for i = 1, · · · , n.

z := ( β+

0 β−0 (β+)′ (β−)′ (ζ+)′ (ζ−)′ )′

c := ( 0 0 0′ 0′ 1′ 0′ )′

a := ( 0 0 1′ 1′ 0′ 0′ )′

A := ( Y −Y diag(Y )X −diag(Y )X I −I )b := 1.

Optimality Interval

CorollaryFor a fixed λ∗ ≥ 0, let B∗ be an optimal basic index set of theparametric-cost LP problem at λ = λ∗. Define

λ := max{j : a∗

j > 0; j ∈ N \ B∗}

(

−c∗ja∗

j

)

and λ := min{j : a∗

j < 0; j ∈ N \ B∗}

(

−c∗ja∗

j

)

,

where a∗j := aj − a′

B∗A−1B∗ Aj and c∗j := cj − c′

B∗A−1B∗ Aj for j ∈ N .

Then, B∗ is an optimal basic index set for λ ∈ [λ, λ], whichincludes λ∗.

Simplex Algorithm

1. Initialize the optimal basic index set at λ−1 = ∞ with B0.

2. Given Bl at λ = λl−1, determine the solution z l byz lBl = A−1

Bl b and z lj = 0 for j ∈ N \ Bl .

3. Find the entry index

j l = arg maxj : al

j > 0; j ∈ N \ Bl

(

−cl

j

alj

)

.

4. Find the exit index

i l = arg mini∈{j: d l

j <0, j∈Bl}

(

−z l

i

d li

)

.

5. Update the optimal basic index set to Bl+1 = Bl ∪{j l} \ {i l}.

6. Terminate the algorithm if clj l ≥ 0 or equivalently λl ≤ 0.

Otherwise, repeat 2 – 5.

TheoremThe solution path of the parametric-cost LP is

z0 for λ > λ0

z l for λl < λ < λl−1, l = 1, · · · , Jτz l + (1 − τ)z l+1 for λ = λl and τ ∈ [0, 1], l = 0, · · · , J − 1.

◮ The simplex algorithm gives a piecewise constant path.

An Illustrative Example

◮ x = (x1, . . . , x10) ∼ N(0, I)◮ A probit model:

Y = sign(β0 + xβ + ǫ) where ǫ ∼ N(0, 50)

◮ β0 = 0, βj = 2 for j = 1, 3, 5, 10, and 0 elsewhere◮ The Bayes error rate: 0.336◮ n = 400◮ ℓ1-norm SVM

2 4 6 8 10

−1.0

−0.5

0.0

0.5

− log(λ)

β 0

1

2

3

4

5

678

9

10

Figure: ℓ1-norm SVM coefficient path indexed by λ

(five-fold CV with 0-1 and hinge)

Alternative Formulation

◮ Example: ℓ1-norm SVM


p

n∑

i=1

{1 − yi (β0 + xiβ)}+

s.t. ‖β‖1 ≤ s

◮ As a parametric right-hand-side LP,

minz ∈ R

N , δ ∈ Rc′z

s.t. Az = ba′z + δ = sz ≥ 0, δ ≥ 0.

TheoremFor s ≥ 0, the solution path can be expressed as{ sl+1 − s

sl+1 − slz l + s − sl

sl+1 − slz l+1 if sl ≤ s < sl+1 and l = 0, · · · , J − 1

zJ if s ≥ sJ ,

where sl = a′z l .

◮ The simplex algorithm gives a piecewise linear path.

0.0 0.5 1.0 1.5 2.0 2.5

−1.0

−0.5

0.0

0.5

s

β 0

1

2

3

4

5

678

9

10

Figure: ℓ1-norm SVM coefficient path indexed by s(five-fold CV with 0-1 and hinge)

1.0 1.5 2.0 2.5

0.33

0.35

0.37

0.39

s

Err

or R

ate

Figure: The true error rate path for the l1-norm SVM under the probitmodel

Annual Household Income Data

◮ http://www-stat.stanford.edu/∼tibs/ElemStatLearn/◮ Predict the annual household income with 13 demographic

attributes (education, age, gender, marital status,occupation, householder status etc.).

◮ The response takes one of nine income brackets specified.◮ Split 6,876 records into a training set of 2,000 and a test

set of 4,876.

=<

Gra

de 8

Gra

de 9

−11

Hig

h S

choo

l

Col

lege

1−

3

Col

lege

Gra

duat

e

20

40

60

80

100

Inco

me

Education

14−

17

18−

24

25−

34

35−

44

45−

54

55−

64

>65

20

40

60

80

100

Inco

me

Age

Mal

e

Fem

ale

20

40

60

80

100

Inco

me

Gender

Mar

ried

Tog

ethe

r

Div

orce

d

Wid

owed

Sin

gle

20

40

60

80

100

Inco

me

Marital Status

Pro

f.

Sal

es

Fac

tory

Cle

rical

Hom

e

Stu

dent

Mili

tary

Ret

ire

Une

mpl

oy

20

40

60

80

100

Inco

me

Occupation

Ow

n

Ren

t

Par

ents

20

40

60

80

100

Inco

me

House Status

Figure: Boxplots of the annual household income with education,age, gender, marital status, occupation, and householder status outof 13 demographic attributes in the data

Median Regression with l1 Penalty

minβ

n∑

i=1

∣

∣

∣

∣

yi −

p∑

j=1

βjxij

∣

∣

∣

∣

subject to ‖β‖1 ≤ s

◮ Main effect model with 35 variables plus a quadratic termfor age

◮ Partial two-way interaction model with additional 69two-way interactions (out of 531 potential interaction terms)

Main effect model

0 10 20 30 40 50 60

−4−2

02

46

8

s

β

Figure: Positive: home ownership (in dark blue relative to renting), education (in brown), dual income due tomarriage (in purple relative to ‘not married’), age (in skyblue), and male (in light green). Negative: single or divorced(in red relative to ‘married’) and student, clerical worker, retired or unemployed (in green relative toprofessionals/managers)

Two-way interaction model

0 20 40 60 80 100 120

−50

5

s

β

Figure: Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗education’, and ‘married but no dual income ∗ education’. Negative:‘single ∗ education’ and ‘home ownership ∗ age’

Risk Path

20 40 60 80 100 120

7.6

7.8

8.0

8.2

8.4

s

Estim

ated

Risk

Figure: The risks of the two-way fitted models are estimated by usinga test data set with 4,876 observations.

Refinement of the Simplex Algorithm

◮ The simplex algorithm assumes the non-degeneracy ofsolutions that z l 6= z l+1 for each l .

◮ Tableau-simplex algorithm with the anti-cycling property formore general settings

◮ Structural commonalities in the elements of the standardLP form can be utilized for efficient computation.

Concluding Remarks

◮ Establish the connection between a family of regularizationproblems for feature selection and the LP theory.

◮ Shed a new light on solution path-finding algorithms for theoptimization problems in comparison with the existingalgorithms.

◮ Provide fast and efficient computational tools for screeningand selection of features for regression and classificationproblems.

◮ Unified algorithm with modular treatment of differentprocedures (lpRegPath)

◮ Model selection (or averaging) and validation◮ Optimization theory and tools are very useful to

statisticians.

Reference

◮ Another Look at Linear Programming for Feature Selectionvia Methods of Regularization, Yao, Y., and Lee, Y.,Technical Report No. 800, Department of Statistics, TheOhio State University, 2007.

◮ For a copy of the paper and slides of this talk, visithttp://www.stat.ohio-state.edu/∼yklee

◮ E-mail: [email protected]

Date post:	26-Oct-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Linear Programming for Feature Selection via Regularization

Documents