+ All Categories
Home > Documents > Model Selection via Bilevel Optimization

Model Selection via Bilevel Optimization

Date post: 15-Jan-2016
Category:
Upload: mervyn
View: 28 times
Download: 0 times
Share this document with a friend
Description:
Model Selection via Bilevel Optimization. Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY. Convex Machine Learning. - PowerPoint PPT Presentation
Popular Tags:
61
Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY
Transcript
Page 1: Model Selection via Bilevel Optimization

Model Selection via Bilevel Optimization

Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang

Department of Mathematical SciencesRensselaer Polytechnic InstituteTroy, NY

Page 2: Model Selection via Bilevel Optimization

Convex Machine LearningConvex optimization approaches to machine

learning has been major obsession of machine learning for last ten years.

But are the problems really convex?

Page 3: Model Selection via Bilevel Optimization

Outline The myth of convex machine learning Bilevel Programming Model Selection

Regression Classification

Extensions to other machine learning tasks Discussion

Page 4: Model Selection via Bilevel Optimization

Modeler’s ChoicesData

Function

Loss/Regularization

1 1

2 1

1

y

y

y

x

x

x

2

1' 22

min max(| | ,0) || ||i ii

C y x w w

Optimization Algorithm

ˆ( )f x x w

CONVEX!

w( )f x x w

Page 5: Model Selection via Bilevel Optimization

Many Hidden Choices Data:

Variable Selection Scaling Feature Construction Missing Data Outlier removal

Function Family: linear, kernel (introduces kernel parameters)

Optimization model loss function regularization Parameters/Constraints

Page 6: Model Selection via Bilevel Optimization

Data

Function

Loss/Regularization

Cross-Validation Strategy

Generalization Error

, , 1t ttraitest tn T

[ ], ,[ ] Xy yX

1

1 1| |

t

Tt

i it itT

x w y

NONCONVEX

Cross-Validation:

C,ε, [X,y]

Optimization Algorithm

w

1 22

min max(| | ,0) || ||i ii

C y x w w

ˆ( )f x x w

ˆ( )f x x w

Page 7: Model Selection via Bilevel Optimization

How does modeler make choices? Best training set error Experience/policy Estimate of generalization error

Cross-validation Bounds

Optimize generalization error estimate Fiddle around. Grid Search Gradient methods Bilevel Programming

Page 8: Model Selection via Bilevel Optimization

Splitting Data for T-fold CV

-th partition

t

disjoint partitions of

T

validation set for -th fold

t

t

\ training set for -th fold

t t

t

Page 9: Model Selection via Bilevel Optimization

CV via Grid SearchFor every C, ε • For every validation set, Solve model on corresp.

training set, and to estimate loss for

•Estimate generalization error for C, ε

Return best values for C,εMake final model using C,ε

C

ε

,C

t

tt

Page 10: Model Selection via Bilevel Optimization

Bilevel Program for T folds

Prior Approaches: Golub et al., 1979, Generalized Cross-Validation for one parameter in Ridge Regression

CV as Continuous Optimization Problem

,1

2

2

1min

1 s.t. arg min max ,0

2

1,..,

t

t

Tt

i iC

t it

tj j

j

yT

C y

t T

w

x w

w x w wT inner-level training problems

Outer-level validation problem

Page 11: Model Selection via Bilevel Optimization

Benefit: More Design Variables

Add feature box constraint: in the inner-level problems.

w w w

, ,1

2

2

1min

1 s.t. arg min max ,0

2

t

t

Tt

i iCt it

tj j

j

yT

C y

w

w w w

x w

w x w w

Page 12: Model Selection via Bilevel Optimization

-insensitive Loss Function max ,0z

Page 13: Model Selection via Bilevel Optimization

Inner-level Problem for t-th Fold 2

2

2

2,

1min max , 0

2

1 min

2

s.t.

t

t

t t

t

t tj j

j

t tj

j

t tj j j

t tj j j

C y

C

y

y

w w w

w ξ

x w w

w

x w

x w

0tj

t

w w w

Page 14: Model Selection via Bilevel Optimization

Optimality (KKT) conditions for fixed , ,C w

, t

, t

, ,

, t

, t

t , , , ,

j

0 + + 0

0 + + 0

0 0

0 0

0 0

0 =t

t tj j j jt t

j j j j tt t tj j j

t

t

t t t tj j j

y

y j

C

x w

x w

γ w w

γ w w

w x γ γ

where 0 a b a b

Page 15: Model Selection via Bilevel Optimization

Key Transformation KKT for the inner level training problems

are necessary and sufficient Replace lower level problems by their

KKT Conditions Problem becomes an Mathematical

Programming Problem with Equilibrium Constraints (MPEC)

Page 16: Model Selection via Bilevel Optimization

Bilevel Problem as MPEC

, , , 1

, t

, t

, ,

, t

, t

1min

s.t. for 1,...,

0 + + 0

0 + + 0

0 0

0 0

0 0

0 =

t

t

Tt

i iC t it

t tj j j jt t

j j j j tt t tj j j

t

t

yT

t T

y

y j

C

w w

x w

x w

x w

γ w w

γ w w

w t , , , ,

j t

t t t tj j j

x γ γ

Replace T inner-level problems with

corresponding optimality conditions

Page 17: Model Selection via Bilevel Optimization

MPEC to NLP via Inexact Cross ValidationRelax “hard” equilibrium constraints

to “soft” inexact constraints

tol is some user-defined tolerance.

tol tol

, 00 0

a ba b

a b

, 00 0

0

a ba b

a b

Page 18: Model Selection via Bilevel Optimization

Solvers Strategy: Proof of concept using nonlinear general

purpose solvers from NEOS on NLP FILTER, SNOPT Sequential Quadratic Programming Methods FILTER results almost always better.Many possible alternatives: Integer Programming Branch and Bound Lagrangian Relaxations

Page 19: Model Selection via Bilevel Optimization

Computational Experiments: DATASynthetic (5,10,15)-D Data with Gaussian and Laplacian

noise and (3,7,10) relevant features. NLP: 3-fold CV Results: 30 to 90 train, 1000 test points, 10 trialsQSAR/Drug Design 4 datasets, 600+ dimensions reduced to 25 top

principal components. NLP: 5-fold CV Results: 40 – 100 train, rest test, 20 trials

Page 20: Model Selection via Bilevel Optimization

Cross-validation Methods Compared Unconstrained Grid: Try 3 values each for C,ε Constrained Grid: Try 3 values each for C, ε, and {0, 1} for each component of Bilevel/FILTER: Nonlinear program

solved using off-the-shelf SQP algorithm via NEOS

w

Page 21: Model Selection via Bilevel Optimization

15-D Data: Objective Value

0

0.5

1

1.5

2

2.5

3

15 pts 30 pts 60 pts 90 pts

Unc GridCon GridFilter

Page 22: Model Selection via Bilevel Optimization

15-D Data: Computational Time

0

100

200

300

400

500

600

700

800

900

15 pts 30 pts 60 pts 90 pts

Unc GridCon GridFilter

Page 23: Model Selection via Bilevel Optimization

15-D Data: TEST MAD

0

0.5

1

1.5

2

2.5

3

15 pts 30 pts 60 pts 90 pts

Unc GridCon GridFilter

Page 24: Model Selection via Bilevel Optimization

QSAR Data: Objective Value

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Aquasol BBB Cancer CCK

Unc GridCon GridFilter

Page 25: Model Selection via Bilevel Optimization

QSAR Data: Computation Time

0

200

400

600

800

1000

1200

1400

Aquasol BBB Cancer CCK

Unc GridCon GridFilter

Page 26: Model Selection via Bilevel Optimization

QSAR Data: TEST MAD

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Aquasol BBB Cancer CCK

Unc GridCon GridFilter

Page 27: Model Selection via Bilevel Optimization

Classification Cross ValidationGiven sample data from

two classes.

Find classification function that minimizes out-of-sample estimate of classification error

( , ) 1,...,i iy i x

1

-1

Page 28: Model Selection via Bilevel Optimization

Lower level - SVM

' 1x w

Define parallel planes Minimize points on

wrong side Maximize margin of

separation

' 1x w

2

|| ||w

Page 29: Model Selection via Bilevel Optimization

Lower Level Loss Function: Hinge Loss

max 1 ,0z

Measures distance of points that violate the appropriate hyperplane constraints,ib y w x

Page 30: Model Selection via Bilevel Optimization

Lower Level Problem: SVC with box

2

2, ,min

2

s.t. 1

0

t tt

t

t tj

b j

t tj j t j

tj

t

y b

w ξ

w

w

w

x w

w

2

2

min max 1 , 02

tt

t

t tj j t

jb

y b

w w w

w x wR

Page 31: Model Selection via Bilevel Optimization

Inner-level KKT Conditions

t

,

,

, ,

0 1 0

0 1 0

0 0

0 0

0 =

0

t

t

t tj j j t j

tt tj j

t t

t t

t t t tj j j

j

tj j

j

y bj

y

y

x w

γ w w

γ w w

w x γ γ

Page 32: Model Selection via Bilevel Optimization

Outer-level Loss Functions Misclassification Minimization Loss (MM)

Loss function used in classical CV Loss = 1, if validation pt misclassified,

0, otherwise(computed using step function, )

Hinge Loss (HL) Both inner and outer levels use same loss

function Loss = distance from

(computed using max function, ) ib y w x

*

max ,0

Page 33: Model Selection via Bilevel Optimization

Hinge Loss is Convex Approx. of Misclassification Minimization Loss

max 1 ,0z

*z

Page 34: Model Selection via Bilevel Optimization

Hinge Loss Bilevel Program (BilevelHL)

Replace max in outer level objective with convex constraints

Replace inner-level problems with KKT conditions

,

1

2

2

1min

s.t. , arg min2

max 1 ( ),0

max 1 ( ),0

t

t

T

t it

tt

jb

ti i t

j j

T

b

y b

y b

w

w w ww w

x w

x wR

,

,

min

min s.t. 1 ( )

max 1 ( ,0 0

)t

t

tt

ti

bt ti i i

b ti i

i

tt ty b

z

z y bz

w

wxx ww

Page 35: Model Selection via Bilevel Optimization

Hinge Loss MPEC , , , 1

t

1min

s.t. 0, 0,

and for 1,...,

1 ( )

0

0

tt

t

Tti

b t it

t ti i i t

tti

tj j j

zT

t T

z y bi

z

y

w w

w

x w

x w

,

,

, ,

1 0

0 1 0

0 0

0 0

0 =

0

t

t

tt j

tt tj j

t t

t t

t t t tj j j

j

tj j

j

bj

y

y

γ w w

γ w w

w x γ γ

Page 36: Model Selection via Bilevel Optimization

Misclassification Min. Bilevel Program (BilevelMM)

,

1

2

2

1min

s.t. , arg max 1 ( )

( )

,0m n

*

i2

t

t

j

T

t it

t

b

t

j

t

t

i i

j

Ty b

bb y

w

w w w

x

www

w

xR

1, if 0,( )

* 0, if 0.n

nn

a

a

a

Misclassifications are counted using the step function, defined component

wise for a n-vector as

Page 37: Model Selection via Bilevel Optimization

The Step FunctionMangasarian (1994) showed that

and that any solution, , to the LP satisfies

arg min : 0 1*

ζ

a ζ a ζ

0 0

0 1 0

ζ a z

z ζ

,z ζ

Page 38: Model Selection via Bilevel Optimization

Misclassifications in the Validation Set

Validation point misclassified when the sign of is negative i.e.,

This can be recast for all validation points (within the t-th fold) as

( )ti i ty b x w

( ) 0*

t ti i i ty b x w

0 1

arg min ( )t

t

t t ti i i t

i

y b

ζ

ζ x w

Page 39: Model Selection via Bilevel Optimization

Misclassification Minimization Bilevel Program (revisited)

,1

0 1

2

2

1min

s.t. arg min ( )

, arg min max 1 ( ),02

t

t

t

Tti

t it

t ti i i t

i

tt j j

jb

T

y b

b y b

w

ζ

w w w

ζ x w

w x w wR

Inner-level problems to determine misclassified

validation points

Inner-level training problems

Outer-level average misclassification minimization

Page 40: Model Selection via Bilevel Optimization

Misclassification Minimization MPEC

, , , 1

t

1min

s.t. 0, 0, and for 1,...,

0 0

0 1 0

0 1 0 0 1 0

tt

t

Tti

b t it

t t ti i i t i

tt ti i

t tj j j t j

t tj j

T

t T

y b zi

z

y bj

w w

w

x w

x w

,

,

, ,

0 0

0 0 0 =

0t

t

t

t t

t t

t t t tj j j

j

tj j

j

y

y

γ w wγ w w

w x γ γ

Page 41: Model Selection via Bilevel Optimization

Inexact Cross Validation NLP Both BilevelHL and BilevelMM MPECs

are transformed to NLP by relaxing equilibrium constraints (inexact CV)

Solved using FILTER on NEOS These are compared with classical cross

validation: unconstrained and constrained grid.

Page 42: Model Selection via Bilevel Optimization

Experiments: Data sets

3-fold cross validation for model selection

Average results for 20 train test splits

Page 43: Model Selection via Bilevel Optimization

Computational Time

0

1000

2000

3000

4000

5000

6000

7000

8000

pima cancer heart iono bright dim

Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge

Page 44: Model Selection via Bilevel Optimization

Training CV Error

0

5

10

15

20

25

30

35

40

45

pima cancer heart iono bright dim

Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge

Page 45: Model Selection via Bilevel Optimization

Testing Error

0

5

10

15

20

25

pima cancer heart iono bright dim

Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge

Page 46: Model Selection via Bilevel Optimization

Number of Variables

0

5

10

15

20

25

30

35

pima cancer heart iono bright dim

Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge

Page 47: Model Selection via Bilevel Optimization

Progress Cross Validation is a bilevel problem

solvable by continuous optimization methods

Off-the-shelf NLP algorithm – FILTER solved classification and regression

Bilevel Optimization extendable to many Machine Learning problems

Page 48: Model Selection via Bilevel Optimization

Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Variable Selection/Scaling Multi-task Learning Semi-supervised Learning Generative methods

Page 49: Model Selection via Bilevel Optimization

Semi-supervised Learning Have labeled data, and

unlabeled data Treat missing labels, , as design variables

in the outer level Lower level problems are still convex

1,i i iy

Ω x

1,u u

Ψ x m

uz

Page 50: Model Selection via Bilevel Optimization

Semi-supervised Regression

1

, , ,

,

2

2

min

max ,0

s.t. , arg min max ,0

1

2

n

i iC Di

j jj

u ub u

b y

C b y

b D b z

Ω

w Ψ

x w

x w

w x w

w

Outer level minimizes error on labeled data

to find optimal parameters and labels -insensitive loss on

labeled data in inner

level

-insensitive loss on

unlabeled data in inner

level

Inner level regularization

Page 51: Model Selection via Bilevel Optimization

Discussion New capacity offers new possibilities: Outer level objectives? Inner level problem? classification, ranking, semi-supervised, missing values, kernel selection, variable

selection, … Need special purpose algorithms for

greater efficiency, scalability, robustnessThis work was supported by Office of Naval Research Grant N00014-06-1-0014.

Page 52: Model Selection via Bilevel Optimization

Experiments: Bilevel CV Procedure Run BilevelMM/BilevelHL to compute

optimal parameters, Drop descriptors with small Create model on all training data

using Compute test error on hold-out set

ˆˆ ,ww

ˆˆ ,w

ˆˆ( , )bw

( , )

1 1 ˆˆsign2

test

testytest

ERROR b y x

w x

Page 53: Model Selection via Bilevel Optimization

Experiments: Grid Search CV Procedure Unconstrained Grid: Try 6 values for on a log10 scale Constrained Grid: Try 6 values for and {0, 1} for each

component of (perform RFE if necessary) Create model on all training data using

optimal grid point Compute test error on hold-out set

w

ˆˆ( , )bwˆˆ ,w

( , )

1 1 ˆˆsign2

test

testytest

ERROR b y x

w x

Page 54: Model Selection via Bilevel Optimization

Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Different Regularizations (L1, elastic nets)

Enhanced Feature Selection Multi-task Learning Semi-supervised Learning Generative methods

Page 55: Model Selection via Bilevel Optimization

Enhanced Feature Selection Assume at most descriptors allowed Introduce outer-level constraint

(with counting the non-zero elements of )

Rewrite constraint, observing that Get additional conditions for

maxn

0 maxnww

0

0 *w e w

1

0 00 0

n

m maxm

n

δ w dd e δ

*δ w

Page 56: Model Selection via Bilevel Optimization

Kernel Bilevel Discussion Pros

Performs model selection in feature space Performs feature selection in input space

Cons Highly nonlinear model Difficult to solve

Page 57: Model Selection via Bilevel Optimization

Kernel Classification (MPEC form)

, , , 1

1min

s.t. 0, 0, and for 1,...,

0 ( , ; ) 0

0 1 0

0 ( , ; )

tt

t

t

t

Tti

b C t it

t t ti i k k i k t i

k t

t ti i

t tj j k k j k

k

TC

t T

y y b zi

z

y y

α p

p

κ x x p

κ x x p 1 0

0 0

0t

tt j

t

t tj j

tj j

j

bj

C

y

Page 58: Model Selection via Bilevel Optimization

Is it okay to do 3-folds

Page 59: Model Selection via Bilevel Optimization

Applying the “kernel trick” Drop the box constraint, Eliminate from the optimality conditions

Replace with an appropriate

w w ww

0 0

0 1 0

0 1 0

0 0

0

t

t

t

t t ti i k k t i

k t

t ti i

t t tj j k k t j

k t

t

i k

j

tj j

tj

k

jj

y y b zi

z

y y bj

C

y

x x

x x

kx x ( , )kκ x x

Page 60: Model Selection via Bilevel Optimization

Feature Selection with Kernels Parameterize the kernel with

such that if the n-th descriptor vanishes from

Linear kernel

Polynomial kernel

Gaussian kernel

, 0n p p0np

( , ; )kκ x x p

( , ; ) diagd

k k c κ x x p x p x

( , ; ) diagk kκ x x p x p x

( , ; ) exp 'diagk k k κ x x p x x p x x

Page 61: Model Selection via Bilevel Optimization

Kernel Regression (Bilevel form)

,

, ,

, , 1

, ,

,

, , , ,

1min ( , ; )

max ( , ; ) ,0

s.t. arg min1

( , ; )2

t

t t

t t

t t

Tt tk k i k i

C t i kt

t tk k j k j

j kt

t t t tk k l l l k

k l

yT

C y

α p

α

κ x x p

κ x x p

α

κ x x p


Recommended