Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université...

Margin-Sparsity Trade-offfor

the Set Covering Machine

ECML 2005ECML 2005

François Laviolette (Université Laval)

Mario Marchand (Université Laval)

Mohak Shah (Université d’Ottawa)

PLAN Margin-Sparsity trade-off for Sample Compressed Classifiers

The “classical” Set Covering Machine (Classical-SCM) Definition Tight Risk Bound and model selection The learning algorithm

The modified Set Covering Machine (SCM2) Definition A non trivial Margin-Sparsity trade-off expressed by the risk bound The learning algorithm Empirical results

Conclusions

The Sample compression Framework In the sample compression setting, each classifier is

identified by 2 different sources of information: The compression set: an (ordered) subset of the training set A message string of additional information needed to identify a

classifier

To be more precise: In the sample compression setting, there exists a

“reconstruction” function R that gives a classifier

h = R(, Si) when given a compression set Si and a message string .

The Sample compression Framework (2) The examples are supposed i.i.d.

The risk (or generalization error) of a classifier h (noted R(h)) is the probability that h misclassified a new example.

The empirical risk (noted RS(h)) on a training set S is the frequency of errors of h on S.

Examples of sample-compressed classifiers

Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002]

Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005]

Support Vector Machines (SVM)

…

Margin-Sparsity trade-off There is a widespread belief that in the sample compression

setting learning algorithms should somehow try to find a non-trivial margin-sparsity trade-off.

SVM are looking for Margin. But some efforts as been done in order to find a sparser SVM

(Bennett (1999), and Bi et al. (2003)). This seems a difficult task.

SCM are looking for Sparsity. To force a classifier which is a conjunction of “geometric” Boolean

features to have no training example within a distance of its decision surface seems a much easier task.

Moreover, we will see that in our setting, both sparsity and margin can be considered as different forms of data-compression

The “Classical” Set Covering Machine(Marchand and Shawe-Taylor 2002)

Construct the “smallest possible” conjunction of (Boolean-valued) features

Each feature h is a ball identified by two training examples (the center (xc, yc) and the border point (xb, yb) ) and defined for any input example x as:

(Dually, one can consider to construct “smallest possible” disjunction of features, but we will only consider the conjunction case in this talk)

+

-

An Example of a “Classical”-SCM

-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

-

++

+

-

-

-

+

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-


-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

-

But SCM is looking for sparsity !!

-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

+

But SCM is looking for sparsity !!

-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

-

--

--

A risk bound

For “classical” SCMs If we choose the following Prior:

Then Corollary 1 becomes:

Which almost expresses a symmetry between k and d. Because

PM (Zi)(), is small for the “classical” SCMs compared to and

Idem for ln(d+1)

Model selection by the bound

Empirical results showed that looking for a SCM that minimise this risk bound is a slightly better model selection’s strategy than the cross-validation approach

The reasons are not totally clear This bound is tight There is a symmetry between d and k ???

A Learning algorithm for the “Classical” SCM

Ideally we would like a to find a SCM that minimizes the risk bound

Unfortunately, this is NP-Hard (at least)

We will therefore use a greedy heuristic based on the following observation

+

Adding one ball at the time, a classification error on an example “+” can not be fixed by adding other balls

-

-

-

-

+

+

+

+

+

-

++

+

-

-

-

- -

-

But, for an example “-” it is possible.

+

-

-

--

--

A Learning algorithm for the “Classical” SCM(Marchand and Shawe-Taylor 2002)

Define a list p1,p2,…,pl , and for each such p (called the learning parameter) DO STEP 1

STEP 1:Suppose i balls (Bp,0, Bp,1, … Bp,i-1) already have been construct by the algorithm UNTIL every “-” is assign correctly by the SCM (Bp,1, Bp,2,

… Bp,i-1) DO Choose a new ball Bp,i that maximizes qi - p ¢ ri where

qi is the number of “-” correctly assign by Bp,i but not correctly assign by the SCM (Bp,0, Bp,1, … Bp,i-1)

ri is the number of “+” not correctly assign by Bp,i but correctly assign by the SCM (Bp,0, Bp,1, … Bp,i-1)

A Learning algorithm for the “Classical” SCM(continued)

Among the following SCMs,

OUTPUT the one that have the best risk bound

Note: the algorithm can be adapt to a cross-validation approach

SCM2, SCM with radii coded by message strings In the “classical” SCM: centers and radii are defined

by examples of training set

Another alternative: to code each radius value by a message string (but still use examples of the training set to define the centers)

Objective: to construct the “smallest possible” conjunction of balls each of which having the “smallest possible” number of bits in the message string that define its radius

What kind of radius can be described with l bits?

Let us choose a scale

Then with l = 0 bit, we can define:

R

+ R/2

3R/4



Then with l = 1 bit, we can define:

R

+ R/4

3R/8 5R/8 7R/8



Then with l = 2 bits, we can define:

R

+ R/8

More precisely Under some parameter R (that will be our scale), the radius of

any ball of a SCM2 will be code by a pair (l, s) such that

0 < 2s-1 < 2l+1

the code (l,s) means that the radius of the ball is

Thus the possible radius value for l=2 are

Note that l is the number of bits of the radius

R/8, 3R/8, 5R/8 and 7R/8

--

-

--

-

-

-

-

--

-

Observe that if we have a large margin, among all the “interesting” balls, there will be one whose radius (l,s) of small number of bits

+

+

+

+

+

++

+

For SCM2 If we choose the following Priors:

Then Corollary 1 becomes:

Which expresses a non trivial margin-sparsity trade-off !!!

The learning algorithm is similar to the classical one, except it need two extra learning parameter: R and the maximum of bits allowed by message strings (noted l*)

Empirical results SVMs and SCMs on UCI data sets:

We observe: For SCMs, model selection by the bound is almost always

better than by cross-validation SCM2 is almost always better than SCM1 SCM2 tends to produce more balls than SCM1.

Hence SCM2 sacrifices sparsity to obtain a larger margin

ConclusionWe have proposed:

A new representation for the SCM that use two distinct sources of information: A compression set to represent the centers of the balls A message string to encode the radius value of each ball

A general data-compression risk bound that depend explicitly on these two information sources which exhibits a non trivial trade-off between sparsity (the inverse of the

compression set size) and the margin (the inverse of the message length) Seems to be an effective guide for choosing the proper margin-sparsity trade-

off of a classifier

Date post:	05-Jan-2016
Category:	Documents
Upload:	earl-hutchinson
View:	215 times
Download:	1 times

Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université...

Documents