Recent Development in Generalization Error for Supervised...

Recent Development in Generalization Error for Supervised Learning Problems

with Applications in Model and Feature Selection

Calgary University, 22 April 2008

Daniel S. YeungWing W. Y. Ng

IEEE Systems, Man & Cybernetics Society, USAMachine Learning & Cybernetics Research Institute, Hong Kong

MiLeS Computing Lab Shenzhen Graduate School, Harbin Institute of Technology

2

Presentation Outline

� Generalization Error Model (GEM)

� A New Look at the GEM

� Applications

� Neural Network Architecture Selection

� Feature Selection for Supervised Learning

3

x

Pattern Classification Problem

� The artificial dataset 1

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

4


x

o

o

o

x

x

x

xx

Input Space

o

o


o o

ooo

oo

x xxxx


xx

xx

xxx

OO O

O O

Red Points – Future Unknown Samples

How good is the classifier when future unseen samples are presented to it?

5


� Trains a classifier to approximate the unknown input-

output system using the training dataset.� For neural networks, usually done by minimizing the MSE between network outputs and desired outputs in the training dataset --- the Training Error

� Remp the average error of the finite training dataset

�

where l , f and F denote number of training samples, classifier output and desired output respectively

V. Vapnik, “Statistical Learning Theory”, Wiley, 1988

R.O. Duda, P. E. Hart and D.G. Stork, “Pattern Classification”, Wiley, 2001

( )( ) ( )( )( )∑=

−=l

b

bb

emp XFXfl

R1

21(1)

6

� Instead of minimizing the training error Remp only, the ultimate goal of classifier training is to correctly predict the class / category of future unseen samples.

� Rtrue the expected error for all samples

�

� Rgen the generalization error for unseen samples� Rgen = Rtrue - Remp� Rgen not computable, only estimated by

� Empirical Methods

� Analytical Models

V. Vapnik, “Statistical Learning Theory”, Wiley, 1998

R.O. Duda, P. E. Hart and D.G. Stork, “Pattern Classification”, Wiley, 2001

( ) ( )( ) ( ) XdXpXFXfRtrue ∫ −= 2 (2)

(3)


7

GEM

� Rgen not computable, only estimated

� Empirical Models

� K-fold Cross-Validation (CV)

� Leave-One-Out Cross-Validation (LOOCV)

� Analytical Models

� Information Criteria

� VC-Dimension based

8

K-fold CV and LOOCV

� Main advantage of CV -- using test points with known classification outputs

� Major drawbacks

� Selection of K, usually K = 5 or 10

� Very time Consuming

� Variance of the CV errors

� The classifier’s architecture yielding the lowest average CV error may not lead to individual classifier yielding the lowest generalization error

� Provides estimation of average Rtrue instead of an upper bound

T. Hastie et al, “The Element of Statistical Learning”, Springer, 2001

9

Analytical Models

� Analytical model gives bound or estimation for Rgen

� According to the Bias and Variance Dilemma, decompose the MSE into squared bias and variance terms.

� Squared bias term describes how good the classifier approximates the real input-output mapping.

� Variance term describes how complex a classifier is.

� The best generalization classifier aims at a good balance between bias and variance.

S. Geman et al, “Neural Networks and the Bias/Variance Dilemma”, Neural Computation, 1992.

10

Analytical Models

� Analytical models give estimate or upper bound for Rgen

� Recall that Rtrue = Rgen + Remp

� Minimizes only Remp (slides# 12-14)

� Minimize complexity of classifier (h) in the analytical models while fixing other terms for a given training dataset (slides# 15-17)

� More desirable to minimize both Remp and h(slides# 18-19)

Training classifiers

11

x

Analytical Models

� The artificial dataset 1

o

o

o

x

x

x

xx

Input Space

o

o


o o

ooo

oo

x xxxx


12

x

Analytical Models

� Classifier trained by minimizing Remp only� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

Unseen testing sample

13

x

Analytical Models

� Classifier trained by minimizing Remp only� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

14

x

Analytical Models� Classifier trained by minimizing Remp only

� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

15

Analytical Models� Classifier trained by minimizing complexity only

� Easily under-fitting and over minimizing the complexity

x

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

16

Analytical Models

� Classifier trained by minimizing h only� Easily under-fitting and over minimizing the complexity (h)

x

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

17

Analytical Models� Classifier trained by minimizing h only

� Easily under-fitting and over minimizing the complexity (h)

x

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

18

Analytical Models� Classifier trained by minimizing both Remp and h

x

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

19

Analytical Models� Classifier trained by minimizing both Remp and h

x

o

o

o

x

x

x

x

x

Input Space

o

o


ooooo

oo

x xx

x

x

Classifier


o

oo

o

x

xx

xx

x

20

A New Look atGeneralization Error Model (GEM)

� Predicting samples far away from training samples – classification result meaningless or misleading

� Question: Better to ignore them?

D. Chakraborty and N. Pal, “A Novel Training Scheme for Multilayer Perceptrons to Realize Proper Generalization

and Incremental Learning”, IEEE Trans. NN, 2003

B. Scholkopf et al, “Estimating the Support of a High-Dimensional Distribution”, Neural Computation, 2001

21

A New Look at GEM

� Many applications assume that unseen samples similar to training samples are more relevant

� Tumor recognition in medical images

� Disease diagnostic

� Finger-print recognition

� Speaker recognition via speech

� Pattern based financial time series prediction

� Web sites / pages categorization

� Etc…

22

A New Look at GEM

� Recall Eq. (2):

� (2)

� Assume unseen samples very different from training samples to be ignored. Find a bound for the generalization error for unseen samples (RSM) close to training samples only, i.e. within a neighborhood SQ .

(Note: SM denotes Sensitivity Measure)

( ) ( )( ) ( ) XdXpXFXfR Xtrue ∫ −= ∀2

( ) ( )( ) ( ) ( ) ( )( ) ( )

resSM

SXSXtrue

RR

XdXpXFXfXdXpXFXfRQQ

+=

−+−= ∫∫ ∉∈22

Relationship between Rtrue and RSM

(4)

23

A New Look at GEM

� Q-neighborhood of a training sample

� A set of unseen samples within Q distance

� RSM , the generalization error for unseen samples in the union of all Q -neighborhoods, i.e. Q -Union SQ

� When Q approaches infinity, SQapproaches the entire input space and Rres vanishes.

24

x

A New Look at GEM� Q-neighborhood

o

o

o

x

x

x

xx

Input Space

o

o


o o

ooo

oo

x xxxx


Q

25

x

A New Look at GEM� Union of Q-neighborhoods (SQ)

o

o

o

x

x

x

xx

Input Space

o

o


o o

ooo

oo

x xxxx


SQ

26

A New Look at GEM

� Training Set Q-Union Entire Input Space

� Remp ≤ Rtraining + unseen in SQ = RSM ≤ Rtrue (RSM with Q )

� With probability at least (1 – η),

�

� R*SM upper bound for avg prediction MSE for unseen samples in SQ.

� R*SM varies with Remp and Sensitivity Measure (SM) and Q

where , and are constants, denotes the confidence of the bound holding true and l denotes the number of training samples

Yeung, Ng, et al., “Localized Generalization Error and Its Application to Architecture Selection for Radial Basis

Function Neural Network”, IEEE TNN, Oct. 2007

Ng & Yeung, “Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic

Sensitivity Measure”, IEE Electronic Letters, 2003

( )( ) ( )QRAyERRSMSempSM

*

22 =+

+∆+≤ ε (5)

∞→

lB 2/)(lnηε −=η

( )( ) ( )( )( )XFXFArr

minmax −= ( ) ( )( )( )2max XFXfBrr

−= θ

The localized GEM ---- R*SM

27

A New Look at GEM

� With probability at least (1 – η),

� Remp denotes the training error� Indicate how good the classifier learns from training

dataset

� denotes the SM which describes classifier output differences between samples located within the Q –Union & the training point q

� Indicate how complex is the classifier

� A and εεεε are constants describing the training dataset

( )( ) ( )QRAyERRSMSempSM

*

22 =+

+∆+≤ ε

( )( )2yES∆

The localized GEM ---- R*SM

28

A New Look at GEMSM

� Δy = f(*) - f(q)

� SM = ES ( (Δy) 2)� SM for samples in a Q-

neighborhood is the average of the squares of the classifier output differences between the training sample q and the unseen samples in Q-neighborhood.

1. Yeung, Ng et al, “Localized Generalization Error and Its Application to Architecture Selection for Radial Basis Function Neural Network”,

IEEE TNN, Oct. 2007

2. Ng, Dorado, Yeung et al, “Image Classification with the use of Radial Basis Function Neural Networks and the Minimization of Localized

Generalization Error”, Pattern Recognition, 2007

3. Ng and Yeung, “Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic

Sensitivity Measure”, IEE Electronic Letters, 2003

4. Yeung and Sun, “Using Function Approximation to Analyze the Sensitivity of MLP with Antisymmetric Squashing Activation

Function”, IEEE Trans. Neural Networks, 2002

5. Zeng and Yeung, “Sensitivity Analysis of Multilayer Perceptron to Input and Weight Perturbations”, IEEE Trans. NN, 2001

6. Zeng and Yeung, “A Quantified Sensitivity Measure for Multilayer Perceptron to Input Perturbation”, Neural Computation, 2003

q

*

*

*

*

* *

*

∆X

*

*

Q

29

Neural Network

Architecture Selection

Using R*SM

30

Neural Network Architecture Selection

� K-fold CV

� The most widely used method

� Select the RBFNN yielding the smallest average CV error

� Sequential Learning

� Start with 1 hidden neuron

� Add one more hidden neuron if the stopping criterion is not satisfied

� E.g. training error smaller than a threshold

� For our experiments, two criteria used

� MSE < 0.025

� The highest training accuracy

Existing Selection Methods

31

Neural Network Architecture Selection

� Two Ad-hoc Methods

� Select number of hidden neurons to be the square-root of number of training samples

� Is it reasonable to use more hidden neurons to solve the same problem just because we have more training samples?

� Select the number of hidden neurons to be the number of training samples

� Easily overfit

� Good for regression problems

Existing Selection Methods

S. Haykin, “Neural Networks”, Prentice Hall, 1998

32

Neural Network Architecture Selection Using R*SM

� Formulation of optimization problem for selecting number of hidden neurons for RBFNN� The center positions and widths of hidden neurons could be determined by automatic clustering algorithm, e.g. k-means, fuzzy c-means, hierarchical clustering, etc…

� Determining connection weights for an RBFNN having a fixed number of hidden neurons are well studied.

� Our method and GEM do not depend on the learning algorithm.

� Concentrate on selecting number of hidden neurons (M)

� The range of M could be from 1 to the number of training samples

33


� The optimization problem:

�

� One could convert problem (6) into an unconstraint optimization problem:

where

and Q* is the minimum real solution of Eq. (8).

Where , , N denotes # of features,

, ,

, and denote the j th center, the mean and variance of the i th input respectively.

[ ]( ) aQRtsQ

SMlM

≤∈

*

,1.max (6)

[ ]( )QMh

lM,max

,1∈

(8)

( ) ≥

=elseQ

aRQMh

emp

*

*0

,

( )( ) ( ) 03/3

2.0 2

1

4

1

222

1

44 =−−−−

−++ ∑ ∑∑

= ==ARavuQvNQ emp

M

jj

N

ijiixixj

M

jjj εµσϕϕ

(7)

( ) ( )( ) ( )( )( )2422exp

jjjjjjvsEvsVarw −=ϕ

,

( ) ( )( )∑=

−+=N

ijiixixj

usE1

22 µσ

( ) ( )[ ] ( ) ( )[ ]( ) ( )( )∑=

−+−−+−−=N

ijiixixjiixixiDixixiDj

uuxExEsVar1

22322444 µσµµσµ

2

jjUXsrr

−= ( )′=jNjjj

uuuU Lr

,,21

2

ixσix

µ

Maximal Coverage Classification problem with Selected Generalization error bound

(MC2SG)

34


� Experimental Results

� We compare our proposed method with 5-fold and 10-fold CVs, sequential learning and two ad-hoc methods

� The sequential learning adds hidden neurons until a pre-selected criterion is satisfied

� Sequen_MSE - Training MSE is lower than 0.025

� Sequen_01 - Highest training classification accuracy

� 8 datasets are used and we perform 10 independent runs using each of them

� Every dataset is split into two halves randomly

� Training dataset

� Testing dataset – Not involved in training

35


� Average Testing Accuracies (%)

48.1784.1179.9482.0683.4083.2984.71Ionosphere

82.6276.5078.2581.1780.8780.4983.20Sonar Target

83.6097.7796.1396.2796.0796.5397.87Iris

74.4277.2777.6677.2079.3580.2682.99Hepatitis

93.1897.2696.4397.1096.9296.9997.29Breast Cancer

93.1890.5791.8291.1491.1990.0693.18Wine Recognition

39.4588.3488.8479.7786.1387.0688.87Credit Approval

85.7986.5485.1484.3077.3880.6586.82Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVMC2SGDatasets\Methods

36


� McNemar Test Values over 10 runs� The MC2SG performs statistically significantly better than other method

if the McNemar Test Value is larger than 2.71

� The differences of most methods are insignificant in the IRIS datasets because the number of samples is too small.

582.200.5731.086.8842.4843.18Ionosphere

1.1631.3915.453.8012.1212.67Sonar Target

87.230.051.721.652.081.39Iris

50.6736.7434.2035.899.949.26Hepatitis

999.340.0516.335.796.906.33Breast Cancer

0.014.692.793.453.149.12Wine Recognition

1399.402.500.28176.8924.4312.67Credit Approval

6.001.297.3614.4075.0040.09Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVDatasets\Methods

37

� Average Number of Hidden Neurons Required

176.013.094.977.226.633.313.6Ionosphere

105.010.048.555.747.32.422.9Sonar Target

75.09.026.722.39.913.56.2Iris

78.09.042.942.74.87.06.3Hepatitis

350.019.042.82.027.8 25.102.1Breast Cancer

90.09.024.336.138.733.77.2Wine Recognition

344.019.044.036.110.811.013.9Credit Approval

108.010.038.25.111.716.26.7Thyroid Gland



38

� Total Time Required (seconds)

4194572799121624Ionosphere

21659854122045Sonar Target

1166134582Iris

1110101686710Hepatitis

1811311012733601Breast Cancer

119142911199Wine Recognition

1811612470033665Credit Approval

1181221401Thyroid Gland



39


� Experiments show that RBFNN selected using the R*SM, i.e. the MC

2SG, performs the best and yield the best testing accuracy for unseen samples

� The time and number of RBFNN training requirement are both low when comparing to CV methods

� The number of hidden neurons in selected RBFNNs are the smallest if we only compare those RBFNNs yielding the best testing accuracies

40

Neural Network

Feature Selection

Using R*SM

41

Feature Selection Problem

� Objective of Feature Selection� Find a reduced feature subset for a classifier such that the generalization capability of the classifier will not decrease in comparison with the one using full set of features

� Other objectives

� Low computational complexity

� Scalable to large number of features

� Scalable to large number of samples

� Universal to any type of classifiers

� Meaningful feature subset selection

W.W.Y.Ng, D.S.Yeung et. al., “Feature Selection Using Localized Generalization Error for Supervised Classification ProblemsUsing RBFNN”, Submitted to Pattern Recognition.

42

Feature Selection MethodsFeature Selection Methods

Hybrid Embedded

Leave-One-OutGeneralization

Error

Wrapper

Similarity

Filter

Generalization Error Estimation

Selection Based on Generalization Error

Correlation

Selection Based on Training Error Indirectly

Separability

Mutual Information

Methods in RED region do not use error as selection criterion directly. However they indirectly involve training error in the selection. E.g. the separability criteria tries to keep or reduce the error for training samples

43

Feature Selection MethodCorrelation

� Correlation Between Features and Desired Output

� Measures the simplest linear correlation between features and the desired output

� A relevant feature yields high linear correlation with the desired output

� E.g. the value of the feature increases when the desired output increases

� E.g. X1 < 0.5 for class 1, X1 > 0.5 for class 2

44

Feature Selection MethodMutual Information

� Mutual Information

� Measures the nonlinear correlation between features and desired output

� Measures the difference between the uncertainty of the output with the existence and absence of an input feature

� If adding the input feature Xi could reduce more uncertainty of the output with the existence of Xi, then Xi is considered to be more relevant.

� Features sorted in ascending order of its mutual information and the one yielding the least mutual information is removed until a pre-selected no. of features is reached.

45

Feature Selection MethodSimilarity Measure

� Similarity Measure� Finds similarity between features and only one feature is selected among a group of similar features with sim (a,b) < given threshold

� If similar features are collected, use only one of them instead of all

� E.g. the volume of a fish and the weight of a fish are similar and they provide no more information if we keep both of them. Thus we may select either one of them

46

Feature Selection MethodSeparability Measure

� Separability Measure

� E.g. in a two class problem, if samples in the two classes are separated very well in X2 while mixing together in X1, then X2 yields better separability and we say it is a more relevant feature.

47

Feature Selection MethodLeave-One-Out Method

� Leave-One-Out Method� First, a classifier is trained using all the Nfeatures (full set of features)

� N classifiers are trained: the i th classifier is trained without the i th feature

� The i th feature is removed if the i th classifier yields best accuracy on a validation dataset

� Continue until the accuracy drops significantly

J. Weston, et al. “Feature Selection for SVMs”, NIPS, Vol. 13, 2001

48

Feature Selection MethodLeave-One-Out Method

� Leave-One-Out Method� To estimate the generalization error of leaving out a feature, 5-fold or 10-fold cross-validation is adopted

� E.g., in 5-fold cross-validation, 20% of training samples are reserved for validation

� The average of the validation error of these 5 trained classifiers is used as the CV-error

� The feature yields the smallest drop in CV-error will be removed

� However, Training Datasets would change dramatically if the total number of samples is not large enough

T. Hastie, R. Tibshirani and J. Friedman, “The Element of Statistical Learning”, Springer, 2001

49

Feature Selection MethodUsing Generalization Error Model

� Using Generalization Error Model (GEM)

� In previous slides, we introduced the method to select features using empirical estimationof generalization error for removing a feature.

� However, they are very time consuming and infeasible for large dataset.

� Use analytical error bounds to replace the empirical estimation of generalization error.

� In this talk, we demonstrate the use ofLocalized Generalization Error Bound (R*SM) in Feature Selection

50

Feature Selection Methods

SuitablesuitableSuitableSuitableSuitableNot suitableLarge # of

Features

SuitableSuitableSuitableSuitableSuitableNot suitableLarge # of

Samples

Not suitable for

problems with

nonlinear

relationship

Require

computing the

high

dimensional

joint

probability

density

function

Can not deal

with problems

where two

classes of

samples sharing

the same mean

or overlapping

Relevant

feature similar

to other

features will

also be

removed

Based on the

estimation of the

generalization

error only

Extremely time

consuming and

require a huge

amount of

classifier training

Major

Disadvantage

Easy to

compute

Capture

nonlinear

relationship

between

desired output

and features

Easy to compute

and related to

the accuracy of

the training

samples

Fast, also

applicable to

unsupervised

problem

Make use of

estimation of

generalization

error and do not

require large

amount of time

consuming

classifier training

Generalization

error is the

selection criteria

Major

Advantage

IndirectlyIndirectlyIndirectlyIndirectlyGeneralization

Error

Generalization

Error

Consider

Accuracy

SmallMediumSmallMediumSmallVery LargeTime

CorrelationMISeparabilitySimilarityR*SM

LOO

51

Feature Selection Using R*SM

� RSMFS Method� We evaluate a feature based on its contribution to

the R*SM , i.e. generalization error bound.

� R*SM (xi) evaluates the generalization error bound of

the classifier using the same set of unseen samples as R*

SM , but we keep the i th feature to be constant

� Mean of the values in the training set

� If the values of a feature could be replaced by a constant, then we may ignore this feature and remove it.

where xi denotes the ith input feature

52


� RSMFS Method� The feature selection is formulated as

�

where CFS and IFS denote candidate feature setand irrelevant feature set respectively

denotes the R*SM without perturbing

the features in

Initially, the CFS = full set of features and IFS = empty set

� To prevent evaluating all the possible feature subsets, an heuristic forward search is adopted.

{ }( ) { }( )( )IFSxQRQR

iSMSMCFSix

U,minarg ** −⊆

{ }( )( )IFSxQRiSMU,*

{ } IFSxiU

(8)

53


� RSMFS Method1. Let CFS be the Candidate Feature Set and initially

equal to the full set of features

2. Build a classifier using CFS

3. Compute the R*SM by perturbing all the features

4. Compute the R*SM (xi) by perturbing all the features

except the i th feature

5. Remove the feature from CFS yielding the smallest value of | R*

SM - R*SM (xi) |

6. Go back to Step 2 if CFS ≠≠≠≠ NULL

54

� Case Study – UCI Wine Dataset

� These are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars

� 13 features

� 178 samples

� 3 classes

� We use this dataset to study the difference between the feature subsets selected by 6 feature selection methods

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

55

Case Study on UCI Wine Dataset

X1X2X10SIM

X1X2X13SEPA

X6X7X12COR

X6X9X13LOO

X1X12X13MI

X7X12X13RSM

FS

3rd Most Relevant Feature2nd Most Relevant FeatureMost Relevant FeatureFeature Selection Methods

The 3 Most Relevant Features Selected by the Feature Selection Methods

56.89% (10.10%)

t-Test = 3.42

70.18% (8.43%)

t-Test = 2.54

77.80% (5.56%)

t-Test = 2.41

87.57% (4.39%)

t-Test = 0.98

92.57% (2.56%)

t-Test = 0.00SIM

61.32% (3.46%)

t-Test = 7.26

77.23% (5.97%)

t-Test = 2.36

91.32% (4.70%)

t-Test = 0.23

94.05% (2.63%)

t-Test = -0.40

92.57% (2.56%)

t-Test = 0.00SEPA

84.79% (2.68%)

t-Test = 2.10

85.60% (4.23%)

t-Test = 1.41

93.72% (2.05%)

t-Test = -0.35

90.47% (2.62%)

t-Test = 0.57

92.57% (2.56%)

t-Test = 0.00COR

56.99% (5.75%)

t-Test = 5.65

65.12% (5.91%)

t-Test = 4.26

79.33% (5.09%)

t-Test = 2.32

88.76% (4.88%)

t-Test = 0.69

92.57% (2.56%)

t-Test = 0.00LOO

85.30% (3.44%)

t-Test = 1.70

89.22% (4.47%)

t-Test = 0.65

90.07% (5.52%)

t-Test = 0.41

93.19% (2.36%)

t-Test = -0.18

92.57% (2.56%)

t-Test = 0.00MI

85.30% (3.44%)

t-Test = 1.70

92.00% (2.81%)

t-Test = 0.15

94.27% (2.42%)

t-Test = -0.48

94.39% (2.52%)

t-Test = -0.51

92.57% (2.56%)

t-Test = 0.00RSM

FS

2 Features3 Features5 Features10 FeaturesFull SetMethods

Avg (Std Dev in bracket) Testing Accuracy & t-Test Values for the Selected Feature Subsets

In the tables, Red denotes better than Full set and Blue denotes statistically insignificant loss (t-Test values < 1.96)

56

� Similarity� (X1 Alcohol, X2 Malic Acid, X10 Color Intensity), (X1 Alcohol, X2 Malic Acid)

� The similarity selection is only based on means and variances of each pair of features, without considering their real distributions. Hence one may select an irrelevant feature instead that leads to poor generailzation.


57

� 5-Fold Cross Validation� (X6 Total Phenols, X9 Proanthocyanins, X13 Proline), (X9

Proanthocyanins, X13 Proline)

� This method selects feature subset which yields the validation error for the validation set (reserved from the training set and thus training set consists of 80% of training samples only).

� This method finds the best feature X13, however all of the otherchoices are not good. This may be due to the reservation of the 20% validation set and thus the training datasets being used are much different from the original training dataset.

� This may be solved by using leave-one-out method, however the computational cost makes it infeasible.

� The value of this method – makes use of real samples with real target outputs in the validation set to evaluate the feature, however a portion of training samples may not be used in training


58

� Separabaility� (X1 Alcohol, X2 Malic Acid, X13 Proline), (X1 Alcohol, X2 Malic Acid)

� One may notice that the samples from class 2 (Red X in next slide) mix with samples from the other 2 classes.

� But the mean values of the samples from different classes (indicated by black solid circles) are far apart from each other.

� This example demonstrates that the Separability method is over dependent on the mean value of samples.


59Mean values


60

� Correlation Coefficient

� (X6 Total Phenols, X7 Flavanoids, X12 OD280/OD315 of Diluted Wines), (X7 Flavanoids, X12 OD280/OD315 of Diluted Wines)

� This method works best when input and output are highly linearly correlated, i.e., X7 and X12, but not for X12 and X13 (which are better feature subset).

� For a feature subset that is linearly correlated to the desired output, it must have all decision hyperplanes whose shapes look like those in the next slide.

� E.g. lower in class ID of a sample, higher the values in both features X7 and X12

� However, the X12 and X13 combination yields a better nonlinear distinguishing power, but not linear one.


61


Linear separation for the samples

62

� Mutual Information

� (X1 Alcohol, X12 OD280/OD315 od Diluted Wine, X13 Proline), (X12 OD280/OD315 od Diluted Wine, X13 Proline)

� This method works best when input and output are highly non-linearly correlated, i.e., X13 and X12

� However, the mutual information only finds the correlation between features and target output for samples in the training set. It ignores the generalization capability of the features being selected.

� In the next 3 slides, we present the 3 best pairs of features interm of distinguishing power of samples from different classes


63


Linear separation for the samples

64

� Minimization of Localized Generalization Error Model (RSMFS by R

*SM)

� (X7 Flavanoids, X12 OD280/OD315 od Diluted Wine, X13 Proline), (X12 OD280/OD315 od Diluted Wine, X13 Proline)

� This method selects feature subset which yields largest change to the generalization error bound (R*

SM)

� A feature is removed if it does not change the generalization error bound (RSM)

� Due to both the training samples and unseen samples in the Q-Union are unchanged thoughout the feature selection process.

� If the feature does not change the RSM, this means that the removal of feature does not change the classification decision (small ST-SM).

� From the 3-D plot using features X7, X12 and X13, one finds that the feature subset selected by the MC2SG yields the best distinguishing power.

� The overlapping regions for samples from any two classes are the least and thus the generalization capability should be the best.


65


Experimental Comparisons of Feature Selection Methods

6 More Experiments

235134Ionosphere

220860Sonar

2464,02041Network Intrusion

102,000649Multiple Feature Digit

266,238617Isolet

220312,600Lung Carcinomas

Number of ClassesNumber of SamplesNumber of FeaturesDataset

67

Lung Carcinomas dataset

� Feature Selection Results

N/A82.18%t-Test = 6.09

82.18%t-Test = 3.03

82.18%t-Test = 6.51

82.18%t-Test = 4.42

86.14%t-Test = 5.24

1

N/A86.50%t-Test = 2.19

84.61%t-Test = 2.88

82.18%t-Test = 6.86

91.85%t-Test = 2.07

90.10%t-Test = 1.94

50

N/A93.25%t-Test = 0.97

85.85%t-Test = 4.55

84.16%t-Test = 6.95

90.83%t-Test = 1.23

96.04%t-Test = 0.20

150

N/A97.03%t-Test = 0.00

96.04%t-Test = 0.29

91.09%t-Test = 1.93

98.02%t-Test = -0.48

99.01%t-Test = -0.76

250

N/A97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

500

N/A97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

12,600

LOOSEPASIMCORMIR*SM# Features


Red denotes better than Full set Blue denotes statistically insignificant loss(t-Test values < 1.96)

68

� Isolated Letter Speech Recognition Dataset


30.65%

t-Test = 20.49

30.62%

t-Test = 11.70

48.92%

t-Test = 5.28

35.08%

t-Test = 9.68

38.78%

t-Test = 11.28

62.15%

t-Test = 5.24

62

(10%)

43.30%

t-Test = 12.84

38.92%

t-Test = 26.20

62.00%

t-Test = 3.88

46.77%

t-Test = 18.13

63.52%

t-Test = 11.86

68.77%

t-Test = 8.79

123

(20%)

68.61%

t-Test = 12.75

46.77%

t-Test = 19.44

71.54%

t-Test = 2.54

67.69%

t-Test = 11.20

77.45%

t-Test = 3.59

80.46%

t-Test = 1.75

247

(40%)

77.08%

t-Test = 4.44

73.6%

t-Test = 4.57

76.31%

t-Test = 4.12

80.00%

t-Test = 1.41

82.52%

t-Test = 1.18

83.08%

t-Test = 0.49

370

(60%)

82.56%

t-Test = 0.73

82.00%

t-Test = 1.21

82.31%

t-Test = 0.80

82.77%

t-Test = 0.56

84.52%

t-Test = -0.46

85.69%

t-Test = -0.76

494

(80%)

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

617

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

69

� Multiple Feature Digits Recognition Dataset


82.29%

t-Test = 18.61

91.74%

t-Test = 7.31

85.88%

t-Test = 3.20

88.94%

t-Test = 4.70

93.95%

t-Test = 3.10

97.00%

t-Test = -0.13

65

(10%)

86.58%

t-Test = 11.51

94.62%

t-Test = 3.42

91.18%

t-Test = 3.43

92.93%

t-Test = 2.24

94.33%

t-Test = 3.10

97.42%

t-Test = -0.65

130

(20%)

92.64%

t-Test = 5.88

96.56%

t-Test = 0.4693.54%3.85

95.76%

t-Test = 1.02

95.85%

t-Test = 1.45

96.45%

t-Test = 0.62

260

(40%)

94.79%

t-Test = 2.74

96.67%

t-Test = 0.29

95.16%

t-Test = 2.66

96.85%

t-Test = 0.06

96.55%

t-Test = 0.47

97.12%

t-Test = -0.30

389

(60%)

95.80%

t-Test = 1.78

96.12%

t-Test = 1.01

96.32%

t-Test = 0.83

96.97%

t-Test = -0.08

96.60%

t-Test = 0.35

96.88%

t-Test = 0.03

519

(80%)

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

649

(100%)


Number of

Features

70

Feature Selection Problems

� Network Intrusion Detection Dataset

99.51%t-Test = 0.77

98.83%t-Test = 5.17

80.51%t-Test = 4.08

98.22%t-Test = 13.20

98.43%t-Test = 9.52

98.50%t-Test = 5.48

4

(10%)

99.51%t-Test = 0.19

98.92%t-Test = 5.32

92.14%t-Test = 3.87

99.14%t-Test = 3.70

99.18%t-Test = 3.60

99.30%t-Test = 1.68

8

(20%)

99.41%t-Test = 1.30

99.05%t-Test = 3.65

93.44%t-Test = 3.05

99.25%t-Test = 3.42

99.22%

t-Test = 4.4499.46%

t-Test = 0.74

16

(40%)

99.42%t-Test = 1.30

99.13%t-Test = 2.40

98.39%t-Test = 15.95

99.48%t-Test = 0.60

99.41%t-Test = 1.41

99.50%t-Test = 0.47

25

(60%)

99.54%t-Test = 0.00

99.33%t-Test = 1.80

99.49%t-Test = 0.69

99.54%t-Test = 0.00

99.41%t-Test = 1.53

99.54%t-Test = 0.00

33

(80%)

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

41

(100%)


Number of

Features

71

Sonar Detection dataset


76.68%t-Test = 1.46

77.07%t-Test = 1.64

73.28%t-Test = 2.62

76.39%t-Test = 1.81

74.06%t-Test = 2.27

75.45%t-Test = 1.86

6

(10%)

77.17%t-Test = 1.62

78.72%t-Test = 0.98

77.26%t-Test = 1.40

77.65%t-Test = 1.44

80.89%t-Test = 0.49

79.09%t-Test = 0.92

12

(20%)

79.88%t-Test = 0.77

81.34%t-Test = 0.42

80.27%t-Test = 0.65

81.34%t-Test = 0.42

80.18%t-Test = 0.64

81.59%t-Test = 0.30

24

(40%)

80.67%t-Test = 0.82

81.53%t-Test = 0.46

81.99%t-Test = 0.25

80.66%t-Test = 0.59

80.95%t-Test = 0.55

82.38%t-Test = 0.17

36

(60%)

82.12%t-Test = 0.27

81.15%t-Test = 0.47

82.89%t-Test = 0.05

80.37%t-Test = 0.73

81.05%t-Test = 0.52

82.47%t-Test = 0.18

48

(80%)

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

60

(100%)


Number of

Features

72

� Ionosphere Detection Dataset


85.57%t-Test = 0.91

77.89%t-Test = 4.23

74.80%t-Test = 3.62

83.09%t-Test = 0.68

84.63%t-Test = -0.03

86.97%t-Test = -1.07

3

(10%)

82.43%t-Test = 0.73

75.43%t-Test = 5.45

76.97%t-Test = 3.40

85.20%t-Test = -0.39

86.06%t-Test = -0.73

86.06%t-Test = -0.71

7

(20%)

82.14%t-Test = 1.14

84.80%t-Test = -0.16

84.11%t-Test = 0.21

84.70%t-Test = -0.07

85.03%t-Test = -0.25

85.03%t-Test = -0.19

14

(40%)

81.29%t-Test = 1.33

84.29%t-Test = 0.14

84.23%t-Test = 0.21

84.71%t-Test = -0.08

85.03%t-Test = -0.28

85.20%t-Test = -0.30

20

(60%)

82.43%t-Test = 1.75

84.34%t-Test = 0.22

83.54%t-Test = 0.74

83.97%t-Test = 0.36

84.34%t-Test = 0.14

84.34%t-Test = 0.16

27

(80%)

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

34

(100%)


Number of

Features

73

� The proposed feature selection method using the R*SM is:

� Faster

� More accurate (RBFNNs trained with selected feature subsets)

� Scalable to large number of features

� Scalable to large number of samples

� Scalable to large number of classes

� LOO method, with 5-fold CV for estimating the generalization error, performs worse than other methods. This indicates that the LOO method depends on the effect of removing 20% of training samples.

� Similarity method may remove relevant feature.

� Separability method may not be suitable for problems where samples in several classes overlap


Overall comments on the feature selection experimental results

74

Conclusion

� Propose a new look at the generalization error–localized generalization error (R*SM)

� Demonstrate application of R*SM in selecting optimal number of hidden neurons for RBFNN

� Demonstrate application of R*SM for feature reduction for a supervised pattern classification problems.

75

Recent Publications Related to This Talk

� Yeung, Ng, et al., "Localized Generalization Error and Its Application to Architecture Selection for Radial Basis Function Neural Network", IEEE Trans. on Neural Networks, Oct., 2007.

� Ng, Yeung et al., "Feature Selection Using Localized Generalization Error for Supervised Classification Problems for RBFNN", Submitted to Pattern Recognition, 2007.

� Ng, Dorado, Yeung, et al., "Image Classification with the use of Radial Basis Function Neural Networks and the Minimization of Localized Generalization Error" Pattern Recognition, pp. 19 - 32, 2007

� Ng, Yeung and Tsang, "The Localized Generalization Error Model for Single Layer Perceptron Neural Network and Sigmoid Support Vector Machine" Submitted to International Journal of Pattern Recognition and Artificial Intelligence, 2007.

� Ng, Yeung, et al., "Localized Generalization Error of Gaussian Based Classifiers and Visualization of Decision Boundaries ", Soft Computing, pp. 375 - 381, 2007.

� Y. Wang, Zeng,Yeung and Peng, "Computation of Madalines' Sensitivity to Input and Weight Perturbations", To appear in Neural Computation

� Shi, Yeung and Gao, "Sensitivity Analysis Applied to the Construction of Radial Basis Function Networks“, Neural Networks, 2005

� Ng and Yeung, "Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic Sensitivity Measure ", IEE Electronic Letters, vol. 39, pp. 787 - 789, 2003

� Zeng and Yeung, "A Quantified Sensitivity Measure for Multilayer Perception to Input Perturbation", Neural Computation, 2003

� Yeung and Sun, "Using Function Approximation to analyze the Sensitivity of MLP with AntisymmeticSquashing Activation Function", IEEE Transactions on Neural Networks, vol.13, No.1, pp. 34-44, Jan. 2002.

� Zeng and Yeung, "Sensitivity Analysis of Multilayer Perceptron to Input and Weight Perturbations", IEEE Trans. on Neural Networks, 2001

Date post:	04-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Recent Development in Generalization Error for Supervised...

Documents