+ All Categories
Home > Documents > Recent Development in Generalization Error for Supervised...

Recent Development in Generalization Error for Supervised...

Date post: 04-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
75
Recent Development in Generalization Error for Supervised Learning Problems with Applications in Model and Feature Selection Calgary University, 22 April 2008 Daniel S. Yeung Wing W. Y. Ng IEEE Systems, Man & Cybernetics Society, USA Machine Learning & Cybernetics Research Institute, Hong Kong MiLeS Computing Lab Shenzhen Graduate School, Harbin Institute of Technology
Transcript
Page 1: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

Recent Development in Generalization Error for Supervised Learning Problems

with Applications in Model and Feature Selection

Calgary University, 22 April 2008

Daniel S. YeungWing W. Y. Ng

IEEE Systems, Man & Cybernetics Society, USAMachine Learning & Cybernetics Research Institute, Hong Kong

MiLeS Computing Lab Shenzhen Graduate School, Harbin Institute of Technology

Page 2: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

2

Presentation Outline

� Generalization Error Model (GEM)

� A New Look at the GEM

� Applications

� Neural Network Architecture Selection

� Feature Selection for Supervised Learning

Page 3: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

3

x

Pattern Classification Problem

� The artificial dataset 1

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

Page 4: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

4

Pattern Classification Problem

x

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

xx

xx

xxx

OO O

O O

Red Points – Future Unknown Samples

How good is the classifier when future unseen samples are presented to it?

Page 5: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

5

Pattern Classification Problem

� Trains a classifier to approximate the unknown input-

output system using the training dataset.� For neural networks, usually done by minimizing the MSE between network outputs and desired outputs in the training dataset --- the Training Error

� Remp the average error of the finite training dataset

where l , f and F denote number of training samples, classifier output and desired output respectively

V. Vapnik, “Statistical Learning Theory”, Wiley, 1988

R.O. Duda, P. E. Hart and D.G. Stork, “Pattern Classification”, Wiley, 2001

( )( ) ( )( )( )∑=

−=l

b

bb

emp XFXfl

R1

21(1)

Page 6: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

6

� Instead of minimizing the training error Remp only, the ultimate goal of classifier training is to correctly predict the class / category of future unseen samples.

� Rtrue the expected error for all samples

� Rgen the generalization error for unseen samples� Rgen = Rtrue - Remp� Rgen not computable, only estimated by

� Empirical Methods

� Analytical Models

V. Vapnik, “Statistical Learning Theory”, Wiley, 1998

R.O. Duda, P. E. Hart and D.G. Stork, “Pattern Classification”, Wiley, 2001

( ) ( )( ) ( ) XdXpXFXfRtrue ∫ −= 2 (2)

(3)

Pattern Classification Problem

Page 7: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

7

GEM

� Rgen not computable, only estimated

� Empirical Models

� K-fold Cross-Validation (CV)

� Leave-One-Out Cross-Validation (LOOCV)

� Analytical Models

� Information Criteria

� VC-Dimension based

Page 8: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

8

K-fold CV and LOOCV

� Main advantage of CV -- using test points with known classification outputs

� Major drawbacks

� Selection of K, usually K = 5 or 10

� Very time Consuming

� Variance of the CV errors

� The classifier’s architecture yielding the lowest average CV error may not lead to individual classifier yielding the lowest generalization error

� Provides estimation of average Rtrue instead of an upper bound

T. Hastie et al, “The Element of Statistical Learning”, Springer, 2001

Page 9: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

9

Analytical Models

� Analytical model gives bound or estimation for Rgen

� According to the Bias and Variance Dilemma, decompose the MSE into squared bias and variance terms.

� Squared bias term describes how good the classifier approximates the real input-output mapping.

� Variance term describes how complex a classifier is.

� The best generalization classifier aims at a good balance between bias and variance.

S. Geman et al, “Neural Networks and the Bias/Variance Dilemma”, Neural Computation, 1992.

Page 10: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

10

Analytical Models

� Analytical models give estimate or upper bound for Rgen

� Recall that Rtrue = Rgen + Remp

� Minimizes only Remp (slides# 12-14)

� Minimize complexity of classifier (h) in the analytical models while fixing other terms for a given training dataset (slides# 15-17)

� More desirable to minimize both Remp and h(slides# 18-19)

Training classifiers

Page 11: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

11

x

Analytical Models

� The artificial dataset 1

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

Page 12: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

12

x

Analytical Models

� Classifier trained by minimizing Remp only� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Unseen testing sample

Page 13: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

13

x

Analytical Models

� Classifier trained by minimizing Remp only� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 14: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

14

x

Analytical Models� Classifier trained by minimizing Remp only

� Easily over-fitting, high Rgen

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 15: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

15

Analytical Models� Classifier trained by minimizing complexity only

� Easily under-fitting and over minimizing the complexity

x

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 16: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

16

Analytical Models

� Classifier trained by minimizing h only� Easily under-fitting and over minimizing the complexity (h)

x

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 17: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

17

Analytical Models� Classifier trained by minimizing h only

� Easily under-fitting and over minimizing the complexity (h)

x

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 18: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

18

Analytical Models� Classifier trained by minimizing both Remp and h

x

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 19: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

19

Analytical Models� Classifier trained by minimizing both Remp and h

x

o

o

o

x

x

x

x

x

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

ooooo

oo

x xx

x

x

Classifier

Artificial Dataset 1

o

oo

o

x

xx

xx

x

Page 20: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

20

A New Look atGeneralization Error Model (GEM)

� Predicting samples far away from training samples – classification result meaningless or misleading

� Question: Better to ignore them?

D. Chakraborty and N. Pal, “A Novel Training Scheme for Multilayer Perceptrons to Realize Proper Generalization

and Incremental Learning”, IEEE Trans. NN, 2003

B. Scholkopf et al, “Estimating the Support of a High-Dimensional Distribution”, Neural Computation, 2001

Page 21: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

21

A New Look at GEM

� Many applications assume that unseen samples similar to training samples are more relevant

� Tumor recognition in medical images

� Disease diagnostic

� Finger-print recognition

� Speaker recognition via speech

� Pattern based financial time series prediction

� Web sites / pages categorization

� Etc…

Page 22: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

22

A New Look at GEM

� Recall Eq. (2):

� (2)

� Assume unseen samples very different from training samples to be ignored. Find a bound for the generalization error for unseen samples (RSM) close to training samples only, i.e. within a neighborhood SQ .

(Note: SM denotes Sensitivity Measure)

( ) ( )( ) ( ) XdXpXFXfR Xtrue ∫ −= ∀2

( ) ( )( ) ( ) ( ) ( )( ) ( )

resSM

SXSXtrue

RR

XdXpXFXfXdXpXFXfRQQ

+=

−+−= ∫∫ ∉∈22

Relationship between Rtrue and RSM

(4)

Page 23: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

23

A New Look at GEM

� Q-neighborhood of a training sample

� A set of unseen samples within Q distance

� RSM , the generalization error for unseen samples in the union of all Q -neighborhoods, i.e. Q -Union SQ

� When Q approaches infinity, SQapproaches the entire input space and Rres vanishes.

Page 24: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

24

x

A New Look at GEM� Q-neighborhood

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

Q

Page 25: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

25

x

A New Look at GEM� Union of Q-neighborhoods (SQ)

o

o

o

x

x

x

xx

Input Space

o

o

o: Training Sample in Class 1 X: Training Sample in Class 2

o o

ooo

oo

x xxxx

Artificial Dataset 1

SQ

Page 26: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

26

A New Look at GEM

� Training Set Q-Union Entire Input Space

� Remp ≤ Rtraining + unseen in SQ = RSM ≤ Rtrue (RSM with Q )

� With probability at least (1 – η),

� R*SM upper bound for avg prediction MSE for unseen samples in SQ.

� R*SM varies with Remp and Sensitivity Measure (SM) and Q

where , and are constants, denotes the confidence of the bound holding true and l denotes the number of training samples

Yeung, Ng, et al., “Localized Generalization Error and Its Application to Architecture Selection for Radial Basis

Function Neural Network”, IEEE TNN, Oct. 2007

Ng & Yeung, “Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic

Sensitivity Measure”, IEE Electronic Letters, 2003

( )( ) ( )QRAyERRSMSempSM

*

22 =+

+∆+≤ ε (5)

∞→

lB 2/)(lnηε −=η

( )( ) ( )( )( )XFXFArr

minmax −= ( ) ( )( )( )2max XFXfBrr

−= θ

The localized GEM ---- R*SM

Page 27: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

27

A New Look at GEM

� With probability at least (1 – η),

� Remp denotes the training error� Indicate how good the classifier learns from training

dataset

� denotes the SM which describes classifier output differences between samples located within the Q –Union & the training point q

� Indicate how complex is the classifier

� A and εεεε are constants describing the training dataset

( )( ) ( )QRAyERRSMSempSM

*

22 =+

+∆+≤ ε

( )( )2yES∆

The localized GEM ---- R*SM

Page 28: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

28

A New Look at GEMSM

� Δy = f(*) - f(q)

� SM = ES ( (Δy) 2)� SM for samples in a Q-

neighborhood is the average of the squares of the classifier output differences between the training sample q and the unseen samples in Q-neighborhood.

1. Yeung, Ng et al, “Localized Generalization Error and Its Application to Architecture Selection for Radial Basis Function Neural Network”,

IEEE TNN, Oct. 2007

2. Ng, Dorado, Yeung et al, “Image Classification with the use of Radial Basis Function Neural Networks and the Minimization of Localized

Generalization Error”, Pattern Recognition, 2007

3. Ng and Yeung, “Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic

Sensitivity Measure”, IEE Electronic Letters, 2003

4. Yeung and Sun, “Using Function Approximation to Analyze the Sensitivity of MLP with Antisymmetric Squashing Activation

Function”, IEEE Trans. Neural Networks, 2002

5. Zeng and Yeung, “Sensitivity Analysis of Multilayer Perceptron to Input and Weight Perturbations”, IEEE Trans. NN, 2001

6. Zeng and Yeung, “A Quantified Sensitivity Measure for Multilayer Perceptron to Input Perturbation”, Neural Computation, 2003

q

*

*

*

*

* *

*

∆X

*

*

Q

Page 29: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

29

Neural Network

Architecture Selection

Using R*SM

Page 30: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

30

Neural Network Architecture Selection

� K-fold CV

� The most widely used method

� Select the RBFNN yielding the smallest average CV error

� Sequential Learning

� Start with 1 hidden neuron

� Add one more hidden neuron if the stopping criterion is not satisfied

� E.g. training error smaller than a threshold

� For our experiments, two criteria used

� MSE < 0.025

� The highest training accuracy

Existing Selection Methods

Page 31: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

31

Neural Network Architecture Selection

� Two Ad-hoc Methods

� Select number of hidden neurons to be the square-root of number of training samples

� Is it reasonable to use more hidden neurons to solve the same problem just because we have more training samples?

� Select the number of hidden neurons to be the number of training samples

� Easily overfit

� Good for regression problems

Existing Selection Methods

S. Haykin, “Neural Networks”, Prentice Hall, 1998

Page 32: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

32

Neural Network Architecture Selection Using R*SM

� Formulation of optimization problem for selecting number of hidden neurons for RBFNN� The center positions and widths of hidden neurons could be determined by automatic clustering algorithm, e.g. k-means, fuzzy c-means, hierarchical clustering, etc…

� Determining connection weights for an RBFNN having a fixed number of hidden neurons are well studied.

� Our method and GEM do not depend on the learning algorithm.

� Concentrate on selecting number of hidden neurons (M)

� The range of M could be from 1 to the number of training samples

Page 33: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

33

Neural Network Architecture Selection Using R*SM

� The optimization problem:

� One could convert problem (6) into an unconstraint optimization problem:

where

and Q* is the minimum real solution of Eq. (8).

Where , , N denotes # of features,

, ,

, and denote the j th center, the mean and variance of the i th input respectively.

[ ]( ) aQRtsQ

SMlM

≤∈

*

,1.max (6)

[ ]( )QMh

lM,max

,1∈

(8)

( ) ≥

=elseQ

aRQMh

emp

*

*0

,

( )( ) ( ) 03/3

2.0 2

1

4

1

222

1

44 =−−−−

−++ ∑ ∑∑

= ==ARavuQvNQ emp

M

jj

N

ijiixixj

M

jjj εµσϕϕ

(7)

( ) ( )( ) ( )( )( )2422exp

jjjjjjvsEvsVarw −=ϕ

,

( ) ( )( )∑=

−+=N

ijiixixj

usE1

22 µσ

( ) ( )[ ] ( ) ( )[ ]( ) ( )( )∑=

−+−−+−−=N

ijiixixjiixixiDixixiDj

uuxExEsVar1

22322444 µσµµσµ

2

jjUXsrr

−= ( )′=jNjjj

uuuU Lr

,,21

2

ixσix

µ

Maximal Coverage Classification problem with Selected Generalization error bound

(MC2SG)

Page 34: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

34

Neural Network Architecture Selection Using R*SM

� Experimental Results

� We compare our proposed method with 5-fold and 10-fold CVs, sequential learning and two ad-hoc methods

� The sequential learning adds hidden neurons until a pre-selected criterion is satisfied

� Sequen_MSE - Training MSE is lower than 0.025

� Sequen_01 - Highest training classification accuracy

� 8 datasets are used and we perform 10 independent runs using each of them

� Every dataset is split into two halves randomly

� Training dataset

� Testing dataset – Not involved in training

Page 35: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

35

Neural Network Architecture Selection Using R*SM

� Average Testing Accuracies (%)

48.1784.1179.9482.0683.4083.2984.71Ionosphere

82.6276.5078.2581.1780.8780.4983.20Sonar Target

83.6097.7796.1396.2796.0796.5397.87Iris

74.4277.2777.6677.2079.3580.2682.99Hepatitis

93.1897.2696.4397.1096.9296.9997.29Breast Cancer

93.1890.5791.8291.1491.1990.0693.18Wine Recognition

39.4588.3488.8479.7786.1387.0688.87Credit Approval

85.7986.5485.1484.3077.3880.6586.82Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVMC2SGDatasets\Methods

Page 36: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

36

Neural Network Architecture Selection Using R*SM

� McNemar Test Values over 10 runs� The MC2SG performs statistically significantly better than other method

if the McNemar Test Value is larger than 2.71

� The differences of most methods are insignificant in the IRIS datasets because the number of samples is too small.

582.200.5731.086.8842.4843.18Ionosphere

1.1631.3915.453.8012.1212.67Sonar Target

87.230.051.721.652.081.39Iris

50.6736.7434.2035.899.949.26Hepatitis

999.340.0516.335.796.906.33Breast Cancer

0.014.692.793.453.149.12Wine Recognition

1399.402.500.28176.8924.4312.67Credit Approval

6.001.297.3614.4075.0040.09Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVDatasets\Methods

Page 37: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

37

� Average Number of Hidden Neurons Required

176.013.094.977.226.633.313.6Ionosphere

105.010.048.555.747.32.422.9Sonar Target

75.09.026.722.39.913.56.2Iris

78.09.042.942.74.87.06.3Hepatitis

350.019.042.82.027.8 25.102.1Breast Cancer

90.09.024.336.138.733.77.2Wine Recognition

344.019.044.036.110.811.013.9Credit Approval

108.010.038.25.111.716.26.7Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVMC2SGDatasets\Methods

Neural Network Architecture Selection Using R*SM

Page 38: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

38

� Total Time Required (seconds)

4194572799121624Ionosphere

21659854122045Sonar Target

1166134582Iris

1110101686710Hepatitis

1811311012733601Breast Cancer

119142911199Wine Recognition

1811612470033665Credit Approval

1181221401Thyroid Gland

lSQRT(l)Squen_01Squen_MSE10-CV5-CVMC2SGDatasets\Methods

Neural Network Architecture Selection Using R*SM

Page 39: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

39

Neural Network Architecture Selection Using R*SM

� Experiments show that RBFNN selected using the R*SM, i.e. the MC

2SG, performs the best and yield the best testing accuracy for unseen samples

� The time and number of RBFNN training requirement are both low when comparing to CV methods

� The number of hidden neurons in selected RBFNNs are the smallest if we only compare those RBFNNs yielding the best testing accuracies

Page 40: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

40

Neural Network

Feature Selection

Using R*SM

Page 41: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

41

Feature Selection Problem

� Objective of Feature Selection� Find a reduced feature subset for a classifier such that the generalization capability of the classifier will not decrease in comparison with the one using full set of features

� Other objectives

� Low computational complexity

� Scalable to large number of features

� Scalable to large number of samples

� Universal to any type of classifiers

� Meaningful feature subset selection

W.W.Y.Ng, D.S.Yeung et. al., “Feature Selection Using Localized Generalization Error for Supervised Classification ProblemsUsing RBFNN”, Submitted to Pattern Recognition.

Page 42: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

42

Feature Selection MethodsFeature Selection Methods

Hybrid Embedded

Leave-One-OutGeneralization

Error

Wrapper

Similarity

Filter

Generalization Error Estimation

Selection Based on Generalization Error

Correlation

Selection Based on Training Error Indirectly

Separability

Mutual Information

Methods in RED region do not use error as selection criterion directly. However they indirectly involve training error in the selection. E.g. the separability criteria tries to keep or reduce the error for training samples

Page 43: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

43

Feature Selection MethodCorrelation

� Correlation Between Features and Desired Output

� Measures the simplest linear correlation between features and the desired output

� A relevant feature yields high linear correlation with the desired output

� E.g. the value of the feature increases when the desired output increases

� E.g. X1 < 0.5 for class 1, X1 > 0.5 for class 2

Page 44: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

44

Feature Selection MethodMutual Information

� Mutual Information

� Measures the nonlinear correlation between features and desired output

� Measures the difference between the uncertainty of the output with the existence and absence of an input feature

� If adding the input feature Xi could reduce more uncertainty of the output with the existence of Xi, then Xi is considered to be more relevant.

� Features sorted in ascending order of its mutual information and the one yielding the least mutual information is removed until a pre-selected no. of features is reached.

Page 45: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

45

Feature Selection MethodSimilarity Measure

� Similarity Measure� Finds similarity between features and only one feature is selected among a group of similar features with sim (a,b) < given threshold

� If similar features are collected, use only one of them instead of all

� E.g. the volume of a fish and the weight of a fish are similar and they provide no more information if we keep both of them. Thus we may select either one of them

Page 46: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

46

Feature Selection MethodSeparability Measure

� Separability Measure

� E.g. in a two class problem, if samples in the two classes are separated very well in X2 while mixing together in X1, then X2 yields better separability and we say it is a more relevant feature.

Page 47: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

47

Feature Selection MethodLeave-One-Out Method

� Leave-One-Out Method� First, a classifier is trained using all the Nfeatures (full set of features)

� N classifiers are trained: the i th classifier is trained without the i th feature

� The i th feature is removed if the i th classifier yields best accuracy on a validation dataset

� Continue until the accuracy drops significantly

J. Weston, et al. “Feature Selection for SVMs”, NIPS, Vol. 13, 2001

Page 48: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

48

Feature Selection MethodLeave-One-Out Method

� Leave-One-Out Method� To estimate the generalization error of leaving out a feature, 5-fold or 10-fold cross-validation is adopted

� E.g., in 5-fold cross-validation, 20% of training samples are reserved for validation

� The average of the validation error of these 5 trained classifiers is used as the CV-error

� The feature yields the smallest drop in CV-error will be removed

� However, Training Datasets would change dramatically if the total number of samples is not large enough

T. Hastie, R. Tibshirani and J. Friedman, “The Element of Statistical Learning”, Springer, 2001

Page 49: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

49

Feature Selection MethodUsing Generalization Error Model

� Using Generalization Error Model (GEM)

� In previous slides, we introduced the method to select features using empirical estimationof generalization error for removing a feature.

� However, they are very time consuming and infeasible for large dataset.

� Use analytical error bounds to replace the empirical estimation of generalization error.

� In this talk, we demonstrate the use ofLocalized Generalization Error Bound (R*SM) in Feature Selection

Page 50: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

50

Feature Selection Methods

SuitablesuitableSuitableSuitableSuitableNot suitableLarge # of

Features

SuitableSuitableSuitableSuitableSuitableNot suitableLarge # of

Samples

Not suitable for

problems with

nonlinear

relationship

Require

computing the

high

dimensional

joint

probability

density

function

Can not deal

with problems

where two

classes of

samples sharing

the same mean

or overlapping

Relevant

feature similar

to other

features will

also be

removed

Based on the

estimation of the

generalization

error only

Extremely time

consuming and

require a huge

amount of

classifier training

Major

Disadvantage

Easy to

compute

Capture

nonlinear

relationship

between

desired output

and features

Easy to compute

and related to

the accuracy of

the training

samples

Fast, also

applicable to

unsupervised

problem

Make use of

estimation of

generalization

error and do not

require large

amount of time

consuming

classifier training

Generalization

error is the

selection criteria

Major

Advantage

IndirectlyIndirectlyIndirectlyIndirectlyGeneralization

Error

Generalization

Error

Consider

Accuracy

SmallMediumSmallMediumSmallVery LargeTime

CorrelationMISeparabilitySimilarityR*SM

LOO

Page 51: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

51

Feature Selection Using R*SM

� RSMFS Method� We evaluate a feature based on its contribution to

the R*SM , i.e. generalization error bound.

� R*SM (xi) evaluates the generalization error bound of

the classifier using the same set of unseen samples as R*

SM , but we keep the i th feature to be constant

� Mean of the values in the training set

� If the values of a feature could be replaced by a constant, then we may ignore this feature and remove it.

where xi denotes the ith input feature

Page 52: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

52

Feature Selection Using R*SM

� RSMFS Method� The feature selection is formulated as

where CFS and IFS denote candidate feature setand irrelevant feature set respectively

denotes the R*SM without perturbing

the features in

Initially, the CFS = full set of features and IFS = empty set

� To prevent evaluating all the possible feature subsets, an heuristic forward search is adopted.

{ }( ) { }( )( )IFSxQRQR

iSMSMCFSix

U,minarg ** −⊆

{ }( )( )IFSxQRiSMU,*

{ } IFSxiU

(8)

Page 53: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

53

Feature Selection Using R*SM

� RSMFS Method1. Let CFS be the Candidate Feature Set and initially

equal to the full set of features

2. Build a classifier using CFS

3. Compute the R*SM by perturbing all the features

4. Compute the R*SM (xi) by perturbing all the features

except the i th feature

5. Remove the feature from CFS yielding the smallest value of | R*

SM - R*SM (xi) |

6. Go back to Step 2 if CFS ≠≠≠≠ NULL

Page 54: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

54

� Case Study – UCI Wine Dataset

� These are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars

� 13 features

� 178 samples

� 3 classes

� We use this dataset to study the difference between the feature subsets selected by 6 feature selection methods

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 55: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

55

Case Study on UCI Wine Dataset

X1X2X10SIM

X1X2X13SEPA

X6X7X12COR

X6X9X13LOO

X1X12X13MI

X7X12X13RSM

FS

3rd Most Relevant Feature2nd Most Relevant FeatureMost Relevant FeatureFeature Selection Methods

The 3 Most Relevant Features Selected by the Feature Selection Methods

56.89% (10.10%)

t-Test = 3.42

70.18% (8.43%)

t-Test = 2.54

77.80% (5.56%)

t-Test = 2.41

87.57% (4.39%)

t-Test = 0.98

92.57% (2.56%)

t-Test = 0.00SIM

61.32% (3.46%)

t-Test = 7.26

77.23% (5.97%)

t-Test = 2.36

91.32% (4.70%)

t-Test = 0.23

94.05% (2.63%)

t-Test = -0.40

92.57% (2.56%)

t-Test = 0.00SEPA

84.79% (2.68%)

t-Test = 2.10

85.60% (4.23%)

t-Test = 1.41

93.72% (2.05%)

t-Test = -0.35

90.47% (2.62%)

t-Test = 0.57

92.57% (2.56%)

t-Test = 0.00COR

56.99% (5.75%)

t-Test = 5.65

65.12% (5.91%)

t-Test = 4.26

79.33% (5.09%)

t-Test = 2.32

88.76% (4.88%)

t-Test = 0.69

92.57% (2.56%)

t-Test = 0.00LOO

85.30% (3.44%)

t-Test = 1.70

89.22% (4.47%)

t-Test = 0.65

90.07% (5.52%)

t-Test = 0.41

93.19% (2.36%)

t-Test = -0.18

92.57% (2.56%)

t-Test = 0.00MI

85.30% (3.44%)

t-Test = 1.70

92.00% (2.81%)

t-Test = 0.15

94.27% (2.42%)

t-Test = -0.48

94.39% (2.52%)

t-Test = -0.51

92.57% (2.56%)

t-Test = 0.00RSM

FS

2 Features3 Features5 Features10 FeaturesFull SetMethods

Avg (Std Dev in bracket) Testing Accuracy & t-Test Values for the Selected Feature Subsets

In the tables, Red denotes better than Full set and Blue denotes statistically insignificant loss (t-Test values < 1.96)

Page 56: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

56

� Similarity� (X1 Alcohol, X2 Malic Acid, X10 Color Intensity), (X1 Alcohol, X2 Malic Acid)

� The similarity selection is only based on means and variances of each pair of features, without considering their real distributions. Hence one may select an irrelevant feature instead that leads to poor generailzation.

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 57: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

57

� 5-Fold Cross Validation� (X6 Total Phenols, X9 Proanthocyanins, X13 Proline), (X9

Proanthocyanins, X13 Proline)

� This method selects feature subset which yields the validation error for the validation set (reserved from the training set and thus training set consists of 80% of training samples only).

� This method finds the best feature X13, however all of the otherchoices are not good. This may be due to the reservation of the 20% validation set and thus the training datasets being used are much different from the original training dataset.

� This may be solved by using leave-one-out method, however the computational cost makes it infeasible.

� The value of this method – makes use of real samples with real target outputs in the validation set to evaluate the feature, however a portion of training samples may not be used in training

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 58: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

58

� Separabaility� (X1 Alcohol, X2 Malic Acid, X13 Proline), (X1 Alcohol, X2 Malic Acid)

� One may notice that the samples from class 2 (Red X in next slide) mix with samples from the other 2 classes.

� But the mean values of the samples from different classes (indicated by black solid circles) are far apart from each other.

� This example demonstrates that the Separability method is over dependent on the mean value of samples.

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 59: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

59Mean values

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 60: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

60

� Correlation Coefficient

� (X6 Total Phenols, X7 Flavanoids, X12 OD280/OD315 of Diluted Wines), (X7 Flavanoids, X12 OD280/OD315 of Diluted Wines)

� This method works best when input and output are highly linearly correlated, i.e., X7 and X12, but not for X12 and X13 (which are better feature subset).

� For a feature subset that is linearly correlated to the desired output, it must have all decision hyperplanes whose shapes look like those in the next slide.

� E.g. lower in class ID of a sample, higher the values in both features X7 and X12

� However, the X12 and X13 combination yields a better nonlinear distinguishing power, but not linear one.

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 61: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

61

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Linear separation for the samples

Page 62: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

62

� Mutual Information

� (X1 Alcohol, X12 OD280/OD315 od Diluted Wine, X13 Proline), (X12 OD280/OD315 od Diluted Wine, X13 Proline)

� This method works best when input and output are highly non-linearly correlated, i.e., X13 and X12

� However, the mutual information only finds the correlation between features and target output for samples in the training set. It ignores the generalization capability of the features being selected.

� In the next 3 slides, we present the 3 best pairs of features interm of distinguishing power of samples from different classes

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 63: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

63

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Linear separation for the samples

Page 64: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

64

� Minimization of Localized Generalization Error Model (RSMFS by R

*SM)

� (X7 Flavanoids, X12 OD280/OD315 od Diluted Wine, X13 Proline), (X12 OD280/OD315 od Diluted Wine, X13 Proline)

� This method selects feature subset which yields largest change to the generalization error bound (R*

SM)

� A feature is removed if it does not change the generalization error bound (RSM)

� Due to both the training samples and unseen samples in the Q-Union are unchanged thoughout the feature selection process.

� If the feature does not change the RSM, this means that the removal of feature does not change the classification decision (small ST-SM).

� From the 3-D plot using features X7, X12 and X13, one finds that the feature subset selected by the MC2SG yields the best distinguishing power.

� The overlapping regions for samples from any two classes are the least and thus the generalization capability should be the best.

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 65: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

65

Experimental Comparisons of Feature Selection MethodsCase Study on UCI Wine Dataset

Page 66: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

Experimental Comparisons of Feature Selection Methods

6 More Experiments

235134Ionosphere

220860Sonar

2464,02041Network Intrusion

102,000649Multiple Feature Digit

266,238617Isolet

220312,600Lung Carcinomas

Number of ClassesNumber of SamplesNumber of FeaturesDataset

Page 67: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

67

Lung Carcinomas dataset

� Feature Selection Results

N/A82.18%t-Test = 6.09

82.18%t-Test = 3.03

82.18%t-Test = 6.51

82.18%t-Test = 4.42

86.14%t-Test = 5.24

1

N/A86.50%t-Test = 2.19

84.61%t-Test = 2.88

82.18%t-Test = 6.86

91.85%t-Test = 2.07

90.10%t-Test = 1.94

50

N/A93.25%t-Test = 0.97

85.85%t-Test = 4.55

84.16%t-Test = 6.95

90.83%t-Test = 1.23

96.04%t-Test = 0.20

150

N/A97.03%t-Test = 0.00

96.04%t-Test = 0.29

91.09%t-Test = 1.93

98.02%t-Test = -0.48

99.01%t-Test = -0.76

250

N/A97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

500

N/A97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

97.03%t-Test = 0.00

12,600

LOOSEPASIMCORMIR*SM# Features

Experimental Comparisons of Feature Selection Methods

Red denotes better than Full set Blue denotes statistically insignificant loss(t-Test values < 1.96)

Page 68: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

68

� Isolated Letter Speech Recognition Dataset

Experimental Comparisons of Feature Selection Methods

30.65%

t-Test = 20.49

30.62%

t-Test = 11.70

48.92%

t-Test = 5.28

35.08%

t-Test = 9.68

38.78%

t-Test = 11.28

62.15%

t-Test = 5.24

62

(10%)

43.30%

t-Test = 12.84

38.92%

t-Test = 26.20

62.00%

t-Test = 3.88

46.77%

t-Test = 18.13

63.52%

t-Test = 11.86

68.77%

t-Test = 8.79

123

(20%)

68.61%

t-Test = 12.75

46.77%

t-Test = 19.44

71.54%

t-Test = 2.54

67.69%

t-Test = 11.20

77.45%

t-Test = 3.59

80.46%

t-Test = 1.75

247

(40%)

77.08%

t-Test = 4.44

73.6%

t-Test = 4.57

76.31%

t-Test = 4.12

80.00%

t-Test = 1.41

82.52%

t-Test = 1.18

83.08%

t-Test = 0.49

370

(60%)

82.56%

t-Test = 0.73

82.00%

t-Test = 1.21

82.31%

t-Test = 0.80

82.77%

t-Test = 0.56

84.52%

t-Test = -0.46

85.69%

t-Test = -0.76

494

(80%)

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

83.85%

t-Test = 0.00

617

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

Page 69: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

69

� Multiple Feature Digits Recognition Dataset

Experimental Comparisons of Feature Selection Methods

82.29%

t-Test = 18.61

91.74%

t-Test = 7.31

85.88%

t-Test = 3.20

88.94%

t-Test = 4.70

93.95%

t-Test = 3.10

97.00%

t-Test = -0.13

65

(10%)

86.58%

t-Test = 11.51

94.62%

t-Test = 3.42

91.18%

t-Test = 3.43

92.93%

t-Test = 2.24

94.33%

t-Test = 3.10

97.42%

t-Test = -0.65

130

(20%)

92.64%

t-Test = 5.88

96.56%

t-Test = 0.4693.54%3.85

95.76%

t-Test = 1.02

95.85%

t-Test = 1.45

96.45%

t-Test = 0.62

260

(40%)

94.79%

t-Test = 2.74

96.67%

t-Test = 0.29

95.16%

t-Test = 2.66

96.85%

t-Test = 0.06

96.55%

t-Test = 0.47

97.12%

t-Test = -0.30

389

(60%)

95.80%

t-Test = 1.78

96.12%

t-Test = 1.01

96.32%

t-Test = 0.83

96.97%

t-Test = -0.08

96.60%

t-Test = 0.35

96.88%

t-Test = 0.03

519

(80%)

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

96.90%

t-Test = 0.00

649

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

Page 70: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

70

Feature Selection Problems

� Network Intrusion Detection Dataset

99.51%t-Test = 0.77

98.83%t-Test = 5.17

80.51%t-Test = 4.08

98.22%t-Test = 13.20

98.43%t-Test = 9.52

98.50%t-Test = 5.48

4

(10%)

99.51%t-Test = 0.19

98.92%t-Test = 5.32

92.14%t-Test = 3.87

99.14%t-Test = 3.70

99.18%t-Test = 3.60

99.30%t-Test = 1.68

8

(20%)

99.41%t-Test = 1.30

99.05%t-Test = 3.65

93.44%t-Test = 3.05

99.25%t-Test = 3.42

99.22%

t-Test = 4.4499.46%

t-Test = 0.74

16

(40%)

99.42%t-Test = 1.30

99.13%t-Test = 2.40

98.39%t-Test = 15.95

99.48%t-Test = 0.60

99.41%t-Test = 1.41

99.50%t-Test = 0.47

25

(60%)

99.54%t-Test = 0.00

99.33%t-Test = 1.80

99.49%t-Test = 0.69

99.54%t-Test = 0.00

99.41%t-Test = 1.53

99.54%t-Test = 0.00

33

(80%)

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

99.54%t-Test = 0.00

41

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

Page 71: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

71

Sonar Detection dataset

Experimental Comparisons of Feature Selection Methods

76.68%t-Test = 1.46

77.07%t-Test = 1.64

73.28%t-Test = 2.62

76.39%t-Test = 1.81

74.06%t-Test = 2.27

75.45%t-Test = 1.86

6

(10%)

77.17%t-Test = 1.62

78.72%t-Test = 0.98

77.26%t-Test = 1.40

77.65%t-Test = 1.44

80.89%t-Test = 0.49

79.09%t-Test = 0.92

12

(20%)

79.88%t-Test = 0.77

81.34%t-Test = 0.42

80.27%t-Test = 0.65

81.34%t-Test = 0.42

80.18%t-Test = 0.64

81.59%t-Test = 0.30

24

(40%)

80.67%t-Test = 0.82

81.53%t-Test = 0.46

81.99%t-Test = 0.25

80.66%t-Test = 0.59

80.95%t-Test = 0.55

82.38%t-Test = 0.17

36

(60%)

82.12%t-Test = 0.27

81.15%t-Test = 0.47

82.89%t-Test = 0.05

80.37%t-Test = 0.73

81.05%t-Test = 0.52

82.47%t-Test = 0.18

48

(80%)

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

83.04%t-Test = 0.00

60

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

Page 72: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

72

� Ionosphere Detection Dataset

Experimental Comparisons of Feature Selection Methods

85.57%t-Test = 0.91

77.89%t-Test = 4.23

74.80%t-Test = 3.62

83.09%t-Test = 0.68

84.63%t-Test = -0.03

86.97%t-Test = -1.07

3

(10%)

82.43%t-Test = 0.73

75.43%t-Test = 5.45

76.97%t-Test = 3.40

85.20%t-Test = -0.39

86.06%t-Test = -0.73

86.06%t-Test = -0.71

7

(20%)

82.14%t-Test = 1.14

84.80%t-Test = -0.16

84.11%t-Test = 0.21

84.70%t-Test = -0.07

85.03%t-Test = -0.25

85.03%t-Test = -0.19

14

(40%)

81.29%t-Test = 1.33

84.29%t-Test = 0.14

84.23%t-Test = 0.21

84.71%t-Test = -0.08

85.03%t-Test = -0.28

85.20%t-Test = -0.30

20

(60%)

82.43%t-Test = 1.75

84.34%t-Test = 0.22

83.54%t-Test = 0.74

83.97%t-Test = 0.36

84.34%t-Test = 0.14

84.34%t-Test = 0.16

27

(80%)

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

84.57%t-Test = 0.00

34

(100%)

LOOSEPASIMCORMIRSMFS

Number of

Features

Page 73: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

73

� The proposed feature selection method using the R*SM is:

� Faster

� More accurate (RBFNNs trained with selected feature subsets)

� Scalable to large number of features

� Scalable to large number of samples

� Scalable to large number of classes

� LOO method, with 5-fold CV for estimating the generalization error, performs worse than other methods. This indicates that the LOO method depends on the effect of removing 20% of training samples.

� Similarity method may remove relevant feature.

� Separability method may not be suitable for problems where samples in several classes overlap

Experimental Comparisons of Feature Selection Methods

Overall comments on the feature selection experimental results

Page 74: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

74

Conclusion

� Propose a new look at the generalization error–localized generalization error (R*SM)

� Demonstrate application of R*SM in selecting optimal number of hidden neurons for RBFNN

� Demonstrate application of R*SM for feature reduction for a supervised pattern classification problems.

Page 75: Recent Development in Generalization Error for Supervised ...pages.cpsc.ucalgary.ca/~rokne/stuff/MOVIE/Microsoft PowerPoint... · A New Look at GEM Training Set Q-Union Entire Input

75

Recent Publications Related to This Talk

� Yeung, Ng, et al., "Localized Generalization Error and Its Application to Architecture Selection for Radial Basis Function Neural Network", IEEE Trans. on Neural Networks, Oct., 2007.

� Ng, Yeung et al., "Feature Selection Using Localized Generalization Error for Supervised Classification Problems for RBFNN", Submitted to Pattern Recognition, 2007.

� Ng, Dorado, Yeung, et al., "Image Classification with the use of Radial Basis Function Neural Networks and the Minimization of Localized Generalization Error" Pattern Recognition, pp. 19 - 32, 2007

� Ng, Yeung and Tsang, "The Localized Generalization Error Model for Single Layer Perceptron Neural Network and Sigmoid Support Vector Machine" Submitted to International Journal of Pattern Recognition and Artificial Intelligence, 2007.

� Ng, Yeung, et al., "Localized Generalization Error of Gaussian Based Classifiers and Visualization of Decision Boundaries ", Soft Computing, pp. 375 - 381, 2007.

� Y. Wang, Zeng,Yeung and Peng, "Computation of Madalines' Sensitivity to Input and Weight Perturbations", To appear in Neural Computation

� Shi, Yeung and Gao, "Sensitivity Analysis Applied to the Construction of Radial Basis Function Networks“, Neural Networks, 2005

� Ng and Yeung, "Selection of Weight Quantisation Accuracy for Radial Basis Function Neural Network Using Stochastic Sensitivity Measure ", IEE Electronic Letters, vol. 39, pp. 787 - 789, 2003

� Zeng and Yeung, "A Quantified Sensitivity Measure for Multilayer Perception to Input Perturbation", Neural Computation, 2003

� Yeung and Sun, "Using Function Approximation to analyze the Sensitivity of MLP with AntisymmeticSquashing Activation Function", IEEE Transactions on Neural Networks, vol.13, No.1, pp. 34-44, Jan. 2002.

� Zeng and Yeung, "Sensitivity Analysis of Multilayer Perceptron to Input and Weight Perturbations", IEEE Trans. on Neural Networks, 2001


Recommended