Bias and Stability of Single Variable Classiﬁers for Feature Ranking...

Bias and Stability of Single Variable Classifiersfor Feature Ranking and Selection

Shobeir Fakhraeia,b, Hamid Soltanian-Zadeha,c, Farshad Fotouhid

aMedical Image Analysis Laboratory, Department of Radiology, Henry Ford Health System, Detroit, MI 48202, USAbDepartment of Computer Science, University of Maryland, College Park, MD 20740, USA

cControl and Intelligent Processing Center of Excellence (CIPCE), School of Electrical and Computer Engineering, University of Tehran, Tehran14395, Iran

dCollege of Engineering, Wayne State University, Detroit, MI 48202, USA

Abstract

Feature rankings are often used for supervised dimension reduction especially when discriminating power of eachfeature is of interest, dimensionality of dataset is extremely high, or computational power is limited to perform morecomplicated methods. In practice, it is recommended to start dimension reduction via simple methods such as featurerankings before applying more complex approaches. Single Variable Classifier (SVC) ranking is a feature rankingbased on the predictive performance of a classifier built using only a single feature. While benefiting from capabilitiesof classifiers, this ranking method is not as computationally intensive as wrappers. In this paper, we report the resultsof an extensive study on the bias and stability of such feature ranking method. We study whether the classifiersinfluence the SVC rankings or the discriminative power of features themselves has a dominant impact on the finalrankings. We show the common intuition of using the same classifier for feature ranking and final classification doesnot always result in the best prediction performance. We then study if heterogeneous classifiers ensemble approachesprovide more unbiased rankings and if they improve final classification performance. Furthermore, we calculate anempirical prediction performance loss for using the same classifier in SVC feature ranking and final classificationfrom the optimal choices.

Keywords: Feature Ranking, Feature Selection, Bias, Stability, Single Variable Classifier, Dimension Reduction,Support Vector Machines, Naıve Bayes, Multilayer Perceptron, K-Nearest Neighbors, Logistic Regression,AdaBoost, Random Forests

1. Introduction and Motivation

Due to the use of cross validation, classifier-basedfeature selection/ranking methods often demonstrate su-perior performance comparing to methods that only usetraining error (Guyon, 2008) . Wrapper feature se-lection (Kohavi and John, 1997) and Single VariableClassifier (SVC) feature ranking methods (Guyon andElisseeff, 2003) are two common methods of utilizingclassifiers in this context. While multiple aspects ofwrappers have been studied, SVCs have not receivedmuch attention in the scientific literature. In this paper,we study the stability and bias of SVC feature rankingmethods and report problems that affect both SVC andwrapper methods in practice.

When using a classifier to rank features, it is inter-esting to find out whether that the result is a relativelyuniversal or the classifier’s influence on the result makes

it specific to that classifier. We have conducted multipleexperiments to study such questions. Furthermore, itis interesting to found out if the rankings are highly af-fected by the classifiers, which classifiers generate moresimilar results. This information provides insight onwhich classifiers could use rankings interchangeably,and also provides interesting information for generatingensemble rankings using methods with less correlatedresults. We empirically study and report this similarityfor multiple classifiers.

It is also a common intuition that a better optimizedfeature selection/ranking is achieved when the sameclassifier is used in the selection/ranking process as anevaluation criterion. We empirically study this and re-port interesting results that are somewhat contradictorywith this intuition. In other words, we study how betterwould the final classification performance be if we use

Preprint submitted to Expert Systems with Applications May 24, 2014

the same classifier for feature ranking and final classifi-cation, compared to situations in which different classi-fiers are used in the two steps.

Superiority of ensemble methods in multiple domainsraises the question about the possibility of finding a uni-versally superior ranking method based on ensemble ofmultiple rankings. This question would be of high inter-est especially if the optimal ranking for a classificationtask could not be achieved via using the same classi-fier’s performance as the ranking criteria. We study theperformance of an ensemble ranking and compare it toother SVCs.

Finally, we investigate that if using the same classi-fier for feature ranking and final classification is not op-timal, how far it is from the best empirically found ap-proach. In other worlds, is the extra effort of finding abetter ranking worth it? Or using the same classifier forranking and final classification is good enough?

In the rest of the paper, first an overall introduction todimension reduction and feature selection and rankingis provided. Then, some information about SVC fea-ture rankings is presented, and bias and stability of suchfeature ranking methods are formally defined. The fivequestions we address in the study are defined and frame-work and study method are presented. Next, empiricalresults based on multiple dataset are reported. Finally,insights into the reasons behind the observed phenom-ena are provided and some practical guidelines to gainsuperior results are described.

1.1. Dimension Reduction

To fully benefit from the power of data mining algo-rithms and learning models, data have to be preparedto fit the requirements of the desired learning model.Such preparation which is commonly referred to as datapre-processing might contain tasks such as aggregation,sampling, discretization, data cleansing, normalization,and dimension reduction. Dimension reduction which isthe process of reducing the number of variables1 underconsideration, is a significantly important step whichhas received considerable attention from the researchcommunity (Huan and Lei, 2005). It is well known thatthe prediction performance of learning models degradeswhen faced with many variables that are not neces-sary for predicting the desired output (Kohavi and John,1997). Curse of dimensionality is perhaps the most wellknown phenomena that happens when the number ofvariables increases dramatically (Bishop, 2006, Hastieet al., 2009).

1We use variable, feature and attribute terms interchangeably.

There are many reasons that collected data might con-tain unwanted variables, thus making dimension reduc-tion a necessity. Often due to convenience and low cost,data collection is overdone. Most of the time, data is noteven collected for the purpose of data mining, thus in-cluding many unrelated variables. In occasions, it is notknown which variable is most appropriate for learningbeforehand, therefore, all presumably related variablesare recorded.

For such reasons, dimension reduction has become anessential step in application of data mining and machinelearning in many domains like text mining, image re-trieval, microarray data analysis, protein classification,face recognition, handwriting recognition, intrusion de-tection, and biomedical data analysis. While dimensionreduction is as old as machine learning itself, exponen-tial increases in the number of variables in recent yearshave made dimension reduction significantly importantin several domains (Guyon and Elisseeff, 2003, Zhaoand Liu, 2011).

In some of these domains, the number of variableseven exceeds the number of samples. Such scenarioswould result in poor performance due to over-fittingand make dimension reduction inevitable (Hastie et al.,2009). For example, according to Probably Approxi-mately Correct (PAC) learning theory, a dataset witha binary class and n binary variables has a hypothesisspace of size 22n

, and it would require O(2n) samplesto learn a PAC hypothesis without any inductive bias(Loscalzo et al., 2009). Therefore, with an increase inthe number of variables (n), collection of more samplesis necessary to avoid over-fitting, which is not alwayspossible.

Dimension reduction methods may be applied to adataset with several objectives. It may be used to elimi-nate unrelated features and improve the performance ofthe model. It may also decrease learning time, measure-ment efforts, and storage space. Other important aspectof dimension reduction is the acquisition of knowledgeabout the data and features themselves. Finding whichgene is more discriminative of a condition, or whichmedical test is more informative with regards to a di-agnosis is a valuable discovery. Finally, dimension re-duction may be used to achieve the best reconstructionof the data with a minimum number of variables; suchobjective however is less relevant to machine learningand is more useful for data compression.

In a typical machine learning and data mining classi-fication problem a dataset (DS) is provided via a collec-tion of m-dimensional vectors (x j) or data points whereeach element of the vector corresponds to a value froma feature ( fi). A class label (y j) is also provided for each

2

vector that maps each data point to a particular class.Dataset is formally defined in Rn ∗ R(m+1) space with(X, F,Y) where X contains n instances of x j such thatx j = ( f j

1 , . . . , f jm) ∈ Rm and j = 1, . . . , n. Each fea-

ture ( fi) comes from collection of features F and f ji

indicates the value of that feature for a particular in-stance. A classifier (CL) optimizes a criteria (λ) to beable to predict the class labels of unobserved instancesas correctly as possible. Example of such criteria couldbe mean square error, accuracy or area under the ROCcurve which is discussed in later sections. To facilitatethe argument and without loss of generality, we assumethat λ should always be maximal. Dimension reductionaims to maximally preserve the useful information inthe original data according to some criterion and discardthe unnecessary information. In other words, dimensionreduction maps DS from (X, F,Y) to (X′, F′,Y) whereX′ contains vectors such as x′ j = ( f ′ j1 , . . . , f ′ jm′ ) ∈ Rm′

and m′ < m and λ(X′, F′,Y) ≥ λ(X, F,Y) − ε for somecriterion λ and some estimation parameter ε which wewill not include in the rest of our discussions for sim-plicity.

1.2. Feature Selection and RankingDimension reduction may be broadly categorized into

two main groups; feature extraction/construction andfeature selection. In feature extraction, new featuresare constructed based on the original features in thedatasets. Principal Component Analysis (PCA), Iso-metric Feature Mapping (IsoMap) and Locally LinearEmbedding (LLE) are examples of such approach (Leeand Verleysen, 2007). In feature extraction the follow-ing characteristics hold for newly generated feature:

( f ′i ∈ F′) ∧ ( f ′i < F) ∧(

f ′i = ϕ( f j, . . . , fk); fi ∈ F)

Although feature extraction methods are used inmany domains to improve performance of the learn-ing models, they suffer from limitations with respectto other objectives of dimension reduction. Since theoriginal features are not the output of these processes,knowledge about the discriminative value of each fea-ture is not acquired easily. On the other hand, most ofthese methods construct new features based on all origi-nal features and do not reduce the measurement require-ments of the data gathering.

Feature selection on the other hand is the processof choosing or prioritizing a subset of original featuresbased on a criterion. In general the following character-istics hold for features in the new feature space:

( f ′i ∈ F′) ∧ ( f ′i ∈ F) ∧(|F′| < |F|

)

where |F∗| indicates number of features in F∗.Feature selection in general can itself be divided intoseveral subgroups from different perspectives (Guyonand Elisseeff, 2003, Huan and Lei, 2005, Kira and Ren-dell, 1992), one of which is the output of the featureselection process that can be categorized as the follow-ing:

• Feature subset selection: Here, the output is anoptimal feature subset that is aimed to maximizeperformance of the model. The features in the se-lected subset are not prioritized against each other.Should the subset size be reduced, there is no infor-mation regarding which features to eliminate firstfrom the group. In this group, a feature subset F′

is provided such that:

F′ ={f ′1 , . . . , f ′i

}and

λ(X′, F′,Y) ≥ λ(X, F,Y)

• Nested subsets of features: As the name sug-gests, the output of this group of algorithms is alist of features that form a nested structure of fea-ture subsets. There could be two types of nestedfeature subsets depending on the search strategy;backward and forward selection. As an example,when forward selected, a nested list of feature sub-sets R = 〈 f ′1 , f ′2 , f ′3 , . . . 〉 is returned by an algo-rithm. This means that best subset containing onlya single feature is f ′1 and best subset of two fea-tures with the condition of already having f ′1 in thesubset is

{f ′1 , f ′2

}. However, this list does not in-

dicate individual preference of f ′2 over f ′3 , but itstates that the former demonstrates superior per-formance in combination with f ′1 comparing to thelater. A backward elimination nested feature sub-set conveys the same logic with a reverse order. Inthis group, a list of feature subsets R is providedsuch that:

R = 〈 f ′1 , . . . , f ′i 〉 ; ∀(

f ′i , f ′j ∈ R)

: i < j→

λ(X′,Ri−1 ∪ { f ′i },Y

)≥ λ

(X′,Ri−1 ∪ { f ′j },Y

)where

Rk = { f ′1 , . . . , f ′k }

• Feature ranking: This group of algorithms sortthe features based on a quality index that reflectsthe individual relevance, information, or discrimi-native capacity of a feature. Therefore, when a list

3

such as R = 〈 f ′1 , f ′2 , f ′3 , . . . 〉 is produced in the fea-ture ranking fashion, it suggests that f ′2 is superiorto f ′3 individually with respect to the quality index.In this group, a list of feature subsets R is providedsuch that:

R = 〈 f ′1 , . . . , f ′i 〉 ; ∀( f ′i , f ′j ∈ R) :

i < j→ λ(X′, { f ′i },Y) ≥ λ(X′, { f ′j },Y)

A portion of the top features in the later two ap-proaches may be selected as a feature subset to build aclassifier in several ways (Ruiz et al., 2006). User man-dated constrains such as the number of desired featuresor performance threshold, and use of randomly gener-ated features as a probe to discriminate between relatedand unrelated features such as the method summarizedby Stoppiglia et al. (2003) are common approaches forthis purpose. Plotting a learning or an error curve, con-structed by adding a feature at a time to the subset andevaluating the subset with a classifier is another way offinding the best portion of the highly ranked featuresfor subset construction (Slavkov et al., 2010). With re-spect to the statistical methods summarized by Demsar(2006), different cut-off criteria have been studied byArauzo-Azofra et al. (2011).

The final objective of feature selection determineswhich one of the above categories is preferable. Whenmaximizing the performance of a learning model is theonly goal, feature subset selection algorithms would besufficient. Nested subsets of features are useful wheninformation about the features values with considera-tion to their interaction is desirable and user demandsmore insight and supervision in the final selected subset.However, since feature selection maps to a search prob-lem in a lattice (Kohavi and John, 1997), most greedyhill-climbing algorithms generate a nested subset of fea-tures where the latest generated subset is the outcome.

Feature ranking is mostly desirable when knowledgeabout the discriminative value of individual features isof interest. For example, in medicine or bioinformat-ics, when each feature corresponds to a medical test,biometric, or a gene, the result of a feature ranking al-gorithm itself is of great value.

The feature selection algorithms that evaluate sub-set of features are called multivariate feature selectionmethods. Feature ranking on the other hand usuallyevaluates performance of each feature individually, andis categorized as a univariate method. When interac-tions between features are important, univariate meth-ods fail to capture such phenomena of interest and mul-tivariate methods must be used (Guyon and Elisseeff,2003). It should also be noted that although multivariate

methods lead to more universal predictors than univari-ate methods, they may result in less predictive perfor-mances due to over-fitting (Guyon et al., 2005). Further-more, due to the time consuming nature of the searchalgorithms used in feature selection, feature rankingmethods are widely used in practice.

Based on the use of a learning model, feature selec-tion algorithms are divided into wrapper, filter, embed-ded and hybrid methods. The filter methods evaluatefeature subsets based on statistical measures, while thewrapper methods use a learning model such as a clas-sifier for that matter. Embedded approaches also use alearning model but instead of features predictive or dis-criminative performance, internal factors of the learningmodel are considered as the evaluation criteria; SVM-RFE is an example in this category (Guyon et al., 2002).Hybrid methods generate feature subsets based on sta-tistical measures and evaluate the best selected subsetusing a learning model to benefit from both filter andwrapper approaches (Peng et al., 2010, Bacauskieneet al., 2009, Gheyas and Smith, 2010). Due to use oftraining error in statistical tests, cross-validation basedfeature selection/ranking methods often demonstrate su-perior results compared to statistical tests.

Another group of algorithms categorized as penalty-based methods tend to address this problem via aug-menting the training error by a penalty term (Guyon,2008). A successful algorithm in this category whichhas generated major interest is the least absolute shrink-age and selection operator (LASSO) (Hastie et al., 2009,Tsanas et al., 2010, Tibshirani, 1996) . This is a penal-ized least squares method imposing an `1-norm on theregression coefficients. It does both continuous shrink-age and automatic variable selection simultaneously.The `1-norm promotes sparsity (some coefficients be-come zero), and therefore the LASSO can be used asa feature selection method. Use of elastic nets is an-other method in this group which reportedly outperformLASSO when the number of features is highly greaterthan the number of samples (Zou and Hastie, 2005).

While filter methods are faster, they are shown to beless accurate than wrapper approaches, which on theother hand, have better accuracy but are computation-ally expensive and are likely to over-fit the selected fea-ture subset to a specific learning model (Chrysostomouet al., 2008). The classifier bias in the wrapper methodspromotes the filter methods to be stronger candidates incases where features independent discriminative powersare of interest, such as in feature ranking. In such cases,regardless of the learning model, the user is interestedin knowing which feature is superior to the others withrespect to a criterion such as class discrimination.

4

Nevertheless, classifiers are being used in featureranking. Single Variable Classifier (SVC) is a featureranking method where variables are ranked accordingto their individual predictive power. In other words, fea-tures are ranked based on the performance of a classifier(CR) built with just that single variable (Guyon and Elis-seeff, 2003). While benefiting from the accuracy of clas-sifiers, this method of ranking is not as computationallyintensive as a full wrapper. In terms of classificationperformance this method has been shown to be superiorto filters (Fakhraei et al., 2010a). For clarity we denotethe classifiers that are used for ranking the features withCR and the classifiers that are used for the final classifi-cation (learning) with CL.

2. Bias and Stability of SVCs

Intuitively, the classifier (CR) used in SVCs for fea-ture ranking affects the ranking results. It is reasonableto think SVC feature ranking is biased towards a classi-fier (CR), because features that satisfy the characteristicsof such classifier would be ranked higher than their realdiscriminative power. As an example from two featuresthat completely separate the instances in each class, thefeature that separates the instances of classes in a linearfashion receives a higher score when evaluated with alinear classifier.

A visual example of such bias is demonstrated in Fig-ure 1 for one feature. While both features have thesame capability to completely discriminate the classes,when evaluated with cross validation, the feature in Fig-ure 1.(a) should be favored when a linear classifier (e.g.,linear Support Vector Machine (SVM)) is used. On theother hand, the feature in Figure 1.(b) should be givena higher rank when a nonlinear classifier such as an in-stance based classifier like K-Nearest Neighbors (KNN)is used.

(a) (b)

Figure 1: Histogram of features with linear and nonlinear class dis-crimination. The feature shown in (a) that linearly separates theclasses should be favored with a linear classifier as more discrimi-native while the feature in (b) that separates the classes nonlinearlyshould be favored with an instance based classifier such a KNN.

To formally define such bias, let’s assume that a rank-ing list RCR is provided using SVC feature ranking withclassifier CR and RCR

α indicates a feature subset that in-cludes the best features of that list, and λCL indicatesthe classification performance criteria (e.g., accuracy orAUC) evaluated via using the classifier CL. A featureranking demonstrates bias towards a certain classifierwhen:

RCR = 〈 f ′1 , . . . , f ′i 〉;

∀CR , CL : λCL (X′,RCLα ,Y) ≥ λCL (X′,RCR

α ,Y)

where RCRα = { f ′1 , . . . , f ′α}.

This definition of bias depends on the value of α whichcould be defined in various ways that are explained innext sections. However, as it would be clarified in moredetails in later sections, to acquire a more general con-clusion about such bias, we use multiple values for α inour experiments and report the average result.

There is also the issue of stability of feature selec-tion and ranking methods especially in high dimen-sional datasets with small sample sizes. A feature se-lection algorithm may select largely different subsets offeatures, under variations to the training data. Thesedifferent feature subsets could even generate relativelyclose classification performance. In such cases, obtain-ing knowledge about the features themselves is not easy.Stability or robustness of feature selection methods withrespect to variations in the training data is a topic of re-cent interest among researchers (Gulgezen et al., 2009,Kalousis et al., 2007, Yu et al., 2008, Saeys et al., 2008).In this paper, we have studied the stability of SVC fea-ture rankings with regards to change of classifiers.

In general, stability of a feature selection method ismeasured as the effect of the method’s internal condi-tions variation on the variation of the final selection re-sults. In other words, stability of feature selection meth-ods are measured in terms of the similarity of their re-sult under variations of conditions. In our scenario, ifω indicates similarity (e.g., correlation) of two sets, andtwo different rankings RC1

R and RC2R were generated us-

ing different classifiers in SVC, then the similarity of thetwo rankings would be defined as:

ω(RC1R ,RC2

R )

where this similarity could be reported as a number in[0, 1] range or as a percentage. Higher values of thisnumber are indications of the feature selection method’shigher stability with variations in the ranking classifier(CR).

5

In this paper, we have shown how instable the SVCfeature ranking results could be using seven classifiersand eleven datasets, and studied the bias of classifiersin SVC ranking. We also measured the similarity andcorrelation of the results obtained using different clas-sifiers and studied if using a heterogonous ensemble ofthe results from several classifiers reduces the bias.

3. Related Work

In recent years, stability of feature selection meth-ods has been studied with respect to variations in thetraining set. Stability of a feature selection algorithm isusually defined as the robustness of the feature prefer-ences produced relative to the differences in the train-ing sets drawn from the same generating distribution(Kalousis et al., 2007). Zou (2006) has studied the sta-bility of LASSO and necessary conditions for its consis-tency and has proposed adaptive LASSO, where adap-tive weights are used for penalizing different coeffi-cients in the `1-norm.

Several stability measures have been studied and pro-posed to measure the similarity of feature selection viavariation in the datasets. Novovicova et al. (2009) pro-posed a new stability measure based on the Shannonentropy to evaluate the overall occurrence of individualfeatures in selected subsets of possibly varying cardi-nality. Kalousis et al. (2005) studied several similaritymeasures to quantify the stability of feature preferencesvia performing a series of experiments with several fea-ture selection algorithms on a set of proteomics datasets.Yu et al. (2008) and Loscalzo et al. (2009) have pro-posed stable feature selection methods that cluster fea-tures and choose representatives of each cluster for thefinal feature subset.

Using this definition, stability is measured with re-spect to how similar the feature selection results are, us-ing multiple instance samples drawn from a dataset. Ourstudy, however, instead of changing the sample popula-tion, changes the classifier used in the SVC feature rank-ing and studies the effect of this change on the stabilityof the final selected features.

Feature selection bias has also been studied in recentyears. Choosing features based on the correlation withthe class variable may introduce an optimistic bias, inwhich the response variable appears to be more pre-dictable than it actually is. This is because the high cor-relation of the selected features with the response maybe due to chance (Li et al., 2008). With respect to this is-sue a Bayesian method for making well-calibrated pre-dictions has been proposed by Li et al. (2008).

In another study, Singhi and Liu (2006) have statedthat using the same training dataset in both feature se-lection and learning can result in so called feature subsetselection bias. Motivated by the research done on the se-lection bias in regression, statistical properties of featureselection bias in classification are analyzed. They haveshown how this bias impacts classification learning viavarious experiments, and shown the selection bias hasless negative effect in classification than in regressiondue to the disparate functions of the two.

In this paper, however, we study the bias of eachclassifier with respect to the final classification perfor-mance. In other words, we study how better would theclassification performance be if we use the same clas-sifier for feature ranking and final classification, com-pared to situations in which different classifiers are usedin each step.

There are also multiple experimental studies thatcompare performances in feature ranking and featureselection methods. As an example, Hall and Holmes(2003) provided a benchmark for comparison of fea-ture selection methods that work based on a ranked listof features. Naıve Bayes and decision trees classifica-tion algorithms have been used for the evaluation ofthe feature selection methods. In another study, ninecommon performance metrics are compared in a wrap-per based feature ranking method described by Altidoret al. (2009) and their correlations have been reported.They have ranked the features based on cross validationwith SVC and calculation of Feature Risk Impact. Thisstudy identifies metrics that provide relatively same fea-ture rankings. It is also demonstrated that area underPrecision-Recall curve (APRC) and area under ReceiverOperating Characteristic curve (AUC) are somewhatcorrelated in most of the domains and classifiers. Inour studies, we report similarities and correlations ofthe rankings when using different classifiers in the SVCrankings.

Feature ranking and feature selection ensembles havealso been studied from different perspectives in recentyears and their benefits are reported in several publica-tions. In an effort to address the non-robust behavior offeature rankings in different datasets and different clas-sifiers, Makrehchi and Kamel (2007) combine eight dif-ferent ranking metrics with different combination func-tions. Similarly, Yan (2007) have studied fusion of mul-tiple criteria for feature ranking and combined severalranking measures into a united rank.

In a credit scoring application, Chen and Li (2010) re-port results of a study on relative effectiveness of featureselection methods based on Linear Discriminate Anal-ysis (LDA), Rough Sets Theory (RST), Decision Tree

6

(DT), F-score and their combination when the classi-fication is performed via SVM classifiers. Bryll et al.(2003) have proposed an ensemble method based on at-tribute bagging and have studied the stability of the clas-sifiers participating in the ensemble by considering thestandard deviation of their accuracies. Fakhraei et al.(2010a) have shown that a consensus feature rankingbased on SVCs outperform Chi-Square and InformationGain in a dataset with missing values, have studied andpositive and negative effects of classifiers in the consen-sus ranking (Fakhraei et al., 2010b).

Santana and Canuto (2014) identified that the wrap-per methods are strongly coupled to the classificationalgorithm, having to run again when switching classifi-cation methods. They proposed to use different hyper-parameter setting, training dataset and classifier types toenhance the diversity of an ensemble method. In an en-semble setting via using particle swarm, ant-colony andgenetic algorithms optimization techniques they chosesubsets of features for the individual components of en-sembles to enhance the performance. Cruz et al. (2013)proposed use of ensemble methods based on multiplefeature representations. They used a dissimilarity andthe intersection of errors, to analyze the relationshipsamong feature representations which they use to trainclassifiers. They showed efficiency of this approach tohandwritten character and digit recognition.

Cho (2014) used genetic algorithms to search thespace of single variable classifiers. They combinedthe result of several selected SVCs to generate the fi-nal classification and showed enhanced performance onhandwritten digit recognition. You et al. (2014) pro-posed a feature ranking method based on partial leastsquares and feature elimination and applied it to multi-class classification problems. Alt (2013) proposed sin-gle variable classifiers for feature extraction in binarytext categorization, by using the outputs of the SVCs toform the document vectors. Yoon et al. (2013) proposeda method to integrate the feature selection and classifi-cation for neural networks. Zhao et al. (2011) proposeda method based on genetic algorithm to simultaneouslyoptimize the feature subset and the parameters for SVM.

4. Proposed Study

We have investigated the following five questions re-lated to application of single variable classifiers in fea-ture ranking:

1. Does the classifier used in SVC ranking have ahuge impact on final feature ranking results or

the discriminative power of features themselves ismore important than the classifier?

2. Which classifiers produce more similar resultswhen used in SVC feature ranking?

3. If choice of classifiers influences the final rankings,is there a classifier bias in the rankings? In otherwords, when using the final feature ranking resultto build a model for classification, is it always bestto use the same classifier for feature ranking andfinal classification or other combinations work bet-ter?

4. Does taking an ensemble approach and combiningthe results from SVC feature rankings via multipleclassifiers help getting a universally superior resultthat improves classification performance?

5. If using the same classifier for feature ranking andfinal classification is not optimal, what is the per-formance loss of taking such approach comparingto the optimal choice of classifiers?

We have carried out multiple experiments to answerthe above questions. To answer the first question, wehave compared the results from SVC rankings usingdifferent classifiers. We reported the correlation andsimilarity of the rankings, to address the second ques-tion. For the third, we have evaluated a top portion ofranked features using different classifiers and reportedtheir comparative performance results. With respect tothe forth question, we have calculated ensemble scoresand rankings and compared their performances with sin-gle classifier SVC rankings. Finally, we have calculatedthe classification performance difference between thebest combination of ranking classifiers (CR) and learn-ing classifiers (CL) and the where the same classifier isused for both of them, to answer the fifth question.

In the following section, the overall framework of theexperiments has been discussed and ranking and simi-larity measures that we have used in the experiments areexplained. Calculations of ensemble ranking and meth-ods for classification performance evaluation are alsoincluded in this section.

4.1. Framework

The overall framework of the experiments is shownin Figure 2. At the first step, single variable classifiersare built using each of the classifiers (CR) included inthe study and using each of the features in the datasets.Then, the features are ranked based on the classifica-tion performance of the SVC (CR) built with that fea-ture. Correlation of the ranking with different classifiers(CR) and their intersection of results are measured in this

7

step as their similarity (ω) for each dataset and the over-all averages reported. An ensemble performance scoreis also assigned to each feature by combining the resultsfrom different classifiers (CR). This ensemble rank willbe used to answer the fourth question asked in the previ-ous section. In other stages, portion of the top rankingfeatures from each ranking list has been selected intoa feature subset and several classifiers (CL) have beenbuilt using these features. Via 5-fold cross validation,performances of the classifiers (CL) have been recorded.These experiments have been repeated for different val-ues of and the mean and the standard deviation of theresults reported.

For each classifier (𝐶𝐶𝑅𝑅𝑗𝑗) in CR

For each dataset (𝑋𝑋𝑖𝑖,𝐹𝐹𝑖𝑖 ,𝑌𝑌𝑖𝑖) in X

For each feature (𝑓𝑓𝑘𝑘) in 𝐹𝐹𝑖𝑖

Build a SVC

using(𝑓𝑓𝑘𝑘,𝐶𝐶𝑅𝑅𝑗𝑗)

𝜆𝜆𝐶𝐶𝑅𝑅𝑗𝑗(𝑋𝑋𝑖𝑖′, {𝑓𝑓𝑘𝑘},𝑌𝑌𝑖𝑖) = SVC performance

measure with cross-validation

Ranking 𝑅𝑅𝐶𝐶𝑅𝑅𝑗𝑗(𝑋𝑋𝑖𝑖′,𝐹𝐹𝑖𝑖,𝑌𝑌𝑖𝑖) =

Sorted list of 𝜆𝜆𝐶𝐶𝑅𝑅𝑗𝑗(𝑋𝑋𝑖𝑖′, {𝑓𝑓𝑘𝑘},𝑌𝑌𝑖𝑖)

Select

Top α

𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒(𝑋𝑋𝑖𝑖′,𝐹𝐹𝑖𝑖 ,𝑌𝑌𝑖𝑖) =

φ(𝑅𝑅𝐶𝐶𝑅𝑅𝑗𝑗(𝑋𝑋𝑖𝑖′,𝐹𝐹𝑖𝑖 ,𝑌𝑌𝑖𝑖) for all 𝐶𝐶𝑅𝑅

𝑗𝑗)

For every classifier (𝐶𝐶𝐿𝐿𝑒𝑒) in CL

Evaluate

with (𝐶𝐶𝐿𝐿𝑒𝑒)

Report μ of 𝑅𝑅𝐶𝐶𝑅𝑅𝑗𝑗(𝑋𝑋𝑖𝑖′,𝐹𝐹𝑖𝑖,𝑌𝑌𝑖𝑖) and

𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒(𝑋𝑋𝑖𝑖′,𝐹𝐹𝑖𝑖 ,𝑌𝑌𝑖𝑖) performance

with 𝐶𝐶𝐿𝐿𝑒𝑒 for all (𝑋𝑋𝑖𝑖,𝐹𝐹𝑖𝑖 ,𝑌𝑌𝑖𝑖) in X

For all (𝐶𝐶𝑅𝑅𝑖𝑖 ,𝐶𝐶𝑅𝑅𝑘𝑘) pairs in CR

𝜔𝜔�𝑅𝑅𝐶𝐶𝑅𝑅1,𝑅𝑅𝐶𝐶𝑅𝑅

2�

Calculate the similarity between

Report μ and σ of the similarity of all

classifier pairs for all (𝑋𝑋𝑖𝑖,𝐹𝐹𝑖𝑖,𝑌𝑌𝑖𝑖) in X

Figure 2: Experimental schema of the study and evaluation methods.In the experiments, for each dataset, correlations and similarities ofthe SVCs built using different classifiers (CR) are calculated and thefinal correlation between the SVCs reported as the average correlationon different datasets. In other parts of the study, feature subsets areselected from the top of the feature ranking lists and these subsets areused for classification to measure which subset works best with whichclassifier (CL), and answer the related question. The reported resultsare the average numbers over all datasets.

4.2. Ranking MeasureSeveral criteria may be used to measure the classifi-

cation performance of a SVC, such as accuracy, preci-sion or positive predictive value (ppv), negative predic-tive value (npv), recall or sensitivity or true positive rate(tpr), specificity or true negative rate (tnr), false posi-tive rate (fpr), false negative rate (fnr), and F1-measure.Definitions of these measures could be found in (For-man, 2003) as well as most classic data mining and ma-chine learning books and resources. However, most of

these measures calculate the model performance basedon a fixed decision boundary which causes problemswith class imbalance datasets.

Since datasets with class imbalance distributions arecommon in real world applications, performance ofmany classifiers are evaluated by the receiver operat-ing characteristic (ROC) curve which is constructed byplotting the true positive rate versus the false positiverate by changing the decision threshold or boundary(Fawcett, 2004). Area under the ROC curve (AUC) isa performance measure that can be used to evaluate theSVC, despite the imbalanced distribution of the classes.Methods of Chen and Wasikowski (2008) and Wang andTang (2009) that use AUC for feature selection are ex-amples in this category. We have used AUC in our stud-ies to evaluate the classification performance.

4.3. Feature Subset SelectionDifferent thresholds may be used to select feature

subsets from the top feature ranking lists. Selectinga fixed number of features, a fixed percentage of thefeatures, using a performance threshold and using theamount of performance improvement when adding anew feature to the subset, can be used for this purpose(Arauzo-Azofra et al., 2011). Since we mostly wantto use datasets with very high dimensionality, using amethod that works based on measuring the performanceimprovement of adding features is not feasible. On theother hand, since the numbers of features in the datasetsare very different, using a fixed percentage of the fea-tures is not practical either. For example, using 10%threshold in datasets with 20,000 feature and 44 fea-tures produce incomparable results. Hence, for the sakeof consistency across datasets, we have chosen a fixednumber of features in the subsets, with respect to beingcomputationally feasible and relevant to the datasets av-erage dimensionality.

4.4. Similarity MeasureIn the study of stability of feature selection methods,

different similarity measures have been proposed andutilized (Kalousis et al., 2007, Yu et al., 2008, Saeyset al., 2008). In our studies, we have considered twoperspectives. One is how the whole ranking list gener-ated by one classifier is similar to the rankings gener-ated by the other. This similarity (ω) is measured bythe Spearman’s rank correlation coefficient, which for aranking using classifiers C1

R and C2R is defined as:

ωρ(RC1

R ,RC2R)

= 1 −6

m∑i=1

(RC1

R ( f ′i ) − RC2R ( f ′i )

)2

m(m2 − 1)8

where m is the total number of features in a dataset andRC1

R ( f ′i ) and RC2R ( f ′i ) are the rank of feature f ′i with SVCs

built with classifiers C1R and C2

R.Besides SVCs, classifiers are often used in wrapper

feature selection algorithms where a feature or a groupof features with the best or the worst performance areselected for the next step of the iteration, to either beadded to or eliminated from the final subset. With thispractice in mind, we have also reported the intersec-tions of the features in the top and bottom portion ofthe ranked list obtained using different classifiers (CR).Intersection of the results generated using classifiers C1

Rand C2

R is reported as:

ω∩(RC1

R ,RC2R)

=

∣∣∣∣RC1R

α ∩ RC2R

α

∣∣∣∣α

× 100

where RC1R and RC2

R are the feature subsets selected fromthe top or bottom α portion of the ranked feature rank-ings RC1

R and RC2R . The number is multiplied by 100 to

get a percentage value.

4.5. Ensemble Feature Ranking

Ensemble learning methods have been widely usedto address multiple issues in data mining and machinelearning such as improving accuracy, generalization,and robustness of the learning model. Ensemble classi-fication methods are learning algorithms that constructa set of classifiers and classify new data points by tak-ing votes of their predictions. Dietterich (2000) havedemonstrated that an ensemble of classifiers that haveindependent errors improves the overall accuracy. Theyhave identified three main reasons to explain the pos-itive effect of ensemble methods on classification: re-ducing the risk of choosing the wrong classifier, low-ering the chance of getting stuck in local optima, andexpanding the space of representable functions. Intu-itively, the above reasoning should also hold in featureranking with SVCs.

Based on the nature of the classifiers that participatein the voting, there could be two types of classifiers en-sembles:

• Homogeneous ensemble of classifiers where all ofthe classifiers involved are of the same type andthe training samples of each classifier are different.e.g., decision tree bagging.

• Heterogeneous ensemble of classifiers where theclassifiers in the ensemble are not the same, for ex-ample when the final prediction result is produced

from a combination of SVM, KNN, and DT predic-tion. This method is also referred to as consensuslearning.

Since we want to study the effect of the ensemble ap-proach on the SVC feature ranking bias, we have useda heterogeneous ensemble method. Formally, if RCR ( f ′i )which is the ranking of f ′i using SVC with classifier CR

is calculated based on the classification performance ofthe feature f ′i with the classifier CR as λCR (X′, { f ′i },Y)then the ensemble ranking score of the feature f ′i (λens)can be calculated based on the following:

λens(X′, { f ′i },Y

)=

τ(λC1

R(X′, { f ′i },Y), . . . , λCn

R(X′, { f ′i },Y)

)where Ci

R indicates the classifiers used in the study andτ is the ensemble function.

There are multiple ensemble functions to combine thefeature ranking results. Maximum, sum, mean, median,and different voting schemes which have been used andstudied extensively (Chrysostomou et al., 2008, Makre-hchi and Kamel, 2007, Yan, 2007). However, our studyis not focused on comparing ensemble functions. Sincethe mean and the median are more commonly used en-semble functions, and an outlier can seriously distort theresult of the mean (Kittler et al., 1998), the median hasbeen adopted in our approach as the ensemble function.Furthermore, preliminary results of our experiments didnot show significant outcome changes when substitutingthe median with the mean.

4.6. Bias EvaluationTo address the third question, we have selected fea-

ture subsets from the top of the feature ranking listand built classifiers with them. The performance ofthese classifiers have been evaluated via cross valida-tion. Intuitively, the feature subsets selected from theSVC rankings generated by the same classifier (CR) asthe final learning classifier (CL) (i.e., CR = CL) shouldperform better than the SVCs rankings generated withthe other classifiers (i.e., CR , CL).

We should also consider that the threshold for choos-ing the feature subset changes the final results. To mit-igate this effect, we have considered two scenarios. Inthe first scenario, we have chosen feature subsets con-taining different numbers of features. Feature subsetswith the same number of features are then sorted basedon the performance of the classifiers built with them andreceived placements from 1 indicating the best perfor-mance to n, the worst performance. This method is sim-ilar to comparing classifier performance which has beenexplained by Demsar (2006).

9

In other words, we have made subsets having 1 fea-ture to subsets having α features and in every subsetsize, we have recorded the subset that comparativelyworks better with a specific classifier. Then, we haveaveraged the placements over the subsets with differentsizes for all of the datasets. Since mean is sensitive tooutliers, they are removed prior to calculating the aver-age. To remove the outliers only 90% of the placementshave been used and the least 5% and the greatest 5% ofthe values have been discarded. The pseudo-code thatcalculates the average placement values is shown in Al-gorithm 1. In this algorithm number of SVC rankingequals the number of classifiers (CR) plus one for theensemble ranking.

Algorithm 1 Average position of CR ranking in CL.1: int c = Number of classifiers in the study;2: int M.AUC[c,c+1,α], M.POSITION[c,c+1,α],

M.MEAN.POSITION[c,c+1];3: for i:=1 to α do4: for every ”classifier” CL do5: for every ”SVC Ranking list” RCR do6: Feature.Subset=RCR

i ;7: Build CL using Feature.Subset;8: M.AUC[CL,CR,i]=CL × AUC;9: end for

10: M.POSITION[CL,,i]=order(M.AUC[CL,,i]);11: end for12: end for13: for every ”classifier” CL do14: for every ”SVC Ranking list” RCR do15: M.MEAN.POSITION[CL,CR]=16: mean(M.POSITION[CL,CR,],”Remove 10%

Outliers”);17: end for18: end for

Although in practice, the feature subset selectionthreshold is mostly chosen arbitrary, it is possible tochoose the best performing number of features in a sub-set, using a learning curve, and build a classifier withthat number of features (Arauzo-Azofra et al., 2011).For this reason, we have also repeated the experimentsby considering the number of features that result in thehighest performance of all and repeated the above ex-periments without averaging over multiple thresholds.

5. Experimental Results

Seven classifiers and ten datasets have been used toconduct this study. Single variable classifiers have been

built with each classifier on each dataset. Any sampleswith missing values have been eliminated prior to con-ducting the study. The classifiers used in the experi-ments are Support Vector Machines (SVM) with poly-nomial kernel, Naıve Bayes (NB), Multilayer Percep-tron (MLP), K-Nearest Neighbors (KNN) (K=5), Lo-gistic Regression (LR), AdaBoost (AB) with decisionstump meta classifier, and Random Forests (RF). Thedataset used in the experiments are listed in Table 1and have been obtained from University of Californiaat Irvine machine learning data repository (UCI) (Frankand Asuncion, 2010), Arizona State University featureselection repository (ASU) (Liu and al., 2011), andCausality Workbench Repository (CWR) (CW-Team,2011).

Table 1: Datasets used in the experiments.Name Samples Features Domain SourceInternetAds

3279 1558 Internet UCI

BASE-HOOK

4862 1993 news-group

ASU

CINA 16033 133 Census CWRGLI 85 22283 85 Micro-

arrayASU

MADE-LON

4400 500 Synthetic UCI

MARTI 500 1025 Micro-array

CWR

MUSK(v.1)

476 166 Physics UCI

REGED 500 1000 Micro-array

CWR

SMK 187 19993 Micro-array

ASU

SPECTF 267 44 Clinical UCI

The experiments explained in the previous sectionshave been implemented using Waikato Environment forKnowledge Analysis (WEKA), Java and R. For each fea-ture in each dataset, a SVC has been built and evalu-ated via cross validation. The features have been rankedbased on the AUC of the SVCs.

For example, Figure 3 demonstrates the ranking of166 features from MUSK dataset that has been gener-ated with SVC built with Naıve Bayes and KNN. Fig-ure 3.(a) is the AUC of features SVCs and Figure 3.(b)is the rankings based on the SVC AUCs. Lines demon-strate the threshold for 30 top and bottom ranked fea-tures, and features marked with a triangle pointing upare top ranked features and the one marked with trian-

10

gles pointing down demonstrate features in the bottomsection. For example, the top 30 ranking features withSVC built with Naıve Bayes are shown in boxes markedwith 3, 6, and 9, and the bottom 30 are shown in boxesmarked with 1, 4, and 7.

Figure 3: SVC feature ranking on MUSK using KNN and naıveBayes: (a) AUC of SVC for features and (b) rankings based on theAUCs. Lines demonstrate the threshold for 30 top and bottom rankedfeatures. Features marked with a triangle pointing up are top rankedfeatures and the one marked with triangles pointing down demonstratefeatures in the bottom section.

The plot demonstrates how different the rankingswith one classifier could be with respect to the other.Top 30 of the features ranked with KNN-SVC are in 1,2, and 3 where NB-SVC’s top 30 are in 3, 6, and 9.The only intersection between the two is box number3 containing 11 features out of 30 which is only 37%agreement between the two. It is also interesting thatthey even disagree on the most discriminative feature.Boxes 1 and 9 contain features that are in one rankingmethod’s selection list and in the other one’s elimina-tion nominees. As shown in the plot, there are 3 fea-tures in box 1 that NB sees as totally worthless whileKNN ranks them as highly valuable.

It is trivial that the order of the features has notchanged in the two scatter plots, although their distri-bution has. The distribution in Figure 3.(b) does not de-pend on the values of the AUCs which differ from oneclassifier to the other. AUC plot are more skewed to-wards the top which has no particular meaning in ourcontext. It is only the variability of the rankings that isof interest in this study. Therefore, the rankings whichare in more normalized form were used for correlationcalculations.

This ranking was performed using all seven classi-fiers in SVCs. Figure 4 demonstrates the feature rank-ings from MUSK dataset generated with SVCs built us-ing all seven classifiers in the study. It is seen that therankings disagreement shown in Figure 3 for SVCs withNB and KNN, are present with most of the other clas-

sifiers as well. However, there are also some classifiersthat generate more correlated ranking results.

Figure 4: Feature rankings from MUSK dataset generated with SVCsbuilt using all seven classifiers in the study. Plot from Figure 3 ishighlighted using double borders.

Figures 3&4 demonstrate that SVC feature rankingsgenerated using different classifiers are extremely dif-ferent on the MUSK dataset. To quantify the results,we have calculated the Spearman’s rank correlationmeasure for each two classifiers on all datasets. Ta-ble 2 summarizes the average correlation between theSVC rankings built with multiple classifiers over the tendatasets. It is seen that on average, the similarity be-tween the results is only 0.60 and it is highly dependenton the dataset, considering the overall standard devia-tion of 0.24. However, it is interesting to note that theresults from LR and MLP are highly correlated and NBdemonstrates relatively high correlation with these two.KNN and RF on the other side produce relatively similarresults and AB tends more towards the LR group. SVCrankings using SVM does not show strong correlationwith any other classifier.

To see how the feature subsets selected using theseSVCs are similar to each other, we have selected featuresubsets from the top and bottom of the ranking lists. Afixed number of 30 features which is close to the averageof 10% of the number of features in each dataset hasbeen chosen to form the feature subsets. We calculatedthe similarity measure for the feature subsets selectedusing the 30 top and bottom features of the lists usingthe intersection percentage described earlier.

Table 3 shows similarities of the feature subsets. It

11

Table 2: Spearman correlation of the SVCs using different classifiers.

NB 0.54±0.22 "

MLP 0.63±0.25

0.83±0.10 "

KNN 0.35±0.28

0.53±0.35

0.51±0.34 "

LR 0.67±0.27

0.83±0.13

0.94±0.03

0.52±0.36 "

AB 0.48±0.20

0.74±0.20

0.73±0.19

0.66±0.27

0.71±0.20 "

RF 0.29±0.27

0.44±0.35

0.44±0.35

0.83±0.12

0.43±0.37

0.57±0.28

µ± SVM NB MLP KNN LR ABσ

Average Mean = 0.60, Average STD = 0.24

is interesting to note that selected subsets from the topof the list are more similar than the ones from the bot-tom of the list. This might suggest that SVCs are moreconsistent in finding good quality features compared todistinguishing useless features. Therefore, classifiersmight work better when used in forward feature selec-tion than backward elimination. On the other hand, thisphenomenon might be the result of abundance of use-less features in the datasets which have no superiorityagainst each other.

The results have a high standard deviation suggest-ing the influence of datasets on the outcome. It is alsointeresting that the two clusters of classifiers based onoverall ranking correlation reported in the previous sec-tion do not exist here. Instead KNN, AB, and NB seemto be more similar in the top section. SVM and AB alsoshow similar results. In the feature subsets from the bot-tom of the ranking list, the same similarity exists withless strength.

From the above experimental results, we can answerthe first two questions. The overall average correlationof 0.6 and the overall average similarity of 61.5% inthe top, and 36.2% in the bottom ranking feature sub-sets, suggests that rankings would be highly differentwhen different classifiers are used to build the SVCs.As shown in the MUSK dataset, in many cases the re-sults do not even agree on finding the most discrimina-tive feature. This suggests that even wrappers that onlyselect one top ranking feature for the next step of theiralgorithms might result in highly different selected fea-ture subsets.

With regards to the second question, although someclassifiers form groups looking at their correlations,

Table 3: Similarity (Intersection) of the feature subsets selected usingthe top and bottom features of the SVC rankings.

(a) Similarity of the top 30 features using differentclassifiers.

NB 62.4±24.9 "

MLP 59.0±33.7

56.2±26.5 "

KNN 64.1±24.0

74.5±23.6

56.9±28.7 "

LR 46.6±32.0

56.2±36.5

54.5±31.2

56.5±35.4 "

AB 75.2±29.8

77.6±22.5

57.2±31.1

82.4±14.4

57.2±37.3 "

RF 52.8±27.8

63.8±32.3

63.4±28.8

63.4±34.1

63.8±33.9

65.2±33.0

µ(%)± SVM NB MLP KNN LR ABσ(%)

Average Mean=61.5%, Average STD=30.7%

(b) Similarity of the bottom 30 features using differentclassifiers.

NB 25.7±25.3 "

MLP 35.3±39.5

19.7±25.5 "

KNN 27.0±29.4

47.3±28.1

18.7±27.5 "

LR 19.3±25.7

34.7±39.1

23.7±28.9

32.0±36.9 "

AB 41.3±37.7

58.0±23.9

24.0±27.8

61.7±25.5

37.3±41.2 "

RF 20.7±26.1

40.3±35.4

21.3±29.3

38.0±36.6

39.0±36.5

41.3±36.7

µ(%)± SVM NB MLP KNN LR ABσ(%)

Average Mean=36.2%, Average STD=31.3%

these groups do not exist with respect to the top andthe bottom ranking feature subsets. Therefore, it couldbe inferred that although there are some correlationsbetween the overall rankings which is reported in Ta-ble 2, classifiers select the top ranking features differ-ently. However, it should be noted that changing thethresholds for selecting the feature subsets changes thesimilarity results to some extent.

To address the third question and observe if there isany classifier bias in the SVC feature rankings, we havechosen feature subsets containing from 1 to 30 featuresusing SVC rankings from all classifiers (CR). Then,these subsets have been used to build the final classi-

12

fiers (CL) and their performances evaluated via 5-foldcross validation in terms of AUC. For example Figure 5demonstrates the performance of top 30 features fromMUSK dataset based on different SVC rankings evalu-ated by NB, MLP and LR classifiers (CL). Each pointindicates the number of top performing features fromeach SVC rankings that the final classifier (CL) was builtwith. It could be seen that in this dataset NB works bestwith its own SVC ranking on most points, KNN and RFshare the best performing points when MLP is the finalclassifier (CL), and SVM ranking works very well withLR in this dataset.

Figure 5: NB, MLP and LR classification performance of top 30ranked features from MUSK dataset. Different feature rankings wereobtained using all classifiers in the studies.

To generate more general results, the SVC rankingshave been sorted based on the AUCs and placed from 1for the best performance to 8 for the worst performance.The overall average placement numbers for each classi-fier on all datasets are shown in Figure 6.(a). For thesecond scenario where only the best performing num-ber of features in a subset between 1 and 30 are chosen,the findings are shown in Figure 6.(b).

It could be seen that in Figure 6.(a) four out of sevenclassifiers do not generate the best classification perfor-mance with the SVC feature rankings using the sameclassifier. NB, LR and AB are the classifiers that showbest performance when ranking and classification aredone with the same classifiers. However, when consid-ering the best number of features, the SVC ranking withLR becomes the second choice to be used with the LRclassifier. SVM, MLP, KNN, and RF are classifiers thatdo not generate the best performance when used with anSVC ranking of the same type.

This experiment shows that some classifiers highlyaffect the SVC feature ranking, while some others donot. Therefore, depending on the classifier used, theSVC ranking will not be generated based on pure dis-criminative value of the features, and other measuresshould be considered. On the other hand, when the clas-sifier that is to be used in building the final model (CL)is known, such bias is desirable, since using a rankingthat results in best performance in a classifier is the goal.

Figure 6: Average classification performance placement of differentclassifiers used in the SVC ranking (CR) on all datasets. Each rowcorresponds to one classifier being the final learning classifier (CL)and comparative performance placements are shown on the right sideof the row. (a) Several numbers of features are selected for featuresubsets and the results are averaged. (b) Only the best numbers offeature that generated the highest classification performance are con-sidered in the study.

The above experiment, however, suggests that it is notalways the best choice to use the same classifier in theSVC ranking (CR) and final classification (CL).

Even in cases where using the same classifiers in bothstages result in comparative best performance, the aver-age placement is not near the first. This suggests thateven though on average they are better than other classi-fiers used in SVCs, they are not always the best choicesand other choices generate better results depending onthe dataset. For the other four cases, even on average, itdoes not produce the best performance to use the sameclassifier in the SVC ranking and final classification.

To address the fourth question about the ensembleranking, we have included the ensemble SVC rankingin Figure 6 charts, shown with an E symbol. It couldbe seen that in Figure 6.(a) which shows that averagenumber of features selected, the ensemble SVC rankingis mostly placed as the second or third SVC ranker. Itperforms relatively worst in Figure 6.(b) when the bestnumbers of features are considered. In these experi-ments, RF seems to be an exception where ensembleSVC ranking performs near the worst. Although en-semble rankings never generate the best performancewith any classifier, on average it performs reasonablywell when the best numbers of features to choose fora subset are unknown, thus making it a good candidatefor ranking. It should also be noted that in these ex-periments, only the first 30 top performing features areconsidered, while the true best number of feature mightbe beyond 30.

13

Knowing that the de facto standard of using the sameclassifier for SVC and final classifier will not alwaysgenerate the best results, we would like to understandhow far from the optimal choice of SVC ranker is the fi-nal classification performance for each classifier if suchapproach is taken. The fifth question that we try to an-swer in the paper addresses this concern.

In these experiments, for each number of features be-tween 1 and 30, we have calculated the distance be-tween the best performances with that number of fea-tures (best combination of CR and CL), and the perfor-mance of using the same classifier in the SVC ranking(i.e., CR = CL). By removing the 10% outliers similarto algorithm 1 the averages performance losses are re-ported in Table 4.(a) in terms of the AUC percentage.The smallest loss was with NB and AB which also inFigure 6 demonstrated the best performance while usingthe same classifiers in SVC (CR) and final classification(CL).

The largest loss is shown in KNN and RF which alsoin Figure 6 did not show good performances. This studysuggests that by using different classifiers for SVC rank-ing and final classification a gain of 0.24% up to 5.56%in AUC might be achieved depending on the final clas-sifier. The high variation of the results demonstrates thehigh dependency of this conclusion on the datasets. Ta-ble 4.(b) summarizes the results from the same type ofexperiment when the best numbers of features were cho-sen for the subsets. Similar but more extreme distancesare reported in this table.

6. Discussion

To understand the reason behind the phenomena inFigure 3 where NB places a feature in its bottom listand KNN ranks the same feature in the top perform-ing category, we have studied such features from theMUSK dataset in more details. Feature F121 which ac-cording to the description in the UCI repository (Frankand Asuncion, 2010) is a kind of distance measurement,is one of the three features with such characteristics inFigure 3. It is ranked as the ninth best performing fea-ture using KNN while NB ranks it as one of the worstperforming features. The effect of this feature in the fi-nal classification task is shown in Table 5. When KNNis used as the final classifier (CL), F121 increases theAUC by 1% and when NB is used as the final classi-fier F121 reduces the AUC by 0.3%. By looking at thechanges that F121 brings to the performance of otherclassifiers it seems that the patterns in this feature aremore local than global.

Table 4: Average performance loss from the optimal combinationswhen using the same classifier in the SVC feature ranking (CR) andfinal classification task (CL). The numbers show AUC percentages.(a) Several numbers of features are selected for feature subsets and the results

are averaged.SVM NB MLP KNN LR AB RF

AD 11.85 1.00 0.64 0.07 0.26 0.15 0.18BASEHOOK 0.68 0.01 0.38 0.51 0.34 0.10 0.34CINA 0.71 0.04 0.09 0.09 0.02 0.08 0.14GLI 2.97 0.22 1.61 2.07 4.14 1.01 1.75MADELON 0.13 0.25 2.39 27.59 0.34 1.64 16.84MARTI 2.39 0.00 6.45 17.41 10.60 2.58 18.58MUSK 0.09 0.21 4.59 0.29 1.43 3.15 1.21REGED 0.34 0.39 0.15 0.17 1.10 0.02 0.28SMK 0.11 0.32 6.87 3.66 0.91 0.29 8.20SPECTF 1.52 0.00 3.03 3.73 4.98 1.18 1.69Average 2.08± 0.24± 2.62± 5.56± 2.41± 1.02± 4.92±Distance(%) 3.39 0.29 2.43 8.87 3.17 1.07 6.79

(b) Only the best numbers of features that generated the highest classificationperformance are considered in the study.

SVM NB MLP KNN LR AB RFAD 10.30 2.02 1.03 0.00 0.41 0.00 0.03BASEHOOK 0.54 0.10 0.10 0.21 0.19 1.01 0.89CINA 1.54 0.70 0.23 0.00 0.40 0.53 0.00GLI 1.12 0.00 5.78 6.78 2.13 2.91 0.00MADELON 0.00 0.29 3.49 28.22 0.00 0.27 18.22MARTI 0.00 0.00 12.05 24.10 25.90 0.00 20.86MUSK 0.00 0.92 4.26 0.96 3.14 3.39 0.00REGED 0.38 0.43 0.00 0.00 0.88 0.02 0.05SMK 0.00 0.00 13.41 4.94 3.14 1.55 7.45SPECTF 4.97 0.00 5.61 9.35 3.72 0.00 0.00Average 1.88± 0.45± 4.60± 7.46± 3.99± 0.97± 4.75±Distance(%) 3.15 0.61 4.58 9.90 7.72 1.20 7.73

Table 5: AUC of the different classifiers (CL) when trained and testedwith the first eight and night features from the KNN (CR) feature rank-ing. F121 improves the performance of KNN classifiers (CL) whileexacerbating the NB performance.

Rank Feature SVM NB MLP KNN LR AB RF8 F98 0.774 0.723 0.849 0.858 0.779 0.796 0.9099 F121 0.772 0.720 0.866 0.868 0.777 0.796 0.924

Histograms of instance values distributions based onF121 feature are shown in the Figure 7.(a) and Fig-ure 7.(b) for each class. It is also shown how theycould be estimated via a kernel density estimator witha higher smoothing level that is closer to a Gaussian,and with a less smoothing hyper-parameter . In Fig-ure 7.(c) it can be seen that there are not much differ-ence between the near Gaussian distribution estimationof two classes while the less smooth kernel density es-timation provides some distinctions around values of 0and 100. This variation demonstrates the reason thatKNN which captures the local features performs betterthan NB. In our scenario NB uses a fixed smoothinghyper-parameter for the distributions. By changing thedensity estimation method in NB, F121 no longer per-forms among the worst features and provides averageperformance.

While the smoother distribution estimation works

14

Figure 7: Histogram of values from F121 for each class is shownin (a) and (b). Comparison of kernel density estimation with moresmooth and less smooth hyper-parameter is shown in (c) for bothclasses where blue indicates class=1 and red indicates class=0.

well with other features, this performance change im-plies that the fixed hyper-parameters chosen for NB isnot suitable for the F121 feature. However, it is virtu-ally not possible to tune the hyper-parameters for eachfeature using a classifier as a means to measure thequality of features. Since a separate classifier is builtfor every feature, the model selection problem shouldbe addressed to build the perfect classifier for that fea-ture. Depending on the classifier (CR), running a gridsearch or other methods to find the best possible hyper-parameters for each feature is simply not possible dueto extreme time consumption. The common way is toset the hyper-parameters to values that work best on av-erage or use the hyper-parameters considered for thefinal classifier (CL). This introduces the challenge ofhow to find the best hyper-parameter of the final classi-fier before knowing the final features. Hence, we arguethat when using a classifier to find the best features, thesearch is not only in the feature space but also in thehyper-parameter space at each step. Wrappers and Sin-gle Variable Classifiers both suffer from this issue.

7. Conclusion and Future Directions

In this paper we investigated the set of questions wehighlighted in section 4 regarding the bias and stabilityof single variable classifier feature ranking using multi-ple classifiers and datasets. The most important findingsinclude:

1. We showed that SVC feature ranking is highly sen-sitive to the choice of classifiers. Average corre-lation of overall feature rankings generated usingdifferent classifiers is only 0.60, more importantlythe SVC rankings only agree on 61.5% of the 30most predictive features, and on 36.2% of the 30least predictive features. This finding strongly sug-gests that SVC feature rankings are not good can-didates to report predictive power of features, e.g.

in medical and biological domains. It also sug-gests that SVC rankings are even less robust inbackward feature elimination methods comparingto forward feature selection approaches due to theirhigh disagreement rate for elimination (i.e. only36.2% agreement)

2. While the effect of classifiers on SVC ranking isexpected, we also empirically challenged the defacto standard of using the same classifier for rank-ing and final classification (CR = CL) and itsefficiency to generate the best prediction perfor-mance. In four out of seven classifiers used in ourstudy, the best classification performances werenot achieved when using the same classifiers in therankings and final classification tasks (CR = CL).

3. Furthermore, we have shown the comparative lossof performance when following the de facto stan-dard of using the same classifier in the SVC rank-ing and final classification with respect to the opti-mal classifier combination for ranking and classifi-cation. We have observed that NB and AB classi-fiers show insignificant loss of performance com-paring to other combinations and are good candi-dates to be used for both roles (CR = CL). In con-trast, KNN and RF suffered the highest loss of per-formance when used for both ranking and classi-fication comparing to better combinations. Whileproviding insights about the stability of each clas-sifier with respect to feature rankings, this studysuggests that NB and AB may be better candidatesfor SVC rankings and classification, comparing toKNN and RF classifiers.

4. Several methods referenced in section 3 choose anensemble approach towards feature selection andranking both by combining different classifiers orusing same classifiers with different settings. Wehave demonstrated that ensemble feature rankingby using different classifiers would not generatethe overall best feature rankings but it is a rankingmethod that generates above average performancewith most classifiers. This ensemble approach canprovide a more independent insight when investi-gating the predictive value of each feature, e.g. inmedical and biological domains.

Although we have performed multiple studies inves-tigating single variable classifier feature rankings, thereare several other settings that could affect the perfor-mance of SVCs. It would be highly valuable to studyhow data distributions in each feature affect the perfor-mance of the overall SVC feature rankings. Further-more, the effect of different ensemble settings that can

15

be generated using different sampling of data is inter-esting to study and can provide better insights to en-semble feature rankings. Our studies show that thereis no best performing feature subset for different clas-sifiers and suggest using different feature rankings andselections for each weak learner (classifier) in hetero-geneous ensemble methods. An ensemble method thatcould use this diversity may achieve high performanceand is worth studying.

In section 6 we provided an example to show the sig-nificant effect of not only the classifier type but alsoits hyper-parameters on the performance of SVC fea-ture ranking and the overall performance. This issue af-fects both SVC feature rankings and more widely usedwrapper methods. Our studies suggest that when us-ing a classifier to find the best features, both featurespace and hyper-parameter space should be searched ateach step. Finding the best feature subset depends on aspecific hyper-parameter setting for a classifier, whilethe common approach of finding the best values forhyper-parameters are via cross-validation using a cer-tain set of features. Therefore, separating the featureranking/selection and hyper-parameter tuning steps failsto capture the best combinations between them. Zhaoet al. (2011) and Yoon et al. (2013) show enhanced per-formance by combining feature selection and parame-ter optimization for SVMs and Neural Networks. Moresystematic and general approaches could be highly valu-able in finding better combinations and achieving higherperformances.

Acknowledgments

This work was supported in part by NIH grants R01-EB002450 and R01-EB013227.

References

Isabelle Guyon. Practical feature selection: from correlation to causal-ity. Mining massive data sets for security: advances in data min-ing, search, social networks and text mining, and their applicationsto security, pages 27–43, 2008.

R. Kohavi and G. H. John. Wrappers for feature subset selection.Artificial Intelligence, 97:273–324, 1997.

I. Guyon and A. Elisseeff. An introduction to variable and feature se-lection. Journal of Machine Learning Research, 3:1157–82, 2003.

Liu Huan and Yu Lei. Toward integrating feature selection algorithmsfor classification and clustering. IEEE Transactions on Knowledgeand Data Engineering, 17:491–502, 2005.

Christopher M. Bishop. Pattern recognition and machine learning.Springer, New York, 2006.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elementsof Statistical Learning: Data Mining, Inference, and Prediction.Springer, 2nd edition, February 2009.

Zheng Alan Zhao and Huan Liu. Spectral feature selection for datamining. Chapman & Hall/CRC, 2011.

Steven Loscalzo, Lei Yu, and Chris Ding. Consensus group stablefeature selection. In Proceedings of the ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining,pages 567–575. ACM, 2009.

J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction.Springer Verlag, 2007.

K. Kira and L. A. Rendell. The feature selection problem: traditionalmethods and a new algorithm. In AAAI-92. Proceedings Tenth Na-tional Conference on Artificial Intelligence, pages 129–34. AAAIPress, 1992.

Roberto Ruiz, Jose C. Riquelme, and Jesus S. Aguilar-Ruiz. Incre-mental wrapper-based gene selection from microarray data for can-cer classification. Pattern Recognition, 39:2383–2392, 2006.

Herve Stoppiglia, Gerard Dreyfus, Remi Dubois, and Yacine Oussar.Ranking a random feature for variable and feature selection. Jour-nal of Machine Learning Research, 3:1399–1414, 2003.

I Slavkov, B Zenko, and S Dzeroski. Evaluation method for featurerankings and their aggregations for biomarker discovery. Journalof Machine Learning Research, 2010.

J. Demsar. Statistical comparisons of classifiers over multiple datasets. The Journal of Machine Learning Research, 7:1–30, 2006.

Antonio Arauzo-Azofra, Jos Luis Aznarte, and Jos M. Bentez. Empir-ical study of feature selection methods based on individual featureevaluation for classification problems. Expert Systems with Appli-cations, 38:8170–8177, 2011.

I. Guyon, H.M. Bitter, Z. Ahmed, M. Brown, and J. Heller. Multivari-ate non-linear feature selection with kernel methods. Soft Com-puting for Information Processing and Analysis, pages 313–326,2005.

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vap-nik. Gene selection for cancer classification using support vectormachines. Machine Learning, 46:389–422, 2002.

Yonghong Peng, Zhiqing Wu, and Jianmin Jiang. A novel featureselection approach for biomedical data classification. Journal ofBiomedical Informatics, 43:15–23, 2010.

M. Bacauskiene, A. Verikas, A. Gelzinis, and D. Valincius. A fea-ture selection technique for generation of classification committeesand its application to categorization of laryngeal images. PatternRecognition, 42:645–54, 2009.

Iffat A. Gheyas and Leslie S. Smith. Feature subset selection in largedimensionality domains. Pattern Recognition, 43:5–13, 2010.

A. Tsanas, M.A. Little, and P.E. McSharry. A simple filter bench-mark for feature selection. Journal of Machine Learning Research,2010.

R. Tibshirani. Regression shrinkage and selection via the lasso. Jour-nal of the Royal Statistical Society. Series B (Methodological),pages 267–288, 1996.

Hui Zou and Trevor Hastie. Regularization and variable selection viathe elastic net. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 67:301–320, 2005.

K. Chrysostomou, S. Y. Chen, and Liu Xiaohui. Combining multipleclassifiers for wrapper feature selection. International Journal ofData Mining, Modelling and Management, 1:91–102, 2008.

Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, and KostElisevich. Consensus feature ranking in datasets with missing val-ues. In Machine Learning and Applications (ICMLA), 2010 NinthInternational Conference on, pages 771–775. IEEE, 2010a.

Gokhan Gulgezen, Zehra Cataltepe, and Lei Yu. Stable and accuratefeature selection. In Lecture Notes in Computer Science, volume5781 LNAI, pages 455–468. Springer Verlag, 2009.

A. Kalousis, J. Prados, and M. Hilario. Stability of feature selectionalgorithms: a study on high-dimensional spaces. Knowledge andInformation Systems, 12:95–116, 2007.

16

Lei Yu, Chris Ding, and Steven Loscalzo. Stable feature selection viadense feature groups. In Proceedings of the ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining,pages 803–811. ACM, 2008.

Yvan Saeys, Thomas Abeel, and Yves Van de Peer. Robust featureselection using ensemble feature selection techniques. In Wal-ter Daelemans, Bart Goethals, and Katharina Morik, editors, Lec-ture Notes in Computer Science, volume 5212, pages 313–325.Springer Berlin / Heidelberg, 2008.

H. Zou. The adaptive lasso and its oracle properties. Journal of theAmerican Statistical Association, 101:1418–1429, 2006.

Jana Novovicova, Petr Somol, and Pavel Pudil. A new measure offeature selection algorithms’ stability. In Data Mining Workshops,2009. ICDMW’09. IEEE International Conference on, pages 382–387. IEEE, 2009.

Alexandros Kalousis, Julien Prados, and Melanie Hilario. Stability offeature selection algorithms. In Data Mining, Fifth IEEE Interna-tional Conference on, pages 8–pp. IEEE, 2005.

L. Li, J. Zhang, and R.M. Neal. A method for avoiding bias from fea-ture selection with application to naive bayes classification models.Bayesian Analysis, 3:171–196, 2008.

Surendra K. Singhi and Huan Liu. Feature subset selection bias forclassification learning. In ACM International Conference Proceed-ing Series, volume 148, pages 849–856. ACM, 2006.

Mark A. Hall and Geoffrey Holmes. Benchmarking attribute selectiontechniques for discrete class data mining. IEEE Transactions onKnowledge and Data Engineering, 15:1437–1447, 2003.

W. Altidor, T. M. Khoshgoftaar, and J. Van Hulse. An empirical studyon wrapper-based feature ranking. In IEEE International Confer-ence on Tools with Artificial Intelligence (ICTAI 2009), pages 75–82. IEEE, 2009.

Masoud Makrehchi and Mohamed S. Kamel. Combining feature rank-ing for text classification. In Conference Proceedings - IEEE In-ternational Conference on Systems, Man and Cybernetics, pages510–515. IEEE, 2007.

Weizhong Yan. Fusion in multi-criterion feature ranking. In Infor-mation Fusion, 2007 10th International Conference on, pages 1–6.IEEE, 2007.

Fei-Long Chen and Feng-Chia Li. Combination of feature selectionapproaches with SVM in credit scoring. Expert Systems with Ap-plications, 37:4902–4909, 2010.

R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute bagging: Im-proving accuracy of classifier ensembles by using random featuresubsets. Pattern Recognition, 36:1291–302, 2003.

Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, and KostElisevich. Effect of classifiers in consensus feature ranking forbiomedical datasets. In Proceedings of the ACM fourth interna-tional workshop on Data and text mining in biomedical informat-ics, pages 67–68. ACM, 2010b.

Laura Emmanuella A dos S Santana and Anne M Canuto. Filter-based optimization techniques for selection of feature subsets inensemble systems. Expert Systems with Applications, 41(4, Part

2):1622 – 1631, 2014.Rafael M.O. Cruz, George D.C. Cavalcanti, Ing Ren Tsang, and

Robert Sabourin. Feature representation selection based on clas-sifier projection space and oracle analysis. Expert Systems withApplications, 40(9):3813 – 3827, 2013.

Combination of single feature classifiers for fast feature selection. InAdvances in Knowledge Discovery and Management, volume 527,pages 113–131. Springer International Publishing, 2014.

Wenjie You, Zijiang Yang, and Guoli Ji. Feature selection for high-dimensional multi-category data using pls-based local recursivefeature elimination. Expert Systems with Applications, 41(4, Part1):1463 – 1475, 2014.

Feature extraction using single variable classifiers for binary text clas-sification. In Recent Trends in Applied Artificial Intelligence, vol-ume 7906, pages 332–340. Springer Berlin Heidelberg, 2013.

Hyunsoo Yoon, Cheong-Sool Park, Jun Seok Kim, and Jun-GeolBaek. Algorithm learning based neural network integrating featureselection and classification. Expert Systems with Applications, 40(1):231 – 241, 2013.

Mingyuan Zhao, Chong Fu, Luping Ji, Ke Tang, and Mingtian Zhou.Feature selection and parameter optimization for support vectormachines: A new approach based on genetic algorithm with fea-ture chromosomes. Expert Systems with Applications, 38(5):5197– 5204, 2011.

G. Forman. An extensive empirical study of feature selection metricsfor text classification. Journal of Machine Learning Research, 3:1289–305, 2003.

T. Fawcett. ROC graphs: Notes and practical considerations for re-searchers. Machine Learning, 31:138, 2004.

Xue-Wen Chen and Michael Wasikowski. FAST: a roc-based fea-ture selection metric for small samples and imbalanced data clas-sification problems. In Proceedings of the ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining,pages 124–132. ACM, 2008.

Rui Wang and Ke Tang. Feature selection for maximizing the area un-der the roc curve. In Data Mining Workshops, 2009. ICDMW’09.IEEE International Conference on, pages 400–405. IEEE, 2009.

T. G. Dietterich. Ensemble methods in machine learning. In Multi-ple Classifier Systems. First International Workshop, MCS 2000.Proceedings (Lecture Notes in Computer Science Vol.1857), pages1–15. Springer-Verlag, 2000.

J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On combining clas-sifiers. Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, 20:226–239, 1998.

A. Frank and A. Asuncion. UCI machine learning repository. Uni-versity of California, Irvine, School of Information and ComputerSciences {http://archive.ics.uci.edu/ml}, 2010.

Huan Liu and et al. ASU feature selection data repository. ArizonaState University {http://featureselection.asu.edu/}, 2011.

CW-Team. Causality workbench repository{http://www.causality.inf.ethz.ch/repository.php}. 2011.

17

Date post:	16-Mar-2018
Category:	Documents
Upload:	trannguyet
View:	213 times
Download:	1 times

Bias and Stability of Single Variable Classiﬁers for Feature Ranking...

Documents