Unsupervised Feature Selection with Controlled Redundancy (UFeSCoR

Unsupervised Feature Selection withControlled Redundancy (UFeSCoR)

Monami Banerjee and Nikhil R. Pal, Fellow, IEEE

Abstract—Features selected by a supervised/ unsupervised technique often include redundant or correlated features. While use of

correlated features may result in an increase in the design and decision making cost, removing redundancy completely can make the

system vulnerable to measurement errors. Most feature selection schemes do not account for redundancy at all, while a few supervised

methods try to discard correlated features. We propose a novel unsupervised feature selection scheme (UFeSCoR), which not only

discards irrelevant features, but also selects features with controlled redundancy. Here, the number of selected features can also be

directed. Our algorithm optimizes an objective function, which tries to select a specified number of features, with a controlled level of

redundancy, such that the topology of the original data set can be maintained in the reduced dimension. Here, we have used

Sammon’s error as a measure of preservation of topology. We demonstrate the effectiveness of the algorithm in terms of choosing

relevant features, controlling redundancy, and selecting a given number of features using several data sets. We make a comparative

study with five unsupervised feature selection methods. Our results reveal that the proposed method can select useful features with

controlled redundancy.

Index Terms—Unsupervised feature selection, dimensionality reduction, redundancy control, Sammon’s error, gradient descent technique

Ç

1 INTRODUCTION

AN important problem related to mining large data setsis to project the data into a lower dimensional space,

i.e., Dimensionality Reduction. Among the features presentin a data set, there might be some derogatory, indifferent,redundant or correlated/dependent features, besides rele-vant ones. We can project the data by either selecting therelevant features (feature selection) or extracting new fea-tures as combinations of the relevant ones (feature extrac-tion). Apart from resulting in a computationally efficientpredictor and providing a better understanding of theunderlying structure of the data, dimensionality reductioncan also improve the performance of the predictor by dis-carding derogatory and indifferent features.

Many applications such as gene expression analysisinvolve a large number of features and comparativelysmaller number of samples [1] and these lead to the prob-lem of “curse of dimensionality”. Therefore, selecting asmall number of discriminative genes from thousands ofgenes becomes essential for designing successful diagnosticclassification systems [2], [3], [4], [5], [6], [7].

Feature selection schemes can be broadly classifiedinto two groups,Wrapper Models and Filter Models [8]. Wrap-per methods have a well-specified objective function, whichshould be optimized through the selection of a subsetof features, whereas, filter models use some underlyingproperties of the features. Filter methods are usually compu-tationally less intensive but wrapper methods produce a

feature set which is tuned to a specific type of predictivemodel. Apart from these, there is another class of meth-ods, namely Embedded Methods. Embedded methods selectfeatures as a part of model construction process. As anexample, in SVM-RFE [9] the feature with the least weight,according to the model constructed by a support vectormachine (SVM), is removed iteratively. The final remain-ing feature set forms the reduced dimensional space.Other examples would be sparsity induced feature selec-tion algorithms [10], [11], [12]. In Lasso (least absoluteshrinkage and selection operator) [13], a linear regressionmodel is built by forcing some coefficients of the modelto zero.

Dimensionality reduction methods can also be broadlycategorized into three families, Supervised, Unsupervised andSemi-Supervised. For supervised approaches a target valuefor each data point (e.g. the class label) is provided, while incase of unsupervised approaches no such target values areavailable. Feature selection using supervised approaches ismuch easier because the goal is to select a small subset offeatures so that the problem of prediction/classification canbe done satisfactorily. For unsupervised methods, sincethere is no target application, some task independent crite-rion should be used to select features. For “small-labeleddatasets” none of the above methods can be usedefficiently. Supervised methods fail to generalize due tosmall number of annotated samples and unsupervisedschemes completely ignore the label information, whichmay result poor performance. Semi-supervised schemesexploit the advantages of the two schemes by using infor-mation from both labeled and unlabeled data sets. Manymethods have been proposed under the supervised family[9], [14], [15], [16], [17], [18], [19], [20], [21], [22], but thereare few methods that have been suggested for unsupervised[10], [12], [23], [24], [25], [26], [27], [28] and semi-supervisedaproaches [29], [30], [31], [32], [33], [34].

� The authors are with the Electronics and Communication Sciences UnitIndian Statistical Institute, Calcutta 700 108, India.E-mail: {monamie.b, nrpal59}@gmail.com.

Manuscript received 8 Jan. 2014; revised 2 July 2015; accepted 7 July 2015.Date of publication 12 July 2015; date of current version 3 Nov. 2015.Recommended for acceptance by J. Bailey.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2015.2455509

3390 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 12, DECEMBER 2015

1041-4347� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Many data sets, such as gene expression data, involvecorrelated features. Suppose, in a dataset, there are twohighly correlated relevant features. Then selection of onlyone of the two is sufficient, though both have high rele-vance. Such features are redundant features. Redundantfeatures are those useful features which are dependent oneach other. Unless explicitly controlled, a feature selectiontechnique may select both of these features. But to solve aparticular problem, all of the redundant features are notnecessary. Although selection of correlated featuresincreases computational cost in the system that uses thosefeatures, removing redundancy completely is also not desir-able for two reasons. First, unless the absolute correlation isalmost one, the discriminating power of two correlated fea-tures may be slightly more than the discriminating powerof each feature. Moreover, even if the absolute correlationbetween a pair is nearly 1, use of just one of the two willmake the system vulnerable to measurement errors. Hencesome amount of redundancy is desirable, as then the deci-sion making system, using those features, may be able toefficiently deal with measurement errors. Most of the pres-ent feature selection schemes do not account for redun-dancy at all, while a few try to exclude redundancy amongthe selected features [21], [24].

So the goal of unsupervised feature selection should be toselect a set of ”useful” features in a task-independent man-ner with a control on the level of redundancy in the selectedfeatures. Since there is no target task at hand, one of themost desirable objectives is to preserve the geometry of theoriginal dimension in the lower dimension.

Here we have propose a novel unsupervised featureselection scheme (UFeSCoR). Besides discarding irrelevantfeatures, it selects a desired number of features with con-trolled redundancy. Here, we optimize an objective functionthrough a gradient descent technique. This objective func-tion is designed in such a way, that by optimizing it, thealgorithm UFeSCoR tries to select the specified number offeatures, with a desired level of redundancy, such that thetopology of the original dataset can be maintained in thereduced dimension. Sammon’s error (SE) [35] has been usedto measure the preservation of topology. For controllingredundancy a linear measure of redundancy, Pearson corre-lation, has been considered. The use of any nonlinearmeasures of dependency, such as mutual information, isstraightforward. UFeSCoR does not require any explicitevaluation of different feature subsets. Besides, as thisapproach can look at all the features at a time, any possiblenonlinear subtle interaction between features can beaccounted for.

To assess the quality of the selected features, we usethree indices: (i) Sammon’s error [35] to measure the extentof topology preservation; (ii) Cluster Preserving Index (CPI)[27] to measure the extent of cluster structure preservation;and (iii) Misclassification Error (MCE) for labeled data toassess how comparable the performance of the 1-NN (Near-est Neighbor) classifier in the original and reduced dimen-sion. To further demonstrate the effectiveness of ourmethod, a comparative study of UFeSCoR with five unsu-pervised feature selection methods, FS1 [24], modified ver-sion of FS1 (mFS1) [28], Unsupervised DiscriminativeFeature Selection [12], Nonnegative Discriminative Feature

Selection [10] and, Clustering-Guided Sparse StructuralLearning [11] is done.

2 RELATED WORK

For any given data set, let F be the feature set, G be somesubset of F and C be the target vector (such as, class labels).In general, the goal of any feature selection technique canbe formalized as selecting a minimum cardinality subset Gsuch that P ðCjGÞ is equal or as close as possible to P ðCjFÞ,where P ðCjGÞ and P ðCjFÞ are the respective probability dis-tributions of different classes (C) given the feature values inG and F [36], respectively. For the sake of clarity, nextwe divide our discussion on related work based on thephilosophy used.

2.1 Unsupervised Approaches

There are several unsupervised feature selection methods[11], [28], [37], [38], [39]. Mitra et al. [37] proposed an unsu-pervised feature selection scheme, which partitions the fea-ture set into a number of clusters. Then from each cluster, arepresentative feature is retained to form the reduced fea-ture set. In order to measure the feature similarity, theyhave proposed a measure called Maximal information com-pression index. He et al. [40] proposed a Laplacian Score basedfeature selection scheme applicable to both supervised andunsupervised frameworks. For multi-cluster data sets,Cai et al. [41] solved the generalized eigen value probleminvolving the graph Laplacian for feature selection; whileprincipal component analysis has been used in [42] forselection of useful features. An unsupervised feature selec-tion scheme on text data set has been proposed by Wira-tunga et al. [43] where each document denotes a sampleand words are the features. To select a set of representativewords, they used a feature utility score (FUS) in spirit ofterm contribution score [44]. In [45] Hong et al. first use acluster ensemble method to combine different clusteringsolutions obtained by different clustering algorithms on thedata set. Then it tries to select a subset of features for whichthe resultant clustering in the reduced dimension has thehighest similarity with the ensembled clustering. On theother hand, a randomized unsupervised feature selectionalgorithm has been proposed for the k-means clusteringproblem in [26].

Saxena et al. [27] proposed a Genetic Algorithm basedfeature selection scheme that uses Sammon’s stress/error asthe fitness function. In [46] a greedy method for unsuper-vised feature selection has been introduced. Let G be the setof indices of selected r features and X be the data matrix ofsize n� p. The authors have used a novel feature selection

criterion, F ðGÞ, F ðGÞ ¼ kX � P ðGÞXk2F ; where, k:kF is the

Frobenius norm and P ðGÞ is an n� n projection matrixwhich projects the columns of X onto the span of the col-umns corresponding to the selected features. The goal hereis to select the subset G of dimension r for which the F valueis minimized.

In [47] a kernel based unsupervised feature selection isproposed, which selects a subset of features maximizing thedependence between the full data set and the data set inthe reduced dimension using kernel matrices. Authors usethe variance kernel. Since adding a feature that is correlated

BANERJEE AND PAL: UNSUPERVISED FEATURE SELECTION WITH CONTROLLED REDUNDANCY (UFESCOR) 3391

with an already selected feature will not help to maximizetheir objective function, this method selects features, whichare not correlated and thus reduces redundancy among theselected features.

2.2 Consideration of Dependency

Guyon et al. [9] proposed an SVM-based backward elimina-tion technique (SVM-RFE) which recursively removes onefeature at a time starting with the complete feature set. Thedecision function of an SVM is fðxÞ ¼ wTxþ b where

w ¼ ½w1; w2; . . . ; wp�T is the weight vector, b is a scalar and p

is the number of features. They rank features based on thesquare of the weight (wi) associated with each feature. Thefeature with the lowest rank is rejected. This process isrepeated until the set contains the desired number of fea-tures. This scheme was extended by Zhou and Tuck [19] todeal with k classes, where k > 2. It is known as MultiClassSVM-RFE (MSVM-RFE).

In [48] the SVD-entropy of a data set X 2 Rn�p isdefined as,

EðX½n�p�Þ ¼ � 1

logðNÞXNj¼1

Vj logðVjÞ; (1)

where N ¼ minfn; pg and Vj is the jth singular value of X.Varshavsky et al. [24] used this SVD based data set entropyto estimate the relevance of each feature. They defined thecontribution of the ith feature to the entropy, CEi, by aleave-one-out comparison according to,

CEi ¼ EðX½n�p�Þ �EðX½n��p�Þ; (2)

where �p ¼ pnfithfeatureg. Let the average and the standarddeviation of all CEi values be c and s respectively. Then fea-tures are partitioned into three groups:

� CEi > cþ s, i.e., ith feature has high contribution.� c� s < CEi < cþ s, i.e., ith feature has average

contribution.� CEi < c� s, i.e., ith feature has low (usually nega-

tive) contribution.In [24] only the features in the first group (i.e., the featureswith high contribution) are considered relevant. And thenumber of features in this group is taken as the optimumnumber of features to be selected. Let this optimum numberof features be r.

The selection of features based on CE values can be donein three different ways.

Simple Ranking (SR) Method: Select the top r featuresaccording to the highest CE values.

Forward Selection (FS) Method: Two implementationshave been considered as follows:

FS1: Choose the first feature according to the highest CEvalue. Choose among all the remaining features the one,which together with the first feature produces a two-featureset with the highest entropy. Iterate this process until r fea-tures are selected.

FS2: Choose the first feature as before. Recalculate theCE values of the remaining set of size p� 1 and select thesecond feature according to the highest CE value. Continuethe same way until r features are selected.

Backward Elimination (BE) Method: Eliminate the fea-ture with the lowest CE value. Recalculate the CE valuesand iteratively eliminate the lowest one until r featuresremain.

A modified definition of contribution to entropy (mCEi)is introduced by Banerjee and Pal [28]. In this paper, thecontribution of the ith feature to the entropy, mCEi isdefined as,

mCEi ¼ EðX½n��p�Þ � EðX½n�p�Þ: (3)

Using this revised definitionmCEi, the modified versions ofSR, FS1, FS2 and BE have been shown to yield better fea-tures than their counterparts in [24]. Similar to FS1 [24], itsmodified version,mFS1 [28] can also reduce redundancy.

2.3 Sparsity Based Approaches

In the recent past many methods have been proposed basedon sparsity requirement [10], [10], [12], [12]. In [10] Li et al.proposed an unsupervised ‘2;1-norm based feature selectionscheme. Their method simultaneously learn the cluster indi-cator matrix and feature selection matrix. The cluster indica-tor matrix, F assigns each data point to one of the clustersand the feature selection matrix,W is the same as was intro-duced in [12]. The objective function is chosen such a waythat the spectral clustering criterion [49] on F is minimized

as well as XTW is close to F . The ‘2;1-norm of W is alsoneeded to be minimized to ensure the sparsity. They simul-taneously optimized for both F and W using an efficientiterative algorithm.

Li et al. have extended their works UDFS [12] andNDFS [10] in a generalized framework called Cluster-Guided Sparse Structural Learning in [11]. Besides thethree criteria in the objective function of NDFS as dis-cussed above, they added a quadratic regularization

term. Then they minimize kW �QPk2F with an orthogo-

nality constraint QTQ ¼ I. Like the previous cases, herealso the authors developed an optimization scheme tosimultaneously optimize with respect to all the fourunknowns, F , W , P and Q. Recently Xingzhong et al. [50]proposed a multiple graph based unsupervised featureselection scheme to exploit information from both localand global graphs. To control sparsity of the feature selec-tion matrix, authors use the l2;p norm, where 0 < p < 2.

To evaluate the effectiveness of our proposed algorithm,we have compared its performance with that of FS1 [24]and mFS1 [28] in Section 6.1, as both of these methods canreduce redundancy. In Section 6.2, the performance ofUFeSCoR without any redundancy control has also beencompared with the ‘2;1-norm Regularized DiscriminativeFeature Selection [12], the Nonnegative Discriminative Fea-ture Selection [10] and the Cluster-Guided Sparse StructuralLearning [11].

3 UNSUPERVISED FEATURE SELECTION

TECHNIQUE WITH CONTROLLED REDUNDANCY

(UFESCOR)

The proposed algorithm, UFeSCoR, aims to select a subset offeatures so that the underlying “properties” of the datasetcan be preserved, as much as possible, in the reduced


dimension. In this algorithm, initially no feature is selected.Conceptually, there is a gate associated with each feature,which is initially closed. The openings of all gates arelearned by a gradient descent technique, while optimizingan objective function. The objective function is so designedthat its optimization reflects three primary intents of thealgorithm: (i) selection of relevant features, which can main-tain cluster structure or topology of the data; (ii) control ofredundancy among the selected features; and (iii) selectionof a specified number of features. The construction of UFeS-CoR comprises of two stages, formulation of the objectivefunction (Section 3.1) and learning of the feature modulators(Section 3.2).

3.1 Formulation of the Objective Function

In absence of a target application, any criterion that can pre-serve the “inherent structure” (say neighborhood relation)of the original data in a lower dimension would be a goodobjective function. Sammon’s error [35] is one such crite-rion, which was formulated for feature extraction. We shalluse it for feature selection.

The Sammon’s error [35] measures the extent of preser-vation of interpoint distances in the reduced dataset. Thelower the value, the better is the preservation. Let X 2 Rn�p

be the dataset, i.e., X ¼ fx1; x2; . . . ; xng; where 8i ¼ 1; . . . ; n,

xi ¼ ðxi1; xi2; . . . ; xipÞT , and the Euclidean distance betweenxi and xj be d

�ij. Then,

d�ijðxi; xjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXpk¼1

ðxik � xjkÞ2s

: (4)

Let the dataset in the reduced dimension be Y ¼ fy1;y2; . . . ; yng and dij be the Euclidean distance between yi andyj. Then, the Sammon’s error [35] can be formulated

as follows:

SE ¼ 1Pi < j d

�ij

Xi < j

ðd�ij � dijÞ2d�ij

: (5)

In Eq. (5) the first termP

i < j d�ij does not affect the opti-

mization process as for a given data set, it is constant. How-ever, we show it as we are using exactly the same objectivefunction as in Sammon’s original work [35]. The term d�ij inthe denominator of the second term of (5) reduces the effectof outliers on the objective function. Sammon’s error tries tomaintain the inter-point distances in the original dimensioninto the lower dimensional representation of the data andhence, it can preserve the ”structures” of the original datainto its lower dimensional representation. This is an impor-tant advantage to SE over other projection method. Forexample, suppose in a three dimensional data set, we havea sphere and a surrounding shell. For this data set, no linearprojection would be able to separate the shell from thesphere, while Sammon’s projection would be able to do thesame. The other important advantage of SE is that it reducesthe impact of outliers on the objective function. However,because of the use of Euclidean distance, SE may be signifi-cantly affected if different features are scaled differentlyand this may influence the list of selected features. Note

that, for identical data samples (when d�ij ¼ 0), the contribu-

tion to SE is taken as 0.In order to select a set of features, we associate a weight

(modulator) wi; i ¼ 1; . . . ; p, with the ith feature, wherewi 2 ½0; 1�. Ideally we want wi 2 f0; 1g, where wi ¼ 1 indi-cates that the ith feature is selected and wi ¼ 0 implies thatit is not. We want to learn wis by gradient descent. Sincegradient descent cannot guarantee wi to lie in ½0; 1�, we

model wi as wi ¼ e�b2i , where bi is unrestricted. In fact, in

place of e�b2i , we can use the sigmoidal function or any other

monotonic differentiable function with range [0,1]. This isconceptually equivalent to associating each feature with afeature modulating gate, where, the extent of gate openingis controlled by bi. Thus for feature selection we define

yj ¼ ðw1 xj1; w2 xj2; . . . ; wp xjpÞT ¼ ðe�b21 xj1; e

�b22 xj2; . . . ;

e�b2p xjpÞT . Hence, the Euclidean distance between yi and yj,

i.e., dij is,

dijðyi; yjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXpk¼1

e�2b2kðxik � xjkÞ2

s: (6)

We note here that there are other way of realizing theconstraints on wi [51]. If we minimize Sammon’s error, theglobal optima is obtained when all features are selected.To control the number of selected features, we add a pen-alty term to prevent selection of too many features. Sup-pose we want to select r features, then we can add thefollowing penalty,

PF1 ¼ 1

ðp� rÞ2Xpi¼1

e�b2i � r

!2

: (7)

The term PF1 will try to equal the total gate opening of allfeatures with the number of features to be selected. Toabstain from having too many partially open gates a secondpenalty factor, PF2 is included (Eq. (8)). PF2 tries to make

the value e�b2i close to either 0 or 1, 8i ¼ 1; . . . ; p,

PF2 ¼ 4

p

Xpi¼1

e�b2i ð1� e�b2

i Þ: (8)

Now to control redundancy, we add a third penalty fac-tor which can avoid selection of dependent features. Forexample, if the ith feature is selected and it is highly corre-lated to the jth feature, then we do not need to select both ofthem. If the correlation coefficient between the ith and thejth feature is rij, then the penalty factor assessing the redun-

dancy among the selected features, PF3 is taken as,

PF3 ¼ 1

pðp� 1ÞXpi¼1

e�b2i

Xj6¼i

r2ije�b2j : (9)

Here, we have considered Pearson correlation coefficientto control linear dependency among features. Nonlineardependency can also be regulated, simply by taking rij asmutual information or any other nonlinear measure ofdependency between feature i and feature j.


Now, we want to form an objective function, whichdepends on Sammon’s error to preserve topology of thedataset, correlation between selected features to controlredundancy, and on the approximate number of features tobe selected. A natural choice of the objective function is:

E ¼ SE þB� PF1 þ C � PF2 þD� PF3: (10)

In Eq. (10), all three penalty terms are in [0,1] and B, Cand D are non-negative weights. The value of Sammon’serror is the minimum when all features are selected. Butdue to the constraint on the number of selected featuresand redundancy among them, optimization of Eq. (10) willresult in selection of a subset of features depending on theparameters B, C andD.

3.2 Learning of Feature Modulators

To begin the learning, we need to initialize the modulators.A question comes: can we initialize bs keeping all gatesopen or can we use arbitrary random initialization of bs?The answer is no to both. If we initialize with all gates open,then Sammon’s error is zero. But because of the other termsin the objective function, the system will close some gates.However, starting with all gates open may cause a seriousproblem if there is a feature with constant value or almostconstant value for all samples (we call such features as indif-ferent features). Such a feature does not contribute anythingto Sammon’s error. Therefore, this feature will remainselected from the very beginning of learning. Now if we setB to a high value, the algorithm will select r features, butone of the selected features will be a useless (indifferent)feature. Such features do exist in real life data sets, e.g., Ion-osphere [52]. More importantly, our feature selection shouldprimarily be guided by Sammon’s error. Hence if we keepall gates open at the beginning, Sammon’s error attains itsglobal minimum. Although, the other factors will closesome of the gates, closing of those gates will mainly be con-trolled by the other factors, which is not desirable at allhere. Now consider the situation, when we use random ini-tialization of b in [�3;þ3], i.e., initializing the gate openingwith random values in [0,1]. This will lead to some localminimum depending on the initial b values but this mini-mum usually will not correspond to some of our desiredsolutions. To demonstrate this we have conducted an exper-iment on the Sonar data with two kinds of initializations: (i)initializing the b values using our recommended strategy;i.e., making the gates almost closed; (ii) randomly initializ-ing the b values in ½�3;þ3�, i.e., the initial opening of thegates set to random values in [0,1]. We have used r ¼ 9; B ¼40; c ¼ 1; D ¼ 2: For each of the two cases, we repeated theexperiment 10 times. The average value of E in (10) and theaverage Sammons error, when the gates were almost closedinitially, are 0.3232, and 0.2808, respectively and the samefor the random initialization are 0.4397, and 0.4245, respec-tively. Thus, for Sonar data, with our recommended initiali-zation, on average E is improved by more than 26 percentand Sammons Error is improved by more than 33 percent.We have conducted similar experiments with a fewother data sets also and obtained similar results. Thisclearly reveals the advantage of the proposed initializationstrategy used.

Thus, for our problem, we start the learning assuming allfeatures as unimportant, i.e., with all gates almost closed. Ifwe make all gates completely closed, the learning may notproceed. This almost closed initial state of the gates will helpus to discard indifferent features. Hence the initial values of

bs should be such that e�b2 is nearly zero. If we use a fixedvalue of b, the algorithm will work, but we keep a small ran-domness just to deal with cases having strongly correlatedfeatures. In an extreme case suppose we have two copies ofthe same feature. Both are important and only one isneeded. If the initial gate opening is the same for both fea-tures, we shall not be able to discard one of them. Hence,we need some randomness in the initial values of bs and at

the same time we want e�b2 near zero. With b ¼ 3:0,

e�b2i ¼ 1:2341e�4, which is small enough to initialize the fea-

ture modulators. In order to maintain some randomnessamong the initial b values, the b’s are randomly initializedbetween ½2:995; 3:005�. With this choice, the initial values of

e�b2i lie between 1:27e�4 and 1:20e�4. The selection of the

feature i, which can be viewed as the opening of the gateassociated with the ith feature, is determined by the func-

tion e�b2i . So, initially no feature is selected, as e�b2

i � 0,8i ¼ 1; . . . ; p. Learning of the values of b is done following agradient descent technique, using the following equation,

bnewk ¼ boldk � h

@E@bold

kffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPpi¼1ð @E

@boldi

Þ2r ; 8k ¼ 1; . . . ; p: (11)

In Eq. (11), bnewk and bold

k are the new and old b values ofthe kth feature respectively. h > 0 is the learning rate. The

partial derivative of E with respect to bk,@E@bk

is given as

follows:

@E

@bk

¼ @SE

@bk

þB@PF1

@bk

þ C@PF2

@bk

þD@PF3

@bk

: (12)

Where,

@SE

@bk

¼ 4bke�2b2

kPni¼1

Pj> i d

�ij

Xni¼1

Xj> i

ðd�ij � dijÞðxik � xjkÞ2d�ijdij

(13)

@PF1

@bk

¼ � 4bke�b2

k

ðp� rÞ2Xpi¼1

e�b2i � r

!(14)

@PF2

@bk

¼ � 8bke�b2

k

pð1� 2e�b2

kÞ (15)

@PF3

@bk

¼ � 4bke�b2

k

pðp� 1ÞXpi¼1i 6¼k

e�b2i r2ik: (16)

Features with stronger influence on the objective functionwill have steeper gradients and associated gates will openfaster. Iteratively features are selected by adjustment of bvalues and the iterations stop, when there is no significantchange in any b value for two consecutive iterations, i.e.,

the iterations stop if kbold � bnewk < �, where � > 0 is a


predetermined threshold. After iterations stop, a feature isselected, if the gate associated with the feature is open more

than a predefined value, say �; i.e., if e�b2k > �.

In this study, we have experimented with � values 0:3and 0:8. Because of the second penalty factor PF2, it hasbeen noted that usually the gate associated with a feature iseither open more than 99 percent or less than 1 percent. Thishappens possibly because of the penalty, PF2, which tries toforce the gate opening values either close to 1 or close to 0.However, there is no guarantee, that this will happen forother choices of gate function, parameters and data sets. Inour present investigation, any � value between 0:05 and0:95 can be chosen. A schematic description of the algorithmis given in Fig. 1. Computational complexity of the algo-

rithm per iteration is Oðn2pÞ.

4 ASSESSMENT OF THE SELECTED FEATURES

There are several indices to measure the goodness of theselected features. An important quality attribute of the setof selected features is its ability to preserve the topology ofthe original data. For this Sammon’s error can be a usefulindex. The selected features should also preserve the clusterstructure, if any, of the original data. To assess this severalCluster Preserving Indices like Adjusted Rand Index [53],Normalized Mutual Information [54] can be used. In thisinvestigation we have used the Cluster Preserving Indexproposed in [27]. When the data set is labeled, irrespectiveof the selection method, supervised or unsupervised, wecan use 1-NN (Nearest Neighbor) or any other classifier toevaluate the selected features. However, if the featuresselected using an unsupervised method cannot do a goodjob of classification; it does not necessarily mean that thefeature selection method is a poor one, because the class

structure may be quite different from cluster structure. Ifthe cluster structure or the neighborhood relation betweendata points are preserved in the reduced space, then theperformance of 1-NN or any other distance based classifierusing the selected features should be comparable with thatof using all features. Here, in this work, the 1-NN classifierwith 10-fold cross validation has been used. Next we dis-cuss all the three indices, which are used here.

4.1 Sammon’s Error (SE)

We have already explained Sammon’s Error in Section 3.1. Ifd�ij and dij are the Euclidean distances between the ith and

the jth data samples in original and reduced dimensions,then SE can be computed using Eq. (5) and it can be used asa measure of structure preservation.

4.2 Cluster Preserving Index

To measure the preservation of cluster structure in thereduced dimension, we use Cluster Preserving Index dis-cussed in [27]. If the selected features are good in the senseof preserving cluster structure then in both original andreduced feature space the cluster structure would be similar.To cluster the data in the original dimension, we apply theFuzzy C-Means (FCM) algorithm with number of clustersequals to k. The fuzzy partition matrix for the original dataset is initialized randomly. Let the final fuzzy partitionmatrix be U . In the reduced dimension, we initialize thefuzzy partition matrix with U and then apply FCM to get thefinal partition in the reduced dimension. This is likely toreduce the effect of initialization on the final clusteringresults. As a result of this, the final partition in the reduceddimension is likely to be similar to that in the original dimen-sion, if the cluster structure remains more or less the same inthe reduced dimension. After this, we create a confusionmatrix C, where the ijth entry of C denotes the number ofdata points in the ith cluster in the original dimension whichare placed in the jth cluster in the reduced dimension. Thenthe clusters are re-labeled and the confusion matrix is real-igned using a realignment algorithm described in Supple-mentary Fig. S1. All Supplementary materials can be foundon the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2015.2455509.The sum (T ) of the off diagonal entries of the realigned con-fusionmatrix is the number of points misplaced between thetwo partitions. The percentage of points misplaced between

the two partitions is M ¼ 100 Tn. The CPI is defined as

(100�M) . The algorithm for computing the ClusterPreserving Index is given in Supplementary Fig. S2, availableonline. Since all our data sets are labeled, we use k ¼ numberof classes, but thismay not necessarily be the best choice.

4.3 Misclassification Error

For a labeled data set, the goodness of the selected featurescan also be measured by calculating the MisclassificationError in the reduced dimension. As mentioned earlier, herethe 1-NN classifier with 10-fold cross validation is used tocompute MCE.

1-NN Classifier: Let the training data set be denoted by XTR

and testing data set by XTS . A data sample v 2 XTS is assignedto the same class as of its closest neighbour u 2 XTR, i.e.,

Fig. 1. Algorithm UFeSCoR.


dðu; vÞ � dðw; vÞ, 8w 2 XTR. dðu; vÞ is a dissimilarity measurebetween u and v.

Ten-fold Cross Validation: The data set X is randomly par-titioned into 10 nearly equal subsets, S1; S2; . . . ; S10; where,[Si ¼ X; Si 6¼ f, 8i; and Si \ Sj ¼ f, 8i; j i 6¼ j.

The union of nine subsets is used for training and theremaining one for testing. In this way each of the 10 sub-sets is used for testing. The entire process is repeated10 times and the average misclassification over these10 iterations is reported.

5 EXPERIMENTAL RESULTS

We provide here experimental results on 17 data sets,including 15 real data sets, one augmented version of theIris data and one synthetic data set. Of these 17 data sets 12data sets have dimensions between 4 and 60, while theremaining five have dimensions between 2; 000 and 12; 582.A brief description of the data sets is given in Supplemen-tary Table S1, available online.

The Synthetic data set has 300 points in a seven dimen-sional space. These points are distributed in three sphericalclusters. The clusters are first formed in 3D and then fouradditional features are added to each point as follows.

The centers of the three clusters in 3D are ð0; 0; 0Þ;ð0; 15; 0Þ and ð0; 0; 15Þ. These three features correspond tothe first, second and third features. Note that feature 1 doesnot add any discriminating power—it is an indifferent fea-ture. Now we add two features, fourth and fifth, wherefourth is highly correlated with second, while fifth is highlycorrelated with third (we have generated these two featuresby adding small random noise with features 2 and 3 respec-tively). Two random features, with values ranging from �5to 5, are added as sixth and seventh features. Hence, (sec-ond or fourth) and (third or fifth) features are the importantones. The first feature is an indifferent one and sixth andseventh features are derogatory.

We have also added two additional features (fifth andsixth) to the Iris data set to form the (Augmented) Iris data.The fifth feature is generated as 2� 3rd feature þNð0; 0:5Þ.Similarly, the sixth feature is generated as 2� 4th featureþNð0; 0:5Þ. Thus, the fifth and sixth features have highervariances than their correlated counterparts, third andfourth features. Detailed description of the real data sets canbe found in [52], [55] and [56].

In the following Sections 5.1, 5.2 and 5.3, we discussthe effects of weights B, C and D respectively on theselected features.

5.1 Effect of B on Feature Selection

The penalty factor PF1, as given in Eq. (7), will be mini-

mized whenPp

i¼1 e�b2

i ¼ r, i.e., total gate opening of all fea-tures is equal to the number of features to be selected (r). If

there is no partially opened gate, i.e., e�b2 2 f0; 1g, 8i, thenPF1 in Eq. (7) can only attain its minimum when the speci-fied number of features (r) are selected. With higher valuesof B, the contribution of the penalty factor PF1 in total errorE will increase, and hence it is more likely to select therequired number of features with a non-zero C.

The penalty factor PF2 in E tries not to have any partiallyopened gate, as will be discussed in Section 5.2. Hence, inthe following experimental results while presenting theeffect of B to select a given number of features, non-zero Cvalues are considered.

The effect of increasing B in feature selection for Winedata is shown in Fig. 2. For each triplet (B, C, D), the algo-rithm UFeSCoR is run 30 times. The table in Fig. 2 lists theaverage number of features selected over 30 runs for threevalues of B, i.e., B ¼ 0, B ¼ 10 and B ¼ 20, keeping C ¼ 1and D ¼ 0. Here, the number of features to be selected, r =3. The bar graph in Fig. 2 shows the frequencies of selectingthe 13 features of Wine data for these three parameter sets.

In the Wine data the 13th feature (has a very high valuecompared to other features) can control the preservation ofthe distance topology of the data set to a great extent.Hence, by selecting only the 13th feature Sammon’s errorcan be quite low. At B ¼ 0, as there is no pressure to selectthe required number (r) of features, UFeSCoR has selectedonly the 13th feature for all of the 30 runs, as shown inFig. 2. From the figure it is also clear that with increasing B,the average number of selected feature tends to become 3,the desired number of features.

The effect of B in selecting features for the Synthetic datawith C ¼ 2 and no redundancy control is shown in Fig. 3.Let the number of features to be selected be 3.

As discussed earlier, in the Synthetic data, the secondand the third features are the most important ones. Features4 and 5 are highly correlated to the second and the third

Fig. 2. Effect of B in feature selection for wine data with C ¼ 1 andD ¼ 0when r ¼ 3. Fig. 3. Effect of B in feature selection for synthetic data with C ¼ 2 and

D ¼ 0 when r ¼ 3.


features respectively. When B ¼ 0 and D ¼ 0, a feature willget selected if the amount of reduction in Sammon’s errorthat can be achieved by that feature is more than theincrease in E caused by C � PF2: Note that when an almostclosed gate slowly opens it passes through partially openedstates, which increases the value of PF2. To demonstratethat this is indeed the case, we ran the same experiment byB ¼ 0, C ¼ 1 and D ¼ 0. In this case on average 1:17 fea-tures are selected.

With lower B value, there is a preference for those fea-tures which can reduce the Sammon’s error better. But as Bincreases, with a non-zero C, the need to select exactly rnumber of features dominates. That is why, for B ¼ 2 UFeS-CoR has not selected the indifferent feature 1 or the deroga-tory features 6 and 7 at all, but for B ¼ 10 it has selectedthese features with high frequencies (as r ¼ 3). From Fig. 3we can also notice that, in absence of redundancy control,for B ¼ 6 features 2 and 4 have been selected simulta-neously, though they are highly correlated.

For Colon data the effect of B in feature selection isshown in Table 1. Here the number of features to be selectedis 81 and the values of the parameters C andD are 1 and 0.

5.2 Effect of C on Feature Selection

The value of the penalty factor PF2 in Eq. (8) will attain its

minimum when e�b2i 2 f0; 1g 8i, i.e., all feature modulating

gates are either totally open or totally close. Hence, whileselecting a given number of features, besides regulating theparameter B, a non-zero C value is desired.

For Wine data, the average number of features selectedwith C ¼ 0 and 1, and for different choices of B is given inTable 2. For C ¼ 0, with increasing value of B the algorithmcould not select r ¼ 3 features beacause of multiple partiallyopened gates.

Consider a case with D ¼ 0. As initially all gates areclosed (PF2 is nearly zero and plays no role), a featurewill be selected, if the feature’s contribution to Sammon’serror and algorithm’s need to select r features (B� PF1)can exceed the effect of C � PF2 to keep the

corresponding gate closed. When D ¼ 0, with B and runchanged, by setting the value of C, we implicitly createa ”threshold”. Only those features, which can preservethe topology of the dataset higher than that, will beselected. If this threshold is too high, i.e., for large C com-pared to B, no feature may get selected, as happened incase of the Synthetic data set in Fig. 3 for B ¼ 0 andC ¼ 2. But if the same experiment is run with C ¼ 1, onaverage 1:17 features are selected. The effect of C, withD 6¼ 0, can also be explained in a similar way.

5.3 Effect ofD on Feature Selection

The penalty factor PF3 in Eq. (9) amounts the total correla-tion among the selected features. Because of the contribu-tion of D� PF3 in E, the algorithm tries to select lesscorrelated features. With increase in D value, this constraintbecomes more stringent. A large D value will result in alarge D� PF3, even with small correlations among features.Hence, if D value is sufficiently high, no feature may getselected. The effects of the parameter D in feature selectionfor a few data sets are discussed next.

Fig. 4 depicts changes in b values up to 150th iteration forall four features of Iris data with B ¼ 0, C ¼ 0 and D ¼ 5.This figure and other figures depicting gate opening anderror as a function of iterations are generated for some arbi-trarily chosen runs of the algorithm. Fig. 5 depicts howthese changes in b values are reflected in the error (E). Theextent of opening of the gates corresponding to these fea-

tures can be computed by e�b2 . Hence, as jbj approacheszero, the gate becomes more and more open. We can see inFigs. 4 and 5 that, in the first iteration gates for the first,third and fourth features tend to open, resulting in a reduc-tion of the error value. The reduction in b3 is higher thanthat of the other two. From the next iteration only feature

TABLE 1Effect of B for Colon Data withC ¼ 1 & D ¼ 0When r ¼ 81

B Avg. No. of Features

0 3.2050 22.63100 63.90150 68.83

TABLE 2Effect of C for Wine Data withD ¼ 0When r ¼ 3

C ¼ 0 C ¼ 1

B Avg. No. ofFeatures

B Avg. No. ofFeatures

0 0.83 0 1.0010 2.00 10 2.2720 2.00 20 3.0040 2.00 40 3.00

Fig. 5. Error value for Iris with B ¼ 0, C ¼ 0, andD ¼ 5.

Fig. 4. Opening of Gates for Iris with B ¼ 0, C ¼ 0, andD ¼ 5.


three tends to get selected and due to the high correlation ofthe third feature with the fourth and first features, b valuescorresponding to these two gates increase, and thus dese-lecting them. The correlation values between different fea-ture pairs of the Iris data can be found in Table 3. In nextfew iterations, while trying to lower the value of b3, itbecomes negative ð� �1Þ due to overshoot. This results inan increment of error (E) from � 0:1 to 0:5. After this, up toiteration 50, jb3j reduces while keeping the other b valuesfixed, resulting in a declining E value. After 50th iteration,there is no significant change in the b values and hence Ealso remains nearly the same.

The effect of D can also be shown when B and C valuesare non-zero. The change in b values for all six features ofthe (Augmented) Iris data with B ¼ 15, C ¼ 1 and D ¼ 15 isshown in Fig. 6. Here, r = 2. The change in error with itera-tions is shown in Fig. 7. In the first few iterations the firstfeature gets selected. After that up to around 60th iterationthere is no significant change in b values. Hence, in Fig. 7the error value (E) at first drops from � 0:9 to � 0:3 andthen remains nearly the same until around 60th iteration.As the number of features to be selected (r) is 2, after thefirst 60 iterations the gate corresponding to the sixth feature

starts to open. Correlation between the first and the sixthfeature is high, as shown in Table 4. Hence, due to the highD value UFeSCoR should not select both of them together.For the (Augmented) Iris data set, if only feature 1 or onlyfeature 6 is selected, then Sammon’s error is 0:84 or 0:39,respectively. From this it is evident that, feature 6 can pre-serve the topology of the data set better than that by feature1. Hence, UFeSCoR has deselected the first feature in follow-ing iterations. Our method can exploit this kind of nonlinearinteractions between features, because UFeSCoR looks at all fea-tures together. After about 170th iteration the sixth and thesecond features are selected, as correlation between thesetwo features is small. As shown in Fig. 7, this has eventuallydecreased the error (E) to a very small value, as penalty dueto correlation, i.e., PF3 and penalties due to selected numberof features, i.e., PF1 and PF2 are reduced.

In case of WBC data, the ranks of the features accord-ing to their individual contribution to SE are as follows:6, 3, 2, 1, 8, 4, 7, 5 and 9. The plots of opening of the gatesand the error value over the first 100 iterations for WBCdata are shown in Figs. 8 and 9, respectively. Here wehave used r ¼ 4, B ¼ 100, C ¼ 10 and D ¼ 55. Initially, allgates start to open and within about 20 iterations, feature6 and 2 are selected due to their high influence on SE.But around iteration 30 after selecting features 6 and 2,the gate of feature 3 starts to close, and it gets deselected.This is due to the fact that feature 3 is highly correlatedwith feature 2 and 6, i.e., r2;3 ¼ 0:91, and r6;3 ¼ 0:71. Fea-

ture 1 is selected around iteration 32 because of its goodcontribution to SE, as well as low correlation with thealready selected features (r1;6 ¼ 0:59 and r1;2 ¼ 0:64). Due

to the high B value, the algorithm is forced to select fourfeatures. Since features 8, 4, 7 and 5 have high correla-tions with feature 2 (the minimum correlation of these

TABLE 3Correlation Matrix for Iris Data

Features 1 2 3 4

1 1.00 -0.11 0.88 0.822 -0.11 1.00 -0.42 -0.363 0.88 -0.42 1.00 0.974 0.82 -0.36 0.97 1.00

Fig. 7. Error value for (August.) Iris with B ¼ 15 (r ¼ 2), C ¼ 1, andD ¼ 15.

Fig. 6. Opening of Gates for (August.) Iris with B ¼ 15 (r ¼ 2), C ¼ 1,andD ¼ 15.

TABLE 4Correlation Matrix for (Augmented) Iris Data

Features 1 2 3 4 5 6

1 1.00 -0.11 0.88 0.82 0.88 0.822 -0.11 1.00 -0.42 -0.36 -0.42 -0.373 0.88 -0.42 1.00 0.97 1.00 0.974 0.82 -0.36 0.97 1.00 0.97 1.005 0.88 -0.42 1.00 0.97 1.00 0.976 0.82 -0.37 0.97 1.00 0.97 1.00

Fig. 8. Opening of Gates for WBC Data with B ¼ 100 (r ¼ 4), C ¼ 10,andD ¼ 55.


four features with feature 2 is 0:71), and a high penaltyfor redundancy (D ¼ 55), feature 9 is selected as thefourth feature around iteration 35. From Fig. 9, we cansee that, after selecting four features around iteration 35,the error value becomes nearly stable. There is no signifi-cant change in the error (E) after the 35th iteration.

To see the effect of D, for a fixed set of parameters, wepropose an index called the RMS correlation, g. For a giventriplet (B, C, D), we make N runs of the algorithm. Let thenumber of features to be selected is r. Then in each run,

there are r2

� �feature pairs to be considered. The root mean

square (RMS) correlation, g for a given triplet, is thenobtained as follows:

g ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNl¼1

Pfi;jg2 r

2ð Þ r2ij

N r2

� �vuut

: (17)

In our experiments, we useN ¼ 30. rij is the Pearson cor-relation coefficient between a pair of features i and j.

For the Colon data, the effect ofDwhile keeping B and Cunchanged, is shown in Table 5. All the results are gener-ated with B ¼ 150 and C ¼ 1. Here, the number of featuresto be selected, i.e., r is 81. If in any run, the number ofselected features is not 81, then the first 81 features with the

highest e�b2 values are taken. We can see that the RMS cor-relation g decreases with increasingD.

A similar result for the Synthetic data with B ¼ 15 andr ¼ 3 is given in Table 6. The changes in g with changing Dare observed for five different C values, i.e., C ¼ 0:00,C ¼ 0:50, C ¼ 1:00, C ¼ 1:75 and C ¼ 3:50. As expected, thevalue of g has decreased with increase in D value. Thus

with proper tuning of the weight D we can control theredundancy in selected features.

Note that, a large value of D, which tries to removeredundancy significantly, might have a “bad effect” on fea-ture selection. This is demonstrated by the following experi-ment on the synthetic data with B = 80, C = 1, D = 80, r = 3.For this setting, our method selected features 2, 6, and 7.Recall that feature 2 is an important feature and features 6and 7 are derogatory ones. But, this result is justifiedbecause D and B are assigned high values. The correlationbetween features (2,3), (2,5) and (3,5) are �0.45,�0.46, and0.998, respectively; while the correlation between (6,2), (7,2)and (6,7) are �0.05, 0.04, and �0.08, respectively. Conse-quently the high value of D forces selection of two poor butuncorrelated features. This is a “bad effect” of over-penaliz-ing redundancy.

6 COMPARISONS

The singular value decomposition (SVD) based four unsu-pervised feature selection approaches (SR, FS1, FS2 andBE) were introduced by Varshavsky et al. in [24]. Amongthese four methods FS1 can reduce redundancy among theselected features. In [28] a modified version of these algo-rithms has been proposed. In Section 6.1, the performanceof UFeSCoR has been compared with that of FS1 and themodified version of FS1 (mFS1). In Section 6.2, the perfor-mance of UFeSCoR without any redundancy control hasalso been compared with that of ‘2;1-norm Regularized Dis-criminative Feature Selection algorithm [12], the Nonnega-tive Discriminative Feature Selection algorithm [10] and theClustering-Guided Sparse Structural Learning algorithm[11]. In this case, for UFeSCoR, we use D ¼ 0, as UDFS,NDFS and CGSSL do not control redundancy in theselected features, explicitly.

The optimum number of features (r) to be selected forany dataset, is taken as the number of features with high

Fig. 9. Error value for WBC Data with B ¼ 100 (r ¼ 4), C ¼ 10, andD ¼ 55.

TABLE 5Effect ofD on RMS Correlation for ColonData with B ¼ 150 (r ¼ 81) and C ¼ 1

D RMS Correlation

0 0.3610 0.3420 0.3340 0.3280 0.30

TABLE 6Effect ofD on RMS Correlation for Synthetic Data with B ¼ 15 (r ¼ 3)

C ¼ 0:00 C ¼ 0:50 C ¼ 1:00 C ¼ 1:75 C ¼ 3:50

D RMSCorrelation

D RMSCorrelation

D RMSCorrelation

D RMSCorrelation

D RMSCorrelation

0 0.47 0 0.40 0 0.40 0 0.42 0 0.4710 0.29 10 0.35 10 0.38 10 0.38 10 0.4620 0.21 20 0.29 20 0.35 20 0.35 20 0.3640 0.17 40 0.28 40 0.29 40 0.34 40 0.3180 0.14 80 0.16 80 0.21 80 0.24 80 0.27


contribution to entropy (CE), as proposed in Varshavskyet al. [24]. Let this number be r0 if the CEi definition, asgiven in Eq. (3) [24] is followed and r1 if we consider themodified mCEi definition from Eq. (2) [28]. The values of r0and r1 for all 17 data sets are shown in Table 7.

For comparing the performances of any two unsuper-vised approaches, topology preservation by Sammon’sError [35] and cluster preservation by CPI [27] are used. Wehave also presented a comparison based on Misclassifica-tion error, as computed with the features selected by thealgorithms. For each triplet, (B, C, D), we made 30 indepen-dent runs of UFeSCoR. If for any run the number of selectedfeatures is not the same as the given optimum number r,

then the first r features with the highest e�b2 values havebeen chosen. In each of these 30 runs, SE, CPI and MCE val-ues are computed for the r selected features and the averageindices values are reported.

6.1 Comparison with FS1 andmFS1

While comparing the performance of UFeSCoR with that ofFS1 and mFS1, the optimum number of features are takenas r0 and r1 respectively. For each data set, the parametersB and D have been varied from 20 to 100 with an incrementof 20 and from 0 to 10with an increment of 2:5, respectively,while keeping the value of C as 1.

As, r0 is zero for five data sets, as shown in Table 7, wehave compared UFeSCoR with FS1 for the rest 12 data sets.Similarly for comparison with mFS1, 16 data sets with non-zero r1 values have been considered. The comparativeresults of UFeSCoR with respect to FS1 and mFS1 in termsof SE, CPI and MCE are summarized in Tables 8 and 9.These tables have been generated as follows.

Comparison table for any of the three metrics is a 5� 5matrix, with all zero values initially. Each entry of thesetables corresponds to a distinct choice of the pair (B, D),keeping C fixed. For a dataset, if UFeSCoR outperforms thecorresponding unsupervised method, for a given triplet (B,C, D), in terms of a goodness measurement, then in therespective table the corresponding entry will be increased

by 1. Thus for the used datasets, in Tables 8 and 9 any entrycan have an integer value upto 12 and 16, respectively. Thehigher the value, the better UFeSCoR has performed.

For almost all data sets UFeSCoR has been able to pre-serve both topology and cluster structure better than FS1and mFS1. For more than 50 percent of the data sets used,features selected by UFeSCoR have also resulted in betterclassification accuracy than by the features selected by those

TABLE 8Comparative Results of UFeSCoR and FS1 for 12 Data Sets

with C ¼ 1

SE

D\B 20 40 60 80 100

0 12 11 12 11 122:5 12 12 12 11 125:0 12 11 12 11 117:0 12 12 12 12 1110:0 12 12 12 12 11

CPID\B 20 40 60 80 100

0 12 11 12 11 122:5 12 12 12 11 125:0 12 11 12 11 117:0 12 12 12 12 1110:0 12 12 12 12 11

MCED\B 20 40 60 80 100

0 9 10 10 10 102:5 9 10 10 10 85:0 9 10 10 10 107:0 10 10 10 10 910:0 10 9 10 10 9

TABLE 7Optimum Number of Features to be Selected

Data Sets No: ofFeatures

r0 r1

Synthetic 7 0 2Iris 4 0 1(Augmented) Iris 6 0 1WBC 9 1 2Glass 9 0 1Wine 13 1 1Ionosphere 34 8 9Sonar 60 6 9Vehicle 18 1 0Liver 6 1 1Thyroid 5 0 1SPECT Heart 22 3 4CNS 7; 129 113 101Colon 2; 000 124 81Leukemia 7; 129 311 240MLL 12; 582 254 193SRBCT 2; 308 88 81

TABLE 9Comparative Results of UFeSCoR andmFS1

for 16 Data Sets with C ¼ 1

SE

D\B 20 40 60 80 100

0 15 15 15 15 152:5 15 15 14 14 155:0 15 15 15 15 157:0 15 15 16 16 1510:0 15 14 15 15 15

CPID\B 20 40 60 80 100

0 14 14 14 14 152:5 15 15 13 13 155:0 14 14 15 15 147:0 14 14 16 15 1410:0 14 13 14 14 14

MCED\B 20 40 60 80 100

0 10 10 10 10 92:5 10 10 9 9 105:0 10 10 10 10 107:0 10 10 11 11 1010:0 10 10 10 10 10


two algorithms. Moreover, in case of UFeSCoR, with appro-priate tuning of parameter D, the level of dependencyamong the selected features can be regulated, whereas theamount of redundancy among the selected features, givenby FS1 andmFS1, cannot be controlled.

6.2 Comparison with UDFS, NDFS and CGSSL

Here we compare the performance of UFeSCoR with threewell-known unsupervised feature selection algorithms:UDFS [12],NDFS [10] and CGSSL [11].

As these three methods do not control redundancyexplicitly among the selected features, while comparingwith them, the performance of UFeSCoRwithout any redun-dancy control (D ¼ 0) has been considered. We keep the Cvalue fixed at 1 as before and chose the parameter B to be40, arbitrarily.

To apply the three algorithms, UDFS, NDFS andCGSSL, we need to identify the parameters a, b, g, � and k,

as appear in [12], [10] and [11]. We set b, g, � and k as 1, 102,

108 and 5, respectively, which are the default values as men-tioned in [12], [10] and [11]. In [12], the value of a was cho-

sen by a “grid-search” strategy from f10�6; 10�4; . . . ; 106g. Ithas also been mentioned in [11], that the parameter a has alesser effect on the performance of the algorithm. Hence, we

chose a ¼ 10�6, arbitrarily. None of these algorithms specifyr, the optimum number of features to be selected. So, tocompare them in a common platform, we let the optimumnumber of features to be r1, as given in Table 7. The com-parative results of these four methods are given in Table 10.The best performance is depicted in bold face.

From Table 10 we observe that UFeSCoR has outper-formed UDFS, NDFS and CGSSL in terms of SE and CPIfor almost all data sets. For more than 50 percent data sets ithas also resulted in better classification accuracy.

6.3 Computational Complexity Analysis

Computational cost of all the methods used for comparisonare given below.

� The cost for each of FS1 [24] and mFS1 [28] is

Oðpminðp; rÞ3 þ pr4Þ, where r is r0 and r1 for FS1andmFS1, respectively.

� The complexity of UDFS [12] is O np3 þ n2 þ tp3ð Þwhere t is the number of iterations for convergence.

� The computational cost for NDFS [10] is O p3ð Þ periteration.

� The complexity of CGSSL [11] per iteration is

O p3 þ np2 þ pn2ð Þ.� As mentioned in Section 3.2, UFeSCoR needs O n2pð Þ

amount of computations per iteration.So, consulting the above analysis, we can say that for

data sets with p n, such as gene expression data sets,UFeSCoR is computationally more efficient than the othercomparing methods.

7 CONCLUSION

In this paper, a novel unsupervised feature selectionscheme, UFeSCoR has been proposed. In this method, withproper choices of the three parameters B, C and D not onlythe amount of redundancy among the selected features canbe controlled but a desired number of features can also beselected. Depending on the target application, the extent ofstructure/topology preservation, the desired level of redun-dancy in the selected features and the number of features tobe selected can be controlled using the parameters B, CandD.

Note that, here our goal is to preserve the original topol-ogy of the data in the reduced dimension. From this pointof view, our method is quite robust against irrelevant fea-tures. To demonstrate this, we have generated five well sep-arated Gaussian clusters in three dimensional space withmean vectors ð0; 0; 0Þt, ð�20; 0; 0Þt, ð20; 0; 0Þt, ð0;�20; 0Þt andð0; 0;�20Þt respectively. For each cluster, the covariancematrix is a randomly generated symmetric positive semide-finite matrix with dominant diagonal and off-diagonalentries in ð0; 1Þ. Then we have added 25, 50, 75, 100 indepen-dent uniformly distributed random features in ð0; 1Þ. In all

TABLE 10Comparative Results of UDFS, NDFS, CGSSL, and UFeSCoR with B ¼ 40 (r ¼ r1), C ¼ 1, andD ¼ 0

Data Sets UDFS NDFS CGSSL UFeSCoR

SE CPI MCE SE CPI MCE SE CPI MCE SE CPI MCE

Synthetic 0:28 70:33 31:53 0:36 63:14 33:57 0:33 64:23 38:60 0:18 85:50 13:80Iris 0:69 45:47 52:67 0:08 94:67 7:76 0:69 45:47 52:67 0:08 94:67 7:76(Augmented) 0:85 95:33 7:07 0:92 51:64 51:93 0:84 74:67 39:73 0:02 97:33 6:94IrisWBC 0:45 89:60 13:21 0:40 91:51 7:79 0:35 93:56 7:47 0:26 94:16 6:82Glass 0:69 32:90 54:21 0:82 21:85 66:36 0:56 29:38 70:09 0:34 41:06 65:07Wine 1:00 40:43 40:67 0:99 59:55 30:96 0:99 56:74 41:29 0:00 100:00 32:98Ionosphere 0:25 96:58 10:51 0:23 98:58 11:91 0:25 94:87 11:88 0:21 96:34 12:61Sonar 0:30 50:37 25:63 0:23 70:19 25:48 0:41 47:82 25:24 0:21 86:70 24:97Liver 0:69 88:12 47:83 0:88 72:17 48:32 0:84 49:97 46:84 0:28 93:09 46:14Thyroid 0:59 35:57 19:63 0:88 33:16 24:98 0:16 80:57 28:56 0:16 80:57 28:56SPECT Heart 0:44 70:14 37:08 0:35 69:29 27:98 0:36 69:86 27:83 0:35 75:74 26:93CNS 0:16 73:17 34:52 0:17 68:10 40:71 0:70 39:68 45:95 0:08 75:32 33:33Colon 0:52 77:42 21:45 0:50 85:48 17:10 0:56 98:39 25:65 0:39 73:92 21:14Leukemia 0:38 85:74 26:47 0:41 87:50 2:94 0:52 92:08 5:88 0:25 88:86 0:00MLL 0:25 88:89 13:33 0:27 94:22 0:00 0:78 65:28 6:67 0:16 95:67 13:33SRBCT 0:29 71:69 30:00 0:29 75:66 5:00 0:75 74:46 20:00 0:19 76:38 20:00


of these four cases, we found that our method could selectthe three important features. However, if the number ofirrelevant features is increased further, say to 500, ourmethod may fail to find the useful features. The success ofthe algorithm with 100 random features, does indicate agood tolerance of the algorithm against irrelevant features.

In this work, we have used Pearson correlation as a mea-sure of redundancy. The non-linear dependencies amongfeatures can also be controlled by using mutual informationin place of Pearson coefficient.

We have analyzed the effect of choice of B, C and D, butwe did not provide any guideline about their choice for agiven data set. The most important parameter seems to beD, which controls the level of redundancy; while the leastimportant is the parameter C because it simply controls thecrispness (closeness to 0 or closeness to 1) of the modulators.On the other hand, choice of B is relatively easy as highervalue of B brings the number of selected features close tothe number of features that the user wants to select. If wehave a target application at hand, we may use cross valida-tion mechanism to choose appropriate values of B, C andD.

REFERENCES

[1] E. R. Dougherty, “Small sample issue for Microarray-based classi-fication,” Comparative Functional Genomics, vol. 2, pp. 28–34, 2001.

[2] C. Ding and H. Peng, “Minimum redundancy feature selectionfrom microarray gene expression data,” in Proc. Comput. Syst. Bio-informatics Conf., 2003, pp. 523–529.

[3] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,C. D. Bloomfield, and E. S. Lander, “Molecular classification ofcancer: Class discovery and class prediction by gene expressionmonitoring,” Science, vol. 286, pp. 531–537, Oct. 1999.

[4] N. R. Pal, K. Aguan, A. Sharma, and S. Amari, “Discovering bio-markers from gene expression data for predicting cancer sub-groups using neural networks and relational fuzzy clustering,”BMC Bioinformatics, vol. 8, p. 5, 2007.

[5] N. R. Pal, “A fuzzy rule based approach to identify biomarkers fordiagnostic classification of cancers,” in Proc. IEEE Int. Fuzzy Syst.Conf., 2007, pp. 1–6.

[6] Y.-S. Tsai, C.-T. Lin, G. C. Tseng, I.-F. Chung, and N. R. Pal,“Discovery of dominant and dormant genes from expression datausing a novel generalization of SNR for multi-class problems,”BMC Bioinformatics, vol. 9, p. 425, 2008.

[7] Y.-S. Tsai, K. Aguan, N. R. Pal, and I.-F. Chung, “Identification ofsingle- and multiple-class specific signature genes from geneexpression profiles by group marker index,” PLoS ONE, vol. 6,p. e24259, 2011.

[8] P. Kohavi and G. H. John, “Wrappers for feature subset selection,”Artif. Intell., vol. 97, pp. 273–324, 1997.

[9] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selectionfor cancer classification using support vector machines,” Mach.Learn., vol. 46, pp. 389–422, Mar. 2002.

[10] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu, “Unsupervised featureselection using nonnegative spectral analysis,” in Proc. AAAI Conf.Artif. Intell., 2012, pp. 1026–1032.

[11] Z. Li, J. Liu, Y. Yang, X. Zhou, and H. Lu, “Clustering-guidedsparse structural learning for unsupervised feature selection,”IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2138–2150,Sep. 2013.

[12] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, “‘2;1-normregularized discriminative feature selection for unsupervisedlearning,” in Proc. 22nd Int. Joint Conf. Artif. Intell., 2011, pp. 1589–1594.

[13] R. Tibshirani, “Regression shrinkage and selection via the lasso,”J. Roy. Statist. Soc., Series B, vol. 58, pp. 267–288, 1994.

[14] N. R. Pal and K. Chintalapaudi, “A connectionist system for fea-ture selection,” Neural, Parallel Sci. Comput., vol. 5, pp. 359–382,1997.

[15] H. Lee, C. Chen, J. Chen, and Y. Jou, “An efficient fuzzy classifierwith feature selection based on fuzzy entropy,” IEEE Trans. Syst.,Man Cybern.-Part B: Cybern., vol. 31, no. 3, pp. 426–432, Jun. 2001.

[16] I. Guyon and A. Elisseeff, “An introduction to variable and featureselection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.

[17] G. Qu, S. Hariri, and M. Yousif, “A new dependency and correla-tion analysis for features,” IEEE Trans. Knowl. Data Eng., vol. 17,no. 9, pp. 1199–1207, Sep. 2005.

[18] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo,“Supervised feature selection via dependence estimation,” inProc. 24th Int. Conf. Mach. Learn., 2007, pp. 823–830.

[19] X. Zhou and D. P. Tuck, “MSVM-RFE : Extensions of SVM-RFE formulticlass gene selection on DNA microarray data,” Bioinformat-ics, vol. 23, no. 9, pp. 1106–1114, 2007.

[20] J. M. Sotoca and F. Pla, “Supervised feature selection by clusteringusing conditional mutual information-based distances,” PatternRecog., vol. 43, no. 6, pp. 2068–2071, 2010.

[21] L. Zhou, L. Wang, and C. Shen, “Feature selection with redun-dancy-constrained class separability,” IEEE Trans. Neural Netw.,vol. 21, no. 5, pp. 853–858, May 2010.

[22] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature sub-set selection algorithm for high-dimensional data,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 1, pp. 1–14, Jan. 2013.

[23] M. Dash, H. Liu, and J. Yao, “Dimensionality reduction for unsu-pervised data,” in Proc. 19th IEEE Int. Conf. Tools with Artif. Intell.,1997, pp. 532–539.

[24] R. Varshavsky, A. Gottlieb, M. Linial, and D. Horn, “Novel unsu-pervised feature filtering of biological data,” Bioinformatics,vol. 22, no. 14, pp. e507–e513, 2006.

[25] R. Liu, N. Yang, X. Ding, and L. Ma, “Feature selection algorithm:Laplacian score combined with distance-based entropy measure,”in Proc. 3rd Int. Symp. Intell. Inf. Technol. Appl., 2009, pp. 65–68.

[26] C. Boutsidis, M. W. Mahoney, and P. Drineas, “Unsupervised fea-ture selection for the k-means clustering problem,” in Proc. 23rdAnnu. Conf. Neural Inf. Process. Syst., 2009, pp. 153–161.

[27] A. Saxena, N. R. Pal, and M. Vora, “Evolutionary methods forunsupervised feature selection using Sammon’s stress function,”Fuzzy Inf. Eng., vol. 2, no. 3, pp. 229–247, 2010.

[28] M. Banerjee and N. R. Pal, “Feature selection with SVD entropy:Some modification and extension,” Inf. Sci., vol. 264, pp. 118–134,2014.

[29] J. Handl and J. D. Knowles, “Semi-supervised feature selection viamultiobjective optimization,” in Proc. Int. Joint Conf. Neural Netw.,2006, pp. 3319–3326.

[30] Z. Zhao and H. Liu, “Semi-supervised feature selection via spec-tral analysis,” in Proc. 7th SIAM Int. Conf. Data Mining, 2007,pp. 641–646.

[31] J. Ren, Z. Qiu, W. Fan, H. Cheng, and P. S. Yu, “Forward semi-supervised feature selection,” in Proc. Pacific-Asia Conf. Knowl. Dis-covery Data Mining, 2008, pp. 970–976.

[32] J. Zhao, K. Lu, and X. He, “Locality sensitive semi-supervised fea-ture selection,” Neurocomputing, vol. 71, no. 10-12, pp. 1842–1849,2008.

[33] X. Kong and P. S. Yu, “Semi-supervised feature selection forgraph classification,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2010, pp. 793–802.

[34] Z. Xu, I. King, M. R. Lyu, and R. Jin, “Discriminative semi-super-vised feature selection via manifold regularization,” IEEE Trans.Neural Netw., vol. 21, no. 7, pp. 1033–1047, Jul. 2010.

[35] Jr. J. W. Sammon, “A nonlinear mapping for data structure analy-sis,” IEEE Trans. Comput., vol. C-18, no. 5, pp. 401–409, May 1969.

[36] D. Koller and M. Sahami, “Toward optimal feature selection,” inProc. 13th Int. Conf. Mach. Learn., 1996, pp. 284–292.

[37] P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selec-tion using feature similarity,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 3, pp. 301–312, Mar. 2002.

[38] N. Søndberg-madsen, C. Thomsen, and J. M. Pea, “Unsupervisedfeature subset selection,” in Proc. Workshop Probabilistic Graph.Models Classification, 2003, pp. 71–82.

[39] J. Tang and H. Liu, “Unsupervised feature selection for linkedsocial media data,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2012, pp. 904–912.

[40] X. He, D. Cai, and P. Niyogi, “Laplacian score for featureselection,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 507–514.

[41] D. Cai, C. Zhang, and X. He, “Unsupervised feature selection formulti-cluster data,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2010, pp. 333–342.


[42] K. Z. Mao, “Identifying critical variables of principal componentsfor unsupervised feature selection,” IEEE Trans. Syst., Man Cybern.Part B, vol. 35, no. 2, pp. 339–344, Apr. 2005.

[43] N. Wiratunga, R. Lothian, and S. Massie, “Unsupervised featureselection for text data,” in Proc. 8th Eur. Conf. Case-Based Reasoning,2006, pp. 340–354.

[44] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation on featureselection for text clustering,” in Proc. 20th Int. Conf. Mach. Learn.,2003, pp. 488–495.

[45] Y. Hong, S. Kwong, Y. Chang, and Q. Ren, “Unsupervised featureselection using clustering ensembles and population based incre-mental learning algorithm,” Pattern Recog., vol. 41, no. 9, pp. 2742–2756, 2008.

[46] A. K. Farahat, A. Ghodsi, and M. S. Kamel, “An efficient greedymethod for unsupervised feature selection,” in Proc. IEEE 11th Int.Conf. Data Mining, 2011, pp. 161–170.

[47] B. Justin, “Microarray design using the HilbertSchmidt indepen-dence criterion,” in Pattern Recognition in Bioinformatics. Berlin,Germany: Springer, 2008, pp. 288–298.

[48] O. Alter, P. O. Brown, and D. Botstein, “Singular value decompo-sition for genome-wide expression data processing and mod-eling,” Proc. Nat. Acad. Sci. USA, vol. 97, no. 18, pp. 10 101–10 106,2000.

[49] J. Shi and J. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,Aug. 2000.

[50] D. Xingzhong, Y. Yan, P. Pan, G. Long, and L. Zhao, “Multiplegraph unsupervised feature selection,” Signal Process,Doi: 10.1016/j.sigpro.2014.12.027, 2014.

[51] P. Lu, C. Zhu, R. H. Byrd, and J. Nocedal, “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrainedoptimization,” ACM Trans. Math. Softw., vol. 23, no. 4, pp. 550–560, 1997.

[52] [Online]. Available: http://archive.ics.uci.edu/ml/datasets.html[53] L. Hubert and P. Arabie, “Comparing partitions,” J. Classification,

vol. 2, pp. 193–218, 1985.[54] C. Studholme, D. L. G. Hill, and D. J. Hawkes, “An overlap invari-

ant entropy measure of 3D medical image alignment,” PatternRecog., vol. 32, pp. 71–86, 1999.

[55] [Online]. Available: http://www.ntu.edu.sg/home/elhchen/data.htm

[56] [Online]. Available: http://www.biolab.si/supp/bi-cancer/projections/info/SRBCT.htm

Monami Banerjee received the M Tech degree incomputer science from the Indian Statistical Insti-tute, Calcutta. Currently, she is working towardthe PhD degree. Her research interest includespattern recognition and feature selection.

Nikhil R. Pal is an INAE chair professor in theElectronics and Communication Sciences Unit ofthe Indian Statistical Institute. He was the editor-in-chief of the IEEE Transactions on Fuzzy Sys-tems for six years and serves on the editorialboard/advisory board of many other journals. Hereceived the IEEE Computational IntelligenceSociety 2015 Fuzzy Systems Pioneer Award. Heis a fellow of the National Academy of Sciences,India, Indian National Academy of Engineering,Indian National Science Academy, the Interna-

tional Fuzzy Systems Association (IFSA), and the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Date post:	27-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Unsupervised Feature Selection with Controlled Redundancy (UFeSCoR

Documents