+ All Categories
Home > Documents > 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL....

2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL....

Date post: 01-Mar-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
17
2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 Linear Dimensionality Reduction for Margin-Based Classification: High-Dimensional Data and Sensor Networks Kush R. Varshney, Member, IEEE, and Alan S. Willsky, Fellow, IEEE Abstract—Low-dimensional statistics of measurements play an important role in detection problems, including those encountered in sensor networks. In this work, we focus on learning low-di- mensional linear statistics of high-dimensional measurement data along with decision rules defined in the low-dimensional space in the case when the probability density of the measurements and class labels is not given, but a training set of samples from this dis- tribution is given. We pose a joint optimization problem for linear dimensionality reduction and margin-based classification, and de- velop a coordinate descent algorithm on the Stiefel manifold for its solution. Although the coordinate descent is not guaranteed to find the globally optimal solution, crucially, its alternating structure enables us to extend it for sensor networks with a message-passing approach requiring little communication. Linear dimensionality reduction prevents overfitting when learning from finite training data. In the sensor network setting, dimensionality reduction not only prevents overfitting, but also reduces power consumption due to communication. The learned reduced-dimensional space and decision rule is shown to be consistent and its Rademacher complexity is characterized. Experimental results are presented for a variety of datasets, including those from existing sensor networks, demonstrating the potential of our methodology in comparison with other dimensionality reduction approaches. Index Terms—Linear dimensionality reduction, sensor net- works, Stiefel manifold, supervised classification. I. INTRODUCTION S ENSOR networks are systems used for distributed de- tection and data fusion that operate with severe resource limitations; consequently, minimizing complexity in terms Manuscript received July 11, 2010; revised December 30, 2010; accepted February 18, 2011. Date of publication April 19, 2011; date of current version May 18, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Anna Scaglione. This work was sup- ported in part by a National Science Foundation Graduate Research Fellowship, by a MURI funded through ARO Grant W911NF-06-1-0076, by the Air Force Office of Scientific Research under Award FA9550-06-1-0324 and by Shell In- ternational Exploration and Production, Inc. Any opinions, findings and conclu- sions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the Air Force. The material in this paper was presented in part at the International Conference on Information Fu- sion, Seattle, WA, July 2009, and in the Ph.D. dissertation “Frugal Hypothesis Testing and Classification,” Massachusetts Institute of Technology, Cambridge, 2010, of K. R. Varshney. K. R. Varshney was with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail: [email protected]). A. S. Willsky is with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2011.2123891 of communication and computation is critical [3]. A current interest is in deploying wireless sensor networks with nodes that take measurements using many heterogeneous modalities such as acoustic, infrared and seismic to monitor volcanoes [4], detect intruders [5], [6] and perform many other classification tasks. Sensor measurements may contain much redundancy, both within the measurement dimensions of a single sensor and between measurement dimensions of different sensors due to spatial correlation. Resources can be conserved if sensors do not transmit irrel- evant or redundant data, but it is usually not known in advance which measurement dimensions or combination of dimensions are most useful for the detection or classification task. The trans- mission of irrelevant and redundant data can be avoided through dimensionality reduction; specifically, a low-dimensional rep- resentative form of measurements may be transmitted by sen- sors to a fusion center, which then detects or classifies based on those low-dimensional measurement representations. As mea- surements or low-dimensional measurement representations are transmitted from sensor to sensor, eventually reaching the fusion center, dimensionality reduction at the parent node eliminates redundancy between parent and child node measurements. Even a reduction from two-dimensional measurements to one-dimen- sional features is significant in many hostile-environment mon- itoring and surveillance applications. Decision rules in detection problems, both in the sensor net- work setting and not, are often simplified through sufficient sta- tistics such as the likelihood ratio [7]. Calculation of a suffi- cient statistic losslessly reduces the dimensionality of high-di- mensional measurements before applying a decision rule de- fined in the reduced-dimensional space, but requires knowledge of the probability distribution of the measurements. The statis- tical learning problem supervised classification deals with the case when this distribution is unknown, but a set of labeled samples from it, known as the training dataset, is available. For the most part, however, supervised classification methods (not adorned with feature selection) produce decision rules defined in the full high-dimensional measurement space rather than in a reduced-dimensional space, motivating feature selection or di- mensionality reduction for classification. In this paper, we propose a method for simultaneously learning both a dimensionality reduction mapping and a clas- sifier defined in the reduced-dimensional space. Not only does dimensionality reduction simplify decision rules, but it also decreases the probability of classification error by preventing overfitting when learning from a finite training 1053-587X/$26.00 © 2011 IEEE
Transcript
Page 1: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Linear Dimensionality Reduction for Margin-BasedClassification: High-Dimensional Data and

Sensor NetworksKush R. Varshney, Member, IEEE, and Alan S. Willsky, Fellow, IEEE

Abstract—Low-dimensional statistics of measurements play animportant role in detection problems, including those encounteredin sensor networks. In this work, we focus on learning low-di-mensional linear statistics of high-dimensional measurement dataalong with decision rules defined in the low-dimensional space inthe case when the probability density of the measurements andclass labels is not given, but a training set of samples from this dis-tribution is given. We pose a joint optimization problem for lineardimensionality reduction and margin-based classification, and de-velop a coordinate descent algorithm on the Stiefel manifold for itssolution. Although the coordinate descent is not guaranteed to findthe globally optimal solution, crucially, its alternating structureenables us to extend it for sensor networks with a message-passingapproach requiring little communication. Linear dimensionalityreduction prevents overfitting when learning from finite trainingdata. In the sensor network setting, dimensionality reduction notonly prevents overfitting, but also reduces power consumptiondue to communication. The learned reduced-dimensional spaceand decision rule is shown to be consistent and its Rademachercomplexity is characterized. Experimental results are presentedfor a variety of datasets, including those from existing sensornetworks, demonstrating the potential of our methodology incomparison with other dimensionality reduction approaches.

Index Terms—Linear dimensionality reduction, sensor net-works, Stiefel manifold, supervised classification.

I. INTRODUCTION

S ENSOR networks are systems used for distributed de-tection and data fusion that operate with severe resource

limitations; consequently, minimizing complexity in terms

Manuscript received July 11, 2010; revised December 30, 2010; acceptedFebruary 18, 2011. Date of publication April 19, 2011; date of current versionMay 18, 2011. The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Anna Scaglione. This work was sup-ported in part by a National Science Foundation Graduate Research Fellowship,by a MURI funded through ARO Grant W911NF-06-1-0076, by the Air ForceOffice of Scientific Research under Award FA9550-06-1-0324 and by Shell In-ternational Exploration and Production, Inc. Any opinions, findings and conclu-sions or recommendations expressed in this publication are those of the authorsand do not necessarily reflect the views of the Air Force. The material in thispaper was presented in part at the International Conference on Information Fu-sion, Seattle, WA, July 2009, and in the Ph.D. dissertation “Frugal HypothesisTesting and Classification,” Massachusetts Institute of Technology, Cambridge,2010, of K. R. Varshney.

K. R. Varshney was with the Laboratory for Information and DecisionSystems, Massachusetts Institute of Technology, Cambridge, MA 02139 USA(e-mail: [email protected]).

A. S. Willsky is with the Laboratory for Information and Decision Systems,Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2011.2123891

of communication and computation is critical [3]. A currentinterest is in deploying wireless sensor networks with nodesthat take measurements using many heterogeneous modalitiessuch as acoustic, infrared and seismic to monitor volcanoes [4],detect intruders [5], [6] and perform many other classificationtasks. Sensor measurements may contain much redundancy,both within the measurement dimensions of a single sensor andbetween measurement dimensions of different sensors due tospatial correlation.

Resources can be conserved if sensors do not transmit irrel-evant or redundant data, but it is usually not known in advancewhich measurement dimensions or combination of dimensionsare most useful for the detection or classification task. The trans-mission of irrelevant and redundant data can be avoided throughdimensionality reduction; specifically, a low-dimensional rep-resentative form of measurements may be transmitted by sen-sors to a fusion center, which then detects or classifies based onthose low-dimensional measurement representations. As mea-surements or low-dimensional measurement representations aretransmitted from sensor to sensor, eventually reaching the fusioncenter, dimensionality reduction at the parent node eliminatesredundancy between parent and child node measurements. Evena reduction from two-dimensional measurements to one-dimen-sional features is significant in many hostile-environment mon-itoring and surveillance applications.

Decision rules in detection problems, both in the sensor net-work setting and not, are often simplified through sufficient sta-tistics such as the likelihood ratio [7]. Calculation of a suffi-cient statistic losslessly reduces the dimensionality of high-di-mensional measurements before applying a decision rule de-fined in the reduced-dimensional space, but requires knowledgeof the probability distribution of the measurements. The statis-tical learning problem supervised classification deals with thecase when this distribution is unknown, but a set of labeledsamples from it, known as the training dataset, is available. Forthe most part, however, supervised classification methods (notadorned with feature selection) produce decision rules definedin the full high-dimensional measurement space rather than in areduced-dimensional space, motivating feature selection or di-mensionality reduction for classification.

In this paper, we propose a method for simultaneouslylearning both a dimensionality reduction mapping and a clas-sifier defined in the reduced-dimensional space. Not onlydoes dimensionality reduction simplify decision rules, butit also decreases the probability of classification error bypreventing overfitting when learning from a finite training

1053-587X/$26.00 © 2011 IEEE

Page 2: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2497

dataset [8]–[11]. We focus on linear dimensionality reductionmappings represented by matrices on the Stiefel manifold[12] and on margin-based classifiers, a popular and effectiveclass of classifiers that includes logistic regression, the supportvector machine (SVM) and the geometric level set (GLS)classifier [13]–[15]. The importance of the Stiefel manifold isits role as the set of all linear subspaces with basis specifiedand hence it provides precisely the right object for exploringdifferent subspaces on which to project measurements.

Many methods for linear dimensionality reduction, includingthe popular principal component analysis (PCA) and Fisher dis-criminant analysis (FDA), can be posed as optimization prob-lems on the Stiefel or Grassmann manifold with different ob-jectives [12]. In this paper, we propose an optimization problemon the Stiefel manifold whose objective is that of margin-basedclassification and develop an iterative coordinate descent algo-rithm for its solution. PCA, FDA and other methods do nothave margin-based classification as their objective and are con-sequently suboptimal with respect to that objective. Coordinatedescent is not guaranteed to find the global optimum; however,as seen later in the paper, an advantage of coordinate descent isthat it is readily implemented in distributed settings and tends tofind good solutions in practice. We successfully demonstrate thelearning procedure on several real datasets from different appli-cations.

The idea of learning linear dimensionality reduction map-pings from labeled training data specifically for the purpose ofclassification is not new. For example, the goal of FDA is clas-sification, but it assumes that the class-conditional distributionsgenerating the data are Gaussian with identical covariances; itis also not well suited to datasets of small cardinality [16]. Wereserve discussion of several such methods until Section I-A.1

Our work fits into the general category of learning data represen-tations that have traditionally been learned in an unsupervisedmanner, appended with known class labels and consequently su-pervision. Examples from this category include learning undi-rected graphical models [20], sparse signal representations [21],[22], directed topic models [23], [24], quantizer codebooks [25]and linear dimensionality reduction matrices, which is the topicof this paper and others described in Section I-A.

Statistical learning theory characterizes the phenomenon ofoverfitting when there is finite training data. The generaliza-tion error of a classifier—the probability of misclassificationon new unseen measurements (the quantity we would ideallylike to minimize)—can be bounded by the sum of two terms[8]: the classification error on the training set and a complexityterm, e.g., the Rademacher complexity [26], [27]. We analyti-cally characterize the Rademacher complexity as a function ofthe dimension of the reduced-dimensional space in this work.Finding it to be an increasing function of the dimension, wecan conclude that dimensionality reduction does in fact preventoverfitting and that there exists some optimal reduced dimen-sion.

As the cardinality of the training dataset grows, the general-ization error of a consistent classifier converges to the Bayes op-

1Our paper focuses on general linear dimensionality reduction and not onfeature subset selection, which is a separate topic in its own right, e.g., see[17]–[19].

timal probability of error, i.e., the error probability had the jointprobability distribution been known. We show that our proposedjoint linear dimensionality reduction and margin-based classifi-cation method is consistent.

The problem of distributed detection has been an object ofstudy during the last 30 years [28]–[31], but the majority of thework has focused on the situation when either the joint prob-ability distribution of the measurements and labels or the like-lihood functions of the measurements given the labels are as-sumed known. Recently, there has been some work on super-vised classification for distributed settings [32]–[34], but in thatwork sensors take scalar-valued measurements and dimension-ality reduction is not involved. Previous work on the linear di-mensionality reduction of sensor measurements in distributedsettings, including [35]–[37] and references therein, have esti-mation rather than detection or classification as the objective.

In this paper, we show how the linear dimensionality reduc-tion of heterogeneous data specifically for margin-based classi-fication may be distributed in a tree-structured multisensor datafusion network with a fusion center via individual Stiefel man-ifold matrices at each sensor. The proposed coordinate descentlearning algorithm is amenable to distributed implementation.In particular, we extend the coordinate descent procedure sothat it can be implemented in tree-structured sensor networksthrough a message-passing approach with the amount of com-munication related to the reduced dimension rather than the fullmeasurement dimension. The ability to be distributed is a keystrength of the coordinate descent optimization approach.

Multisensor networks lead to issues that do not typically arisein statistical learning, where generalization error is the onlycriterion. In sensor networks, resource usage presents an addi-tional criterion to be considered and the architecture of the net-work presents additional design freedom. In wireless sensor net-works, the distance between nodes affects energy usage in com-munication and must therefore be considered in selecting net-work architecture. We give classification results on real datasetsfor different network architectures and touch on these issues em-pirically.

A. Relationship to Prior Work

The most popular method of linear dimensionality reductionfor data analysis is PCA. PCA and several other methods onlymake use of the measurement vectors, not the class labels, infinding a dimensionality reduction mapping. If the dimension-ality reduction is to be done in the context of supervised clas-sification, the class labels should also be used. Several super-vised linear dimensionality reduction methods exist in the liter-ature. We can group these methods into three broad categories:those that separate likelihood functions according to some dis-tance or divergence [38]–[44], those that try to make the proba-bility of the labels given the measurements and the probabilityof the labels given the dimensionality-reduced measurementsequal [45]–[50] and those that attempt to minimize a specificclassification or regression objective [12], [51]–[54].

As mentioned previously in the section, FDA assumes that thelikelihood functions are Gaussian with the same covariance anddifferent means. It returns a dimensionality reduction matrix onthe Stiefel manifold that maximally separates (in Euclidean dis-

Page 3: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

tance) the clusters of the different labels [12]. The method of[39] also assumes Gaussian likelihoods with the same covari-ance and different means, but with an even stronger assump-tion that the covariance matrix is a scalar multiple of the iden-tity. The probability of error is explicitly minimized using gra-dient descent; the gradient updates to the dimensionality reduc-tion matrix do not enforce the Stiefel manifold constraint, butthe Gram–Schmidt orthonormalization procedure is performedafter every step to obtain a matrix that does meet the constraint.With a weaker assumption only that the likelihood functionsare Gaussian, but without restriction on the covariances, othermethods maximize Bhattacharyya divergence or Chernoff di-vergence, which are surrogates for minimizing the probabilityof error [43].

The method of [38], like FDA, maximally separates the clus-ters of the different labels but does not make the strong Gaussianassumption. Instead, it performs kernel density estimation ofthe likelihoods and separates those estimates. The optimiza-tion is gradient ascent and orthonormalization is performedafter every step. Similarly, information preserving componentanalysis also performs kernel density estimation and maxi-mizes Hellinger distance, another surrogate for minimizing theprobability of error, with optimization through gradient ascentand the Stiefel manifold constraint maintained in the gradientsteps [44]. Other approaches with information-theoretic criteriainclude [40]–[42].

Like [38] and [44], the method of [49] also estimates proba-bility density functions for use in the criterion for linear dimen-sionality reduction. The particular criterion, however, is basedon the idea that the dimensionality reduction mapping should besuch that the probability of the class labels conditioned on theunreduced measurements equal the probability conditioned onthe reduced measurements. The same criterion appears in [45],[46], [48], [50], and many references given in [47]. These papersdescribe various methods of finding dimensionality reductionmappings to optimize the criterion with different assumptions.

Some supervised dimensionality reduction methods explic-itly optimize a classification or regression objective. A linearregression objective and a regression parameter/Stiefel mani-fold coordinate descent algorithm is developed in [53]. The sup-port vector singular value decomposition machine of [52] hasa joint objective for dimensionality reduction and classifica-tion with the hinge loss function. However, the matrix it pro-duces is not guaranteed to be on the Stiefel manifold and thespace in which the classifier is defined is not exactly the dimen-sionality-reduced image of the high-dimensional space. It alsochanges the regularization term from what is standardly usedfor the SVM. Maximum margin discriminant analysis is anothermethod based on the SVM; it finds the reduced-dimensional fea-tures one by one instead of giving a complete matrix at once andit does not simultaneously give a classifier [54]. The method of[12] and [51] is based on the nearest neighbor classifier.

The objective function and optimization procedure we pro-pose in Section II has some similarities to many of the methodsdiscussed, but also some key differences. First of all, we do notmake any assumption and indeed do not explicitly make use ofany assumptions on the statistics of likelihood functions (e.g., noassumption of Gaussianity is employed). Moreover, our method

does not require nor involve estimation of the probability den-sity functions under the two hypotheses nor of the likelihoodratio. Indeed, we are directly interested only in learning deci-sion boundaries and using margin-based loss functions to guideboth this learning and the optimization over the Stiefel mani-fold to determine the reduced-dimensional space in which deci-sion making is to be performed. Density estimation is a harderproblem than finding classifier decision boundaries and it is wellknown that when learning from finite data, it is best to onlysolve the problem of interest and nothing more. Similarly, thedesire that the conditional distributions of the class label giventhe high-dimensional and reduced-dimensional measurementsbe equal is more involved than wanting good classification per-formance in the reduced-dimensional space.

Rather than nearest neighbor classification or linear regres-sion, the objective in the method we propose is margin-basedclassification. Our method finds all reduced-dimensional fea-tures in a joint manner and gives both the dimensionality re-duction mapping and the classifier as output. Unlike in [52],the classifier is defined exactly without approximation in thereduced-dimensional subspace resulting from applying the di-mensionality reduction matrix that is found. Additionally, theregularization term and consequently inductive bias of the clas-sifier is left unchanged.

The preceding represent the major conceptual differences be-tween our framework and that considered in previous work. Weuse coordinate descent optimization procedures in Section II,which are also employed in other works, e.g., [52] and [53],but the setting in which we use these are new. Our frameworkalso allows us to develop some new theoretical results on con-sistency and Rademacher complexity. Moreover, as developedin Section III, our framework allows a natural generalization todistributed dimensionality reduction for classification in sensornetworks, a problem that has not been considered previously.

Ji and Ye presented an approach to linear dimensionality re-duction for classification with linear decision boundaries [55]after the initial presentation of this work [1], which is similarto our formulation as well as the formulation of [53]. Ji andYe restrict themselves to the regularization term of the SVMand either a regression objective like [53], or the hinge loss. Inour formulation, any regularization term and any margin-basedloss function may be used and the decision boundaries are gen-erally nonlinear. With the hinge loss, the optimization in [55]is through coordinate descent similar to ours, but the dimen-sionality reduction matrix optimization step is carried out via aconvex-concave relaxation (which is not guaranteed to find theoptimum of the true unrelaxed problem) rather than gradient de-scent along Stiefel manifold geodesics that we do. The workof Ji and Ye also considers the learning problem when trainingsamples may have either zero, one, or more than one assignedclass label, which is known as multilabel classification [56] andis not the focus of our work.

B. Organization of Paper

The paper is organized as follows. Section II combines theideas of margin-based classification and optimization on theStiefel manifold to give a joint linear dimensionality reductionand classification objective as well as an iterative algorithm. An

Page 4: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2499

analysis of Rademacher complexity and consistency is also pre-sented in the section. Section III shows how the basic methodof Section II extends to multisensor data fusion networks, in-cluding wireless sensor networks. In Section IV, an illustra-tive example and results on several real datasets are given. Alsogiven are experimental results of classification performance asa function of transmission power in wireless sensor networks.Section V concludes.

II. LINEAR DIMENSIONALITY REDUCTION FOR

MARGIN-BASED CLASSIFICATION

In this section, we formulate a problem for composite dimen-sionality reduction and margin-based classification. We developa coordinate descent minimization procedure for this formula-tion, characterize the complexity of the formulation from a sta-tistical learning theory perspective and show the consistency ofthe formulation.

A. Formulation

Consider the binary detection or classification problemwith measurement vectors and class labels

drawn according to the probability densityfunction . We would like to find the classifier

that minimizes the error probability. We do not have access to , but

instead are given training data . Thetrue objective we would like to minimize in learning is thegeneralization error , but a direct minimizationis not possible since the joint distribution of and is notknown. In practice, the classifier is selected from a functionclass to minimize a loss function of the training data.

Margin-based classifiers take the form ,where is a decision function whose specifics are tied to thespecific margin-based classifier. The decision function is chosento minimize the functional:

(1)

where the value is known as the margin; it is related to thedistance between and the classifier decision boundary

. The function is known as a margin-based loss function.Examples of such functions are the logistic loss function

and the hinge loss function

The second term on the right side of (1), with non-negativeweight , represents a regularization term that penalizes thecomplexity of the decision function [13], [14]. In the kernelSVM, is the hinge loss, the decision functions are in a repro-ducing kernel Hilbert space and is the squared norm in thatspace [13], [14]. In the GLS classifier, any margin-basedloss function may be used and the decision functions are in

the space of signed distance functions [2], [15]. The magni-tude of equals the Euclidean distance of to the decisionboundary. The regularization term is the surface area of thezero level set of , i.e., , where is an infini-tesimal surface area element on the decision boundary.

The new contribution of this section is the formulation ofa joint linear dimensionality reduction and classification min-imization problem by extension of the margin-based functional(1). The decision function is defined in the reduced -dimen-sional space and a linear dimensionality reduction mapping ap-pears in its argument, but otherwise, the classification objectiveis left unchanged. In particular, the regularization term is notaltered, thereby allowing any regularized margin-based classi-fier to be extended for dimensionality reduction.

The margin-based classification objective is extended to in-clude a matrix with elements as follows:

(2)

with the constraint that lie on the Stiefel manifold ofmatrices, i.e., , where

(3)

With a data vector , is in dimensions. Typi-cally—and especially in our framework—we are uninterestedin scalings of the reduced-dimensional data , so we limitthe set of possible matrices to those which involve orthogonalprojection, i.e., to the Stiefel manifold.

The formulation as presented is for a fixed value of . If gen-eralization error is the only criterion, then any popular model se-lection method from the machine learning literature, includingthose based on cross-validation, bootstrapping and informationcriteria, can be used to find a good value for the reduced dimen-sion . However, other criteria besides generalization error be-come important in various settings, including sensor networks.System resource usage is one such criterion; it is not typicallystatistical in nature and is often a deterministic increasing func-tion of . As such, it may be used as an additional cost withinformation criteria or as a component in modified cross-val-idation and bootstrapping. If different types of errors such asfalse alarms and missed detections incur different costs, then thecriterion is not strictly generalization error, but cross-validationand bootstrapping may be modified accordingly.

B. Coordinate Descent Minimization

An option for performing the minimization of givenin (2) is coordinate descent: alternating minimizations withfixed and with fixed . The problem is conceptually similarto level set image segmentation along with pose estimation fora shape prior [57]. With fixed, we are left with a standardmargin-based classification problem in the reduced-dimen-sional space. The optimization step may be performed usingstandard methods for margin-based classifiers.

With fixed, we have a problem of minimizing a functionof lying on the Stiefel manifold. For differentiable functions,several iterative minimization algorithms exist [58]–[60]. The

Page 5: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2500 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

function is differentiable withrespect to for differentiable loss functions. Using to de-note the matrix with elements , the first derivative is:

(4)

Note that is a vector and thatis a vector,

where is the partial derivative of the decision functionwith respect to dimension . For the logistic loss function

and for the hinge loss function

where is the Heaviside step function.We perform gradient descent along geodesics of the Stiefel

manifold [58]. The gradient is

(5)

Starting at an initial , a step of length in the directionto is

(6)

where is the QR decomposition of and

The step size may be optimized by a line search.The coordinate descent is not guaranteed to find the global

optimum, only a local optimum; however, as seen in the illus-trative example in Section IV-A, even poor initializations leadto the globally optimal solution in practice. For the results givenin Section IV-B, is initialized by making use of estimates ofthe mutual informations between the label and individual datadimensions . Mutual information providesan indication of whether a measurement dimension is individu-ally relevant for classification and thus projection onto dimen-sions with high mutual information is a good starting point. Ofcourse, these dimensions may be correlated and that is preciselywhat the Stiefel manifold optimization iterations uncover. Thefirst column of is taken to be the canonical unit vector corre-sponding to the dimension with the largest mutual information.The second column of is taken to be the canonical unit vectorcorresponding to the dimension with the second largest mutualinformation and so on. The last, i.e., , column of is zeroin the rows already containing ones in the first columnsand nonzero in the remaining rows with values proportional tothe mutual informations of the remaining dimensions. Kerneldensity estimation is used in estimating mutual information.

C. Rademacher Complexity

The generalization error can be bounded by the sum of theerror of on the training set and a penalty that is larger for more

complex . One such penalty is the Rademacher complexity[26], [27]. A classifier with good generalizability bal-

ances training error and complexity; this is known as the struc-tural risk minimization principle [8].

With probability greater than or equal to , Bartlett andMendelson give the following bound on the generalization errorfor a specified decision rule [27]:

(7)

where is an indicator function. The first term on the right-handside is the training error and the second term is complexity. Asdiscussed in [9]–[11], dimensionality reduction reduces classi-fier complexity and thus prevents overfitting. Here, we analyti-cally characterize the Rademacher complexity term forthe joint linear dimensionality reduction and margin-based clas-sification method proposed in this paper. It is shown in [61] thatthe Rademacher average of a function class satisfies

(8)

where is the -entropy of with respect to themetric.2

In classification, it is always possible to scale and shift thedata and this is often done in practice. Forgoing some book-keeping and without losing much generality, we consider thedomain of the unreduced measurement vectors to be the unithypercube, that is . The reduced-dimensionaldomain is then the zonotope3 , where is onthe Stiefel manifold. We denote the set of decision functionsdefined on as and those defined on as .

Given the generalization bound based on Rademachercomplexity (7) and the Rademacher complexity term (8), wemust find an expression for to characterize theprevention of overfitting by linear dimensionality reduction.The function class is tied to the specific margin-basedclassification method employed. In order to make concretestatements, we select the GLS classifier; similar analysis maybe performed for other margin-based classifiers such as thekernel SVM. Such analysis would also be similar to [11]. Asmentioned in Section II-A, the decision function in the GLS

2The �-covering number of a metric space is the minimal number of sets withradius not exceeding � required to cover that space; the �-entropy is the base-twologarithm of the �-covering number [62]. The � metric is � �� �� � ���� �� ��� � � ����.

3The set � � � ��� � , the orthogonal shadow cast by ��� dueto the projection� � ���� ��, is a zonotope, a particular type of polytope thatis convex, centrally symmetric and whose faces are also centrally symmetric inall lower dimensions [63], [64]. For reference, Fig. 1 shows several zonotopesfor� � � and � � �. The matrix� is known as the generator of the zonotope�; we use the notation ���� to denote the zonotope generated by � . Also,let

���� �� � ������� � ���� ��� (9)

Although the relationship between zonotopes and their generators is not bijec-tive, zonotopes provide a good means of visualizing Stiefel manifold matrices,especially when � � �.

Page 6: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2501

Fig. 1. Several zonotopes in ���� ��.

classifier is a signed distance function and is the set of allsigned distance functions whose domain is the zonotope .

For classification without dimensionality reduction, it isshown in [15] that

(10)

This result follows from the fact that -dimensionalhypercubes with side of length fit as a Cartesian grid into

. To find an expression for the -entropy of the di-mensionality-reduced GLS classifier, the same analysis appliesand consequently, we need to determine how many -dimen-sional hypercubes with side of length fit into . The number ofsmall hypercubes that fit inside is related to its content .

An upper bound for is developed in [63] that is asymp-totically of the correct order of magnitude for fixed as goesto infinity. Specifically,

(11)

where is the content of the -dimensionalunit hypersphere and is Legendre’s gamma function. Basedon (11), we find that

(12)

For fixed reduced dimension , increases as a func-tion of the measurement dimension , i.e., the classifier func-tion class is richer for larger measurement dimension with thesame reduced-dimension. Importantly, increases asa function of for fixed .

Substituting the expression (12) into (8), we findthat for a fixed measurement dimension , the more the dimen-sionality is reduced, that is the smaller the value of , the smaller

Fig. 2. Rademacher average as a function of the reduced dimension � for � �� (dotted blue line), � � �� (dashed and dotted green line), � � �� (dashedred line) and � � �� (solid cyan line) for � � ���� and � � ����.

the Rademacher complexity. This is shown in Fig. 2, a plot ofthe complexity value as a function of for different values of .Although larger measurement dimension does result in largercomplexity, the effect is minor in comparison to the effect of .

Since training error increases as decreases, and the general-ization error is related to the sum of the Rademacher complexityand the training error: the more we reduce the dimension, themore we prevent overfitting. However, if we reduce the dimen-sion too much, we end up underfitting the data; the training errorcomponent of the generalization error becomes large. There isan optimal reduced dimension that balances the training errorand the complexity components of the generalization error.4

D. Consistency

With a training dataset of cardinality drawn from, a consistent classifier is one whose probability of

error converges in the limit as goes to infinity to the probabilityof error of the Bayes risk optimal decision rule when bothtypes of classification errors have equal cost.5 For consistencyto be at all meaningful, we assume in this analysis that there isa reduced-dimensional statistic so that the optimal Bayesdecision rule based on this statistic achieves the same perfor-mance as the optimal decision rule based on the complete data ,that is, we assume that there exists at least onesuch that , where

takes the appropriate dimensional argument and isknown. We also assume that the optimization method used intraining finds the global optimum. The question is whether

4Note the purpose of generalization bounds in statistical learning theory asstated by Bousquet [65]: “one should not be concerned about the quantitativevalue of the bound or even about its fundamental form but rather about the termsthat appear in the bound. In that respect a useful bound is one which allows tounderstand which quantities are involved in the learning process. As a result,performance bounds should be used for what they are good for. They shouldnot be used to actually predict the value of the expected error. Indeed, they usu-ally contain prohibitive constants or extra terms that are mostly mathematicalartifacts. They should not be used directly as a criterion to optimize since theirprecise functional form—may also be a mathematical artifact. However, theyshould be used to modify the design of the learning algorithms or to build newalgorithms.”

5The Bayes optimal decision rule is a likelihood ratio test involving� ���� � ��� and � ���� � �� with threshold equal to the ratio ofthe class prior probabilities.

Page 7: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2502 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

for a sequence of classifiers learned from training data, where

does converge in probability tozero. Note that is a random variable that dependson the data.

The properties of are affected by both themargin-based loss function and by the classifier function space

. Conditions on the loss function necessary for a margin-based classifier to be consistent are given in [13], [14], [66].A loss function that meets the necessary conditions is termedFisher consistent in [13]. Common margin-based loss functionsincluding the logistic loss and hinge loss are Fisher consistent.6

Fisher consistency of the loss function is not enough, however,to imply consistency of the classifier overall; the function classmust also be analyzed.

We apply [13, Theorem 4.1], which is, in turn, an applicationof [67, Theorem 1] to show consistency. The theorem is basedon . In order to apply this theorem, we need to notethree things. First, that is a Fisher consistent loss function.Second, that signed distance functions on are bounded in the

norm. Third, that there exists a constant such that, which follows from (12). Then, from [13]

we have that7

(13)

where

The dimensionality reduction and classification method is con-sistent: goes to zero as goes toinfinity because goes to zero.

III. DIMENSIONALITY REDUCTION IN

TREE-STRUCTURED NETWORKS

As discussed in Section I, a classification paradigm that in-telligently reduces the dimensionality of measurements locallyat sensors before transmitting them is critical in sensor networksettings. In this section, we make use of and appropriately ex-tend the formulation of joint linear dimensionality reduction andclassification presented in Section II for this task. For ease ofexposition, we begin the discussion by first considering a setup

6The conditions on � for it to be Fisher consistent are mainly related to it beingsuch that incorrect classifications incur more loss than correct classifications.

7The notation� � � �� �means that the random variable� � � � ,where � is a random variable bounded in probability [68]. Thus, if � con-verges to zero, then � converges to zero in probability.

with a single sensor and then come to the general setting withsensors networked according to a tree graph with a fusion centerat the root of the tree. Also for simplicity of exposition, we as-sume that the fusion center does not take measurements, that itis not also a sensor; this assumption is by no means necessary.We make the assumption, as in [32]–[34], that the class labels

of the training set are available at the fusion center.

A. Network With Fusion Center and Single Sensor

Consider a network with a single sensor and a fusioncenter. The sensor measures data vector and re-duces its dimensionality using . The sensor transmits

to the fusion center, which appliesdecision rule to obtain a classification for

. Clearly in its operational phase, the linear dimensionalityreduction reduces the amount of transmission required fromthe sensor to the fusion center.

Moreover, the communication required in training dependson the reduced dimension rather than the dimension of themeasurements . The coordinate descent procedure describedin Section II-B is naturally implemented in this distributed set-ting. With fixed, the optimization for occurs at the fu-sion center. The information needed by the fusion center to per-form the optimization for are the , the dimension-ality-reduced training examples. With fixed, the optimizationfor occurs at the sensor. Looking at (4), we see that the in-formation required by the sensor from the fusion center to op-timize includes only the scalar valueand the column vector ,which we denote , for .

Thus the alternating minimizations of the coordinate descentare accompanied by the alternating communication of messages

and . The more computationally demandingoptimization for (the application of a margin-based classifi-cation algorithm) takes place at the fusion center. A computa-tionally simple Stiefel manifold gradient update occurs at thesensor.8 One may ask whether it is more efficient to performtraining by just transmitting the full-dimensional measurementsto the fusion center. The total communication involved in thatcase is scalar values, whereas with the distributed im-plementation, this total is multiplied by the number ofcoordinate descent iterations. Frequently is much larger than

(an example in Section IV-B has and optimal) and the number of iterations is typically small (usually

less than ten or twelve). In such cases, the distributed implemen-tation provides quite a bit of savings. This scheme extends to themore interesting case of multisensor networks, as we describenext. The transmission savings of training with distributed im-plementation are further magnified in the multisensor networkcase.

8The Stiefel manifold constraint requires QR factorization or other orthonor-malization which may be prohibitive on certain existing sensor nodes, but as isdemonstrated in [69] and references therein, efficient FPGA implementationsof QR factorization have been developed and could be integrated into existingor new sensor nodes.

Page 8: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2503

B. Multisensor Networks

We now consider networks with sensors con-nected in a tree topology with the fusion center at theroot. We denote the children of the fusion center as

; we also denote the children ofsensor as and we denote the parentof sensor as . Training data vector ismeasured by sensor .9 The sensor receives dimensionality-re-duced measurements from its children, combines them withits own measurements and transmits a dimensionality-reducedversion of this combination to its parent. Mathematically, thetransmission from sensor to its parent is

...(14)

where .As an extension to the margin-based classification and

linear dimensionality reduction objective (2), we propose thefollowing objective for sensor networks:

... (15)

Just as in the single sensor network in which the fu-sion center needed to receive the message fromits child in order to optimize , in the multisensor net-work the fusion center needs to receive the messages

from all of its childrenin order to optimize . The messages coming from the childrenof the fusion center are themselves simple linear functions ofthe messages coming from their children, as given in (14). Thesame holds down the tree to the leaf sensors. Thus, to gatherthe information required by the fusion center to optimize , amessage-passing sweep occurs from the leaf nodes in the treeup to the root.

For fixed and optimization of the , we also seemessage-passing, this time sweeping back from the fusioncenter toward the leaves that generalizes what occurs in thesingle sensor network. Before finding the partial derivative of

with respect to , let us first introducefurther notation. We slice into blocks as follows:

...

9In real-world situations, there is no reason to expect underlying likelihoodfunctions for different sensors � , � � �� � � � �� to be identical. Differentsensors will certainly be in different locations and may even be measuring dif-ferent modalities of different dimensions with different amounts of noise.

where and . Also,

...

...

...

is the slice of the decision function gradient corresponding tothe dimensions transmitted by to the fusion center.Additionally, let

(16)

Then, the matrix partial derivative of the objective function(15) with respect to is

...

...(17)

Like in the single sensor network, the information required atsensor to optimize that it does not already have consistsof a scalar and a vector. The scalar value is commonthroughout the network. The vector message haslength and is received from . As seen in (16), themessage a sensor passes onto its child is a simple linear func-tion of the message received from its parent. To optimize allof the , a message-passing sweep starting from the fusioncenter and going down to the leaves is required. Simple gradientdescent along Stiefel manifold geodesics is then performed lo-cally at each sensor. Overall, the coordinate descent trainingproceeds along with the passing of messagesand , which are functions of incoming messagesas seen in (14) and (16).

C. Consistency and Complexity

The data vector that is received by the fusion centeris reduced from dimensions todimensions. The fact that the composition of linear dimen-sionality reduction by two matrices on the Stiefel manifoldcan be represented by a single matrix on the Stiefel manifoldleads to the observation that the dimensionality reductionperformed by the sensor network has an equivalent matrix

. However, has furtherconstraints than just the Stiefel manifold constraint due to thetopology of the network. For example, the equivalent of thenetwork in which the fusion center has two child sensors mustbe block-diagonal with two blocks.

Page 9: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2504 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Thus in the tree-structured sensor network, there is anequivalent matrix

, where is a subset de-termined by the tree topology. The consistency analysisof Section II-D holds under the assumption that there ex-ists an such that

.The constrained set of dimensionality reduction matrices

may have a smaller maximum zonotope content thanthe full Stiefel manifold, which would in turn mean a smallerRademacher complexity. The fusion center receives the -aryCartesian product of dimensionality-reduced data from itschildren. The content of the Cartesian product is the product ofthe individual contents and thus

which is less than or equal to the bound (11) for. A more refined upper bound

may be developed based on the specifics of the tree topology.The tree-structured network has smaller Rademacher

complexity than a dimensionality-reduced margin-basedclassifier of the same overall dimensions due to further con-straints to the classifier function space resulting from thenetwork structure. However, similar to having a minoreffect on complexity seen in Fig. 2, this smaller complexityfor is not much less thanthe complexity for the system without network constraints

. The network constraints,however, may increase the training error. The generalizationerror expression (7), being composed of both the training errorand the complexity, increases with network constraints dueto increases in training error that are not offset by decreasesin complexity, resulting in worse classification performance.However, for sensor networks, the performance criterion ofinterest is generally a combination of generalization error andpower expenditure in communication.

D. Wireless Sensor Network Physical Model

Thus far in the section, we describe linear dimensionalityreduction for margin-based classification in sensor networksabstractly, without considering the physical implementation orspecific tree topologies. Here we set forth a specific physicalmodel for wireless sensor networks that is used in Section IV-C.Consider sensors and a fusion center in the plane that com-municate wirelessly. The distance between sensor and itsparent is and the power required for communica-tion from to its parent is , where as before, isthe reduced dimension output by the sensor. The model arisesby the common assumption of signal attenuation according to

the square of the distance [70].10 The total transmission powerused by the network is then

transmission power (18)

We consider three network structures: parallel architecture,serial or tandem architecture and binary tree architecture. In theparallel architecture, all sensors are direct children of the fu-sion center. In the serial architecture, the fusion center has asingle child, which in turn has a single child and so on. In thebinary tree architecture, the fusion center has two children, eachof whom have two children on down the tree. When the numberof sensors is such that a perfect binary tree is not produced, i.e.,

is not a power of two, the bottom level of the tree remainspartially filled.

The sensor and fusion center locations are modeled as fol-lows. The fusion center is fixed at the center of a circle with unitarea and the sensor locations are uniformly distributed overthat circle. Given the sensor node locations and desired networktopology, we assume that parent-child links and corresponding

are chosen to minimize (18). In a parallel network,the links are fixed with the fusion center as the parent of all sen-sors and thus there is no parent-child link optimization to beperformed. Exact minimization of (18) for the other architec-tures may not be tractable in deployed ad hoc wireless sensornetworks because it involves solving a version of the travelingsalesman problem for the serial architecture and a version of theminimum spanning tree problem for the binary tree architecture.Nevertheless, we assume that the minimization has been per-formed; we comment on this assumption later in the paper. Forthe parallel architecture, the distances are [71]

(19)

where sensor is the th closest sensor to the fusion center. Thereis no closed form expression for the in the serial orbinary tree architectures, but we estimate it through Monte Carlosimulation.

To fully specify the network, we must also set the reduced di-mensions of the sensors . The choice we make is to set pro-portional to the number of descendants of sensor plus one foritself. This choice implies that all are equal in the parallel net-work and that is proportional to in the serial networkso that the number of dimensions passed up the chain to the fu-sion center increases the closer one gets to the fusion center. Wewill see that with this choice of , all three topologies have es-sentially the same classification performance. This is not, how-ever, generally true for different assignments; for example, ifwe take all to be equal in the serial network, the classificationperformance is quite poor. The imbalance in values amongdifferent nodes is a shortcoming of our approach because nodes

10The model � for values of � other than two could also be con-sidered.

Page 10: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2505

closer to the fusion center consume energy more quickly; fu-ture work may consider adapting aggregation services with bal-anced [72], which have been used for distributed PCA, to ourproblem formulation.

IV. EXAMPLES AND RESULTS

With high-dimensional data, dimensionality reduction aidsin visualization and human interpretation, allows the identifi-cation of important data components and reduces the computa-tional and memory requirements of further analysis. An illustra-tive example is presented in this section, which shows the pro-posed dimensionality reduction and margin-based classificationmethod. The key motivation of dimensionality reduction is thatit prevents overfitting, which is shown in this section on severaldatasets.

Also in this section, we consider wireless sensor networksand look at classification performance as a function of transmis-sion power expended. The phenomenon of overfitting seen in thecentralized case has an important counterpart and implicationfor wireless sensor networks: increasing the total allowed trans-mission power—manifested either by increases in the numberof sensors or increases in the number of transmitted dimensionsper sensor—does not necessarily result in improved classifica-tion performance. The examples in this section illustrate severaltradeoffs and suggest further lines of research.

A. Illustrative Example

We now present an illustrative example showing the opera-tion of the classification-linear dimensionality reduction coor-dinate descent for training from a synthetic dataset. The datasetcontains measurement vectors, of which 502 havelabel and 498 have label . The dimension-ality of the measurements is . The first two dimensionsof the data, and , are informative for classification and theremaining six are completely uninformative. In particular, an el-lipse in the - plane separates the two classes as shown inFig. 3(a). The values in the other six dimensions are indepen-dent samples from an identical Gaussian distribution withoutregard for class label. Linear dimensionality reduction todimensions is sought. Note that the two class-conditional distri-butions have the same mean and are not Gaussians and thus notvery amenable to FDA. Fig. 4 shows the matrices obtainedusing PCA and FDA, visualized using the zonotope . Nei-ther PCA nor FDA is successful at recovering the informativesubspace: the - plane.

We run our coordinate descent minimization of (2) to findboth an matrix and decision boundary using two differentmargin-based classifiers: the SVM with radial basis functionkernel and the geometric level set classifier with the logisticloss function. The matrix is randomly initialized. At conver-gence, the optimization procedure ought to give an matrixwith all zeroes in the bottom six rows, corresponding to a zono-tope that is a possibly rotated square and an elliptical decisionboundary. Fig. 3(b) shows the decision boundary resulting fromthe first optimization for using the GLS classifier with therandom initialization for , before the first gradient descent stepon the Stiefel manifold. Fig. 3(c)–(e) shows intermediate itera-tions, and Fig. 3(f) shows the final learned classifier and linear

Fig. 3. Illustrative example. Magenta � markers indicate label ��. Black �markers indicate label��. The blue line is the classifier decision boundary. Thegreen line outlines a zonotope generated by� . (a) The first two measurementdimensions. (b) Random initialization for � and first � from GLS classifier.(c)–(e) Intermediate iterations. (f) Final� and � from GLS classifier.

Fig. 4. Illustrative example. Magenta � markers indicate label ��. Black �markers indicate label��. The green line outlines a zonotope generated by�from (a) PCA and (b) FDA.

dimensionality reduction matrix. As the coordinate descent pro-gresses, the zonotope becomes more like a square, i.e., alignswith the - plane and the decision boundary becomes morelike an ellipse. Fig. 5 shows the operation of the coordinate de-scent with the SVM. Here also, the zonotope becomes more likea square and the decision boundary becomes more like an ellipsethroughout the minimization.

The random initial matrix and the final matrix solutionsfor the GLS classifier and the SVM are given in Table I. Whatwe would want for this example is that the correct two-dimen-sional projection is identified and, assuming that it is, that thedecision boundary is essentially elliptical. First, note that if thecorrect projection is identified, we expect the last six rows ofthe final matrix to be small compared to the first two rows

Page 11: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2506 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 5. Illustrative example. Magenta � markers indicate label ��. Black �markers indicate label��. The blue line is the classifier decision boundary. Thegreen line outlines a zonotope generated by� . (a) Random initialization for� and first � from SVM. (b)–(c) Intermediate iterations. (d) Final � and �from SVM.

TABLE IINITIAL AND FINAL� MATRICES IN ILLUSTRATIVE EXAMPLE

and the corresponding zonotopes to be nearly square. Since ro-tations and reflections of the space onto which we project areinconsequential, we do not necessarily expect the first two rowsof to be the identity matrix, nor do we expect the orientationof the nearly square zonotopes in Figs. 3(f) and 5(d) to line upwith the coordinate axes. The results shown in Figs. 3(f), 5(d)and Table I reflect these desired characteristics. Given these finalprojections, we see that the resulting decision boundaries are in-deed nearly elliptical.11 As this example indicates, the procedureis capable of making large changes to .

B. Classification Error for Different Reduced Dimensions

We present experimental classification results in this sectionon several datasets from the UCI machine learning repos-itory [73]. The joint linear dimensionality reduction andmargin-based classification method proposed in Section II isrun for different values of the reduced dimension , showingthat performing dimensionality reduction does in fact improveclassification performance in comparison to not performingdimensionality reduction. The margin-based classifier that is

11The curved piece of the decision boundary in the top right corner of thedomain in Fig. 3(f) is an artifact of geometric level sets and does not affectclassification performance.

used is the SVM with radial basis function kernel and defaultparameter settings from the Matlab bioinformatics toolbox.

First, we look at training error and test error12 as a func-tion of the reduced dimension on five different datasets fromvaried application domains: Wisconsin diagnostic breast cancer

, ionosphere , sonar , arrhythmia( after preprocessing to remove dimensions containingmissing values) and arcene . On the first fourdatasets, we look at the tenfold cross-validation training and testerrors. The arcene dataset has separate training and validationsets which we employ for these purposes.

The tenfold cross-validation training error is shown with bluetriangle markers and the tenfold cross-validation test error isshown with red circle markers for the ionosphere dataset inFig. 6(a). The plot also contains error bars showing one standarddeviation above and below the average error over the ten folds.In Fig. 6(b), the test error for the joint minimization is comparedto the test error if the linear dimensionality reduction is firstperformed using PCA, FDA, information preserving componentanalysis [44], or sufficient dimension reduction (structured prin-cipal fitted components [74]), followed by classification with thekernel SVM. Fig. 7 shows tenfold cross-validation training andtest error for other datasets. Fig. 8 gives the training and test per-formance for the arcene dataset. For the Wisconsin diagnosticbreast cancer, ionosphere and sonar datasets, we show classifi-cation performance for all possible reduced dimensions. For thearrhythmia and arcene datasets, we show reduced dimensionsup to and , respectively.

The first thing to notice in the plots is that the training errorquickly converges to zero with an increase in the reduced dimen-sion . The margin-based classifier with linear dimensionalityreduction perfectly separates the training set when the reduceddimension is sufficiently large. However, this perfect separationdoes not carry over to the test error—the error in which we aremost interested. In all of the datasets, the test error first decreasesas we increase the reduced dimension, but then starts increasing.There is an intermediate optimal value for the reduced dimen-sion. For the five datasets, these values are , ,

, and , respectively. This test error be-havior is evidence of overfitting if is too large. Dimensionalityreduction improves classification performance on unseen sam-ples by preventing overfitting. Remarkably, even the ten thou-sand-dimensional measurements in the arcene dataset can belinearly reduced to twenty dimensions. In the ionosphere datasettest error comparison plot, it can be seen that the minimum testerror is smaller with the joint minimization than when doingdimensionality reduction separately with PCA, FDA, informa-tion preserving component analysis, or sufficient dimension re-duction. Moreover, this minimum test error occurs at a smallerreduced dimensionality than the minima for PCA, FDA and suf-ficient dimension reduction. Comparisons on other datasets aresimilar.

The classification error as a function of using our new jointlinear dimensionality reduction and margin-based classification

12Training error is the misclassification associated with the data used to learnthe Stiefel manifold matrix and decision function. Test error is the misclassi-fication associated with data samples that were not used in training and is asurrogate for generalization error.

Page 12: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2507

Fig. 6. (a) Tenfold cross-validation training error (blue triangle markers) andtest error (red circle markers) on ionosphere dataset. Error bars indicate standarddeviation over the ten folds. (b) Tenfold cross-validation test error on ionospheredataset using PCA (dashed and dotted cyan line), FDA (dashed magenta line),information preserving component analysis (dotted blue line), sufficient dimen-sion reduction (green line with markers) and joint minimization (solid red line).Error bars are not included because they would make the plot unreadable, butnote that standard deviations for all five methods are approximately the same.

method matches the structural risk minimization principle.Rademacher complexity analysis supporting these empiricalfindings is presented in Section II-C.

C. Classification Error for Different Networks

Given the sensor network model of Section III-D, we look atclassification performance for the three different network archi-tectures with different amounts of transmission power. Differenttransmission powers are obtained by varying the number of sen-sors and scaling the values. We emulate data coming froma sensor network by slicing the dimensions of the ionosphere,sonar and arcene datasets and assigning the different dimensionsto different sensors. With for all sensors in the networkfor the ionosphere and sonar datasets and for the arcenedataset, we assign the dimensions in the order given in the UCImachine learning repository, so the first sensor “measures” thefirst dimensions listed, the second sensor “measures” dimen-sions through , and so on. The dimensions are notordered according to relevance for classification in any way.

We plot results for the ionosphere dataset in Fig. 9. InFig. 9(a), we plot tenfold cross-validation training and test errorobtained from the algorithm described in Section III-B withthe parallel network as a function of transmission power. Eachtraining and test error pair corresponds to a different value of

and . In Section IV-B, weplotted classification performance as a function of the reduced

Fig. 7. Tenfold cross-validation training error (blue triangle markers) and testerror (red circle markers) on (a) Wisconsin diagnostic breast cancer, (b) sonar,and (c) arrhythmia datasets. Error bars indicate standard deviation over the tenfolds.

Fig. 8. Training error (blue triangle markers) and test error (red circle markers)on arcene dataset.

dimension, but here the horizontal axis is transmission power,taking the distance between sensor nodes into account. As inSection IV-B, the phenomenon of overfitting is quite apparent.

Page 13: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2508 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 9. Tenfold cross-validation training error (blue triangle markers) and testerror (red circle markers) on ionosphere dataset for (a) parallel, (b) serial, and(c) binary tree network architectures.

In Fig. 9(b), classification error is plotted as a function oftransmission power for the serial architecture. The points in theplot are for different numbers of sensors anddifferent scalings of the reduced dimension

. The classification error valuesin Fig. 9(b) are quite similar to the ones for the parallel case.13

The plot for the parallel architecture appearing to be a horizon-tally compressed version of the serial architecture plot indicatesthat to achieve those similar classification performances, moretransmission power is required by the serial architecture. Al-though the distances between parents and children tends to besmaller in the serial architecture, the chosen are larger closerto the fusion center leading to higher transmission power.

The binary tree architecture’s classification error plot is givenin Fig. 9(c). The training and test error values are similar to

13In fact, they are the same for the five pairs of points when � � � becausethe parallel and serial networks are the same when there is a single sensor.

the other two architectures.14 The transmission power neededto achieve the given classification errors is similar to that ofthe parallel architecture and less than the serial architecture.Among the three architectures with the assigned as describedin Section III-D, all have approximately the same classificationperformance, but the serial network uses more power.

The same experiments are repeated for the sonar and arcenedatasets with plots given in Figs. 10 and 11. For the sonardataset, varies from one to eleven and of leaf nodes fromone to five. For the arcene dataset, varies from one to tenand of leaf nodes from one to fifteen. The same trends canbe observed as in the ionosphere dataset; similar plots areproduced for other datasets such as Wisconsin diagnostic breastcancer and arrhythmia. All three network topologies producesimilar classification errors, but the serial network uses morepower.

Some overall observations for wireless sensor networks arethe following. There exist some optimal parameters of the net-work with a finite number of sensors and some dimensionalityreduction. One may be tempted to think that deploying moresensors always helps classification performance since the totalnumber of measured dimensions increases, but we find that thisis not generally true. For a fixed number of samples , oncethere are enough sensors to fit the data, adding more sensorsleads to overfitting and a degradation of test performance. Thata small number of sensors, which perform dimensionality reduc-tion, yield optimal classification performance is good from theperspective of resource usage. Among different possible choicesof network architectures, we have compared three particularchoices. Others are certainly possible, including the investigatedtopologies but with different proportions. For the chosenproportions, all three network topologies have essentially thesame classification performance, but this is not true for otherchoices.

In this empirical investigation of classification performanceversus resource usage, the main observation is that the two arenot at odds. The decrease of resource usage is coincident withthe prevention of overfitting, which leads to improved classifi-cation performance. Oftentimes there is a tradeoff between re-source usage and performance, but that is not the case in theoverfitting regime. Additionally, among the network architec-tures compared, the parallel and binary tree architectures useless power in communication than the serial architecture forequivalent classification performance. The plotted transmissionpower values, however, are based on choosing the parent-childlinks to exactly minimize (18); in practice, this minimizationwill only be approximate for the binary tree architecture and willrequire a certain amount of communication overhead. There-fore, the parallel architecture, which requires no optimization, isrecommended for this application. This new distributed dimen-sionality reduction formulation and empirical study suggests adirection for future research, namely the problem of finding thenumber of sensors, the network structure and the set of thatoptimize generalization error in classification for a given trans-mission power budget and given number of training samples .

14The binary tree is the same as the parallel network for � � �� � and theserial network for � � �.

Page 14: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2509

Fig. 10. Tenfold cross-validation training error (blue triangle markers) and testerror (red circle markers) on sonar dataset for (a) parallel, (b) serial, and (c)binary tree network architectures.

D. Spatially Distributed Sensor Node Data

As a confirmation of the results given for emulated sensornetwork data in Section IV-C, here we present results on twodatasets arising from spatially-distributed sensor nodes. Thefirst dataset is based on sensor measurements collected atthe Intel Berkeley Research Laboratory in 2004. The seconddataset is based on sensor measurements collected at the ArmyResearch Laboratory in 2007 [75].

The Intel Berkeley dataset as available contains temperature,relative humidity and light measurements for 54 sensors overmore than a month. A classification task is required for themethodology developed in this paper and thus we define twoclasses based on the light measurements, dark and bright. Thedark class corresponds to the average light being less than 125 lxand the bright class to greater than 125 lx. Our formulation re-quires a correspondence among measurements from different

Fig. 11. Training error (blue triangle markers) and test error (red circlemarkers) on arcene dataset for (a) parallel, (b) serial, and (c) binary treenetwork architectures.

sensors in order to define a single sample ; the sensor measure-ments are time-stamped with an epoch number such that mea-surements from different sensors with the same epoch numbercorrespond to the same time. However, each epoch number cor-responds to much fewer than 54 sensors. Thus, we take length 60blocks of epoch numbers and consider all measurements withina block to correspond to the same time. We take the first readingif a block contains more than one reading from the same sensor.Even with this blocking, if we insist that a sample needs datafrom all 54 sensors, we obtain very few samples. Thus, we onlyconsider 12 sensors, numbered 1, 2, 3, 4, 6, 31, 32, 33, 34, 35, 36and 37 in the dataset. With such processing, we obtainsamples.

Spatial locations of the sensors are given. For the networkstructure, we consider a fusion center located in the center ofthe sensors and links between nodes according to the Euclideanminimum spanning tree with the fusion center at the root. We

Page 15: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2510 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 12. Training error (blue triangle markers) and test error (red circlemarkers) on (a) Intel Berkeley dataset and (b) Army Research Laboratorydataset.

train on the first quarter of the samples containing temperatureand relative humidity measurements and test on the latter threequarters of the samples, varying the number of sensors and the

scaling. The training and test errors as a function of totaltransmission power in the network is given in Fig. 12(a). Asin previous results, we see the effects of overfitting. An inter-mediate transmission power level is optimal for classificationperformance even with spatially-distributed sensor node data.

The Army Research Laboratory data consists of sensor nodesthat take four acoustic, three seismic, one electric field and fourpassive infrared measurements. Measurements are taken duringthe dropping of a 14 pound steel cylinder from nine inchesabove the ground and during no significant human activity. Thecylinder dropping happens at various spatial locations in re-lation to the sensors. In this dataset, we have 200 samples ofcylinder dropping and 200 samples of no activity. We train onthe first half of the samples and test on the remaining samples.The fusion center is again placed in the center of the sensorsand a minimum spanning tree network is used. Training errorand test error are plotted in Fig. 12(b) for different numbers ofsensors and different scalings. Again, we see that an interme-diate level of transmission power is optimal for classificationtest error, with overfitting for large transmission powers.

V. CONCLUSION

In this paper, we have formulated linear dimensionality re-duction driven by the objective of margin-based classification.We have developed an optimization approach that involves

alternation between two minimizations: one to update a clas-sifier decision function and the other to update a matrix onthe Stiefel manifold. We have both analytically and empir-ically looked at the phenomenon of overfitting: analyticallythrough the Rademacher complexity and empirically throughexperiments on several real datasets, illustrating that dimen-sionality reduction is an important component in improvingclassification accuracy. We have also analytically characterizedthe consistency of the dimensionality-reduced classifier. Wehave described how our proposed optimization scheme can bedistributed in a network containing a single sensor through amessage-passing approach, with the classifier decision functionupdated at the fusion center and the dimensionality reductionmatrix updated at the sensor. Additionally, we have extendedthe formulation to tree-structured fusion networks.

Papers such as [32] and [34] have advocated nonparametriclearning, of which margin-based classification is a subset, forinference in distributed settings such as wireless sensor net-works. Reducing the amount of communication is an impor-tant consideration is these settings, which we have addressedin this paper through a joint linear dimensionality reduction andmargin-based classification method applicable to networks inwhich sensors measure more than one variable. Reducing com-munication is often associated with a degradation in perfor-mance, but in this application it is not the case in the regimewhen dimensionality reduction prevents overfitting. Thus, di-mensionality reduction is important for two distinct reasons: re-ducing the amount of resources consumed and obtaining goodgeneralization.

ACKNOWLEDGMENT

The authors would like to thank J. H. G. Dauwels,J. W. Fisher, III, and S. R. Sanghavi for valuable discussions;P. Bodik, W. Hong, C. Guestrin, S. Madden, M. Paskin, and R.Thibaux for collecting the Intel Berkeley data; T. Damarla, S. G.Iyengar, and A. Subramanian for furnishing the Army ResearchLaboratory data; K. M. Carter, R. Raich, and A. O. Hero, III,for information preserving component analysis software; andR. D. Cook, L. Forzani and D. Tomassi for sufficient dimensionreduction software.

REFERENCES

[1] K. R. Varshney and A. S. Willsky, “Learning dimensionality-reducedclassifiers for information fusion,” in Proc. Int. Conf. Inf. Fusion,Seattle, WA, Jul. 2009, pp. 1881–1888.

[2] K. R. Varshney, “Frugal hypothesis testing and classification,” Ph.D.thesis, Mass. Inst. Technol., Cambridge, MA, 2010.

[3] M. Çetin, L. Chen, J. W. Fisher, III, A. T. Ihler, R. L. Moses, M.J. Wainwright, and A. S. Willsky, “Distributed fusion in sensor net-works,” IEEE Signal Process. Mag., vol. 23, no. 4, pp. 42–55, Jul. 2006.

[4] G. Werner-Allen, K. Lorincz, J. Johnson, J. Lees, and M. Welsh, “Fi-delity and yield in a volcano monitoring sensor network,” in Proc.USENIX Symp. Operat. Syst. Des. Implement., Seattle, WA, Nov. 2006,pp. 381–396.

[5] L. Zong, J. Houser, and T. R. Damarla, “Multi-modal unattendedground sensor (MMUGS),” Proc. SPIE, vol. 6231, p. 623118, Apr.2006.

[6] Z. Zhu and T. S. Huang, Multimodal Surveillance: Sensors, Algo-rithms, and Systems. Boston, MA: Artech House, 2007.

[7] H. L. Van Trees, Detection, Estimation, and Modulation Theory. NewYork: Wiley, 1968.

[8] V. N. Vapnik, The Nature of Statistical Learning Theory. New York:Springer, 1995.

Page 16: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

VARSHNEY AND WILLSKY: LINEAR DIMENSIONALITY REDUCTION FOR MARGIN-BASED CLASSIFICATION 2511

[9] L. Zwald, R. Vert, G. Blanchard, and P. Massart, “Kernel projectionmachine: A new tool for pattern recognition,” in Advances in NeuralInformation Processing Systems. Cambridge, MA: MIT Press, 2005,vol. 17, pp. 1649–1656.

[10] S. Mosci, L. Rosasco, and A. Verri, “Dimensionality reduction and gen-eralization,” in Proc. Int. Conf. Mach. Learn., Corvallis, OR, Jun. 2007,pp. 657–664.

[11] G. Blanchard and L. Zwald, “Finite-dimensional projection for classi-fication and statistical learning,” IEEE Trans. Inf. Theory, vol. 54, no.9, pp. 4169–4182, Sep. 2008.

[12] A. Srivastava and X. Liu, “Tools for application-driven linear dimen-sion reduction,” Neurocomput., vol. 67, pp. 136–160, Aug. 2005.

[13] Y. Lin, “A note on margin-based loss functions in classification,” Stat.Probabil. Lett., vol. 68, no. 1, pp. 73–82, Jun. 2004.

[14] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classifi-cation, and risk bounds,” J. Amer. Statist. Assoc., vol. 101, no. 473, pp.138–156, Mar. 2006.

[15] K. R. Varshney and A. S. Willsky, “Classification using geometric levelsets,” J. Mach. Learn. Res., vol. 11, pp. 491–516, Feb. 2010.

[16] A. M. Martínez and A. C. Kak, “PCA versus LDA,” IEEE Trans. Pat-tern Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233, Feb. 2001.

[17] J. Bi, K. P. Bennett, M. Embrechts, C. M. Breneman, and M. Song, “Di-mensionality reduction via sparse support vector machines,” J. Mach.Learn. Res., vol. 3, pp. 1229–1243, Mar. 2003.

[18] B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo,“A Bayesian approach to joint feature selection and classifier design,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1105–1111,Sep. 2004.

[19] L. Wolf and A. Shashua, “Feature selection for unsupervised and su-pervised inference: The emergence of sparsity in a weight-based ap-proach,” J. Mach. Learn. Res., vol. 6, pp. 1855–1887, Nov. 2005.

[20] V. Y. F. Tan, S. Sanghavi, J. W. Fisher, III, and A. S. Willsky,“Learning graphical models for hypothesis testing and classification,”IEEE Trans. Signal Process., vol. 58, no. 11, pp. 5481–5495, Nov.2010.

[21] K. Huang and S. Aviyente, “Sparse representation for signal classifica-tion,” in Advances in Neural Information Processing Systems. Cam-bridge, MA: MIT Press, 2007, vol. 19, pp. 609–616.

[22] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimi-native learned dictionaries for local image analysis,” in Proc. IEEE CSConf. Comput. Vis. Pattern Recogn., Anchorage, AK, Jun. 2008.

[23] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” in Ad-vances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 2008, vol. 20, pp. 121–128.

[24] S. Lacoste-Julien, F. Sha, and M. I. Jordan, “DiscLDA: Discrimina-tive learning for dimensionality reduction and classification,” in Ad-vances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 2009, vol. 21, pp. 897–904.

[25] S. Lazebnik and M. Raginsky, “Supervised learning of quantizer code-books by information loss minimization,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 7, pp. 1294–1309, Jul. 2009.

[26] V. Koltchinskii, “Rademacher penalties and structural risk minimiza-tion,” IEEE Trans. Inf. Theory, vol. 47, no. 5, pp. 1902–1914, Jul. 2001.

[27] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complex-ities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3,pp. 463–482, Nov. 2002.

[28] R. R. Tenney and N. R. Sandell, Jr, “Detection with distributed sen-sors,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-17, no. 4, pp.501–510, Jul. 1981.

[29] J. N. Tsitsiklis, “Decentralized detection,” Lab. Inf. Decision Syst,Mass. Inst. Technol., Cambridge, MA, Tech. Rep. P-1913, Sep. 1989.

[30] P. K. Varshney, Distributed Detection and Data Fusion. New York:Springer-Verlag, 1996.

[31] J.-F. Chamberland and V. V. Veeravalli, “Decentralized detection insensor networks,” IEEE Trans. Signal Process., vol. 51, no. 2, pp.407–416, Feb. 2003.

[32] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Nonparametric decen-tralized detection using kernel methods,” IEEE Trans. Signal Process.,vol. 53, no. 11, pp. 4053–4066, Nov. 2005.

[33] J. B. Predd, S. R. Kulkarni, and H. V. Poor, “Consistency in models fordistributed learning under communication constraints,” IEEE Trans.Inf. Theory, vol. 52, no. 1, pp. 52–63, Jan. 2006.

[34] J. B. Predd, S. R. Kulkarni, and H. V. Poor, “Distributed learning inwireless sensor networks,” IEEE Signal Process. Mag., vol. 23, no. 4,pp. 56–69, Jul. 2006.

[35] M. Gastpar, P. L. Dragotti, and M. Vetterli, “The distributedKarhunen–Loève transform,” IEEE Trans. Inf. Theory, vol. 52,no. 12, pp. 5177–5196, Dec. 2006.

[36] I. D. Schizas, G. B. Giannakis, and Z.-Q. Luo, “Distributed estima-tion using reduced-dimensionality sensor observations,” IEEE Trans.Signal Process., vol. 55, no. 8, pp. 4284–4299, Aug. 2007.

[37] O. Roy and M. Vetterli, “Dimensionality reduction for distributed es-timation in the infinite dimensional regime,” IEEE Trans. Inf. Theory,vol. 54, no. 4, pp. 1655–1669, Apr. 2008.

[38] E. A. Patrick and F. P. Fischer, II, “Nonparametric feature selection,”IEEE Trans. Inf. Theory, vol. IT-15, no. 5, pp. 577–584, Sep. 1969.

[39] R. Lotlikar and R. Kothari, “Adaptive linear dimensionality reductionfor classification,” Pattern Recognit., vol. 33, no. 2, pp. 185–194, Feb.2000.

[40] J. C. Principe, D. Xu, and J. W. Fisher, III, “Information-theoreticlearning,” in Unsupervised Adaptive Filtering, S. Haykin, Ed. NewYork: Wiley, 2000, vol. 1, pp. 265–320.

[41] K. Torkkola, “Feature extraction by non-parametric mutual informa-tion maximization,” J. Mach. Learn. Res., vol. 3, pp. 1415–1438, Mar.2003.

[42] Z. Nenadic, “Information discriminant analysis: Feature extractionwith an information-theoretic objective,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 29, no. 8, pp. 1394–1407, Aug. 2007.

[43] M. Thangavelu and R. Raich, “Multiclass linear dimension reductionvia a generalized Chernoff bound,” in Proc. IEEE Workshop Mach.Learn. Signal Process., , Cancún, Mexico, Oct. 2008, pp. 350–355.

[44] K. M. Carter, R. Raich, and A. O. Hero, III, “An information geometricapproach to supervised dimensionality reduction,” in Proc. IEEE Int.Conf. Acoust., Speech, Signal Process, Taipei, Taiwan, R.O.C., Apr.2009, pp. 1829–1832.

[45] K.-C. Li, “Sliced inverse regression for dimension reduction,” J. Amer.Statist. Assoc., vol. 86, no. 414, pp. 316–327, Jun. 1991.

[46] K.-C. Li, “On principal Hessian directions for data visualization anddimension reduction: Another application of Stein’s lemma,” J. Amer.Statist. Assoc., vol. 87, no. 420, pp. 1025–1039, Dec. 1992.

[47] F. Chiaromonte and R. D. Cook, “Sufficient dimension reduction andgraphics in regression,” Ann. Inst. Statist. Math., vol. 54, no. 4, pp.768–795, Dec. 2002.

[48] K. Fukumizu, F. R. Bach, and M. I. Jordan, “Dimensionality reduc-tion for supervised learning with reproducing kernel Hilbert spaces,”J. Mach. Learn. Res., vol. 5, pp. 73–99, Jan. 2004.

[49] Sajama and A. Orlitsky, “Supervised dimensionality reduction usingmixture models,” in Proc. Int. Conf. Mach. Learn., Bonn, Germany,Aug. 2005, pp. 768–775.

[50] K. Fukumizu, F. R. Bach, and M. I. Jordan, “Kernel dimension reduc-tion in regression,” Ann. Stat., vol. 37, no. 4, pp. 1871–1905, Aug. 2009.

[51] X. Liu, A. Srivastava, and K. Gallivan, “Optimal linear representationsof images for object recognition,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 26, no. 5, pp. 662–666, May 2004.

[52] F. Pereira and G. Gordon, “The support vector decomposition ma-chine,” in Proc. Int. Conf. Mach. Learn., Pittsburgh, PA, Jun. 2006,pp. 689–696.

[53] D.-S. Pham and S. Venkatesh, “Robust learning of discriminativeprojection for multicategory classification on the Stiefel manifold,” inProc. IEEE CS Conf. Comput. Vis. Pattern Recognit., Anchorage, AK,Jun. 2008.

[54] I. W.-H. Tsang, A. Kocsor, and J. T.-Y. Kwok, “Large-scale maximummargin discriminant analysis using core vector machines,” IEEE Trans.Neural Netw., vol. 19, no. 4, pp. 610–624, Apr. 2008.

[55] S. Ji and J. Ye, “Linear dimensionality reduction for multi-label classi-fication,” in Proc. Int. Joint Conf. Artificial Intell., Pasadena, CA, Jul.2009, pp. 1077–1082.

[56] S. Ji, L. Tang, S. Yu, and J. Ye, “Extracting shared subspace for multi-label classification,” in Proc. ACM SIGKDD Int. Conf. Knowledge Dis-covery Data Mining, Las Vegas, NV, Aug. 2008, pp. 381–389.

[57] M. Rousson and N. Paragios, “Prior knowledge, level set represen-tations & visual grouping,” Int. J. Comput. Vis., vol. 76, no. 3, pp.231–243, 2008.

[58] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithmswith orthogonality constraints,” SIAM J. Matrix Anal. A., vol. 20, no.2, pp. 303–353, Jan. 1998.

[59] J. H. Manton, “Optimization algorithms exploiting unitary constraints,”IEEE Trans. Signal Process., vol. 50, no. 3, pp. 635–650, Mar. 2002.

[60] Y. Nishimori and S. Akaho, “Learning algorithms utilizingquasi-geodesic flows on the Stiefel manifold,” Neurocomput., vol. 67,pp. 106–135, Aug. 2005.

Page 17: 2496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …willsky.lids.mit.edu/publ_pdfs/211_pub_IEEE.pdf2498 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 tance) the

2512 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

[61] U. von Luxburg and O. Bousquet, “Distance-based classification withLipschitz functions,” J. Mach. Learn. Res., vol. 5, pp. 669–695, Jun.2004.

[62] A. N. Kolmogorov and V. M. Tihomirov, “�-entropy and �-capacity ofsets in functional spaces,” Am. Math. Soc. Translations Series 2, vol.17, pp. 277–364, 1961.

[63] G. D. Chakerian and P. Filliman, “The measures of the projections ofa cube,” Studia Scientiarum Mathematicarum Hungarica, vol. 21, no.1–2, pp. 103–110, 1986.

[64] P. Filliman, “Extremum problems for zonotopes,” Geometriae Dedi-cata, vol. 27, no. 3, pp. 251–262, Sep. 1988.

[65] O. Bousquet, “New approaches to statistical learning theory,” Ann. Inst.Statist. Math., vol. 55, no. 2, pp. 371–389, Jun. 2003.

[66] I. Steinwart, “Consistency of support vector machines and other regu-larized kernel classifiers,” IEEE Trans. Inf. Theory, vol. 51, no. 1, pp.128–142, Jan. 2005.

[67] X. Shen and W. H. Wong, “Convergence rate of sieve estimates,” Ann.Stat., vol. 22, no. 2, pp. 580–615, Jun. 1994.

[68] A. W. van der Vaart, Asymptotic Statistics. Cambridge, U.K.: Cam-bridge Univ. Press, 1998.

[69] X. Wang and M. Leeser, “A truly two-dimensional systolic array FPGAimplementation of QR decomposition,” ACM Trans. Embed. Comput.Syst., vol. 9, no. 1, Oct. 2009.

[70] J. C. Maxwell, A Treatise on Electricity and Magnetism. Oxford,U.K.: Clarendon Press, 1873.

[71] P. Bhattacharyya and B. K. Chakrabarti, “The mean distance to the nthneighbour in a uniform distribution of random points: An applicationof probability theory,” Eur. J. Phys., vol. 29, no. 3, pp. 639–645, May2008.

[72] Y.-A. L. Borgne, S. Raybaud, and G. Bontempi, “Distributed principalcomponent analysis for wireless sensor networks,” Sensors, vol. 8, no.8, pp. 4821–4850, Aug. 2008.

[73] A. Asuncion and D. J. Newman, UCI Machine Learning Repository,2007 [Online]. Available: http://archive.ics.uci.edu/ml

[74] R. D. Cook and L. Forzani, “Principal fitted components for dimensionreduction in regression,” Statist. Sci., vol. 23, no. 4, pp. 485–501, Nov.2008.

[75] R. Damarla, M. Beigi, and A. Subramanian, “Human activity experi-ments performed at ARL,” Army Res. Lab., Adelphi, MD, Tech. Rep.,Apr. 2007.

Kush R. Varshney (S’00–M’10) was born inSyracuse, NY, in 1982. He received the B.S. degree(magna cum laude) in electrical and computer engi-neering with honors from Cornell University, Ithaca,NY, in 2004 and the S.M. and Ph.D. degrees, bothin electrical engineering and computer science, fromthe Massachusetts Institute of Technology (MIT),Cambridge, in 2006 and 2010, respectively.

He is a research staff member in the Business An-alytics and Mathematical Sciences Department at theIBM Thomas J. Watson Research Center, Yorktown

Heights, NY. While at MIT, he was a Research Assistant with the StochasticSystems Group in the Laboratory for Information and Decision Systems and aNational Science Foundation Graduate Research Fellow. He has been a visitingstudent at École Centrale, Paris, and an intern at Lawrence Livermore NationalLaboratory, Sun Microsystems, and Sensis Corporation. His research interestsinclude statistical signal processing, statistical learning, and image processing.

Dr. Varshney is a member of Eta Kappa Nu, Tau Beta Pi, and ISIF. He re-ceived a Best Student Paper Travel Award at the 2009 International Conferenceon Information Fusion.

Alan S. Willsky (S’70–M’73–SM’82–F’86) joinedthe Massachusetts Institute of Technology, Cam-bridge, in 1973 and is the Edwin Sibley WebsterProfessor of Electrical Engineering and Directorof the Laboratory for Information and DecisionSystems.

He was a founder of Alphatech, Inc., and ChiefScientific Consultant, a role in which he continues atBAE Systems Advanced Information Technologies.From 1998 to 2002, he served on the U.S. Air ForceScientific Advisory Board. His research interests are

in the development and application of advanced methods of estimation, machinelearning, and statistical signal and image processing.

Dr. Willsky has received several awards, including the 1975 American Auto-matic Control Council Donald P. Eckman Award, the 1979 ASCE Alfred NoblePrize, the 1980 IEEE Browder J. Thompson Memorial Award, the IEEE ControlSystems Society Distinguished Member Award in 1988, the 2004 IEEE DonaldG. Fink Prize Paper Award, the Doctorat Honoris Causa from Université deRennes in 2005, and the 2009 Technical Achievement Award from the IEEESignal Processing Society. In 2010, he was elected to the National Academy ofEngineering. He has delivered numerous keynote addresses and is coauthor ofthe text Signals and Systems.


Recommended