A Framework for Local Supervised Dimensionality Reduction ...

A Framework for Local Supervised Dimensionality Reduction of High

Dimensional Data

Charu C. Aggarwal

IBM T. J. Watson Research Center

[email protected]

Abstract

High dimensional data presents a challenge to the clas-sification problem because of the difficulty in modelingthe precise relationship between the large number offeature variables and the class variable. In such cases,it may be desirable to reduce the information to a smallnumber of dimensions in order to improve the accuracyand effectiveness of the classification process. Whiledata reduction has been a well studied problem forthe unsupervised domain, the technique has not beenexplored quite as extensively for the supervised case.Existing techniques which try to perform dimensional-ity reduction are too slow for practical use in the highdimensional case. These techniques try to find globaldiscriminants in the data. However, the behavior ofthe data often varies considerably with data localityand different subspaces may show better discriminationin different localities. This is an even more challengingtask than the global discrimination problem becauseof the additional issue of data localization. In thispaper, we propose the novel idea of supervised subspacesampling in order to create a reduced representationof the data for classification applications in an efficientand effective way. The method exploits the naturaldistribution of the different classes in order to samplethe best subspaces for class discrimination. Because ofits sampling approach, the procedure is extremely fastand scales almost linearly both with data set size anddimensionality.

Keywords: classification, dimensionality reduction

1 Introduction

The classification problem is defined as follows: We havea set of records D containing the training data, and eachrecord is associated with a class label. The classificationproblem constructs a model which connects the trainingdata to the class variables. The classification problem isa widely studied problem by the data mining, statisticsand the machine learning communities [6, 8, 14]. In

this paper, we will explore the dimensionality reductionproblem in the context of classification.

Dimensionality Reduction Methods have beenwidely studied in the unsupervised domain [9, 13, 15].The idea in dimensionality reduction methods is totransform the data into a new orthonormal coordinatesystem in which the second order correlations are elimi-nated. In typical applications, the resulting axis-systemhas the property that the variance of the data alongmany of the new dimensions is very small [13]. Thesedimensions can then be eliminated, a process resultingin a compact representation of the data with some lossof representational accuracy.

Dimensionality reduction has been studied some-what sparingly in the supervised domain. This is par-tially because the presence of class labels significantlycomplicates the reduction process. It is more impor-tant to find a new set of dimensions in which the newaxis-system retains the discriminatory behavior of thedata, whereas the maintenance of representational ac-curacy becomes secondary. Aside from the advantagesof a compact representation after reduction, dimension-ality reduction also serves the dual purpose of removingthe irrelevant subspaces in the data. This improves theaccuracy of the classifiers on the reduced data. Sometechniques such as those discussed in [8] achieve this byrepeated discriminant computation. This is extremelyexpensive for high dimensional databases.

Most data reduction methods in the supervised andunsupervised domain use global data reduction. In thesetechniques, a single axis system is constructed on whichthe entire data is projected. Such techniques assumeuniformity of class behavior throughout the data setwhile computing the new representation. Recent re-search for the unsupervised domain has shown that dif-ferent parts of the data show considerably different be-havior. As a result, while global dimensionality reduc-tion often fails to capture the important characteristicsof the data in a small number of dimensions, local di-mensionality reduction methods [2, 5] can often provide

358

a more effective solution. In these techniques, a differ-ent axis system is constructed for each data locality formore effective reduction.

For the supervised domain, the analogous intuitionis that different parts of the data may show differentpatterns of discrimination. However, since even globalreduction methods such as the Fisher method are socomputationally intensive, the task of effective localreduction becomes even more intractable. In this paper,we will show that this task can actually be accomplishedquite efficiently by using a sampling process in which therandom process of subspace selection is biased by theunderlying class distribution. We will also show thatthe reduction process is very useful for the classificationproblem itself, since it facilitates the development ofsome interesting decomposable classification algorithms.The overall result is a greatly improved classificationprocess.

The technique of subspace sampling [1, 11] hasrecently been used to perform data reduction in theunsupervised version of the problem. In this paper,we propose an effective subspace sampling approachfor supervised problems. The aim is to exploit thedata distribution in such a way that the axis system ofrepresentation in each data locality is able to expose theclass discrimination. Since the class discrimination canbe modeled with a small number of dimensions in manyparts of the data, the resulting representation is oftenmore concise and effective for the classification process.

We will also show how existing classification algo-rithms can be enhanced by the local reduction approachdescribed in this paper. Since our reduction approachdecomposes the data set by optimizing the discrimina-tion behavior in each segment, different classificationtechniques may vary in effectiveness on different parts ofthe data. This fact can be exploited in order to ensurethat the particular classification model being used isbest suited to its data locality. To the best of our knowl-edge, the decomposable method discussed in this paperwhich picks an optimal classifier depending upon local-ity specific properties of the compressed data is uniquein its approach.

In order to facilitate further development of theideas, we will introduce additional notations and def-initions. We assume that the data set is denoted by D.The number of points in the data set is denoted by Nand the dimensionality by d. The full dimensional dataspace is denoted by U . We define the l-dimensionalhyperplane H(y, E) by an anchor y and a mutually or-thogonal set of vectors E = {e1 . . . el}. The hyperplanepasses through y, and the vectors in E form the basissystem for its subspace. The projection of a point x ontothis hyperplane is denoted by P(x, y, E) and is the clos-

est approximation of x, which lies on this hyperplane.In order to find the value of P(x, y, E), we use y as thereference point for the computation. Specifically, wedetermine the projections of x − y onto the hyperplanedefined by {e1 . . . el}. Then, we translate the resultingpoint by the reference point y. Therefore, we have:

P(x, y, E) = y +

l∑

i=1

[(x − y) · ei] ei(1.1)

A pictorial representation of x′ = P(x, y, E) is illus-trated in Figure 1(a). The value of x′ can be repre-sented in the orthonormal axis system for E with theuse of only l cooordinates ((x − y) · e1 . . . (x − y) · el)).This results in an additional overhead of storing y andE . This storage overhead is however not significant, ifit can be averaged over a large number of points storedon this hyperplane. While the error of approximatingx with P(x, y, E) is given by the euclidean distance be-tween x and P(x, y, E), this measure is secondary to theclassification accuracy of the reduced data.

This paper is organized as follows. In the next sec-tion, we will introduce the supervised subspace samplingtechnique and discuss some of its properties. Section 3will discuss the application of the method to the classi-fication problem. We will also discuss how the perfor-mance of the classification system can be considerablyoptimized by using the decomposition created by thesupervised subspace sampling technique. The empiricalresults are discussed in section 4. Finally, we presentthe conclusions and summary in section 5.

1.1 Contributions of this paper The paper dis-cusses a highly effective and scalable approach to theproblem of supervised data reduction. While unsuper-vised data reduction has been well studied in the liter-ature, the supervised problem is extremely difficult inpractice because of the need to use the class distribu-tions in the reduction process. The available discrimi-nant methods are highly computationally intensive evenon memory resident databases. In contrast, the tech-nique discussed in this paper provides a significantlymore effective reduction process, while exhibiting lin-ear scalability with data set size and dimensionality be-cause of its sampling approach. The process of samplingsubspaces which are optimized to particular data local-ities results in a technique in which each segment of thedata is more suited to the classification task. Further-more, the unique decomposition created by the reduc-tion process facilitates the creation of optimized decom-posable approaches to the classification problem. Thus,the overall approach not only provides savings in termsof data compression, but also a greatly improved classi-fication process.

359

H

x

x’e(1)

e(2)

(x-y).e(1) (x-y).e(2)y

Hyperplane

xxxx

xxxo

oooo

oo

x

x

Supervisionxxx

xxxo

oooo

oo

x

PureSpaceSampling

Space Samplingwith

xxxxx

x oxoooo oo

xx xxx

xo

oo

Projection

o

Point Sampled Supervised Global Random

Projection

Supervised Local Random

xxxxx

x oxoooo oo

xx xxx

xo

oo

o

Point Sampled

(a) Approximation of (b) Space Sampling vs (c) Effects ofReduced Data Point Sampling Data Locality

Figure 1: Illustration of Localized Sampling

E

F

x

x

i3

i4

x i8

Bx i7

i2

xi1

x i6

x i5

A

C

D

A B

C D E F

{i1, i2} {i3, i4}

{i1, i2, i5} {i1, i2, i6} {i3, i4, i7} {i3, i4, i8}

2-dimensional representations

1-dimensional representations

x

Figure 2: Subspace Tree (Example)

2 Supervised Subspace Sampling

An interesting approach for dimensionality reduction ofthe unsupervised version of the problem is that of ran-dom projections [11, 12]. In this class of techniques, werepeatedly sample spherically symmetric random direc-tions in order to determine an optimum hyperplane onwhich the data is projected. These methods can alsobe extended to the classification problem by projectionof the data onto random subspaces and measuring thediscrimination of such spaces. However, such a directextension of the random projection technique [11, 12]may often turn out to be ineffective in practice, sinceit is blind both to the data and class distributions ofthe points. We will try to explain this point by using1-dimensional projections of 2-dimensional data. Con-sider the data set illustrated in Figure 1(b) in whichwe have illustrated two kinds of projections. In theleft figure, the data space is sampled in order to finda 1-dimensional line along which the projection is per-formed. In data space sampling, random projectionsare chosen in a spherically symmetric fashion irrespec-tive of the data distribution. The reduced data in this1-dimensional representation is simply the projection ofthe data points onto the line. We note that such a pro-

jection neither follows the basic pattern in the data, nordoes it provide a direction along which the class distri-butions are well discriminated. For high dimensionalcases, such a projection may be poor at distinguishingthe different classes even after repeated subspace sam-pling. In the other case of Figure 1(b), we have sampledthe points in order to create a random projection. Thesampled subspace is defined as the (l − 1)-dimensionalhyperplane containing l (locally proximate) points (ofdifferent classes) from the data.1 The reason for pick-ing points from different classes is that we would likethe resulting subspace to represent the discriminationbehavior of different classes more effectively. At thesame time these points should be picked carefully onlyfrom a local segment of the data in order to ensure thatthe class discrimination is determined by data locality.For example, in Figure 1(b), the 1-dimensional line ob-tained by sampling two points of different classes picksthe direction of greater discrimination most effectivelythan the space sampled projection in the same figure.

While it is intuitively clear that point sampling ismore effective than space sampling for variance preser-vation, the advantages are limited when the data distri-bution varies considerably with locality. For example,in Figure 1(c), even the optimal 1-dimensional randomprojection cannot represent all points without losinga substantial amount of class discrimination. In fact,there is no 1-dimensional line along which a projectioncan effectively separate out the classes. In Figure 1(c),we have used the random projection technique locally inconjunction with data partitioning. In this technique,each data point is projected on the closest of a numberof point sampled hyperplanes. In this case, it is evidentthat the data points are often well discriminated alongeach of the sampled hyperplanes, while they may be

1The actual methodology of choosing the points is discussed ata later stage. At this point, we are concerned only with choosingthe points in such a way that the chosen subspace is naturallybiased by the original data distribution.

360

poorly discriminated at a global level. This is becausethe nature of the class distribution varies considerablywith data locality, and a global dimensionality reduc-tion method cannot reduce the representation below theoriginal set of two dimensions. On the other hand, the1-dimensional representation of the data created by lo-cal projection of the data points along the sampled linesrepresents the class discrimination very well.

It should be noted that the improvements of thelocalized subspace sampling technique come at theadditional storage costs of the different hyperplanes.This limits the number of hyperplanes which can beretained from the sampling process, and requires us tomake judicious choices in picking these hyperplanes.A second important issue is that even the implicitdimensionalities of the different data localities maybe different. Therefore, we need a mechanism bywhich the sampling process is able to effectively choosehyperplanes of the lowest possible dimensionality foreach data locality. This is an issue which we will discussafter developing some additional notational machinery:

Definition 1. Let P = (x1 . . . xl+1) be a set of (l + 1)linearly independent points. The representative hyper-plane R(P ) of P is defined as the l-dimensional hyper-plane which passes through each of these (l + 1) points.

The hyperplane R(P ) can also be represented with theuse of any point y on the hyperplane, and an orthonor-mal set of vectors E = {e1 . . . el}, which lie on the hy-perplane. We shall call (y, E) the axis representationof the hyperplane, whereas the set P is referred to asthe point representation. Thus, R(P ) (point represen-tation) is the same as H(y, E) (axis representation). Wenote that there can be infinitely many point or axis rep-resentations of the same hyperplane. The axis repre-sentation is more useful for performing distance compu-tations of the hyperplane from individual points in thedatabase, whereas the point representation has advan-tages in storage efficiency in the context of a hierarchicalarrangement of subspaces. We will discuss this issue ina later section.

2.1 The Supervised Subspace Tree The Super-vised Subspace Tree is a conceptual organization of sub-spaces used in the data reduction process. This concep-tual organization imposes a hierarchical arrangement ofthe subspaces of different dimensionalities. Each suchsubspace provides effective data discrimination which isspecific to a particular locality of the data. Since thesubspaces have different data dimensionalities, this re-sults in a variable dimensionality decomposition of thedata. The nodes at level-m in the subspace tree cor-respond to m-dimensional subspaces. The root node

corresponds to the null subspace. Thus, the dimen-sionality of the hyperplane for any node in the tree isdetermined by its depth. The subspace at a node ishierarchically related to that of its immediate parent.Each subspace other than the null subspace at the rootis a 1-dimensional extension of its parent hyperplane.This 1-dimensional extension is obtained by adding asampled data point to the representative set of the par-ent hyperplane. In order to elucidate the concept ofa subspace tree, we will use an example. In Figure2, we have illustrated a hierarchically arranged set ofsubspaces. The figure contains a two-level tree struc-ture which corresponds to 1- and 2-dimensional sub-spaces. For each level-1 node in the tree, we store twopoints which correspond to the 1-dimensional line forthat node. For each lower level node, we store an addi-tional data point which increases the dimensionality ofits parent subspace by 1. Therefore, a level-m node hasa representative set of cardinality (m + 1). For exam-ple, in the case of Figure 2, the node A in the subspacetree (with representative set {i1, i2}) corresponds to the1-dimensional line defined by {i1, i2}. This node is ex-tended to a 2-dimensional hyperplane is two possibleways corresponding to the nodes C and D. In each case,an extra point needs to be added to the representativeset for creating the 1-dimensional extension. In orderto extend to the 2-dimensional hyperplane for node C,we use the point i5, whereas in order to extend to thehyperplane for node D, we use the point i6. Note fromFigure 2(a) that the intersection of the 2-dimensionalhyperplanes C and D is the 1-dimensional line A. Thus,each node in the subspace tree corresponds to a hyper-plane which is defined by its representative set drawnfrom the database D. The representative set for a givenhyperplane is obtained by adding one point to the rep-resentative set of its immediate parent. The subspacetree is formally defined as follows:

Definition 2. The subspace tree is a hierarchical ar-rangement of subspaces with the following properties:(1) Nodes at level-m correspond to m-dimensional hy-perplanes (2) Nodes at level-(m + 1) correspond to 1-dimensional extensions of their parent hyperplanes atlevel-m. (3) The point representative set of a level-(m+1) node is obtained by adding a sampled data pointto the representative set of its m-dimensional parentsubspace.

The data points in D are partitioned among the differentnodes of the subspace tree. We note that since thehyperplane is a subspace of the full dimensional space,it has a lower dimensional axis system in terms ofwhich the coordinates of x are represented. Since thedimensionality of a hyperplane depends directly on the

361

distance of the node to the root, higher levels of the treeprovide greater advantages in the reduction process.

2.2 Subspace Tree Construction Each node ofthe subspace tree corresponds to a hyperplane definedby the sequence of representative points sampled, start-ing from the root up to that node. The terms hy-perplane and node are therefore used interchangeablythrough this paper.

In order to measure the quality of a given node N forthe classification process, a discrimination index β(N)is maintained along with each node. This discriminationindex β(N) always lies between 0 and 1, and is ameasure of how well the different classes are separatedout in the data set for node N . A value of 1 indicatesperfect discrimination among the classes, whereas avalue of 0 indicates very poor discrimination. Wewill discuss the methodology for computation of thediscriminant in a later section.

The input to the subspace sampling algorithm fortree construction is the compression tolerance parame-ter ε, the data set D, the maximum number of nodesL, a discrimination tolerance γ1, and a discriminationtarget γ2. The value of γ2 is always larger than γ1.Each of these discrimination thresholds lie between 0and 1. Intuitively, these discrimination thresholds im-pose a minimum and maximum threshold on the qual-ity of class separation classes in the individual nodes.Correspondingly, each node N is classified into one ofthree types which is recorded by a variable called theStatus(·) vector: (1) Default Node: In this case, thediscrimination index β(N) lies between γ1 and γ2. Thevalue of Status(N) is set to 0. (2) Discriminative Node:Such a node is good for the classification process. Thediscrimination index β(N) is larger than γ2 in this case.the variable Status(N) is set to 2. (3) Forbidden Node:Such a node is bad for the classification process. In thiscase, the discriminant index β(N) is smaller than γ1.The value of Status(N) is set to 1.

A top-down algorithm is used to construct the nodesof the subspace tree, and the data set D is partitionedalong this hierarchy in order to maximize the localizeddiscrimination during the dimensionality reduction pro-cess. A discrimination index β(N) is maintained witheach node N in the tree. At each stage of the algo-rithm, every node N in the subspace tree has a set ofdescendent assignments T (N) ⊆ D from the databaseD. These are the data points which will be assigned toone of the descendants of node N during the tree con-struction process, but not to node N itself. In addition,each node also has a set of direct assignments Q(N),which are data points that are reduced onto node N . Adata point becomes a direct assignment of node N , when

one of the following two properties is satisfied: (1) Thedata point is at most a distance of ε from the hyperplanecorresponding to node N and the discrimination factorβ(N) for the node is larger than γ1. (2) The discrimina-tion factor β(N) for the node is larger than γ2. All as-signments of the node which are not direct assignmentsautomatically become descendent assignments. In eachiteration, the descendent assignments T (N) of the nodesat a given level of the tree are partitioned further intoat most kmax children of node N . This partitioning isbased on the distance of the data points to the hyper-planes corresponding to the kmax children of N . Specifi-cally, each data point is assigned to the hyperplane fromwhich it has the least distance. The assigned points arethen classified either as a descendent or direct assign-ments depending upon the distance from the hyperplaneand corresponding discrimination index. As noted ear-lier, the latter value determines whether a node is adefault node, a discriminative node, or forbidden node.Forbidden nodes do not have any direct assignmentsand therefore all data points at forbidden nodes auto-matically become descendent assignments. On the otherhand, in the case of a discriminative node, the reverseis true and all points become direct assignments. Forthe case of default nodes, a point becomes a direct as-signment only if it is at a distance of at most ε from thecorresponding hyperplane. This process continues untileach data point becomes the direct assignment of somenode, or is identified in the anamoly set. The overallalgorithm for subspace tree construction is illustratedin Figure 3.

A levelwise algorithm is used during the tree con-struction phase. The reason for this levelwise approachis that the database operations during the constructionof a given level of nodes can be consolidated into a sin-gle database pass. The actual construction of the mthlevel is achieved by sampling one representative pointfor each of the kmax children of the level-(m− 1) nodes.This representative point is added in order to createthe corresponding 1-dimensional extension. These kmax

representative points are sampled from the local seg-ment T (N) of the database. Picking these representa-tive points at a node N is tricky, since we would like toensure that they satisfy the following two properties: (1)The points represent the behavior of localized regions inthe data. (2) The points are sufficiently representativeof the different classes in that data locality. In order toachieve this, we would like to ensure that the set R(N)represents as many different classes as possible. Sincethe representative extensions at a node N are sampledfrom the local segment T (N) only, the first propertyis satisfied. In order to satisfy the second property, wechoose the class from which the data points are sampled

362

Algorithm SampleSubspaceTree(CompressionTolerance: ε, MaximumTreeDegree: kmax, Database: D, Node Limit: L,Discrimination Tolerance: γ1, Discrimination Target: γ2)

beginSample 2 ∗ kmax ∗ sampfactor points from D and pair up points randomly to create

kmax ∗ sampfactor 1-dim. point representative hyperplanes (lines) denoted by S;(S, β(S1) . . . β(Skmax

)) = SelectSubspaces(S, kmax);(T (S1), . . .T (Skmax

),Q(S1), . . .Q(Skmax)) =PartitionData(D,S);

for i = 1 to kmax do if β(Si) ≥ γ2 then { Discriminative Node } Q(Si) = Q(Si) ∪ T (Si); T (Si) = φ; Status(Si) = 2;else if β(Si) < γ1 then { Forbidden Node } T (Si) = T (Si) ∪ Q(Si); Q(Si) = φ; Status(Si) = 1;else Status(Si) = 0;

S = DeleteNodes(S1 . . . Skmax, min-thresh); { Lm is the set of level-m nodes }

m = 1; L1 = S; { Each hyperplane (line) in S is the child of Root };while (Lm 6= {}) and (less than L nodes have been generated) do begin

for each non-null level-m node R ∈ Lm do beginSample kmax ∗ sampfactor points from T (R);Extend the node R by each of these kmax ∗ sampfactor points (in turn) to create the kmax ∗ sampfactor

corresponding (m + 1)-dimensional hyperplanes denoted by S;(S, β(S1) . . . β(Skmax

)) = SelectSubspaces(S, kmax);(T (S1), . . . T (Skmax

),Q(S1), . . .Q(Skmax)) = PartitionData(T (R), S);

for i = 1 to kmax do if β(Si) ≥ γ2 then { Discriminative Node } Q(Si) = Q(Si) ∪ T (Si); T (Si) = φ; Status(Si) = 2;else if β(Si) < γ1 then { Forbidden Node } T (Si) = T (Si) ∪ Q(Si); Q(Si) = φ; Status(Si) = 1;else Status(Si) = 0;

S = DeleteNodes(S1 . . . Skmax, min-thresh); { Thus S contains at most kmax children of R }

Lm+1 = Lm+1 ∪ S; m = m + 1;end;

end;end

Figure 3: Supervised Subspace Tree Construction

as follows: Let fR1 . . . fR

k be the fractional class distri-butions in R(N) and fT

1 . . . fTk be the fractional class

distributions in T (N). We sample a point belonging tothe class i in T (N) which always belongs to the classwith the least value of fR

i /fTi . Thus, this process picks

the point from the class which is most under-representedin R(N) relative to T (N).

A total of kmax ∗ sampfactor points (belongingto the selected class) are picked for extension of thenodes from level-(m − 1) to level-m. Thus, a total ofkmax ∗ sampfactor m-dimensional hyperplanes can begenerated by combining the representative set R(N)of node N with each of these sampled points. Thepurpose of oversampling by a factor of sampfactor is toincrease the effectiveness of the final children subspaceswhich are picked. The larger the value of sampfactor,the better the sampled subspaces, but the greaterthe computational requirement. Next, the procedureSelectSubspaces picks kmax hyperplanes out of thesekmax ∗ sampfactor possibilities so that the differentclasses are as well separated as possible. The first taskis to partition the kmax ∗ sampfactor hyperplanes intosampfactor sets of kmax hyperplanes. We will pickone of these partitions depending upon the quality ofthe assignment. In order to achieve this, the distanceof the data point x to each of the kmax ∗ sampfactor

hyperplanes is determined. For each of the sampfactorsets of hyperplanes, we assign the data point x to theclosest hyperplane from that partition. This results ina total of sampfactor possible assignments of the datapoints. The quality of the assignment depends uponhow well the different classes are discriminated fromone another in the resulting localized data sets. TheSelectSubspaces procedure quantifies this separation interms of the discrimination index β(·) for each of thesenodes. The average discrimination index for each of thesampfactor sets of nodes is calculated. The set of kmax

hyperplanes with the smallest average discriminationindex is chosen for the purpose of reduction. Thesehyperplanes are returned as the set S = (S1 . . . Skmax

).In addition, the discrimination index of each of thesenodes is returned as (β1 . . . βkmax

).Once the hyperplanes which form the optimal ex-

tensions of node N have been determined, each point inT (N) is re-assigned to one of the children of N by theuse of the procedure PartitionData. Specifically, eachpoint in T (N) is assigned to the child node to which it isthe least distance. Furthermore, the assigned nodes areclassified into direct assignments T (Si) and descendentassignments Q(Si). Initially, the PartitionData proce-dure returns the direct assignments T (Si) and descen-dent assignments Q(Si) using only the distance of the

363

data points from the corresponding hyperplanes. Specif-ically, the PartitionData procedure returns a point as adirect assignment, if it is at a distance of at most εfrom the corresponding hyperplane. Otherwise, it re-turns the data point as a descendent assignment. Afterapplication of the PartitionData procedure, we furtherre-adjust the direct and descendent assignments usingthe discrimination levels β(Si) of each child node Si. Ifthe node Si is a forbidden node, then the direct assign-ment set Q(Si) is reset to null, and all points becomedescendent assignments. Therefore Q(Si) is added toT (Si). The reverse is true when the node is a discrim-inative node. In that case, all points become direct as-signments and T (Si) is set to null.

Nodes which have too few points assigned to themare not useful for the dimensionality reduction process.Such nodes are deleted by the procedure DeleteNodes.The corresponding data points are considered excep-tions which are stored separately by the algorithm.

The first iteration of the algorithm (m = 1) isspecial in which we sample 2∗kmax∗sampfactor pointsin order to create the initial set of kmax ∗ sampfactorlines. Thus, the only difference is that twice the numberof points need to be sampled in order to create the 1-dimensional hyperplanes used by the algorithm. Theprocedure for selection of these points is exactly similarto that of the general case. The other proceduressuch as selection and deletion of subspaces and datapartitioning are also the same as in the general case.

In order to ease conceptual abstraction, we havepresented the PartitionData and SelectSubspaces proce-dures separately for each node. In the actual implemen-tation, this procedure is executed simultaneously for allnodes at a given level in one scan. Similarly, the processof picking the best hyperplanes for all nodes at a givenlevel is executed simultaneously in a single scan of thedata.

The process of levelwise tree construction continuesuntil no node in the current level can be extended anyfurther, or the maximum limit L for the number ofnodes has been reached. We would like this limit Lto be determined by main memory limitations, since itwill be found useful in applying it effectuvely during theclassification process. For our implementation, we useda conservative limit of only L = 10, 000 nodes, whichwas well within current main memory limitations foreven 1000-dimensional data sets.

Each of the procedures SelectSubspaces and Parti-tionData require the computation of distances of datapoints x to the representative hyperplanes. In orderto perform these distance computations, the axis rep-resentations of the hyperplanes need to be determined.A hyperplane node N at level-m is only implicitly de-

fined by the (m + 1) data points {z1 . . . zm+1} storedat the nodes along the path from the root to N . Thenext tricky issue is to compute the axis representa-tion (y, E = {e1 . . . em}) of the points {z1 . . . zm+1} ef-ficiently in a way that can be replicated exactly at thetime of data reconstruction. This is especially impor-tant, since there can be an infinite number of axis rep-resentations of the same hyperplane, but the projectioncoordinates are computed only with respect to a partic-ular axis-representation. The corresponding representa-tion (y, E = {e1 . . . em}) is computed as follows:

We first set y = z1 and e1 = (z2 − z1)/||z2 − z1||.Next, we iteratively compute ei from e1 . . . ei−1 asfollows:

ei =zi+1 − z1 −

∑i−1

j=1[(zi+1 − z1) · ej ] ej

||zi+1 − z1 −∑i−1

j=1[(zi+1 − z1) · ej ] ej ||

(2.2)

Equation 2.2 is essentially the iteration for the Gram-Schmidt orthogonalization process [10]. The followingobservation is a direct consequence of this fact:

Observation 2.1. The set (z1, E) generated by Equa-tion 2.2 is an axis representation of the hyperplaneR(z1 . . . zm+1).

Many axis representations can be generated usingEquation 2.2 for the same hyperplane R({z1 . . . zm+1})depending upon the ordering of {z1 . . . zm+1}. Sincewe need to convert from point representations to axisrepresentations in a consistent way for both data reduc-tion and reconstruction, this ordering needs to be fixedin advance. For the purpose of this paper, we will as-sume that the point ordering is always the same as onein which it was sampled during the top-down tree con-struction process. This leads to representative pointssampled at higher levels of the tree to be ordered first,and points at lower levels to be ordered last. The onlyambiguity is for the level-1 nodes at which 2 points arestored instead of one. In that case, the record which islexicographically smaller is ordered earlier. We shall re-fer to this particular convention for axis representationas the path-ordered axis representation.

2.3 Computation of Discrimination Index Sev-eral methods can be used in order to calculate the dis-crimination index of the reduced data at a given node.The most popularly used method is the Fisher’s lineardiscriminant [7] of the data points in the node. Thisdiscriminant minimizes the ratio of the intra-class dis-tance to the inter-class distance. Our approach howeveruses a nearest neighbor discriminant. In this technique,we find the nearest neighbor to each data point in thedimensionality reduced database, and calculate the frac-tion of data points which share the same class as their

364

nearest neighbor. This fraction is reported as the dis-crimination index.

2.4 Storage of Compressed Representation Thestorage algorithm needs to store the tree structure andthe set of points associated with each node of the tree.Each of these components are stored as follows:

(1) The Subspace Tree: If the axis-systems are ex-plicitly stored at each node, the storage requirements forthe tree structure could be considerable. This is becausethe axis representation (y, E) for an m-dimensional noderequires m+1 orthonormal vectors (including the originy for the corresponding hyperplane). However, it turnsout that we do not need to store the axis representationsexplicitly, For each level-m node, we only maintain theadditional data point which increases the dimensional-ity of the corresponding subspace by one. In addition,we need to maintain the identity of the node, its imme-diate parent and the status of the node (correspondingto whether it is forbidden, discriminative, or a defaultnode). Thus, a total of (d + 3) values are required foreach node. For the (at most kmax) level-1 nodes, thestorage requires (2 ·d+3) values since we need to main-tain the two points which define the sampled line inlexicographic ordering.

The subspaces in the tree structure are thus onlyimplicitly defined by the sequence of points from theroot to that node. Thus, by using this method, werequire only one vector for an m-dimensional noderather than than O(m) vectors. Since most nodes in thetree are at lower levels, such savings add up considerablyover the entire tree structure. The reason for thisstorage efficiency is the implicit representation of thesubspace tree. This results in the reuse of the vectorstored at a given node for all descendents of that node.

(2) The Reduced Database: Each data point iseither an exception or is associated with one node inthe tree. We store the identity of the node for whichit is a direct assignment. In addition, we maintain thecoordinates of the data point for the axis representation(y, E) of this hyperplane in accordance with Equation1.1. The projection coordinates of x on (y, E) are givenby (c1 . . . cm) = {e1 · (x − y) . . . em · (x − y)}. The classlabels were stored separately.

2.5 Reconstruction Algorithm Since the sub-space tree is represented in an implicit format, the axisrepresentations of the nodes need to be reconstructed. Itis possible to do this efficiently because of the use of thepath-ordered axis convention for representation. Sincethere are a large number of nodes, it would seem thatthe initial phase could be quite expensive to perform foreach node. However, it turns out that because of the use

Algorithm SelectClassifierModels(Subspace Tree: ST )begin

for each node N in the subspace tree ST dobeginDivide the data set T (N) into two parts

T1(N) and T2(N) with ratio r : 1;Use classification training algorithms A1 . . .Am

on T1(N) to create models M1(N) . . .Mm(N);Compute accuracy of models M1(N) . . .Mm(N)

on T2(N);Pick model with highest classification accuracy on T2(N)

and denote as CL(N);end;

end

Figure 4: Node Specific Classification

of the path-ordered convention for axis-representations,this can be achieved in a time complexity which requiresthe computation of only one axis per node. The trick isto construct the axis representations of the nodes in thetree in a top-down fashion. This is because the Equa-tion 2.2 computes the axis representation {e1 . . . ei} ofa node by using the axis representation {e1 . . . ei−1} ofits parent and the point z′ stored at that node. (Forthe nodes at level-1, lexicographic ordering of the rep-resentative points is assumed.) It is easy to verify thatthis order of processing results in the path-ordered axisrepresentation.

Once the axis representations of the nodes havebeen constructed, it is simple to perform the necessaryaxis transformations which represent the reconstructeddatabase in terms of the original attributes. Recallthat for each database point x, the identity of thecorresponding node is also stored along with it. Let(y, E) be the corresponding hyperplane and (c1 . . . cm) ={e1 · (x − y) . . . em · (x − y)} be the coordinates of xalong this m-dimensional axis representation. Then, asevident from Equation 1.1, the reconstructed point x′ isgiven by x′ = y +

∑m

i=1[ci] ei

3 Effective Classification by Locality Specific

Decomposition

It turns out that the locality specific decompositionapproach of the dimensionality reduction problem notonly provides a reduced representation of the data,but can also be leveraged for effective classification.This is because the dimensionality reduction processdecomposes the data into a number of different partseach of which have a unique and distinctive classdistribution. The use of different training phases ondifferent nodes may be quite effective in these cases. Infact, the fundamental variation in the dimensionalitiesand data characteristics of the different nodes makes

365

Algorithm DecomposableClassify(Test Instance: xT ,SubspaceTree: ST , Classifier Models: CL(·))

beginPerform a hierarchical traversal of the tree ST so as tofind the highest level node NT which is not a forbiddennode and is either a default node within a distance ε

from xT or is a discriminative node;if no such node NT exists (outlier node)

then report majority class of outlier points duringsubspace tree generation;

else use classifier model CL(NT ) in order to classifythe test instance xT ;

return class label of xT ;end

Figure 5: Locally Decomposable Classification

each segment amenable to different kinds of classifiers.For each data set at a given node N of the tree, we

apply a classifier whose identity is dependent only uponthe results of a training process applied to the data atthat particular node. We denote the classifier model ata given node N by CL(N). Thus, for one node a nearestneighbor classifier might be used, whereas for anothernode an association rule classifier may be used. We willdiscuss the details of the process of determination of theclassifier model CL(N) slightly later.

Since the reduced data is contained only in nodeswhich are not forbidden, only those nodes are relevant tothe classification process. For each such node N in thesubspace tree ST , we use the corresponding data T (N)in order to decide on the classification algorithm whichbest suits that particular set of records. In the firststep, we divide the data T (N) into two parts T1(N) andT2(N)in the ratio r : 1. The set T1(N) is used to buildtraining different classification algorithms A1 . . .Am onthe particular locality of the data reduced to hyper-plane N . The corresponding models are denoted byM1 . . .Mm. The reduced representation of the data atnode N is used for training purposes. Once the dif-ferent algorithms have been trained using T1(N), thenthe best algorithm is determined by using testing onT2(N). The classification accuracy of each of the mod-els M1 . . .Mm on T2(N) is computed. The model Ms

with maximum classification accuracy on T2(N) is de-termined. This is the classification model CL(N) usedat node N . The overall algorithm is described in theprocedure SelectClassifierModels of Figure 4.

Once the models for each of the nodes have beenconstructed, they can be used for classifying individualtest instances. For a given test instance, we first decidethe identity of the node that it belongs to. In orderto find which node a test instance T belongs to, weuse the same rules that are used to assign points to

Data Set Records Attributes

Forest Cover 581012 54

Keyword1 67121 239

Keyword2 68134 239

C35.I6.D100K.P100 100000 100

C40.I6.D100K.P100 100000 100

Table 1: Characteristics of the Data Sets

nodes in the construction of the subspace tree from theoriginal database D. Therefore, a hierarchical traversalis performed on the tree structure using the samerules as utilized by the tree construction algorithm indefining direct assignments. Thus, for each data pointxT which is to be classified, we perform a hierarchicaltraversal of the tree starting at the root. The traveralalways picks that branch of the tree to which xT isclosest, until it either reaches a node L at which one ofthe following conditions is satisfied:

(1) The node L is a discriminative node.(2) The node L is neither discriminative nor forbidden,and the corresponding hyperplane is within the speci-fied error tolerance of ε.

We note that this choice of tree traversal andnode selection directly mirrors the process of direct anddescendent assignment of points to nodes during thetree construction process. In some cases, such a nodemay not be found if the point xT is an outlier. In thatcase, the majority class from the set of outlier pointsfound during tree construction is reported as the classlabel.

In the event that a node is indeed found whichsatisfied one of the above conditions, we denote itby NT . The classifier model CL(NT ) is then usedfor the classification of the test instance xT and thecorresponding class label is reported as the final resultof the classification process. The overall algorithmis described in the procedure DecomposableClassify ofFigure 5.

4 Empirical Results

The system was tested on an AIX 4.1.4 system with aspeed of 300 MHz and 100 MB of main memory. Thedata was stored on a 2 GB SCSI drive. The supervisedsubspace sampling algorithm was tested for the follow-ing measures: (1) Effectiveness of data reduction withrespect to linear discriminant analysis. (2) Efficiencyof data reduction process. (3) Effectiveness of the su-pervised subspace sampling method as an approach fordecomposable classification.

366

A number of synthetic and real data sets from awere utilized in order to test the effectiveness of the re-duction and classification process. The characteristicsof the data sets are illustrated in Table 1. The For-est Cover Data set was available from the UCI KDDarchive. All attributes and records of the forest coverdata set were used. In this case, binary attributes weretreated in a similar way to numerical attributes for theclassification and dimensionality reduction process. Thekeyword data sets were derived from web pages in theY ahoo! taxonomy. The records were generated by find-ing the frequency of 239 most discriminative keywordsin web pages drawn from the Y ahoo! taxonomy. Theclasses correspond to the highest level categories in theY ahoo! taxonomy. These keywords were determined us-ing the gini index value of each feature. The 239 fea-tures with highest gini index were used. Two data setswere generated corresponding to the commercial andnon-commercial sections in the Y ahoo! taxonomy. Wedenote these data sets by Keyword1 and Keyword2 re-spectively.

In order to test the algorithmic effectiveness further,we used synthetic data sets. These are also useful forscalability testing, since it is possible to show cleartrends with varying data size and dimensionality. Forthis purpose, we used the market basket data setsproposed in [3]. In order to create different classes, wegenerated two different instantiations of the data setTx.Iy.Dz (according to the notations of [3]) and createda two class data set which was a mixture of these twodata sets in equal proportions. The only difference fromthe generation methodology in [3] is that a subset ofw items were used instead of the standard 1000 itemsused in [3]. We refer to this data set as Cx.Iy.D(2z).Pw.Since the data set Tx.Iy.Dz contains z records, the dataset Cx.Iy.D(2z).Pw contains 2 · z records. Two datasets C35.I6.D100K.P100 and C40.I6.D100K.P100 werecreated using this methodology.

In order to test the effectiveness of the datareduction process, we used the Fisher’s discriminantmethod as a comparative baseline. This method wasimplemented as follows:

(1) The first dimension was found by finding theFisher’s direction [7] which maximized the discrimina-tion between the class variables.(2) The data was then projected onto the remainingd−1 dimensions defined by the hyperplane orthonormalto the first direction.(3) The next dimension was again determined byfinding the Fisher’s direction which maximized thetotal discrimination in the remaining d − 1 dimensions.(4) The process was repeated iteratively until a new

orthonormal axis system was determined.(5) The most discriminative dimensions in the datawere retained by using a threshold on the Fisher’sdiscriminant value. This threshold was determined byfinding the mean µ and variance σ2 of the Fisher’sindex for the newly determined dimensions. Thethreshold was then set at µ + 2 · σ.

In order to test the effectiveness, we applied twodifferent classification algorithms on the data sets.The specific classifiers tested were the C4.5 and thenearest neighbor algorithms. The following differentapproaches were tested:

(1) Utilizing individual classifiers on the reduceddata for global dimensionality reduction.(2) Utilizing individual classifiers for local dimensional-ity reduction. In this case, the same classifier was usedfor every local segment. Thus, the approach benefitsfrom the use of different training models to the differentsegments which are quite heterogeneous because of thenature of the supervised partitioning process.(3) Utilizing the decomposable classification process onthe data. This approach benefits not only from the useof different training models, but also from the differentclassifiers on the different nodes.

In Table 2, we have illustrated the effectiveness ofthe different classifiers on the data sets. The first twocolumns report the accuracy of the nearest neighborand C4.5 classifiers on the full dimensional data sets.The next two columns report the accuracy, when thedata was reduced with Fisher’s discriminant. Thefollowing two columns report the accuracy by usinglocal training on each of the nodes with a particularclassifier. The final column reports the accuracy ofthe decomposable classification process. We make thefollowing observations from the results:

(1) Neither of the two classifiers performed as ef-fectively on the full dimensional data as it did on thereduced sets. This is quite natural, since the reductionprocedure was able to remove the noise which wasirrelevant to the classification process.(2) Both the classifiers showed superior classificationaccuracy when reduced with the subspace samplingapproach as compared to the Fisher’s method.(3) The decomposable classifier always performedmore effectively than any combination of classifier andreduction technique.

An important factor to be kept in mind is thatthe nearest neighbor and C4.5 classifiers did not show

367

Data NN C4.5 NN C4.5 NN C4.5 DecomposableSet (Full) (Full) (Fish.) (Fish.) (Subsp.) (Subsp.) Classifier

Forest Cover 63.3% 60.7% 63.1% 61.3% 68.5% 67.3% 71.2%

Keyword1 53.2% 47.1% 56.7% 54.5% 65.3% 63.4% 69.3%

Keyword2 50.1% 43.4% 52.1% 51.3% 61.5% 60.7% 64.5%

C35.I6.D100K.P100 83.4% 84.3% 82.5% 84.7% 86.8% 86.5% 89.5%

C40.I6.D100K.P100 74.2% 75.1% 73.3% 76.7% 80.3% 80.9% 84.4%

Table 2: Effectiveness of Classifiers on Different Data Sets

50 100 150 200 250 300 350 4000

20

40

60

80

100

120

140

160

180

200

RE

LATI

VE

RU

NN

ING

TIM

E

DATA DIMENSIONALITY

HIERARCHICAL SUBSPACE SAMPLINGFISHER METHOD

50 100 150 200 250 300 350 4000

20

40

60

80

100

120

140

160

180

200

RE

LATI

VE

RU

NN

ING

TIM

E

DATA DIMENSIONALITY


2 3 4 5 6 7 8 9 10

x 104

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

RE

LATI

VE

RU

NN

ING

TIM

E

DATA SIZE


(a) Efficiency vs Efficiency vs Efficiency vsDimensionality Dimensionality Data Size

C35.I6.D100K.Px C40.I6.D100K.Px C35.I6.Dx.P100

Figure 6: Efficiency results

consistent performance across different data sets. Insome data sets, the C4.5 classifier was better, whereasin others the nearest neighbor technique was better.Furthermore, the application of a particular kind ofdata reduction method affected the relative behaviorof the two classifiers. This is undesirable from theperspective of a classification task, since this makesit more difficult to determine the best possible clas-sification algorithm for each particular data set. Onthe other hand, the decomposable classification resultsconsistently outperformed all combinations of classifierand data reduction methods.

A second area of examination was the reductionfactor provided by each of the methods on the data.The reduction factor was defined as the ratio of thereduced data size to the original data size. For thesubspace sampling method, this reduced data containedboth the subspace tree and the database records. InTable 3, we have illustrated the corresponding reductionfactors for each of the methods. It is clear that thesupervised subspace sampling approach provides muchhigher reduction factors than the Fisher Discriminant.We have already shown that the classification accuracyis much better when the reduced data representation isgenerated using supervised subspace sampling. The factthat the more compact representation of the subspacesampling technique provides better accuracy indicates

Data Set Reduc. Factor Reduc. Factor(Fisher) (Subsp)

Forest Cover 0.204 0.11

Keyword1 0.167 0.091

Keyword2 0.163 0.087

C35.I6.D100K.P100 0.19 0.081

C40.I6.D100K.P100 0.22 0.094

Table 3: Reduction Ratios of Diff. Methods

that it is more effective at picking those subspaces whichare most discriminatory for the classification process.

We used synthetic data sets for illustrating scala-bility trends with varying data size and dimensional-ity. Both the Fisher method and the subspace samplingmethod were applied to generations of the data of vary-ing dimensionality corresponding to the synthetic datasets C35.I6.D100K.Px and C40.I6.D100K.Px. This pro-vided an idea of the scalability of the algorithms withincreasing data dimensionality. The results are illus-trated in Figures 6(a) and 6(b). It is clear that thesubspace sampling method was always more efficientthan the Fisher data reduction method, and the per-formance gap increased with data dimensionality. Thisis because the subspace sampling method requires sim-ple sampling computations which scale almost linearly

368

2 3 4 5 6 7 8 9 10

x 104

0

0.5

1

1.5

2

2.5

RE

LATI

VE

RU

NN

ING

TIM

E

DATA SIZE


Figure 7: Efficiency vs. Data Size (C40.I6.Dx.P100)

with dimensionality. On the other hand, the runningtimes of the Fisher’s discriminant method rapidly in-crease with dimensionality because of the costly compu-tation of finding optimal axis directions. We also testedthe two methods for increasing scalability with data size.In this case, we used samples of varying sizes of the datasets C35.I6.D100K.P100 and C40.I6.D100K.P100. Theresults are illustrated in Figures 6(c) and 7 respectively.It is clear that both methods scaled linearly with datasize, though the subspace sampling technique consis-tently outperformed the Fisher method.

5 Conclusions and Summary

In this paper we have proposed an effective local di-mensionality data reduction method in the superviseddomain. Most current dimensionality reduction meth-ods such as SVD are designed only for the unsuperviseddomain. The aim of dimensionality reduction method inthe supervised domain is to create a new axis-system sothat the discriminatory characteristics of the data areretained, and the classification accuracy is improved.Methods such as linear discriminant analysis turn outto be computationally intensive in practice, and oftencannot be efficiently applied to large data sets. Fur-thermore, the global approach of these techniques oftenmake the methods ineffective in practice. The super-vised subspace sampling approach uses local data reduc-tion in which the reduction of a data point depends uponthe class distributions in its locality. As a result, the su-pervised subspace sampling technique allows a naturaldecomposition of the data so that the implicit dimen-sionality of each data locality is minimized. This im-proves the effectiveness of the reduction process, whileretaining the efficiency of a sampling based technique.The reduction process improves the accuracy of the clas-sification process because of removal of the irrelevantaxis directions in the data. In addition, the supervisedreduction process naturally facilitates the constructionof a decomposable classifier which is able to providemuch better classification accuracy than a global data

reduction process. Thus, the improved efficiency, com-pression quality and classification accuracy of this re-duction method make it an attractive approach for anumber of real data domains.

References

[1] C. C. Aggarwal, Hierarchical Subspace Sampling: A

Unified Framework for High Dimensional Reduction,

Selectivity Estimation, and Nearest Neighbor Search,ACM SIGMOD Conference, (2002), pp. 452–463.

[2] C. C. Aggarwal, and P. S. Yu, Finding Generalized

Projected Clusters in High Dimensional Spaces, ACMSIGMOD Conference, (2000), pp. 70–81.

[3] R. Agrawal, and R. Srikant, Fast Algorithms for Mining

Association Rules in Large Databases, VLDB Confer-ence, (1994), pp. 487-499.

[4] L. Brieman, Bagging Predictors, Machine Learning, 24(1996), pp. 123–140.

[5] K. Chakrabarti, and S. Mehrotra, Local Dimensionality

Reduction: A New Approach to Indexing High Dimen-

sional Spaces, VLDB Conference, (2000), pp. 89–100.[6] S. Chakrabarti, S. Roy, and M. V. Soundalgekar, Fast

and Accurate Text Classification via Multiple Linear

Discriminant Projections, VLDB Conference, (2002),pp. 658–669.

[7] T. Cooke. Two variations on Fisher’s linear discrim-

inant for pattern recognition PAMI, 24(2) (2002), pp.268–273.

[8] R. Duda, and P. Hart, Pattern Classification and Scene

Analysis, Wiley, NY (1973).[9] C. Faloutsos, and K.-I. Lin, FastMap: A Fast Algo-

rithm for Indexing, Data-Mining and Visualization of

Traditional and Multimedia Datasets, ACM SIGMODConference, (1995), pp. 163–174.

[10] K. Hoffman, and R. Kunze, Linear Algebra, PrenticeHall, NJ, (1998).

[11] P. Indyk, and R. Motwani, Approximate NearestNeighbors: Towards Removing the Curse of Dimen-sionality, ACM STOC Proceedings, (1998), pp. 604–613.

[12] W. Johnson, and J. Lindenstrauss, Extensions of Lips-

chitz mapping into a Hilbert space, Conference in mod-ern analysis and probability, (1984), pp. 189–206.

[13] I. T. Jolliffee, Principal Component Analysis, Springer-Verlag, New York, (1986).

[14] J. R. Quinlan, C4.5: Programs for Machine Learning,Morgan Kaufmann, (1993).

[15] K. V. Ravi Kanth, D. Agrawal, and A. Singh, Di-

mensionality Reduction for Similarity Searching in Dy-

namic Databases, SIGMOD Conference, (1998), pp.166–176.

369

Date post:	21-Oct-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

A Framework for Local Supervised Dimensionality Reduction ...

Documents