LNCS 5702 - Nonlinear Dimension Reduction and ... · Nonlinear Dimension Reduction and...

Nonlinear Dimension Reduction and

Visualization of Labeled Data

Kerstin Bunte1, Barbara Hammer2, and Michael Biehl1

1 University of Groningen, Mathematics and Computing Science,9700 AK Groningen, The Netherlands

2 Clausthal University of Technology, Institute of Computer Science,D-38678 Clausthal-Zellerfeld, Germany

Abstract. The amount of electronic information as well as the size anddimensionality of data sets have increased tremendously. Consequently,dimension reduction and visualization techniques have become increas-ingly popular in recent years. Dimension reduction is typically connectedwith loss of information. In supervised classification problems, class labelscan be used to minimize the loss of information concerning the specifictask. The aim is to preserve and potentially enhance the discrimination ofclasses in lower dimensions. Here we propose a prototype-based local rele-vance learning scheme, that results in an efficient nonlinear discriminativedimension reduction of labeled data sets. The method is introduced anddiscussed in terms of artificial and real world data sets.

1 Intoduction

Dimension reduction techniques aim at finding a smaller set of features by re-ducing or eliminating redundancies. From a theoretical point of view the “curseof dimensionality” causes many difficulties in high-dimensional spaces, such thatdimension reduction constitutes a valuable tool to deal with these problems [1].

In the last decades an enormous number of unsupervised dimension reductionmethods has been proposed. In general, unsupervised dimension reduction is anill-posed problem since a clear specification which properties of the data shouldbe preserved, is missing. Standard criteria, for instance the distance measureemployed for neighborhood assignment, may turn out unsuitable for a givendata set, and relevant information often depends on the situation at hand.

If data labeling is available, the aim of dimension reduction can be definedclearly: the preservation of the classification accuracy in a reduced feature space.Supervised linear dimension reducers are for example the Generalized MatrixLearning Vector Quantization (GMLVQ) [2] and the Linear Discriminant Anal-ysis (LDA) [3]. Often, however, the classes cannot be separated by a linearclassifier while a nonlinear data projection better preserves the relevant infor-mation. Examples for nonlinear discriminative visualization techniques include,an extension of the Self Organizing Map (SOM) incorporating class labels [4].Further supervised dimension reduction techniques are explained in [5,6].

In this contribution we propose a discriminative visualization scheme which isbased on an extension of Learning Vector Quantization and relevance learning.

X. Jiang and N. Petkov (Eds.): CAIP 2009, LNCS 5702, pp. 1162–1170, 2009.c© Springer-Verlag Berlin Heidelberg 2009

Nonlinear Dimension Reduction and Visualization of Labeled Data 1163

2 Supervised Nonlinear Dimension Reduction

For general data sets a global linear reduction to lower dimensions may not bepowerful enough to preserve the information relevant for classification. In [1] itis argued that the combination of several local linear projections to a nonlinearmapping can yield promising results. We use this concept and learn local lin-ear low-dimensional projections from labelled data. Alternatively to the directusage of the local linear patches it is also possible to merge them into a globalnonlinear embedding with a charting technique to obtain a smoother nonlinearprojection. The following subsection gives a short overview over the algorithms.

Localized LiRaM LVQ. Learning vector quantization (LVQ) [7] constitutesa successful class of heuristic, prototype based classification algorithms. LVQ isintuitive, interpretable, fast, and easy to implement. It is distance based anda key issue is the selection of a suitable dissimilarity measure. However, themost frequent choice, i.e. standard Euclidean distance, is not necessarily suit-able. Therefore, relevance learning schemes have been suggested which adaptmore general metrics in the training process [8,9]. Recent extensions parame-terize the distance measure in terms of a relevance matrix, the rank of whichmay be controlled explicitly. The algorithm suggested in [2] can be employedfor linear dimension reduction and visualization of labeled data. The local linearversion presented here provides the ability to learn local low-dimensional projec-tions and combine them into a nonlinear global embedding. We consider trainingdata xi ∈ IRN , i = 1 . . . S with labels yi corresponding to one of C classes re-spectively. A data point xi is assigned to the class of the closest prototype wj

with d(xi, wj)Λj ≤ d(xi, wk)Λk for all j �= k. During the training process LVQadapts l prototypes wj ∈ IRN with class labels c(wj) ∈ {1, . . . , C} to representthe classification as accurately as possible. Generalized LVQ (GLVQ) [10] adaptsprototypes by minimizing the cost function

E =S∑

i=1

Φ

(dΛJ (wJ , xi) − dΛK (wK , xi)dΛJ (wJ , xi) + dΛK (wK , xi)

), (1)

where wJ (wK) denotes the closest prototype with the same (a different) classlabel as xi and Φ refers to a monotonic function, e. g. the logistic function or theidentity, which is used in our experiments. Learning can take place by means ofa stochastic gradient descent of the cost function E (Eq. (1) for details see [2]).

The localized generalized matrix LVQ (LGMLVQ) substitutes the squaredEuclidean distance by a more complex dissimilarity measure which can take intoaccount arbitrary pairwise correlation of features. This metric

dΛj (wj , xi) = (xi − wj)�Λj(xi − wj) (2)

is defined through an adaptive symmetric and positive semi-definite matrixΛj ∈ IRN×N locally attached to each prototype wj . By setting Λj = Ω�

j Ωj semi-definiteness and symmetry is guaranteed. Ωj ∈ IRM×N with arbitrary M ≤ Ntransforms the data locally to an M -dimensional feature space. It can be shown

1164 K. Bunte, B. Hammer, and M. Biehl

that the adaptive distance dΛj (wj , xi) Eq. (2) equals the squared Euclideandistance in the transformed space dΛj (wj , xi) = [Ωj(xi − wj)]2. The targetdimension M must be chosen in advance by intrinsic dimension estimation orsuitable for the given task. For visualization purposes, usually a value of twoor three is appropriate. We will refer to this algorithm as Limited Rank MatrixLVQ (LiRaM LVQ). After each training epoch (sweep through the training set)matrices are normalized to

∑i[Λj ]ii = 1 in order to prevent degeneration. An ad-

ditional regularization term in the cost function proportional to − ln(det(ΩjΩ�j ))

can be used to enforce full rank M of the relevance matrices and prevent over-simplification effects, see [11]. At the end of the learning process the algorithmprovides a set of prototypes wj , their labels c(wj), and corresponding projec-tions Ωj . A low dimensional embedding of each data point xi can then be de-fined by Pj(xi) = Ωjxi using the projection Ωj of its closest prototype wj , withdΛj (wj , xi) = min

kdΛk(wk, xi). For smoother visualizations the outcome of the

classifier can also be mapped with a charting step.

Charting. The charting technique introduced in [12] provides a frame for unsu-pervised dimension reduction by decomposing the sample data into locally linearpatches and combine them into a single low-dimensional coordinate system. Fornonlinear dimension reduction we use the low-dimensional local linear projec-tions Pj(xi) ∈ IRM for every data point xi provided by localized LiRaM LVQand apply only the second step of the charting method to combine them. Thelocal projections Pj(xi) are weighted by their responsibilities rji for data pointxi. Here we choose the responsibilities

rji ∝ exp(−(xi − wj)�Λj(xi − wj)/σj) , (3)

with normalization∑

j rji = 1 and an appropriate bandwith σj > 0. We set σj

to a fraction of the Euclidean distance to the nearest projected prototype

σj = a · mink �=j

[Ωjwj − Ωkwk]2 with 0 < a ≤ 0.5 . (4)

The charting technique finds affine transformations Bj : IRM → IRM of the localcoordinates Pj , such that the resulting points coincide on overlapping parts asmuch as possible in a least squares sense. An analytical solution can be found interms of a generalized eigenvalue problem, which leads to a global embedding inIRM . We refer to [12] for further details.

3 Unsupervised Nonlinear Dimension Reduction

We will compare this locally linear discriminative projection technique with somewell-known unsupervised projection techniques which are based on different pro-jection criteria.

Isomap. [13] is an extension of the metric Multi-Dimensional Scaling (MDS)and uses distance preservation as criterion for the dimension reduction. Whereasmetric MDS frequently employs the Euclidean metric to compute this pairwise


distances, Isomap incorporates the so called graph distances as an approximationof the geodesic distances. The weighted neighborhood graph is constructed byconnecting points i and j if their distance is smaller than ε (ε-Isomap), or if i isone of the K nearest neighbors of j (K-Isomap). Isomap is guaranteed to findthe global optimum of its error function in closed form. The approximation ofthe geodesic distances may be very rough and its quality depends on the numberof data points, the noise and the parameters (ε or K). For details see [13]. Forquantitative analysis we additionally compare the results of L-Isomap [14], whichfocuses on a small subset of the data, called the landmark points.

Locally Linear Embedding. (LLE) [15] uses the topology preservation crite-rion for dimension reduction. LLE aims at the preservation of local angles. Thefirst step of the LLE algorithm is the determination of a number of neighborsfor each data point, either by choosing the K nearest neighbors or all neighborsinside an ε-ball around the point. The idea is to reconstruct each point by alinear combination of its neighbors and to project data points such that thislocal representation of the data is preserved as much as possible. Advantages ofthis method are the elegant theoretical foundation which allows an analyticalsolution. From the computational points of view LLE requires the solution of anS-by-S eigenproblem with S being the number of data points. As reported in[16], the parameters must be tuned carefully, see [15] for further details.

Stochastic Neighbor Embedding. (SNE) [17] is closely related to Isotop [18].It overcomes some limitations of the Self Organizing Maps (SOM) by separatingthe vector quantization and the dimensionality reduction in two steps. SNE isa variant, which follows a probabilistic approach to map high-dimensional datavectors into a low-dimensional space, while preserving the neighbor identities.Like Isotop it centers a Gaussian kernel on each data point to be embedded.The algorithm optimizes the approximation of a probability distribution over allpotential neighbors if the same operation is performed on the low-dimensionalrepresentation of the data point. The minimization of the objective function isdifficult and may stuck in local minima. Details can be found in [17]. In thequantitative analysis we additionally compare the results of the t-DistributedStochastic Neighbor Embedding (t-SNE) [19], which uses a Studen-t distributionrather than a Gaussian and a different cost function.

4 Experiments

In this section we will compare the described dimension reduction techniques ontwo different data sets: an artificial data set and the segmentation data set fromthe UCI repository [20]. For visual comparison we reduce the dimension in bothcases to two.

3 Tip Star. This dataset consists of 3000 samples in IR10 with two classes (C1and C2) arranged on three clusters respectively (see Fig. 1 top left). The firsttwo dimensions contain the information whereas the remaining eight dimensionscontribute high variance noise. Localized LiRaM LVQ was trained for t = 500


1

2

3

4

5

6

2 dimensions of the original data

1

2

3

4

5

6

LiRaM LVQ

1

2

34

5

6

1

2

34

5

6

12

3

4

5

6

LiRaM LVQ/Charting

12

3

4

5

6

C1C2centers

Fig. 1. Upper left: two informational dimensions of the original 3 Tip Star data set.Upper right: projection with LiRaM LVQ. Bottom: nonlinear projection based on thesame LiRaM LVQ projections from the upper right figure combined with charting.

epochs, with three prototypes per class. Each of the prototype was initializedclose to one of the cluster centers. The learning rate of the prototypes is set toα1(t) = 0.01/(1 + (t − 1) · 0.001) and the metric learning starts at epoch t = 50with a learning rate of α2(t) = 0.001/(1+(t−50) ·0.0001). We run the localizedLiRaM LVQ 10 times and one result of the local projected data is shown in Fig. 1top right. Note that the aim of the LiRaM LVQ algorithm is not to preserve anytopology or distances, but to find projections, which separate the classes as muchas possible. So cluster four and six merge, because they carry the same class label.Nevertheless the different orientations and appearances of all six clusters are stillvisible. The bottom visualization in Fig. 1 shows the combination of the localprojections shown in the top right after the charting step. Where the parametera for σj (Eq. (4)) to fix the responsibilities for the local projection Pj is set to0.4 (found by cross validation with values between [0.1 0.5]). The invariancesinherited from the local linear projections of the LiRaM LVQ algorithm and theeigenvalue problem in the charting step lead to a flipped version of the originaldata, where cluster six and three are separated vertically but not horizontally.Fig. 2 shows the results of other dimension reduction methods on this data set.Principal Component Analysis (PCA) leads to very similar results like MDSin this problem. The classes are not well separated in two of the three modes.The other three figures show the results for SNE, and Isomap and LLE with


PCA Isomap

C1C2

SNE LLE

Fig. 2. Unsupervised projections of the 3 Tip Star data set from various methods

K = 35 neighbors each. Obviously, hardly any class structure is preserved inthese projections. Note that, due to the presence of only two classes, standardlinear discriminance analysis (LDA) yield a projection to one dimension only.Table 1 shows the Nearest Neighbor (NN) error on the projected data of theunsupervised methods and the mean NN error of the LVQ based projected dataaveraged over all 10 runs. The NN error of the LiRaM LVQ mapping in Fig. 1is 0.06 and 0.09 with the charting step. We also tried kernel PCA with gaussiankernel and 9 different equidistant variances σ from the interval [1,5], and L-Isomap and t-SNE. The best results are shown in table 1.

Segmentation. The segmentation data set (available at the UCI repository [20])consists of 19 features which have been constructed from regions of 3× 3 pixels,randomly drawn from a set of 7 manually segmented outdoor images. Everysample is assigned to one of seven classes: brickface, sky, foliage, cement, window,path and grass (referred to as C1, . . . , C7). The set consists of 210 trainingpoints with 30 instances per class and the test set comprises 300 instances perclass, resulting in 2310 samples in total. We did not use the features 3,4 and 5,because they display zero variance over the data set. We use the same parameter

Table 1. Nearest neighbor errors on the mapped 3 Tip Star data set

LiRaM LVQ 0.12 kernel PCA (gauss kernel σ = 4.5) 0.41PCA 0.29 Isomap 0.41 L-Isomap (20% landmarks, K = 35) 0.39SNE 0.46 LLE 0.50 t-SNE (perplexity 30) 0.41


LiRaM LVQ

prototypes

LDA

C1 C2 C3 C4 C5 C6 C7

Fig. 3. Left: nonlinear supervised two-dimensional projection of the segmentation dataset with LiRaM LVQ. Right: supervised two-dimensional projection of the same datawith LDA.

settings as specified in the previous section. We set the parameter K of neighborsfor Isomap and LLE to 108, according to the connectivity in the neighborhoodgraph. One example result for the localized LiRaM LVQ with a Nearest Neighborerror of 0.07 is shown in the top left panel in Fig. 3. For this seven class problema supervised dimension reduction with LDA is also possible. The top right panelshows the result of dimension reduction with LDA. In particular, the classes C4and C6 appear to be well separated in the LVQ based approach, whereas they falltogether in LDA. We observed that Generalized Discriminant Analysis (GDA)[21] using gaussian kernels with 19 equidistant variances in the interval [1,10]or polynomial kernels with powers three to 10 and addition values between zeroand 10 were a good deal worse than LDA (see table 2). Fig. 4 shows the results ofthe other dimension reduction techniques. Again PCA and MDS lead to nearlyidentical results only isolating one class: C2. For Isomap and LLE, even usingthe huge number of neighbors K = 108, unsatisfactory results are obtained. SNEyields to the best result compared to other unsupervised techniques, but someclasses scatter in a circle around the zoomed area showed here.

For a quantitative analysis of the results obtained by the differend methods,we compute the leave-one-out estimate of the Nearest Neighbor (NN) classifi-cation error on the mapped segmentation data. The NN error of the localizedLiRaM LVQ mapping is averaged over 10 random initializations of the algorithm.Additionally we evaluate the GDA, t-SNE and L-Isomap and list their best re-sults together with the NN errors of all methods in table 2. t-SNE, GDA, PCAand LLE show the worst results with errors between 84% and 33%, followed byIsomap and L-Isomap with 27% and 25%. The supervised method LDA performsalso not satisfactory with ca. 20%. SNE and localized LiRaM LVQ achieve thebest mean error results with 11% and 9% respectively.


PCA Isomap

C1C2C3C4C5C6C7

LLE SNE

Fig. 4. Unsupervised projections of the segmentation data set from various methods.To see the structure of most samples the Isomap and SNE figures are zoomed, somesamples spread widely.

Table 2. Nearest neighbor errors on the mapped segmentation data set

LiRaM LVQ 0.09 LDA 0.20 GDA (polynomial, pow. 3, offset 6) 0.70Isomap 0.27 PCA 0.31 t-SNE (perplexity 40) 0.84SNE 0.11 LLE 0.33 L-Isomap (20% landmarks, K = 108) 0.25

5 Conclusion

We proposed a supervised discriminative nonlinear dimension reduction tech-nique based on a prototype-based classifier with adaptive distances and charting.Compared to other state-of-the-art methods it shows promising results in twoexamples. Unlike LDA this method provides a nonlinear embedding of the data.Its complexity is linear in the number of examples, which is an advantage espe-cially in comparison with methods based on the construction of a neighborhoodgraph. The combination with a prototype based learning scheme additionallyoffers the possibility of data compression by embedding the prototypes. Thisis especially interesting for the processing of huge data sets. For the localizedLiRaM LVQ combined with the charting we observe a small but non-negligibleloss of classification accuracy which is due to the charting step. We will addressthe optimization of the latter in a forthcoming project.

Acknowledgment. This work was supported by the ”Nederlandse organisatievoor Wetenschappelijke Onderzoek (NWO)“ under project code 612.066.620.


References

1. Van der Maaten, L.J.P., Postma, E.O., Van den Herik, H.J.: DimensionalityReduction: A Comparative Review (2007),http://ticc.uvt.nl/$\sim$lvdrmaaten/Laurens_van_der_Maaten/Matlab_

Toolbox_for_Dimensionality_Reduction_files/Paper.pdf2. Bunte, K., Schneider, P., Hammer, B., Schleif, F.-M., Villmann, T., Biehl, M.:

Discriminative Visualization by Limited Rank Matrix Learning. Machine LearningReports 2, 37–51 (2008),http://www.uni-leipzig.de/~compint/mlr/mlr_03_2008.pdf

3. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press,New York (1990)

4. Villmann, T., Hammer, B., Schleif, F.M., Geweniger, T., Hermann, W.: Fuzzyclassification by fuzzy labeled neural gas. Neural Networks 19(6-7), 772–779 (2006)

5. Kontkanen, P., Lahtinen, J., Myllymaki, P., Silander, T., Tirri, H.: Supervisedmodel-based visualization of high-dimensional data. Intell. Data Anal. 4(3,4), 213–227 (2000)

6. Iwata, T., Saito, K., Ueda, N., Stromsten, S., Griffiths, T.L., Tenenbaum, J.B.:Parametric Embedding for Class Visualization. Neural Comp. 19(9), 2536–2556(2007)

7. Kohonen, T.: Self-Organizing Maps, 2nd edn. Springer, Heidelberg (1997)8. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.

Neural Networks 15(8-9), 1059–1068 (2002)9. Schneider, P., Biehl, M., Hammer, B.: Relevance Matrices in LVQ. In: Proc. of

European Symposium on Artificial Neural Networks (ESANN), pp. 37–42 (2007)10. Sato, A.S., Yamada, K.: Generalized learning vector quantization. In: NIPS, vol. 8,

pp. 423–429 (1996)11. Schneider, P., Bunte, K., Hammer, B., Villmann, T., Biehl, M.: Regularization

in matrix relevance learning. Machine Learning Reports 2, 19–36 (2008), http://www.uni-leipzig.de/~compint/mlr/mlr_02_2008.pdf

12. Brand, M.: Charting a manifold. In: NIPS, vol. 15, pp. 961–968 (2003)13. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for

nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)14. De Silva, V., Tenebaum, J.B.: Global versus local methods in nonlinear dimen-

sionality reduction. In: Advances in Neural Information Processing System, pp.705–712. MIT Press, Cambridge (2002)

15. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally LinearEmbedding. Science 290(5500), 2323–2326 (2000)

16. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of non-linear manifolds. Journal of Machine Learning Research 4, 119–155 (2003)

17. Hinton, G., Roweis, S.T.: Stochastic neighbor embedding. In: Advances in NeuralInformation Processing Systems, vol. 15, pp. 833–840 (2003)

18. Lee, J.A., Archambeau, C., Verleysen, M.: Locally linear embedding versus Isotop.In: 11th European Symposium on Artificial Neural Networks, pp. 527–534 (2003)

19. Van der Maaten, L.J.P., Hinton, G.E.: Visualizing High-Dimensional Data Usingt-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008)

20. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machinelearning databases. University of California, Department of Information and Com-puter Science (1998), http://archive.ics.uci.edu/ml/ (last visit 19.04.2008)

21. Baudat, G., Anouar, F.: Generalized Discriminant Analysis Using a Kernel Ap-proach. Neural Computation 12(10), 2385–2404 (2000)

http://ticc.uvt.nl/$sim $lvdrmaaten/Laurens_van_der_Maaten/Matlab_Toolbox_for_Dimensionality_Reduction_files/Paper.pdf

http://ticc.uvt.nl/$sim $lvdrmaaten/Laurens_van_der_Maaten/Matlab_Toolbox_for_Dimensionality_Reduction_files/Paper.pdf

http://www.uni-leipzig.de/~compint/mlr/mlr_03_2008.pdf



http://archive.ics.uci.edu/ml/

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

LNCS 5702 - Nonlinear Dimension Reduction and ... · Nonlinear Dimension Reduction and...

Documents