PDF - arXiv · PDF filefourth MVA method known as Orthonormalized PLS (OPLS) that is also...

IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 1

Kernel Multivariate Analysis Framework forSupervised Subspace Learning: A Tutorial on

Linear and Kernel Multivariate MethodsJeronimo Arenas-Garcıa, Senior Member, IEEE, Kaare Brandt Petersen,

Gustavo Camps-Valls, Senior Member, IEEE, and Lars Kai Hansen

Abstract: Feature extraction and dimensionality reduction are impor-tant tasks in many fields of science dealing with signal processingand analysis. The relevance of these techniques is increasing ascurrent sensory devices are developed with ever higher resolution, andproblems involving multimodal data sources become more common.A plethora of feature extraction methods are available in the literaturecollectively grouped under the field of Multivariate Analysis (MVA).This paper provides a uniform treatment of several methods: PrincipalComponent Analysis (PCA), Partial Least Squares (PLS), CanonicalCorrelation Analysis (CCA) and Orthonormalized PLS (OPLS), aswell as their non-linear extensions derived by means of the theoryof reproducing kernel Hilbert spaces. We also review their connec-tions to other methods for classification and statistical dependenceestimation, and introduce some recent developments to deal with theextreme cases of large-scale and low-sized problems. To illustratethe wide applicability of these methods in both classification andregression problems, we analyze their performance in a benchmark ofpublicly available data sets, and pay special attention to specific realapplications involving audio processing for music genre predictionand hyperspectral satellite images for Earth and climate monitoring.

I. INTRODUCTION

As sensory devices develop with ever higher resolution andthe combination of diverse data sources is more common,feature extraction and dimensionality reduction become in-creasingly important in automatic learning. This is especiallytrue in fields dealing with intrinsically high-dimensional sig-nals, such as those acquired for image analysis, spectroscopy,neuroimaging, and remote sensing, but also for situations whenmany heterogeneous features are computed from a signal andstacked together for classification, clustering, or prediction.

Multivariate analysis (MVA) constitutes a family of methodsfor dimensionality reduction successfully used in several sci-entific areas [1]. The goal of MVA algorithms is to exploitcorrelations among the variables to find a reduced set offeatures that are relevant for the learning task. Among the mostwell-known MVA methods are Principal Component Analysis(PCA), Partial Least Squares (PLS), and Canonical CorrelationAnalysis (CCA). PCA disregards the target data and exploitscorrelations between the input variables to maximize thevariance of the projections, while PLS and CCA look forprojections that maximize, respectively, the covariance and thecorrelation between the features and the target data. Therefore,they should in principle be preferred to PCA for regression

or classification problems. In this paper, we consider also afourth MVA method known as Orthonormalized PLS (OPLS)that is also well-suited to supervised problems, with certainoptimality in least-squares (LS) multiregression. A commonadvantage of these methods is that they can be formulatedusing standard linear algebra, and can be implemented asstandard or generalized eigenvalue problems. Furthermore,implementations and variants of these methods have beenproposed that operate either in a blockwise or iterative mannerto improve speed or numerical stability.

No matter how refined the various methods of MVA are,they are still constrained to account for linear input-outputrelations. Hence, they can be severely challenged when fea-tures exhibit non-linear relations. In order to address theseproblems, non-linear versions of MVA methods have beendeveloped and these can be classified into two fundamentallydifferent approaches [2]: 1) The modified methods in whichthe linear relations among the latent variables are substitutedby non-linear relations [3], [4]; and 2) Variants in which thealgorithms are reformulated to fit a kernel-based approach [5]–[7]. In this paper, we will review the second approach, inwhich the input data is mapped by a non-linear function intoa high-dimensional space where the ordinary MVA problemsare stated. A central property of this kernel approach is theexploitation of the so-called kernel trick, by which the innerproducts in the transformed space are replaced with a kernelfunction working solely with input space data so the explicitnon-linear mapping is not explicitly necessary.

An appealing property of the resulting kernel algorithmsis that they obtain the flexibility of non-linear expressionsusing straightforward methods from linear algebra. However,supervised kernel MVA methods are hampered in applicationsinvolving large datasets or a small number of labeled samples.Sparse and incremental versions have been presented to dealwith the former problem, while the field of semisupervisedlearning has recently emerged for the latter. Remedies tothese problems involve a particular kind of regularization,guided either by selection of a reduced number of basisfunctions or by considering the information about the manifoldconveyed by the unlabeled samples. Both approaches will bealso reviewed in this paper. Concretely, we aim:

1) To review linear and kernel MVA algorithms, providingtheir theoretical characterization and comparing theirmain properties under a common framework.

2) To present relations between kernel MVA and other

arX

iv:1

310.

5089

v1 [

stat

.ML

] 1

8 O

ct 2

013

2 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013

TABLE IACRONYMS AND NOTATION USED IN THE PAPER.

AR Autoregressive l Size of labelled datasetECMWF European Centre for Medium-Range Weather Forecasts u Number of unlabeled dataGMM Gaussian Mixture Model n = l + u Total number of training samplesHSIC Hilbert-Schmidt Independence Criterion d Dimension of input spaceIASI Infrared Atmospheric Sounding Interferometer m Dimension of output space(k)CCA (Kernel) Canonical Correlation Analysis X Centered input data (size l × d)kFD Kernel Fisher Discriminant Y Centered output data (size l ×m)(k)MVA (Kernel) Multivariate Analysis Cx,Cy Input, output data sample covariance matrices(k)OPLS (Kernel) Orthonormalized Partial Least Squares Cxy Input-output sample cross-covariance matrix(k)PCA (Kernel) Principal Component Analysis X† Moore-Penrose pseudoinverse of matrix X

(k)PLS (Kernel) Partial Least Squares W Regression coefficients matrixMSE Mean-square error I Identity matrixLDA Linear Discriminant Analysis ‖A‖F Frobenius norm of matrix A

LS Least-squares nf Number of extracted featuresOA Overall Accuracy ui, vi ith projection vector for the input, output dataRBF Radial Basis Function U, V [u1, . . . ,unf ], [v1, . . . ,vnf ]. Projection matricesrkCCA reduced complexity kCCA X′, Y′ Extracted features for the input, output datarkOPLS reduced complexity kOPLS F Reproducing kernel Hilbert SpacerkPCA reduced complexity kPCA φ(x) Mapping of x in feature spaceRMSE Root-mean-square Error k(xi,xj) 〈φ(xi),φ(xj)〉F . Kernel functionRTM Radiative transfer models Φ [φ(x1), . . . ,φ(xl)]

>. Input data in feature spaceskPLS Sparse kPLS Kx = ΦΦ> Gram MatrixUCI University of California, Irvine A [α1, · · · ,αnf ]. Coefficients for U = Φ>A

feature extraction methods based on Fisher’s discrim-inant analysis and nonparametric kernel dependenceestimates.

3) To review sparse and semisupervised approaches thatmake the kernel variants practical for large-scale orundersampled labeled datasets. We will illustrate howthese approaches overcome some of the difficulties thatmay have limited a more widespread use of the mostpowerful kernel MVA methods.

4) To illustrate the wide applicability of these methods,for which we consider several publicly available datasets, and two real scenarios involving audio processingfor music genre prediction and hyperspectral satelliteimages for Earth and climate monitoring. Methods willbe assessed in terms of accuracy and robustness to thenumber of extracted features.

We continue the paper with a brief review of linear andkernel MVA algorithms, as well as connections to othermethods. Then, Section III introduces some extensions thatincrease the applicability of kernel MVA methods in real appli-cations. Section IV provides illustrative evidence of method’sperformance. Finally, we conclude the paper in Section V withsome discussion and future lines of research.

II. MULTIVARIATE ANALYSIS IN REPRODUCING KERNELHILBERT SPACES

This section reviews the framework of MVA both in thelinear case and with kernel methods. Interesting connectionsare also pointed out between particular classification anddependence estimation kernel methods. Table I provides a listof acronyms, and basic notation and variables that will be usedthroughout the paper.

A. Problem statement and notation

Let us consider a supervised regression or classificationproblem, and let X and Y be columnwise-centered input andtarget data matrices of sizes l × d and l × m, respectively.Here, l is the number of training data points in the problem,and d and m are the dimensions of the input and output spaces,respectively. The target data can be either a set of variables thatneed to be approximated, or a matrix that encodes the classmembership information. The sample covariance matrices aregiven by Cx = 1

l X>X and Cy = 1

l Y>Y, whereas the co-

variance between the input and output data is Cxy = 1l X>Y.

The objective of standard linear multiregression is to adjusta linear model for predicting the output variables from theinput features, Y = XW, where W contains the regressionmodel coefficients. Ordinary least-squares (LS) regressionsolution is W = X†Y, where X† = (X>X)−1X> is theMoore-Penrose pseudoinverse of X. Highly correlated inputvariables can result in rank-deficient Cx, making the inversionof this matrix unfeasible. The same situation is encounteredin the small-sample-size case (i.e., when l < d, which isusually the case when using kernel extensions). Includinga Tikhonov’s regularization term leads to better conditionedsolutions: for instance, by also minimizing the Frobenius normof the weights matrix ‖W‖2F , one obtains the regularizedLS solution W = (X>X + λI)−1X>Y, where parameterλ controls the amount of regularization.

The solution suggested by MVA to the above problemconsists in projecting the input data into a subspace that pre-serves the most relevant information for the learning problem.Therefore, MVA methods obtain a transformed set of featuresvia a linear transformation of the original data, X′ = XU,where U = [u1,u2, . . . ,unf

] will be referred hereafter as theprojection matrix, ui being the ith projection vector and nf the

ARENAS-GARCIA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK 3

number of extracted features1. Some MVA methods consideralso a feature extraction in the output space, Y′ = YV, withV = [v1,v2, . . . ,vnf

].Generally speaking, MVA methods look for projections of

the input data that are “maximally aligned” with the targets,and the different methods are characterized by the particularobjectives they maximize. Table II provides a summary ofthe MVA methods that we will discuss in the rest of thesection. An interesting property of linear MVA methods isthat they are based on first and second order moments, andthat their solutions can be formulated in terms of (generalized)eigenvalue problems. Thus, standard linear algebra methodscan be readily applied.

B. Linear Multivariate Analysis

Principal Component Analysis, which is also known as theHotelling transform or the Karhunen-Loeve transform [9], isprobably the most widely used MVA method and the oldestdating back to 1901 [10]. PCA selects the maximum varianceprojections of the input data, imposing an orthonormalityconstraint for the projection vectors (see Table II). PCA worksunder the hypothesis that high variance projections containthe relevant information for the learning task at hand. PCA isbased on the input data alone, and is therefore an unsupervisedfeature extraction method. Methods that explicitly look forthe projections that better explain the target data should inprinciple be preferred in a supervised setup. Nevertheless, PCAand its kernel version, kPCA, are used as a preprocessing stagein many supervised problems, likely because of their simplicityand ability to discard irrelevant directions [11], [12].

The PLS algorithm [13] is based on latent variables thataccount for the information in Cxy . In order to do so, PLSextracts the projections that maximize the covariance betweenthe projected input and output data, again under orthonormalityconstraints for the projection vectors. This is done either as aniterative procedure or by solving an eigenvalue problem. In theiterative schemes, the data sets X and Y are recursively trans-formed in a process which subtracts the information containedin the already estimated latent variables. This process, which isoften referred to as deflation, can be done in a number of waysthat define the many variants of PLS. Perhaps the most popularPLS method was presented in [14]. The algorithm, hereafterreferred to as PLS2, assumes a linear relation between X andY that implies a certain deflation scheme, where the latentvariable of X is used to deflate also Y [7, p. 182]. Severalother variants of PLS exist such as ‘PLS Mode A’ and PLS-SB;see [15] for a discussion of the early history of PLS and [16]for a well-written contemporary overview. Among its manyadvantages, PLS does not involve matrix inversion and dealsefficiently with highly correlated data. This has justified itsvery extensive use in fields such as chemometrics and remote

1Note that strictly speaking U is not a projection operator, since it impliesa transformation from Rd to Rnf and does not satisfy the idempotentproperty of projection operators. Nevertheless, if the columns of U are linearlyindependent, vectors ui constitute a basis of the subspace of Rd where thedata is projected, and it is in this sense that we refer to ui and U as projectionvectors and matrix, and to X′ = XU as projected data. This nomenclaturehas been widely adopted in the machine learning field [8].

sensing, where signals typically are acquired in a range ofhighly correlated spectral wavelengths.

Rather than maximizing covariance, CCA maximizes thecorrelation between projected input and output data [17]. Inthis way, CCA can more conveniently deal with directions ofthe input or output spaces that present very high variance, andthat would therefore be over-emphasized by PLS, even if thecorrelation between the projected input and output data is notvery significant.

A final method we will pay attention to is OrthonormalizedPLS (OPLS), also known as multilinear regression [18] orsemi-penalized CCA [19]. OPLS is optimal for performingmultilinear LS regression on the features extracted from thetraining data, i.e.,

U = arg minU

‖Y −X′W‖2F . (1)

with W = X′†Y being the matrix containing the optimalregression coefficients. It can be shown that this optimiza-tion problem is equivalent to the one stated in Table II.Alternatively, this problem can also be associated with themaximization of a Rayleigh coefficient involving projectionsof both input and output data, (u>Cxyv)

2

(u>Cxu)(v>v). It is in this sense

that this method is called semi-penalized CCA, since it disre-gards variance of the projected input data, but rewards thoseinput features that better predict large variance projections ofthe target data. This asymmetry makes sense in supervisedsubspace learning where matrix Y contains target values tobe approximated from the extracted input features. In fact,it has been shown that for classification problems OPLS isequivalent to Linear Discriminant Analysis (LDA), providedan appropriate labeling scheme is used for Y [19]. However,in “two-view learning” problems in which X and Y representdifferent views of the data [7, Sec. 6.5], one would liketo extract features that can predict both data representationssimultaneously, and CCA could be preferred to OPLS.

A common framework for PCA, PLS, CCA and OPLS wasproposed in [18], where it was shown that these methods canbe reformulated as (generalized) eigenvalue problems, so thatlinear algebra packages can be used to solve them. Concretely:

PCA : Cxu = λu

PLS :

(0 Cxy

C>xy 0

)(uv

)= λ

(uv

)OPLS : CxyC

>xyu = λCxu

CCA :

(0 Cxy

C>xy 0

)(uv

)= λ

(Cx 00 Cy

)(uv

)(2)

We can see that CCA and OPLS require the inversionof matrices Cx and Cy . If these are rank-deficient, thenit becomes necessary to first extract the dimensions withnon-zero variance using PCA, and then solve the CCA orOPLS problems. A very common approach is to solve theabove problems using a two-step iterative procedure. In thefirst step, the projection vectors corresponding to the largest(generalized) eigenvalue are chosen, for which there exist


TABLE IISUMMARY OF LINEAR AND KERNEL MVA METHODS. FOR EACH METHOD IT IS STATED THE OBJECTIVE TO MAXIMIZE (1ST ROW), CONSTRAINTS FOR

THE OPTIMIZATION (2ND ROW), AND MAXIMUM NUMBER OF FEATURES (LAST ROW).

PCA PLS CCA OPLS kPCA kPLS kCCA kOPLSu>Cxu u>Cxyv u>Cxyv u>CxyC

>xyu α>K2

xα α>KxYv α>KxYv α>KxYY>Kxα

U>U = IU>U = I

V>V = I

U>CxU = I

V>CyV = IU>CxU = I A>KxA = I

A>KxA = I

V>V = I

A>K2xA = I

V>CyV = IA>K2

xA = I

r(X) r(X) r(Cxy) r(Cxy) r(Kx) r(Kx) r(KxY) r(KxY)

Vectors u and α are column vectors in matrices U and A, respectively. r(·) denotes the rank of a matrix.

efficient methods such as the power method. The second stepis known as deflation, and consists in removing from the dataor covariance matrices the variance that can be already ex-plained with the features extracted in the first step. Alternativesolutions for these methods can be obtained by reformulatingthem as regularized least squares minimization problems. Forinstance, the work in [20]–[22] introduced sparse versions ofPCA, CCA and OPLS by adding sparsity promotion terms,such as LASSO or `1-norm on the projection vectors, to theLS functional.

Figure 1 illustrates the features extracted by the methodsfor a toy classification problem with three classes. The datawas generated from three noisy sinusoid fragments, so thata certain overlap exists between classes. For the applicationof supervised methods, class membership is defined in matrixY using 1-of-c encoding [23]. Above each scatter plot weprovide, for the first extracted feature, its sample variance,the largest covariance and correlation that can be achievedwith any linear transformation of the output data, and the op-timum mean-square-error (MSE) when that feature is used toapproximate the target data, i.e., 1

lm‖Y−Xu1(Xu1)†Y‖2F . Asexpected, the first projection of PCA, PLS and CCA maximize,respectively, the variance, covariance and correlation, whereasOPLS finds the projection that minimizes the MSE. However,since these methods can just perform linear transformations ofthe data, they are not able to capture any non-linear relationsbetween the input variables.

C. Kernel Multivariate Analysis

The framework of kernel MVA (kMVA) algorithms is aimedat extracting nonlinear projections while actually working withlinear algebra. Let us first consider a function φ : Rd → Fthat maps input data into a Hilbert feature space F . The newmapped data set is defined as Φ = [φ(x1), · · · ,φ(xl)]

>, andthe features extracted from the input data will now be givenby Φ′ = ΦU, where matrix U is of size dim(F) × nf . Thedirect application of this idea suffers from serious practicallimitations when the dimension of F is very large, which istypically the case.

To implement practical kernel MVA algorithms we needto rewrite the equations in the first half of Table II in termsof inner products in F only. For doing so, we rely on theavailability of a kernel matrix Kx = ΦΦ> of dimension l× l,and on the Representer’s Theorem [7], which states that theprojection vectors can be written as a linear combination of thetraining samples, i.e, U = Φ>A, matrix A = [α1, . . . ,αnf

]

being the new argument for the optimization2 This is typicallyreferred to as the kernel trick and has been used to developkernel versions of the previous linear MVA, as indicated inthe last four columns of Table II.

For PCA, it was Scholkopf, Smola and Muller who in 1998introduced a kernel version denoted kPCA [6]. Lai and Fyfein 2000 first introduced the kernel version of CCA denotedkCCA [24] (see also [7]). Later, Rosipal and Trejo presenteda non-linear kernel variant of PLS in [25]. In that paper, Kx

and the Y matrix are deflated the same way, which goes morein line with the PLS2 variant than to the traditional algorithm‘PLS Mode A’, and therefore we will denote it as kPLS2. Akernel variant of Orthonormalized PLS was presented in [26]and is here referred to as kOPLS. Specific versions of kernelmethods to deal with signal processing applications have alsobeen proposed, such as the temporal kCCA of [27], that isdesigned to exploit temporal structure in the data.

As for the linear case, kernel MVA methods can be imple-mented as (generalized) eigenvalue problems:

kPCA : Kxα = λα

kPLS :

(0 KxYYKx 0

)(αv

)= λ

(αv

)kOPLS : KxYY>Kxα = λKxKxα

kCCA :

(0 KxYYKx 0

)(αv

)= λ

(KxKx 00 Cy

)(αv

)(3)

It should be noted that the output data could also be mapped tosome feature space H, as it was considered for kCCA in [24]for a multi-view learning case. Here, we consider that it isthe actual labels in Y which need to be well-represented bythe extracted input features, so we will deal with the originalrepresentation of the output data.

For illustrative purposes, we have incorporated to Fig. 1 theprojections obtained in the toy problem by kMVA methodsusing the radial basis function (RBF) kernel, k(xi,xj) =exp

(−‖xi − xj‖2/(2σ2)

). Input data was normalized to zero

mean and unit variance, and the kernel width σ was selectedas the median of all pairwise distances between samples [28].The same σ has been used for all methods, so that features areextracted from the same mapping of the input data. We cansee that the non-linear mapping improves class separability. Asexpected, kPCA, kPLS and kCCA maximize in F the samevariance, covariance and correlation objectives, respectively,

2In this paper, we assume that data is centered in feature space, what caneasily be done through a simple modification of the kernel matrix [7].


Original data

PCA PLS OPLS CCAvar = 1.272; MSE = 0.153 var = 1.134; MSE = 0.139 var = 1.001; MSE = 0.135 var = 0.957; MSE = 0.136

cov = 0.353; corr = 0.299 cov = 0.404; corr = 0.401 cov = 0.391; corr = 0.431 cov = 0.381; corr = 0.433

kPCA kPLS kOPLS kCCAvar = 0.161; MSE = 0.136 var = 0.161; MSE = 0.135 var = 3.41e−6; MSE = 0.084 var = 2.45e−5; MSE = 0.124

cov = 0.156; corr = 0.421 cov = 0.157; corr = 0.43 cov = 0.001; corr = 0.856 cov = 0.002; corr = 0.915

Fig. 1. Features extracted by different MVA methods in a three-class problem. For the first feature extracted by each method we show its variance (var),the mean-square-error when the projected data is used to approximate Y (MSE), and the largest covariance (cov) and correlation (corr) that can be achievedwith any linear projection of the target data.

as their linear counterparts. kOPLS looks for the directions ofdata in F that can provide the best approximation of Y in theMSE sense. This example illustrates also that maximizing thevariance or even the covariance may not be the best choicefor supervised learning.

Although kernel MVA methods can still be described interms of linear equations, their direct solution faces severalproblems. In particular, it is well-known that kOPLS andkCCA can easily overfit the training data, so regularizationis normally required to alleviate numerical instabilities [7],[26]. A second important problem is related to the computa-tional cost. Since Kx is of size l × l, methods’ complexityscales quadratically with l in terms of memory, and cubicallywith respect to the computation time. Further, the solutionof the maximization problem (matrix A) is not sparse, sothat feature extraction for new data requires the evaluationof l kernel functions per pattern, becoming computationallyexpensive for large l. Finally, it is worth mentioning theopposite situation: when l is small, the extracted features maybe useless, especially for high dimensional F [12]. Actually,the information content of the features is elusive and has notbeen characterized so far. These issues limit the applicability ofsupervised kMVA in real-life scenarios with either very largeor very small labeled data sets. In Section III, we describesparse and semisupervised approaches for kMVA that tackleboth difficulties.

D. Relations with other methods

As already stated, close connections have been establishedamong Fisher’s LDA, CCA, PLS, and OPLS for classifica-tion [19]; such links extend to their kernel counterparts aswell. Under the framework of Rayleigh coefficients, Mika etal. [29], [30] extended LDA to its kernel version for binaryproblems, and Baudat and Anouar [31] proposed the gener-alized discriminant analysis (GDA) for multiclass problems.A great many kernel discriminants have appeared since then,

focused on alleviating problems such as those induced by high-dimensional small-sized datasets, the presence of noise, highlevels of collinearity, or unbalanced classes. The number andheterogeneity of these methods makes difficult their unifiedtreatment.

Besides, in the recent years interesting connections ofkMVA and statistical dependence estimates have been es-tablished. For instance, the Hilbert-Schmidt IndependenceCriterion (HSIC) [32] is a simple yet very effective methodto estimate statistical dependence between random variables.HSIC corresponds to estimating the norm of the cross-covariance in F , whose empirical (biased) estimator isHSIC := 1

(l−1)2 Tr(KxKy), where Kx works with samplesin the source domain and Ky = YY>. It can be shownthat, if the RKHS kernel is universal, HSIC asymptoticallytends to zero when the input and output data are independent.The so-called Hilbert-Schmidt Component Analysis (HSCA)method iteratively seeks for projections that maximize de-pendence with the target variables and simultaneously min-imize the dependence with previously extracted features, bothin HSIC terms. This objective translates into the iterativeresolution of the generalized eigen-decomposition problemKxKyKxα = λKxKfKxα, where Kf is a kernel matrixof already extracted features in the previous iteration. If oneis only interested in maximizing source-target dependence inHSIC terms, the problem boils down to solving kOPLS.

Similarly, the connection between kCCA and other kernelmeasures of dependence, such as the kernel GeneralizedVariance (kGV) or the kernel Mutual Information (kMI), wasintroduced in [33]. The empirical kGV estimates dependencebetween input-output data with a function that depends onthe entire spectrum of the associated correlation operator inRKHS, kGV(θ) = − 1

2 log(Πi(1 − λ2i )), where λi are the


solutions to the generalized eigenvalue problem(0 KxKy

KyKx 0

)(αv

)=

λ

(θKxKx + η(1− θ)Kx 0

0 θKyKy + η(1− θ)Ky

)(αv

),

where Kx and Ky are defined using RKHS kernels obtainedvia convolution of the associated Parzen windows, η is ascaling factor, and θ is a parameter in the range [0, 1]. Grettonet al. [33] showed that, under certain conditions, kGV reducesto kMI for θ = 0 and to kCCA for θ = 1 (cf. Eq. 3).

It is worth noting that the previous kernel measures of statis-tical dependence hold connections with Information TheoreticLearning concepts as well. For instance, it can be shown thatHSIC is intimately related to the concept of correntropy [34].All these connections could shed light in the future about theinformative content of the extracted features in a principledway.

III. EXTENSIONS FOR LARGE SCALE ANDSEMISUPERVISED PROBLEMS

Supervised kernel MVA methods are hampered either bythe wealth or the scarcity of labeled samples, which can makethese methods impractical for many applications. We nextsummarize some extensions to deal with large scale problemsand semisupervised situations in which few labeled data isavailable.

A. Sparse Kernel Feature Extraction

A critical bottleneck of kernel methods is that for a datasetof l samples, the kernel matrices are l × l, which, even for amoderate number of samples, quickly becomes a problem withrespect to both memory and computation time. Furthermore,in kernel MVA this is also an issue during the extractionof features for test data, since kernel MVA solutions willin general depend on all training data (i.e., matrix A willgenerally be dense): evaluating thousands of kernels for everynew input vector is, in most applications, simply not accept-able. Furthermore, these so-called dense solutions may resultin severe problems of overfitting, which is particularly truefor kOPLS and kCCA [7], [26]. To address these problems,several solutions have been proposed to obtain sparse solutionsthat can be expressed as a combination of a subset of thetraining data, and therefore require only r kernel evaluationsper sample (with r � l) for feature extraction. Note that,in contrast to the many linear MVA algorithms that inducesparsity with respect to the original variables, in this subsectionwe review methods that obtain sparse solutions in terms of thesamples (i.e., sparsity in the αi vectors).

The approaches to obtain sparse solutions can be broadlydivided into low rank approximation methods, that aim atworking with reduced r × r matrices (r � l), and reducedset methods that work with l× r matrices. Following the firstapproach, the Nystrom low-rank approximation of an l × lkernel matrix Kll is expressed as Kll = KlrK

−1rr Krl, where

subscripts indicate row and column dimensions. The methodwas originally exploited in the context of Gaussian processes,

and was later used in [35] to directly approximate the featuremapping itself rather than the kernel, thus giving rise to sparseversions of kPLS and kCCA.

Among the reduced set methods, a sparse kPCA (skPCA)was proposed by Tipping in [36], where the sparsity in therepresentation is obtained by assuming a generative model forthe data in F that follows a normal distribution and includesa noise term with variance vn. The maximum likelihoodestimation of the covariance matrix is shown to depend onjust a subset of the training data, and so it does the resultingsolution. A sparse kPLS (skPLS) was introduced in [37]. Themethod uses a fraction of the training samples for computingthe projections. Each projection vector is found through a sortof ε-insensitive loss similar to the one used in the supportvector regression method. The sparsification is induced viaa multi-step adaptation with high computational burden. Inspite of obtaining sparse solutions, the algorithms from [36]and [37] still require the computation of the full kernel matrixduring the training.

A reduced complexity kOPLS (rkOPLS) was proposedin [26] by imposing sparsity in the projection vectors rep-resentation a priori, U = Φ>r β, where Φr is a subset of thetraining data containing r samples (r � l) and β is the newargument for the maximization problem, which now becomes:

max β>KrlYY>K>rlβ

subject to : β>KrlK>rlβ = 1,

(4)

Since kernel matrix Krl = ΦrΦ> involves the inner products

in F of all training points with the patterns in the reduced set,this method still takes into account all data during the trainingphase, and is therefore different from simple subsampling.This sparsification procedure avoids the computation of thefull kernel matrix at any step of the algorithm. An additionaladvantage of this method is that matrices KrlYY>K>rl andKrlK

>rl are both of size r× r, and can be expressed as sums

over the training data, so the storage requirements become justquadratic with r. Furthermore, the sparsity constraint acts as aregularizer that can significantly improve the generalizationability of the method. In the experiments section, we willapply the same sparsification procedure to kPCA and kCCA,obtaining reduced complexity versions of these methods towhich we will refer to as rkPCA and rkCCA. Interestingly, theextension to kPLS2 is not straightforward, since the deflationstep would still require the full kernel matrix Kll.

Alternatively, two sparse kPLS schemes were presentedin [38] under the name of Sparse Maximal Aligment (SMA)and Sparse Maximal Covariance (SMC). Here kPLS iterativelyestimates projections that either maximize the kernel align-ment (c.1) or the covariance (c.2) of the projected data andthe true labels:

max β>KjY

subject to (c.1) : β>K2jβ = 1,

subject to (c.2) : β>Kjβ = 1,

(5)

where K1 = Kx and Kj denotes the deflated kernel matrixat iteration j, according to [38, Eq. (3)]. The method imposesthe additional constraint that the cardinality of β is 1. This


restriction explicitly enforces sparsity through an `0-normin the weights space. At each iteration, β is obtained byperforming an exhaustive search over all training patterns.However, the complexity of the algorithm can be significantlyreduced by constraining the search to just p randomly chosensamples.

In Table III, we summarize some computational and imple-mentation issues of the aforementioned sparse kMVA methods,and of standard non-sparse kMVA and linear methods. Ananalysis of the properties of each algorithm provides somehints that can help us choose the algorithm for a particularapplication. Firstly, a critical step when using kernel meth-ods is the selection of an appropriate kernel function andtuning its parameters. To avoid overfitting, kMVA methodscan be adjusted using cross-validation at the cost of highercomputational cost. Sparse methods can help in this respectby regularizing the solution. Secondly, most methods can beimplemented as either eigenvalue or generalized eigenvalueproblems, whose complexity typically scales cubically withthe size of the analyzed matrices. Therefore, both for memoryand computational reasons, only linear MVA and the sparseapproaches from [26] and [38] are affordable when dealingwith large data sets. A final advantage of sparse kMVA is thereduced number of kernel evaluations to extract features fornew data.

B. Semisupervised Kernel Feature Extraction

When few labeled samples are available, the extractedfeatures do not capture the structure of the data manifold well,and hence using them for classification or regression maylead to very poor results. Recently, semisupervised learningapproaches have been introduced to alleviate these problems.Two approaches are encountered: the information conveyedby the unlabeled samples is either modeled with graphs or viakernel functions derived from generative clustering models.

Notationally, we are given l labeled and u unlabeled sam-ples, a total of n = l + u. The semisupervised kCCA (ss-kCCA) has been recently introduced in [28] by using thegraph Laplacian. The method essentially solves the standardkCCA using kernel matrices computed with both labeled andunlabeled data, which are further regularized with the graphLaplacian: (

0 KxnlK

yln

KxnlK

yln 0

)(αv

)=

λ

(Kx

nlKxln + Rx

nn 00 Ky

nlKyln + Ry

nn

)(αv

), (6)

where Rxnn = αxKx

nn + γxKxnnL

xnnKx

nn and Rynn =

αyKynn + γyK

ynnL

ynnKy

nn. For notation compactness,subindexes here indicate the size of the corresponding matriceswhile superscripts denote whether they involve input or outputdata. Parameters αx, αy , γx and γy trade off the contribution oflabeled and unlabeled samples, and L = D−1/2(D−M)D1/2

represents the (normalized) graph Laplacian for the input andtarget domains, where D is the degree matrix whose entriesare the sums of the rows of the corresponding similarity matrix

M, i.e. Dii =∑

j Mij . It should be noted that for n = l andnull regularization, one obtains the standard kCCA (cf. Eq. 3).Note also that this form of regularization through the graphLaplacian can be applied to any kMVA method. A drawbackof this approach is that it involves tuning several parametersand working with larger matrices of size 2n× 2n, which canmake its application difficult in practice.

Alternatively, cluster kernels have been used to developsemisupervised versions of kernel methods in general, andof kMVA methods in particular. The approach was used forkPLS and kOPLS in [39]. Essentially, the method relies oncombining a kernel function based on labeled informationonly, ks(xi,xj), and a generative kernel directly learnedby clustering all (labeled and unlabeled) data, kc(xi,xj).Building kc requires first running a clustering algorithm, suchas Expectation-Maximization assuming a Gaussian mixturemodel (GMM) with different initializations, q = 1, . . . , Q,and with different number of clusters, g = 2, . . . , G+ 1. Thisresults in Q ·G cluster assignments where each sample xi hasits corresponding posterior probability vector πi(q, g) ∈ Rg .The probabilistic cluster kernel kc is computed by averagingall the dot products between posterior probability vectors,

kc(xi,xj) =1

Z

Q∑q=1

G+1∑g=2

πi(q, g)>πj(q, g),

where Z is a normalization factor. The final kernel functionis defined as the weighted sum of both kernels, k(xi,xj) =βks(xi,xj)+(1−β)kc(xi,xj), where β ∈ [0, 1] is a scalar pa-rameter to be adjusted. Intuitively, the cluster kernel accountsfor probabilistic similarities at small and large scales (numberof clusters) between all samples along the data manifold. Themethod does not require computationally demanding proce-dures (e.g. current GMM clustering algorithms scale linearlywith n), and the kMVA still relies just on the labeled data,and thus requires an l × l kernel matrix. All these propertiesare quite appealing from the practitioner’s point of view.

IV. EXPERIMENTAL RESULTS

In this section, we illustrate through different applicationexamples the use and capabilities of the supervised ker-nel multivariate feature extraction framework. We start bycomparing the performance of the linear and kernel MVAmethods in a benchmark of classification problems from thepublicly available Machine Learning Repository at Universityof California, Irvine (UCI)3. We then consider two real ap-plications to show the potential of these algorithms: satelliteimage processing [40] and audio processing for music genreprediction [41]. The size of the data set used in this secondscenario is sufficiently large to make standard kMVA methodsimpractical, a situation that we will use to illustrate the benefitsof the sparse extensions.

A. UCI repository benchmark

Our first battery of experiments deals with standard bench-mark data sets taken from the UCI repository, and will be

3http://archive.ics.uci.edu/ml/.


TABLE IIIMAIN PROPERTIES OF (K)MVA METHODS. COMPUTATIONAL COMPLEXITY AND IMPLEMENTATION ISSUES ARE CATEGORIZED FOR THE CONSIDERED

DENSE AND SPARSE METHODS, INDICATING FROM LEFT TO RIGHT: THE FREE PARAMETERS, NUMBER OF KERNEL EVALUATIONS (KE) DURINGTRAINING, STORAGE REQUIREMENTS, WHETHER AN EIGENPROBLEM (EIG) OR GENERALIZED EIGENPROBLEM (GEV) NEEDS TO BE SOLVED, AND THE

NUMBER OF KERNELS THAT NEED TO BE EVALUATED TO EXTRACT PROJECTIONS FOR NEW DATA.

Method Parameters KE tr. Storage Req. EIG / GEV / Other KE testPCA none none O(d2) EIG nonePLS none none O((d+m)2) EIG noneCCA none none O((d+m)2) GEV noneOPLS none none O(d2) GEV nonekPCA kernel l2 O(l2) EIG lkPLS kernel l2 O((l +m)2) EIG lkCCA kernel l2 O((l +m)2) GEV lkOPLS kernel l2 O(l2) GEV l

skPCA [36] kernel, vn l2 O(l2) ML + EIG† vn dependentskPLS [37] kernel, ν, ε l2 O(l2) ν-SVR ν, ε dependentrkPCA kernel, r rl O(r2) EIG rrkCCA kernel, r rl O((r +m)2) GEV rrkOPLS [26] kernel, r rl O(r2) GEV rSMA / SMC [38]‡ kernel, r l2 O(l2) Ex. search r

† A maximum likelihood estimation step is required. ‡ By constraining the search to p random samples at each step of the algorithm, kernel evaluations andstorage requirements during training can be reduced to rlp and O(lp), respectively.

oriented to discuss some important properties of the standardlinear and kernel MVA methods. The main properties of theselected problems are given in Table IV, namely the numberof training and test patterns (l, ltest), the dimensionality of theinput space (d), the number of classes (c), the ratio of trainingpatterns per dimension, and the Kullback-Leibler divergence(KL) between the sample probabilities of each class and auniform distribution, that can be seen as an indicator of balanceamong classes. The train/test partition has left 60% of thetotal data (or alternatively a maximum of 500 samples) in thetraining set, so that standard kMVA complexity is kept undercontrol. Since all selected problems involve a classificationtask, matrix Y was used to store the class information using1-of-c coding. To obtain the overall accuracies (OA) displayedin the table, we have trained an LS model to approximate Y,followed by a “winner-takes-all” activation function.

We start the discussion by comparing the performanceof linear methods. For OPLS and CCA we have used themaximum number of features (c − 1), whereas for PCA nfhas been fixed through a 10-fold cross-validation scheme onthe training set, and is indicated next to the results of themethod. When the maximum number of features are extracted,linear OPLS and CCA become identical. We can see thatin most cases they perform similarly to PCA, but requiringsignificantly fewer features. It is important to see the very poorperformance of OPLS and CCA in two particular problems:semeion and sonar. We see that the ratio l/d is very smallfor these problems, so the input covariance matrix is likelyto be ill-conditioned. Then, very low variance directions ofthe input space are used by CCA and OPLS to overfit thedata. To avoid this problem, it becomes necessary to regularizethese algorithms, e.g., by loading the main diagonal of thecovariance matrix Cx or by executing the method on theprojections already extracted by PCA [11]. The results ofregularized OPLS and CCA following the latter approach,and using the maximum of (c − 1) features, are given inthe last two columns of Table IV, where we can see thatthe regularization does indeed help to overcome the “small-

sample-size” problem. For completeness, we present also theresults of the PLS2 approach. To get a fair comparison withthe other supervised schemes, PLS2 is also trained to extract(c− 1) features. Thus, we can conclude that OPLS and CCAallow to obtain more discriminative features than the othermethods, but one needs to be aware of the likely need toregularize the solution.

Next, we turn our attention to non-linear versions, whoseresults have been displayed in the bottom half of the table. AnRBF kernel has been used in all cases, selecting the kernelwidth with 10-fold cross-validation. A first consideration isthat kernel approaches considerably outperform the linearschemes. Since the RBF kernel implies a mapping to a veryhigh dimensional space, it is not surprising that standardkOPLS and kCCA are even more prone to overfitting thanbefore, this being a well-known problem of these methods.As before, regularized solutions allow kOPLS and kCCAto achieve comparable performance to kPCA, but retaininga much smaller number of features (c − 1), which demon-strates the superior discriminative capabilities of the featuresextracted by these methods. For completeness, kPLS2 wasused to extract the same number of features as kOPLS andkCCA, achieving considerably smaller OAs in all problems.Nevertheless, with kPLS2 it is possible to extract a largernumber of projections, and this method is known to be morerobust to overfitting than kOPLS and kCCA.

B. Remote sensing image analysisThe last few hundred years human activities have precip-

itated an environmental crisis on Earth. In the last decade,advanced statistical methods have been introduced to quantifyour impact on the land/vegetation and atmosphere, to betterunderstand their interactions. Nowadays, multi- and hyper-spectral sensors mounted on satellite or airborne platformsmay acquire the reflected energy by the Earth with high spatialdetail and in several wavelengths. Recent infrared soundersalso allow us to estimate the profiles of atmospheric parame-ters with unprecedented accuracy and vertical resolution. Here,


TABLE IVMAIN CHARACTERISTICS OF THE DATA SETS THAT COMPOSE THE UCI BENCHMARK AND PERFORMANCE OF THE DIFFERENT (K)MVA FEATURE

EXTRACTION METHODS. AS A FIGURE OF MERIT WE USE THE OVERALL ACCURACY (OA, [%]) ± THE BINOMIAL STANDARD DEVIATION. BEST RESULTSFOR EACH PROBLEM ARE HIGHLIGHTED IN BOLDFACE. THE NUMBER OF EXTRACTED FEATURES IS INDICATED FOR PCA AND KPCA, WHEREAS ALL

OTHER METHODS USE c− 1 FEATURES.

data set l d l/d nf,PCA PCA PLS2 OPLS CCA reg. OPLS reg. CCAcar 500 6 83.3 5 79 ±1.2 79.3 ±1.2 79.6 ±1.1 79.6 ±1.1 79 ±1.2 79 ±1.2glass 128 9 14.2 9 57 ±5.3 50 ±5.4 57 ±5.3 57 ±5.3 57 ±5.3 57 ±5.3optdigits 500 62 8.06 27 88.8 ±0.4 86.6 ±0.5 90.3 ±0.4 90.3 ±0.4 88.8 ±0.4 88.8 ±0.4semeion 500 256 1.95 78 83.8 ±1.1 82.4 ±1.2 69.1 ±1.4 69.1 ±1.4 83.8 ±1.1 83.8 ±1.1sonar 125 60 2.08 7 74.7 ±4.8 67.5 ±5.1 65.1 ±5.2 65.1 ±5.2 74.7 ±4.8 74.7 ±4.8vehicle 500 18 27.8 18 78 ±2.2 63.9 ±2.6 78 ±2.2 78 ±2.2 78 ±2.2 78 ±2.2vowel 500 13 38.5 13 48.8 ±2.3 46.9 ±2.3 48.8 ±2.3 48.8 ±2.3 48.8 ±2.3 48.8 ±2.3yeast 500 8 62.5 7 55.9 ±1.6 55 ±1.6 55 ±1.6 55 ±1.6 55.9 ±1.6 55.9 ±1.6data set ltest c KL nf,kPCA kPCA kPLS2 kOPLS kCCA reg. kOPLS reg. kCCAcar 1228 4 0.55 197 93 ±0.7 80.8 ±1.1 92.1 ±0.8 71.8 ±1.3 93 ±0.7 91.6 ±0.8glass 86 6 0.28 17 60.5 ±5.3 60.5 ±5.3 41.9 ±5.3 29.1 ±4.9 62.8 ±5.2 60.5 ±5.3optdigits 5120 10 0 330 95.4 ±0.3 82.2 ±0.5 95.2 ±0.3 93.3 ±0.4 95.3 ±0.3 88.8 ±0.4semeion 1093 10 0 321 89.5 ±0.9 79 ±1.2 89.3 ±0.9 89.4 ±0.9 89 ±0.9 83.9 ±1.1sonar 83 2 0 97 84.3 ±4 67.5 ±5.1 80.7 ±4.3 80.7 ±4.3 84.3 ±4 49.4 ±5.5vehicle 346 4 0 229 81.5 ±2.1 49.7 ±2.7 76.6 ±2.3 75.1 ±2.3 82.1 ±2.1 72.8 ±2.4vowel 490 11 0 310 92.7 ±1.2 53.1 ±2.3 92.4 ±1.2 92 ±1.2 93.1 ±1.1 88.4 ±1.4yeast 984 10 0.58 56 58.8 ±1.6 56.8 ±1.6 48.1 ±1.6 33.8 ±1.5 58.7 ±1.6 58.9 ±1.6

we pay attention to the performance of several kMVA methodsfor both image segmentation of hyperspectral images, andestimation of climate parameters from infrared sounders.

a) Hyperspectral image classification: The first casestudy deals with image segmentation of hyperspectral im-ages [42]. We have used the standard AVIRIS image taken overNW Indiana’s Indian Pine test site in June 19924. We removed20 noisy bands covering the region of water absorption, andfinally worked with 200 spectral bands. The high numberof narrow spectral bands induce a high collinearity amongfeatures. Discriminating among the major crop classes in thearea can be very difficult (in particular, given the moderatespatial resolution of 20 meters), which has made the scene achallenging benchmark to validate classification accuracy ofhyperspectral imaging algorithms. The image is 145 × 145pixels and contains 16 quite unbalanced classes (ranging from20− 2468 pixels). Among the available 10366 labeled pixels,20% were used for training the feature extractors, and theremaining 80% for testing. The discriminative power of allextracted features was tested using a simple classifier consist-ing of a linear least squares model followed by a “winner-takesall” activation function.

Figure 2 shows the test classification accuracy for a varyingnumber of extracted features, nf . For linear models, OPLSperforms better than all other methods for any number ofextracted features. Even though CCA provides similar resultsfor nf = 10, it involves a slightly more complex generalizedeigenproblem. When the maximum number of projections isused, all methods result in the same error. Nevertheless, whilePCA and PLS2 require 200 features (i.e., the dimensionalityof the input space), CCA and OPLS only need 15 features toachieve virtually the same performance.

We also considered non-linear kPCA, kPLS2, kOPLS andkCCA, using an RBF kernel whose width was adjusted using5-fold cross-validation in the training set. The same conclu-

4The calibrated data is available online (along with detailed ground-truthinformation) from http://dynamo.ecn.purdue.edu/∼biehl/MultiSpec.

sions obtained for the linear case apply also to MVA methodsin kernel feature space. The features extracted by kOPLS allowto achieve a slightly better Overall Accuracy (OA) than kCCA,and both methods perform significantly better than kPLS2 andkPCA. In the limit of nf , all methods achieve similar accuracy.The classification maps obtained for nf = 10 confirm theseconclusions: higher accuracies lead to smoother maps andsmaller error in large spatially homogeneous vegetation covers.

b) Temperature estimation from infrared sounding data:The second case study focuses on the estimation of temper-ature from spaceborne very high spectral resolution infraredsounding instruments. Despite the constant advances in sensordesign and retrieval techniques, it is not trivial to invert thefull information of the atmospheric state contained by suchhyperspectral measurements. Statistical regression and featureextraction methods have overcome the numerical difficultiesof radiative transfer models (RTMs), and enable fast retrievalsfrom high volumes of data [43].

We concentrate here on the Infrared Atmospheric SoundingInterferometer (IASI) onboard the MetOp-A satellite data.IASI spectra consist of 8461 spectral channels (input features)with a spatial resolution of 25 km at nadir. Due to its largespatial coverage and low radiometric noise, IASI providestwice daily global measurements of key atmospheric speciessuch as ozone, carbon monoxide, methane and methanol.Due to the impossibility to obtain real radiosound data forthe whole atmospheric column, we resorted to the standardhybrid approach for developing the prediction models: we usedsynthetic data for training the models, and then applied it toa full IASI orbit (91800 ‘pixels’ on March 4th, 2008). A totalamount of 67475 synthetic samples were simulated with aninfrared RTM according to input profiles of temperature at 90pressure levels.

We are confronted here with a challenging multi-outputregression problem (xi ∈ R8461 and yi ∈ R90), which needsfast methods for retrieval (prediction). Note that the IASImission delivers approximately 1.3 × 106 spectra per day,

http://dynamo.ecn.purdue.edu/~biehl/MultiSpec


100

101

102

40

50

60

70

80

90

Number of features

Ov

era

ll A

ccu

racy

(%

)

kPCAkPLS2kCCA

PCA

PLS2CCAOPLS

PLS2 (61.5%)

CCA (67.5%) kOPLS (80.4%)

RGB composite

kOPLS

Fig. 2. Average classification accuracy (%) for linear and kernel MVA methods as a function of the number of extracted features, along some classificationmaps for the case of 10 extracted features.

0 2 4 6 8

101

102

103

RMSE [K]

p [

hP

a]

ECMWF world temperature map [K]ECMWF PCA (2.54)

kPCA (1.72) kOPLS (1.55)

kPCA

kPLS2

kCCA

PCA

PLS2

CCA

OPLS

kOPLS

Fig. 3. RMSE atmospheric temperature profiles (left); surface temperature [K] world map provided by the official ECMWF model, http://www.ecmwf.int/,on March 4, 2008 (middle); and estimated surface temperature maps in California/Mexico area for several methods along with the averaged RMSE across thewhole atmospheric column given in brackets.

which gives a rate of about 29 Gbytes/day to be processed.We compared several MVA methods followed by LS in termsof the root-MSE (RMSE) computed as the discrepancy to theofficial European Centre for Medium-Range Weather Forecasts(ECMWF) estimations. We obtain the RMSE for all spatial‘pixels’ in the orbit and for all layers of the atmosphere. Wefixed a maximum nf = 100, and used only l = 2000 samplesfor learning the transformation and the LS regression weights.For the kernel approaches, an RBF kernel is used, selectingthe width using 5-fold cross validation.

The left panel of Fig. 3 illustrates the RMSE obtained fordifferent pressure levels, using a spatial average over all thepixels. Results show that kernel methods outperform linearapproaches in RMSE terms, specially in the lower parts ofthe atmosphere, probably due to the presence of clouds andhaze. Estimated surface temperature maps for a particular areaare also given in the right panel of Fig. 3 for PCA, kPCA,and kOPLS. These plots reveal that kernel methods yieldcloser maps to those provided by the ECMWF in averagedRMSE terms. These results can be of high value because thekernel-based estimations are obtained with a drastic reduction

in computational time compared to the physical-inversionmethods used by the ECMWF.

C. Analysis of integrated short-time music features for genreprediction

In this subsection, we consider the problem of predictingthe genre of a song using the audio data only, a task whichhas recently been a subject of much interest. The data set weanalyze has been previously used in [26], [41], and consists of1317 snippets of 30 seconds distributed evenly over 11 musicgenres: alternative, country, easy listening, electronica, jazz,latin, pop&dance, rap&hip-hop, r&b, reggae and rock. Thisis a rather complex data set with an average of 1.83 songsper artist. An estimate of human performance on this data sethas been carried out via subjective tests, providing an averageaccuracy rate around 55%.

The music snippets are MP3 (MPEG1-layer3) encodedmusic with a bitrate of 128 kbps or higher, down sampledto 22050 Hz, and they are processed following the methodin [41]: Mel Frequency Cepstral Coefficients (MFCC) featuresare extracted from overlapping frames of the song, using a

http://www.ecmwf.int/


window size of 20 ms. Then, a multivariate autoregressive(AR) model is adjusted for every 1.2 seconds of the song tocapture temporal correlation, and finally the parameters of theAR model are stacked into a 135 length feature vector forevery such frame.

For training and testing the system we have split the dataset into two subsets with 817 and 500 songs, respectively.After processing the audio data, we have 57388 and 36556135-dimensional vectors in the training and test partitions,an amount which for standard kernel MVA methods is pro-hibitively large. For this reason, in this subsection we studythe performance of linear MVA methods, as well as of sparsekernel methods that promote sparsity following the approachof [26]: rkPCA, rkOPLS, and rkCCA. For completeness, wehave also considered the kPLS2 method of [7]; in this case thedeflation scheme does not allow to use reduced set methods, sowe applied mere subsampling. As in the previous subsections,we use an RBF kernel, with a 10-fold validation scheme overthe training data to adjust the kernel width. We also resortedto an LS scheme followed by “winner-takes-all” to carry outthe classification.

Figure 4 illustrates the performance of all methods fordifferent number of features. For the sparse methods, theinfluence of the machine size, measured as the number ofkernel evaluations required to extract features for new data,is also analyzed. In rkPCA, rkOPLS and rkCCA this is givenby the cardinality of the reduced set, whereas in kPLS2 itcoincides with the number of samples used to train the featureextractor. Since every song consists of about 70 AR vectors,we can measure the classification accuracy in two differentways: 1) On the level of individual AR vectors, or 2) bymajority voting across the AR vectors of a given song. Theleft panel of Fig. 4 illustrates the discriminative power ofthe features extracted by all considered methods. Overall,the best performance is obtained by OPLS- and CCA-typemethods, with the kernel schemes outperforming the linearones for nf > 5. The poor performance of PCA and rkPCAmakes evident the need of exploiting label information duringthe feature extraction step. Finally, it is evident that a meresubsampling does not provide kPLS2 with enough data toextract relevant features, and this method performs worse thanits linear counterpart.

On the right panel of Fig. 4 we analyze the accuracy ofkernel methods as a function of machine size. As before,it is clear that rkOPLS and rkCCA are the best performingmethods both at the AR and song levels. Increasing the sizeof the machine results in better performance, although theimprovement is not very noticeable in excess of 250 nodes.Altogether, these results allow us to conclude that sparsemethods can be used to enhance the applicability of kMVAfor large data sets. In this subsection, we have focused on thesparse-promotion technique of [26], but similar advantages canbe expected from other sparse approaches.

V. CONCLUSIONS

We reviewed the field of supervised feature extractionfrom the unified framework of multivariate analysis. The use

of these techniques in real world applications is becomingincreasingly popular. Beyond standard PCA, there is a plethoraof linear and kernel methods that are generally better suitedfor supervised applications since they seek for projectionsthat maximize the alignment with the target variables. Weanalyzed the commonalities and basic differences of the mostrepresentative MVA approaches in the literature, as wellas the relationships to existing kernel discriminative featureextraction and statistical dependence estimation approaches.We also studied recent methods to make kernel MVA moresuitable to real life applications, both for large scale data setsand for problems with few labeled data. In such approaches,sparse and semi-supervised learning extensions have beensuccessfully introduced for most of the models. Actually,seeking for the appropriate features that facilitate classificationor regression cuts to the heart of manifold learning. We haveillustrated MVA methods in challenging real problems thattypically exhibit complex manifolds. A representative subsetof the UCI repository has been used to illustrate the general ap-plicability of the standard MVA implementations in moderatedata sizes. We have completed the panorama with challengingreal-life applications: the prediction of music genre fromrecorded signals, and the classification of land-cover classesand estimation of climate variables to monitor our Planet.This tutorial presented the framework of kernel multivariatemethods and outlined relations and possible extensions. Theadoption of the methods in many disciplines of science andengineering is nowadays a fact. New exciting advances in thetheory and applications are yet to come.

ACKNOWLEDGMENTS

The authors would like to thank the reviewers and thespecial issue editors for useful comments that improved thepresentation of the present manuscript significantly.

This work was partially supported by Banco Santanderand Universidad Carlos III de Madrid’s Excellence Chairprogramme, and by the Spanish Ministry of Economy andCompetitiveness (MINECO) under projects TIN2012-38102-C03-01, TEC2011-22480 and PRI-PIBIN-2011-1266.

REFERENCES

[1] H. Wold, “Estimation of principal components and related models byiterative least squares,” in P. R. Krishnaiah (ed.) Multivariate Analysis,pp. 391–420, Academic Press, 1966.

[2] R. Rosipal, “Nonlinear Partial Least Squares: An Overview,” inChemoinformatics and Advanced Machine Learning Perspectives: Com-plex Computational Methods and Collaborative Techniques, H. Lodhiand Y. Yamanishi (eds.), ACCM, IGI Global, pp. 169–189, 2011.

[3] S. Wold, N. Kettaneh-Wold, and B. Skagerberg, “Nonlinear PLSModeling,” Chemometrics and Intelligent Laboratory Systems, vol. 7,pp. 53–65, 1989.

[4] S. Qin and T. McAvoy, “Non-linear PLS modelling using neuralnetworks,” Computers & Chemical Engineering, vol. 16, pp. 379–391,1992.

[5] B. E. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm foroptimal margin classifiers,” in Proc. COLT’92, Pittsburgh, PA, 1992,pp. 144–152.

[6] B. Scholkopf, A. Smola and K.-R. Muller. “Nonlinear ComponentAnalysis as a Kernel Eigenvalue Problem,” Neural Computation, vol.10, pp. 1299–1319, 1998.

[7] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis,Cambridge University Press, 2004.


100

101

102

5

10

15

20

25

30

35

40

45

Number of Features

Acc

ura

cy r

ates

(%

)

PCA

PLS2

CCA

OPLS

rkPCA

kPLS2

rkCCA

rkOPLS

Baseline (Random Guess)

100 250 500 75020

25

30

35

40

45

Machine size

Acc

ura

cy r

ates

(%

)

rkPCA, AR

kPLS2, AR

rkCCA, AR

rkOPLS, AR

rkPCA, song

kPLS2, song

rkCCA, song

rkOPLS, song

Fig. 4. Genre classification accuracy of linear and sparse kernel MVA methods. Left: Accuracy at the song level as a function of the number of extractedfeatures. Right: Accuracy of the sparse methods measured as percentage of correctly classified AR patterns and songs for different machine sizes.

[8] IEEE Signal Process. Mag., special issue on Dimensionality Reductionvia Subspace and Submanifold Learning, vol. 28, Mar. 2011.

[9] I. T. Jollife, Principal Component Analysis, Springer-Verlag, 1986.[10] K. Pearson, “On lines and planes of closest fit to systems of points in

space,” Philosophical Mag., vol. 2 pp. 559–572, 1901.[11] M. L. Braun, J. M. Buhmann, and K.-R. Muller, “On Relevant

Dimensions in Kernel Feature Spaces,” Journal of Machine LearningResearch, vol. 9, 1875–1908, 2008.

[12] T. J. Abrahamsen and L. K. Hansen, “A Cure for Variance Inflation inHigh Dimensional Kernel Principal Component Analysis,” Journal ofMachine Learning Research, vol. 12, 2027–2044, 2011.

[13] H. Wold, “Non-linear estimation by iterative least squares procedures,”in F. David (ed.) Research papers in Statistics, pp. 411–444, New York,NY: Wiley, 1966.

[14] S. Wold, et al., “Multivariate Data Analysis in Chemistry,” in B. R.Kowalski (ed.) Chemometrics, Mathematics and Statistics in Chemistry,pp. 17–95. Reidel Publishing Company, Holland, 1984.

[15] P. Geladi, “Notes on the history and nature of partial least squares (PLS)modelling”, Journal of Chemometrics, vol. 2, pp. 231–246, 1988.

[16] N. Kramer and R. Rosipal, “Overview and recent advances in partialleast squares,” in C. Saunders et al. (eds.) Subspace, Latent Structureand Feature Selection Techniques, pp. 34–51, Springer-Verlag, 2006.

[17] H. Hotelling, “Relations between two sets of variates,” Biometrika, 28,321–377, 1936.

[18] M. Borga, T. Landelius, and H. Knutsson, “A unified approach toPCA, PLS, MLR and CCA,” Tech. Report LiTH-ISY-R-1992, Linkoping,Sweden, 1997.

[19] M. Barker and W. Rayens, “Partial least squares for discrimination,”Journal of Chemometrics, vol. 17, pp. 166–173, 2003.

[20] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal ComponentAnalysis,” Journal of Computational and Graphical Statistics, vol. 15,pp. 265–286, 2006.

[21] D. R. Hardoon and J. Shawe-Taylor, “Sparse Canonical CorrelationAnalysis,” Machine Learning, vol. 83, pp. 331–353, 2011.

[22] M. van Gerven, Z. Chao, and T. Heskes “On the decoding of intracranialdata using sparse orthonormalized partial least squares,” Journal ofNeural Engineering, vol. 9, 2012.

[23] C. M. Bishop, Neural networks for pattern recognition, Oxford Univer-sity Press, 1995.

[24] P. L. Lai and C. Fyfe, “Kernel and non-linear Canonical CorrelationAnalysis,” Intl. Journal of Neural Systems, vol. 10, pp. 365–377, 2000.

[25] R. Rosipal and L. J. Trejo, “Kernel partial least squares regression inreproducing Hilbert spaces,” Journal of Machine Learning Research, 2,97–123, 2001.

[26] J. Arenas-Garcıa, K. B. Petersen, and L. K. Hansen, “Sparse kernelorthonormalized PLS for feature extraction in large data sets,” in NIPS,19, MIT Press, 2007.

[27] F. Biessmann, et.al., “Temporal kernel CCA and its application inmultimodal neuronal data analysis,” Machine Learning, vol. 79, pp.5–27, 2010.

[28] M. Blaschko, J. Shelton, A. Bartels, C. Lampert, and A. Gretton, “Semi-supervised Kernel Canonical Correlation Analysis with Application toHuman fMRI”, Patt. Recogn. Lett. vol. 32, pp.1572–1583, 2011.

[29] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller, “Fisherdiscriminant analysis with kernels,” in Proc. IEEE Neural Networks forSignal Processing Workshop, Madison, WI, Aug. 1999, pp. 41–48.

[30] S. Mika et al., “Constructing Descriptive and Discriminative NonlinearFeatures: Rayleigh Coefficients in Kernel Feature Spaces,” IEEE Trans.Patt. Anal. Mach. Intell., vol. 25, pp. 623–628, 2003.

[31] G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using aKernel Approach,” Neural Computation, vol. 12, pp. 2385–2404, 2000.

[32] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf, “Measuringstatistical dependence with Hilbert-Schmidt norms,” in Proc. 16th Intl.Conf. Algorithmic Learning Theory, Springer, 2005, pp. 63–77.

[33] A. Gretton, R. Herbrich and A. Hyvarinen, “Kernel methods formeasuring independence”, Journal of Machine Learning Research, vol.6, 2075–2129, 2005.

[34] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy andKernel Perspectives, Springer, 2010.

[35] L. Hoegaerts, J. A. K. Suykens, J. Vanderwalle, and B. De Moor, “Subsetbased least squares subspace regression in RKHS,” Neurocomputing,vol. 63, pp. 293–323, 2005.

[36] M. E. Tipping, “Sparse Kernel Principal Component Analysis,” in NIPS,13, MIT Press, 2001.

[37] M. Momma and K. Bennett, “Sparse kernel partial least squaresregression,” in Proc. Conf. on Learning Theory, 2003.

[38] C. Dhanjal, S. R. Gunn, and J. Shawe-Taylor, “Efficient sparse kernelfeature extraction based on partial least squares,” IEEE Trans. Patt.Anal. and Mach. Intell., vol. 31, pp. 1347–1360, 2009.

[39] E. Izquierdo-Verdiguier, J. Arenas-Garcıa, S. Munoz-Romero, L.Gomez-Chova and G. Camps-Valls, “Semisupervised Kernel Orthonor-malized Partial Least Squares,” in Proc. IEEE Mach. Learn. Sign. Proc.Workshop, Santander, Spain, 2012.

[40] J. Arenas-Garcıa and G. Camps-Valls, “Efficient Kernel OPLS forremote sensing applications,” IEEE Trans. Geosc. Rem. Sens., 44, 2872–2881, 2008.

[41] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen, “Temporal FeatureIntegration for Music Genre Classification,” IEEE Trans. Audio, Speechand Lang. Process., vol. 15, pp. 1654–1664, 2007.

[42] J. Arenas-Garcıa and K. B. Petersen, “Kernel multivariate analsis inremote sensing feature extraction,” in G. Camps-Valls and L. Bruzzone(eds.) Kernel methods for Remote Sensing Data Analysis, Wiley, 2009.

[43] G. Camps-Valls, J. Munoz-Marı, L. Gomez-Chova, L. Guanter, and X.Calbet, “Nonlinear Statistical Retrieval of Atmospheric Profiles fromMetOp-IASI and MTG-IRS Infrared Sounding Data,” IEEE Trans.Geoscience and Remote Sensing, 2012.

Date post:	06-Feb-2018
Category:	Documents
Upload:	phunghanh
View:	216 times
Download:	0 times

PDF - arXiv · PDF filefourth MVA method known as Orthonormalized PLS (OPLS) that is also...

Documents