1 Kernel Component Analysis using an Epsilon Insensitive ... · PCA constructs the rank ksubspace...

1

Kernel Component Analysis using an Epsilon

Insensitive Robust Loss FunctionCarlos Alzate, Johan A. K. Suykens

Katholieke Universiteit Leuven, ESAT-SCD-SISTA

Kasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium

carlos.alzate, [email protected]

Abstract

Kernel Principal Component Analysis is a technique to perform feature extraction in a high-dimensional

feature space which is nonlinearly related to the original input space. The kernel PCA formulation

corresponds to an eigendecomposition of the kernel matrix:eigenvectors with large eigenvalue correspond

to the principal components in the feature space. Starting from the Least Squares Support Vector Machine

(LS-SVM) formulation to kernel PCA, we extend it to a generalized form with a general underlying loss

function made explicit. For classical kernel PCA, the underlying loss function isL2. In this generalized

form, one can plug in also other loss functions. In the context of robust statistics it is known that the

L2 loss function is not robust because its influence function isnot bounded. Therefore outliers can

skew the solution from the desired one. Another issue with kernel PCA is the lack of sparseness: the

principal components are dense expansions in terms of kernel functions. In this paper, we introduce

robustness and sparseness into kernel component analysis by using an epsilon insensitive robust loss

function. We propose two different algorithms. The first method solves a set of nonlinear equations using

a sequential quadratic programming method with kernel PCA as starting points. The second method uses

a simplified iterative weighting procedure that leads to solving a sequence of generalized eigenvalue

problems. Simulations with toy and real-life data show improvements in terms of robustness together

with a sparse representation.

Keywords:Kernel principal component analysis, Least Squares Support Vector Machines, loss function,

robustness, sparseness, epsilon insensitive loss function.

DRAFT

2

I. I NTRODUCTION

Principal Component Analysis (PCA) is a powerful unsupervised learning technique widely used for

feature extraction, denoising and compression [1]. It is anorthogonal basis transformation where the

new basis is found by an eigendecomposition of the covariance matrixC of a normalized dataset. PCA

constructs the rankk subspace approximation to the training data that is optimalin a least squares sense,

wherek < d andd is the input dimensionality. Although PCA is a powerful technique, it cannot detect

nonlinear structures in the input data [2]. Kernel PCA, as a natural nonlinear generalization of PCA,

first maps the input data into some feature spaceF via a nonlinear feature mapϕ induced by a kernel

and then performs linear PCA on the mapped data. The projections onto the subspace spanned by the

eigenvectors now lie inF but in many applications we need the corresponding patternsin the input

space. In order to recover the extracted features or the denoised patterns we have to map fromF back

to the input space (also called the pre-image problem [3]–[6]). In [7] it was shown that kernel PCA can

be derived as the dual problem to a primal optimization problem formulated in a kernel induced feature

space. With this formulation the underlying loss function associated to classical kernel PCA was shown

to beL2. In the context of robust statistics it is known that theL2 loss function is not robust because

its influence function is not bounded [8], [9], hence outlierscan arbitrarily deviate the solution from the

desired one. Even though the approach based on influence functions was developed mainly for regression,

an explicit formulation for the PCA influence function was derived in [10], showing that it is unbounded

and confirming its sensitivity to outliers and therefore its non-robust nature. In [11] a statistical physics

approach to PCA was studied in order to obtain robust principal components. The resulting formulation

led to a bounded influence function as it was demonstrated in [12]. Another issue with kernel PCA is

the lack of sparseness because the principal components areexpressed in terms of a dense expansion of

kernels associated with every training data point. Several approaches have been developed to introduce

sparseness into kernel PCA, see e.g. [4], [13].

In this paper we derive a generalized formulation where the loss function associated to kernel PCA is

explicit and can easily be changed. By incorporating an epsilon insensitive zone into a robust loss function

it is possible to impose robustness and sparseness into kernel PCA. Two different algorithms are proposed.

The first one solves a set of nonlinear equations using e.g. a sequential quadratic programming (SQP)

method with classical kernel PCA eigenvalues/vectors as starting points. The second proposed algorithm

is a simplified weighting scheme that results in solving a sequence of generalized eigenvalue problems.

With this scheme, every data point in the training set has an associated weight which determines its

DRAFT

3

importance in the component calculation. We also propose a model selection method on validation data

to find the kernel parameters and the number of components. Thistuning method looks for parameters

that maximize a criterion based on the proposed loss function. A heuristic to obtain good choices of the

proposed loss function parameters is also proposed.

This paper is organized as follows: Section II describes the links between Least Squares Support Vector

Machines (LS-SVMs) and kernel PCA. In Section III we propose two approaches to extend the kernel

PCA formulation to other loss functions. Section IV discussesrobustness and sparseness in relation to the

chosen loss function. In Section V we review three pre-image algorithms that can be combined with the

new method. In Section VI we propose two algorithms to find components with a general loss function.

Section VII describes a model selection method. In Section VIII we report the empirical results and in

Section IX we give conclusions.

II. LS-SVMS AND KERNEL PCA

Least squares support vector machines (LS-SVM) formulations todifferent problems were discussed

by Suykenset al. in [14]. It emphasizes primal-dual interpretations for a class of kernel machines in

terms of constrained optimization problems. LS-SVMs have beenstudied with respect to kernel Fisher

discriminant analysis, kernel ridge regression, kernel canonical correlation analysis, kernel partial least-

squares, recurrent networks, control, kernel PCA and others[7], [14]. In this Section we discuss LS-SVM

formulations to kernel PCA which are relevant for the sequel of the paper.

A. Classical PCA formulation

In classical PCA formulation, one considers a given set of zero mean and normalized dataxiNi=1

with xi ∈ Rd. The objective is to find projected variableswTx with maximal variance:

maxw

Var(wTx) ≃1

N

N∑

i=1

(wTxi)2 = wTCw (1)

such thatwTw = 1, whereC = (1/N)∑N

i=1 xixTi is the sample covariance matrix. The Lagrangian for

this constrained optimization problem is given by:

L(w;λ) =1

2wTCw − λ(wTw − 1) (2)

DRAFT

4

whereλ is a Lagrange multiplier. From the conditions for optimality∂L/∂w = 0 , ∂L/∂λ = 0 one

obtains the eigenvalue problem:

Cw = λw. (3)

The eigenvectorw corresponding to the largest eigenvalue defines the direction in which the projected

variables have maximal variance. Subsequent eigenvectors and eigenvalues are usually found by deflating

the data matrix or by adding orthogonality constraints withrespect to the previously found eigenvectors

[1].

B. LS-SVM Approach to Kernel PCA

1) Unconstrained Formulation:Consider the following unconstrained optimization problem:

minwJ(w) =

γ

2N

N∑

i=1

(

wT (ϕ(xi)− µϕ))2−

1

2wTw (4)

where γ is a positive regularization constant,µϕ = (1/N)∑N

j=1 ϕ(xj) and ϕ(·) : Rd → R

nh is the

mapping to a high dimensional feature spaceF of dimensionnh (which can be infinite dimensional).

Proposition 1: Each eigenvectorw⋆ with corresponding eigenvalueρ of the sample covariance matrix

in the feature spaceCϕ = (1/N)∑N

j=1(ϕ(xj)− µϕ)(ϕ(xj)− µϕ)T is a stationary point ofJ(w) given

that γ = 1/ρ. The objective functionJ(w) evaluated at any eigenvectorw⋆ equals zero.

Proof: The optimization problem in (4) can be written as:

minwJ(w) =

γ

2wTCϕw −

1

2wTw

→ =1

2wT (γCϕ − Ih)w.

Using the first-order optimality condition∂J(w)/∂w = 0 leads to

Cϕw⋆ = ρw⋆, (5)

which is an eigenvalue problem ofCϕ with ρ = 1/γ and corresponding eigenvectorw⋆ such that

(w⋆)Tw⋆ = 1. The objective function becomes

J(w⋆) =γ

2(w⋆)Tρw⋆ −

1

2(w⋆)Tw⋆ =

(γρ

2−

1

2

)

(w⋆)Tw⋆ = 0

Remark 1:Note that solving (5) becomes intractable for high dimensional feature spaces and the size

DRAFT

5

of the sample covariance matrixCϕ may become too large.

2) Constrained Formulation:Now consider the following constrained optimization problem obtained

by introducing additional variablesei into (4):

minw,e

Jp(w, e) =γ

2N

N∑

i=1

e2i −1

2wTw (6)

such that ei = wT (ϕ(xi)− µϕ), i = 1, ..., N.

Lemma 1:Given a positive definite kernel functionK : Rd×R

d → R with K(x, z) = ϕ(x)Tϕ(z) and

a regularization constantγ ∈ R+, the dual problem to (6) is given by the following eigenvalueproblem:

Ωcα = λα (7)

whereΩc denotes the centered kernel matrix withij-th entry:Ωc,ij = K(xi, xj)−1N

∑Nr=1K(xi, xr)−

1N

∑Nr=1K(xj , xr) + 1

N2

∑Nr=1

∑Ns=1K(xr, xs) and λ = N/γ. Assume thatΩc − N(IN/γ) ≤ 0 and

γCϕ − IN ≥ 0, then the strong duality property holds.

Proof: Consider the Lagrangian of the problem (6):

L(w, ei;α) =γ

2N

N∑

i=1

e2i −1

2wTw −

N∑

i=1

αi(ei − wT (ϕ(xi)− µϕ))

In general themin-max inequalityholds [15]:

maxα

minw,ei

L(w, ei;α) ≤ minw,ei

maxαL(w, ei;α) (8)

which means that the optimal value to the dual problem is a lower bound on the optimal value to the

primal problem. When (8) is fulfilled with a strict equality then the duality gap is zero and strong duality

holds. MaximizingL(w, ei;α) with respect toα leads to the unconstrained optimization problem (4).

Thus, the right hand side of (8) is zero at the stationary points (see Proposition 1). Minimizing the

Lagrangian with respect toei leads to the optimality condition

∂L

∂ei= 0→ ei =

N

γαi, i = 1, . . . , N. (9)

The Lagrangian is a concave quadratic function ofw therefore is unbounded below. However, the

stationary point is given by the optimality condition

∂L

∂w= 0→ w =

N∑

i=1

αi(ϕ(xi)− µϕ). (10)

DRAFT

6

Eliminatingw andei leads to:

maxαi

L(αi) =1

2

N∑

i,j=1

αiαj(ϕ(xj)− µϕ)T (ϕ(xi)− µϕ)−N

2γ

N∑

i=1

α2i .

By making use of the positive definite kernel function

(ϕ(xi)− µϕ)T (ϕ(xj)− µϕ) =K(xi, xj)−1

N

N∑

r=1

K(xi, xr)−1

N

N∑

r=1

K(xj , xr)

+1

N2

N∑

r=1

N∑

s=1

K(xr, xs) = Ωc(xi, xj)

(11)

one obtains the dual problem:

maxαL(α) =

1

2αT

(

Ωc −N

γIN

)

α (12)

whereα = [α1, . . . , αN ], Ωc = McΩMc, Ω is the kernel matrix withij-th entry Ωij = K(xi, xj) and

Mc = IN − (1/N)1N1TN is the centering matrix. Using∂L(α)/∂α = 0 and definingλ = N/γ leads

to the eigenvalue problem (7). Given thatα⋆ is an eigenvector ofΩc with corresponding eigenvalue

λ = N/γ, the dual objective function value becomesL(α⋆) = 1/2(N/γ − N/γ) = 0, therefore strong

duality holds.

Remark 2:Note that every eigenvalue/eigenvector pair ofΩc is a stationary point of (6) with objective

function value equal to zero. Therefore, the principal components can be seen as a pool of possible

solutions. The problem of component selection can be solved by sorting the components in decreasing

order with respect toλ which corresponds to the variance in theL2 case. This differs from classical PCA

where the objective function value at the optima is the eigenvalue and reflects the amount of variance

captured by the projection onto the corresponding eigenvector.

Remark 3:Note that changing the minimization operator to maximization in (6) leads to the same set

of stationary points (7) and Remark 2 also holds.

Remark 4:Note that (7) corresponds to kernel PCA as proposed by Scholkopf et al. in [2], where

[7] showed that a similar form of (6) is the underlying optimization problem to which kernel PCA is

the dual. It is also remarkable that (7) is regularized by restricting the eigenvectors to the unit ball

while regularization is explicitly present on the other hand in the primal problem in (6) through the

regularization term.

Remark 5:Considering a bias term in (6) leads to centering the kernel matrix Ω in the feature space

DRAFT

7

such that the mapped data have zero mean, see [7]. The bias termbecomes

b = −1

N1T

NΩα. (13)

Remark 6:The projections of a datapointx onto an eigenvector (also called the score variables)

become:

z(x) = wT (ϕ(x)− µϕ) =N

∑

l=1

αl

(

K(xl, x)−1

N

N∑

r=1

K(xr, x)

−1

N

N∑

r=1

K(xr, xl) +1

N2

N∑

r=1

N∑

s=1

K(xr, xs)

)

.

(14)

III. E XTENSION TO OTHER LOSS FUNCTIONS

A. Set of Nonlinear Equations

In (6) the kernel PCA uses theL2 loss function. By extending the formulation now to a generalloss

functionL : R→ R and introducing a bias termb, the following problem can be formulated:

minw,ei,b

Jp(w, ei, b) =γ

2

N∑

i=1

L(ei)−1

2wTw (15)

such that ei = wTϕ(xi) + b, i = 1, ..., N

Lemma 2:Given a positive definite kernel functionK : Rd × R

d → R with K(x, z) = ϕ(x)Tϕ(z) :

Rd ×R

d → R, a differentiable loss functionL : R→ R and positive regularization constantγ, the local

optimal solutions to the problem (15) can be obtained as the solution to the following set of nonlinear

equations with unknownsαi, i = 1, . . . , N (dual problem) andb:

αi −γ2L

′

(

∑Nl=1 αlK(xl, xi) + b

)

= 0, i = 1, ..., N

∑Ni=1 L

′

(

∑Nl=1 αlK(xl, xi) + b

)

= 0

(16)


L(w, ei, b;α) =γ

2

N∑

i=1

L(ei)−1

2wTw −

N∑

i=1

αi(ei − wTϕ(xi)− b) (17)

DRAFT

8

with Karush-Kuhn-Tucker (KKT) optimality conditions:

∂L∂w

= 0→ w =∑N

i=1 αiϕ(xi)

∂L∂ei

= 0→ αi = γ2L

′(ei), i = 1, ...N

∂L∂b

= 0→∑N

i=1 αi = 0

∂L∂αi

= 0→ ei = wTϕ(xi) + b, i = 1, ...N.

Eliminatingw, ei leads to:

γ

2L′

( N∑

l=1

αlϕ(xl)Tϕ(xi) + b

)

= αi, i = 1, ..., N,

N∑

i=1

L′

( N∑

l=1

αlϕ(xl)Tϕ(xi) + b

)

= 0

and applying the definition of a positive definite kernel function gives the set of nonlinear equations (16).

Remark 7:Note that in general (15) is a non-convex problem, thereforethe KKT optimality conditions

are necessary but not sufficient.

Remark 8:Note that (16) leads to a different centering in the feature space, because the bias term is

also affected by the loss function. This is important for robustness because the bias term is related to the

estimation of the mean in the feature space as shown in theL2 case.

Remark 9:Note that the concept of eigenvalue and eigenvector does notexist in (16).

Remark 10:The score variables become

z(x) = wTϕ(x) + b =N

∑

l=1

αlK(xl, x) + b. (18)

Remark 11:Consider the minimization of theL2 loss of the residuals:

U⋆ = argminU

N∑

i=1

r2i (19)

whereri = ||ϕ(xi)−PUϕ(xi)||2, U denotes an orthogonal subspace andPUϕ(xi) denotes the projection

of the i-th mapped datapoint ontoU . SincePUϕ(xi) is an orthogonal projection, one can write

||ϕ(xi)− PUϕ(xi)||22 = ||ϕ(xi)||

22 − ||PUϕ(xi)||

22

DRAFT

9

which leads to

U⋆ = argmaxU

N∑

i=1

||PUϕ(xi)||22.

This shows that minimizing the sum of theL2 loss of the residuals is equivalent to maximizing the

sum of theL2 loss of the projections. However, in the case of a general loss functionL, this equivalence

no longer holds. Hence, a general loss function can be applied to the projectionse but the loss function

applied to the primal vectorw should beL2 in order to exploit the equivalence between a dot product

in the feature space and a kernel function evaluation.

B. A Simplified Weighting scheme

Another approach to extend kernel PCA to general loss functions is to consider a loss function of the

form L(ei) = v(k)i e2i wherev(k)

i , i = 1, . . . , N are user-defined weights andk is the iteration step. A

two-step weighting scheme was proposed in [16] to impose robustness in LS-SVM regression. In the first

step, an unweighted LS-SVM is calculated. The second stage consists of associating weighting values

to the error variables obtained in the first step. This scheme can also be employed in kernel PCA as

a simplified technique to obtain robust and sparse solutions.Introducing weighting factorsv(k)i in the

formulation leads to the following primal optimization problem:

minw,ei

Jp(w, ei) = γ1

2

N∑

i=1

v(k)i e2i −

1

2wTw (20)

such that ei = wT (ϕ(xi)− µϕ), i = 1, ..., N.

A simple method is to make a two-step approach. In general it can be more steps as well. A precise

convergence analysis is needed in that case. However, here we are interested in a simple way to cheaply

incorporate robustness and sparseness to kernel PCA. In the first step, an unweighted kernel PCA solution

is obtained, (v(1)i = 1, i = 1, . . . , N ). Then the weights are calculated in the second step in such a way

that v(2)i e2i = L(ei). The advantage of this approach is that one extends kernel PCA to general loss

functions by solving a sequence of generalized eigenvalue problems which can be solved efficiently,

while in (16) one needs to solve a set of nonlinear equations which may lead to local optima and slow

convergence.

The first weighted principal component is the solution to (20).The second weighted component is

the solution to (20) with an additional constraint. This constraint imposes orthogonality with respect

to the score variables associated to the previously found weighted component. For the third weighted

component, two constraints are added to ensure orthogonality with respect to the score variables for

DRAFT

10

the second and first weighted principal components. This iterative process is repeated until all required

weighted components are obtained. Hence, them-th weighted principal component is the solution to the

following primal problem with a number ofm− 1 orthogonality constraints:

minw,ei

Jp(w, ei) = γ1

2

N∑

i=1

v(k)i e2i −

1

2wTw

such that

ei = wT (ϕ(xi)− µϕ), i = 1, ..., N

∑Ni=1 eie

(1)i = 0

∑Ni=1 eie

(2)i = 0

...

∑Ni=1 eie

(m−1)i = 0

(21)

wherek = 0, . . . , Niter is the iteration step,e(l)i , l = 1, . . . ,m − 1 are the score variables corresponding

to the l-th previously found weighted principal component, which are assumed to be given.

Lemma 3:Given a positive definite kernel functionK : Rd × R

d → R with K(x, z) = ϕ(x)Tϕ(z),

score variablese(1)i , e(2)i , . . . , e

(m−1)i ∈ R, i = 1, . . . , N and positive weighting factorsv(k)

i ∈ R+, i =

1, . . . , N , the solution to the problem (21) is given by the following generalized eigenvalue problem:

ΩVc 0N×(m−1)

0TN×(m−1) 0(m−1)×(m−1)

α

β

= λ

IN E

ET D

α

β

(22)

whereV (k) = diag(v(k)1 , . . . , v

(k)N ), ΩV

c = V (k)Ωc, 0p×q is the p × q zero matrix,α = [α1; . . . ;αN ],

β = [β1;β2; . . . ;βm−1], λ = 1/γ, IN is theN ×N identity matrix andE,D are defined as follows:

E =

e(1)1

v(k)1

e(2)1

v(k)1

. . . e(m−1)1

v(k)1

e(1)2

v(k)2

e(2)2

v(k)2

. . . e(m−1)2

v(k)2

......

......

e(1)N

v(k)N

e(2)N

v(k)N

. . .e(m−1)N

v(k)N

D =

∑

i(e

(1)i

)2

v(k)i

∑

ie(1)i

e(2)i

v(k)i

. . .∑

ie(1)i

e(m−1)i

v(k)i

∑

ie(1)i

e(2)i

v(k)i

∑

i(e

(2)i

)2

v(k)i

. . .∑

ie(2)i

e(m−1)i

v(k)i

......

......

∑

ie(1)i

e(m−1)i

v(k)i

∑

ie(2)i

e(m−1)i

v(k)i

. . .∑

i(e

(m−1)i

)2

v(k)i

.

DRAFT

11


L(w, ei;αi, βj) = γ1

2

N∑

i=1

v(k)i e2i −

1

2wTw −

N∑

i=1

αi(ei − wT (ϕ(xi)− µϕ))

− β1

N∑

i=1

eie(1)i − β2

N∑

i=1

eie(2)i − . . .− βi−1

N∑

i=1

eie(m−1)i

(23)

with conditions for optimality:

∂L

∂w= 0→ w =

N∑

i=1

αi(ϕ(xi)− µϕ) (24)

∂L

∂ei= 0→ ei =

1

γv(k)i

(αi + β1e(1)i + β2e

(2)i + . . .+ βi−1e

(m−1)i ), i = 1, . . . , N (25)

∂L

∂αi= 0→ ei = wT (ϕ(xi)− µϕ), i = 1, . . . , N (26)

∂L

∂β1= 0→

N∑

i=1

eie(1)i = 0

∂L

∂β2= 0→

N∑

i=1

eie(2)i = 0

...

∂L

∂βi−1= 0→

N∑

i=1

eie(m−1)i = 0.

Using (24), (25) and (26) leads to:

1

γ(αi + β1e

(1)i + β2e

(2)i + . . .+ βm−1e

(m−1)i ) = vi

N∑

j=1

αj(ϕ(xj)− µϕ)T (ϕ(xi)− µϕ), i = 1, . . . , N.

Applying the property of Mercel kernels:

1

γ(αi + β1e

(1)i + β2e

(2)i + . . .+ βm−1e

(m−1)i ) = v

(k)i

N∑

j=1

αjΩc(xj , xi), i = 1, . . . , N. (27)

DRAFT

12

Substituting (25) into the orthogonality constraints leadsto the following set of equations:

N∑

j=1

αje(1)j

v(k)j

+ β1

N∑

j=1

(e(1)j )2

v(k)j

+ β2

N∑

j=1

e(1)j e

(2)j

v(k)j

+ . . .+ βm−1

N∑

j=1

e(1)j e

(m−1)j

v(k)j

= 0

N∑

j=1

αje(2)j

v(k)j

+ β1

N∑

j=1

e(1)j e

(2)j

v(k)j

+ β2

N∑

j=1

(e(2)j )2

v(k)j

+ . . .+ βm−1

N∑

j=1

e(1)j e

(m−1)j

v(k)j

= 0

... (28)N

∑

j=1

αje(m−1)j

v(k)j

+ β1

N∑

j=1

e(1)j e

(m−1)j

v(k)j

+ β2

N∑

j=1

e(2)j e

(m−1)j

v(k)j

+ . . .+ βm−1

N∑

j=1

(e(m−1)j )2

v(k)j

= 0.

The sets of equations (27) and (28) give the generalized eigenvalue problem (22).

Lemma 4:The matrixM =

ΩVc 0N×(m−1)

0TN×(m−1) 0(m−1)×(m−1)

is positive semidefinite.

Proof: The matrixΩVc is the product of two positive definite matricesV (k) andΩc. Thus, by Theorem

7.6.3 of [17],ΩVc is also positive definite.

The eigendecomposition ofM :

ΩVc 0N×(m−1)

0TN×(m−1) 0(m−1)×(m−1)

x1

x2

= λ

x1

x2

is equal to

ΩVc x1

0(m−1)×1

= λ

x1

x2

.

Hence, the non-zero eigenvalues ofM are identical to the eigenvalues ofΩVc .

Corollary 1: The generalized eigenvalue problem in (22) has real eigenvalues.

Proof: Let (22) be written asMp = λBp:

B =

IN E

ET D

, p =

α

β

with p ∈ CN+i−1.

Then, premultiplying by the complex conjugatep∗ leads to

p∗Mp = λp∗Bp

DRAFT

13

the generalized eigenvalues become:

λ =p∗Mp

p∗Bp.

SinceM is positive semidefinite (Corollary 4), andB is Hermitian, the generalized eigenvalues become

the ratio of two real numbers.

Remark 12:Note thatλ in (22) can be negative. However, only positiveλ values are considered.

IV. ROBUSTNESS ANDSPARSENESS

A. Robustness in kernel PCA

It is known that classical PCA is heavily influenced by outlyingobservations. A single outlier can

by itself determine an entire component. The subspace approximation to the training data that PCA

constructs is optimal only in a least-squares sense [1]. A small deviation from the assumptions produces

a gross error in the principal component estimation. Although several approaches to robustify PCA have

been developed [11], [18], [19], [20], [21], that is not the case for kernel PCA.

In this Section, we tackle the robustness problem in kernel PCAusing the approach based on influence

functions introduced by Hampelet al. in [22]. However, an explicit form of the influence function isnot

given.

Let TN be a target statistic based on the training dataxiNi=1 assuming thatTN can be expressed as a

functional, sayT (FN ), whereFN indicates the empirical distribution function of the data.The influence

function of T (FN ) against an outlierξ is defined as:

IF (ξ; FN , T ) = limt→0

T [(1− t)FN + tδξ]− T (FN )

t. (29)

whereδξ is the pointmass 1 distribution atξ.

If the influence function is bounded, then the influence of the functionalT (FN ) by anyξ should have

a reduced effect.

The i-th principal componentα(i) can be expressed as a functionalα(i)(FN ) and according to [10],

[1], [12] its influence function becomes:

IF (ξ; FN , α(i)) = −β1(ξ)

d∑

j 6=i

βj(ξ)(λj − λi)−1α(j) (30)

whereβj(ξ) = (α(j))T (ξ − µ), λj is the j-th eigenvalue,µ is the sample mean of the data andα(j)

is the j-th principal component. Note that (30) depends of all the components and is unbounded which

confirms the non-robust nature of PCA.

DRAFT

14

For M-estimators[8], the influence function of the functionalT (FN ) is defined as:

IF (ξ; FN , T ) =L′(ξ)

∫

L′(ξ)dFN

. (31)

In other words, the influence function of anM-estimatoris proportional to the first derivative of the loss

function (also called the score function). Several loss functions with bounded score functions have been

proposed. Well-known examples are the Huber, Hampel, Welsch and Tukey loss functions. Kernel PCA

has an underlyingL2 loss function with an unbounded score function (Fig. 1). Hence, kernel PCA is not

robust against outlying data points. Note also that kernel PCA is linear PCA in a kernel-induced feature

space, therefore outliers in the feature space have a large influence. Christmannet al. showed in [23]

that the kernel and the score function should be bounded in convex risk minimization methods to achieve

robustness.

B. Sparseness

As can be seen in (14), the score variables are expressed in terms of kernel expansions in which

every training point contributes. These expansions are typically dense (non-sparse). Different methods

have been proposed to overcome the lack of sparseness in kernel PCA. Smolaet al. in [13] proposed a

sparse version employing anL1 penalty in the expansion coefficients. Scholkopf et al. in [4] introduced

a compact representation for the expansions.

Because the loss function is explicit in our formulation, wecan introduce an epsilon insensitive zone

into the robust loss function and achieve sparseness and robustness at the same time due to the conditions

for optimality in (18). This epsilon insensitive zone was introduced by Vapnik in to impose sparseness in

support vector regression. We propose a robust loss function with bounded influence function and tunable

parametersǫ andh for sparseness and robustness respectively. This loss function is a combination of the

Huber loss and the epsilon insensitive loss function for support vector regression introduced by Vapnik

in [24]. The proposed loss function (Fig. 1) is defined as:

L(e) =

0 if |e| ≤ ǫ

e2 − 2|e|+ ǫ2 if ǫ < |e| < h

2(h− ǫ)|e| − h2 + ǫ2 if |e| ≥ h.

(32)

with h > ǫ. Its score function becomes:

DRAFT

15

ψ(e) = L′(e) =

0 if |e| ≤ ǫ

2sign(e)(|e| − ǫ) if ǫ < |e| < h

2sign(e)(h− ǫ) if |e| ≥ h.

(33)

Taking an epsilon zone in the loss function leads to sparseness because of the conditions for optimality

(18). Note that settingǫ to zero leads to the Huber loss function andǫ > 0, h → ǫ approaches to the

Vapnik epsilon insensitive loss function.

V. THE PRE-IMAGE PROBLEM

The projections onto the subspace spanned by the principal components lie in some feature spaceF

but in most applications of kernel PCA we are interested in a reconstruction in the input space rather

than inF . This mapping from the feature space back to the input space iscalled the pre-image problem.

In this Section we summarize three pre-image algorithms thatwill be used in the sequel.

• Pre-image algorithm 1:

An iterative nonlinear optimization method is proposed in [3] for Gaussian kernels, but it can suffer

from numerical instabilities and local minima. Given a testpoint x we want to recoverx for which

ϕ(x) ≈ PNcϕ(x), wherePNc

ϕ(x) is the projection ofϕ(x) onto the subspace spanned by the first

Nc principal components. For Gaussian kernels, the approximate pre-imagex can be computed using

the following iteration [3]:

x(t) =

∑Ni=1 ηi exp(−‖x(t−1) − xi‖

2/σ2)xi∑N

i=1 ηi exp(−‖x(t−1) − xi‖2/σ2)(34)

whereηi =∑Nc

j=1 βju(j)i , βj is the projection ofϕ(x) onto thej-th componentu(j) and t is the

iteration step.


Another approach is proposed in [6]. It is a modified version of(34) to remove iteratively the

influence of outliers. Letl(t)j be the certainty ofx(j), wherex(j) denotes thej-th coordinate of the

test pointx. This certainty is calculated using the difference betweenx(j) and the corresponding

reconstructionx(t−1)(j) as:

l(t)j = exp(−(x(j) − x

(t−1)(j) )2/(2σ2

j )), j = 1, . . . , d. (35)

The parameterσj is the standard deviation of the differences and is estimated using robust estimation

DRAFT

16

[9]:

σj = 1.4826(1 + 5/(N − 1))mediani

√

ε2ij (36)

whereε2ij is the squared error between thej-th dimension of thei-th data point and its reconstruction.

The approximate pre-imagex can be computed using the following iteration [6]:

x(t) =

∑Ni=1 ηi exp(−‖g(t) − xi‖

2/σ2)xi∑N

i=1 ηi exp(−‖g(t) − xi‖2/σ2)(37)

whereg(t) is given by:

g(t) = L(t)x+ (Id − L(t))x(t−1), (38)

for t > 0. The matrixId is thed × d identity matrix,d is the dimensionality of the input vectors,

L(t) is a d× d matrix defined asL(t) = diag(l(t)1 , . . . , l

(t)d ).


A newer approach is proposed in [5]. This method is non-iterative, involves only linear algebra

and uses the distances between the projectionPNcϕ(x) and its nearest neighbors. The method can

be summarized as follows: given theNn neighborsx1, . . . , xNn and q = [q1, . . . , qNn

]T where

ql is the input space distance betweenPNcϕ(x) and xl, l = 1, . . . , Nn, first construct the matrix

X = [x1, x2, . . . , xNn] and compute its singular value decomposition (SVD):

XMc = GR (39)

whereMc is the centering matrix,G is the matrix of left singular vectors,R = ΛHT = [r1, . . . , rNn],

Λ is the diagonal matrix of singular values,H is the matrix of right singular vectors andrj is the

projection ofxj ontoG, j = 1, . . . , Nn. The approximate pre-image can be obtained as:

r = −1

2Λ−1HT (q − q0) (40)

whereq0 = [‖r1‖2, . . . , ‖rNn

‖2]T . Transforming back to the original coordinate system in theinput

space leads to:

x = Gr + x (41)

wherex = 1Nn

∑Nn

j=1 xj .

DRAFT

17

VI. A LGORITHMS

We propose two algorithms to obtain the components with the Huber epsilon-insensitive loss function.

The first algorithm solves (16) using a sequential quadratic programming (SQP) approach [25] and smooth

approximation ofL′(e). The second algorithm corresponds to an implementation of the weighted scheme

(22).

A. Sequential Quadratic Programming Algorithm

The Huber epsilon-insensitive influence functionL′(e) is not differentiable at|e| = ǫ and |e| = h

therefore to use a gradient-based approach such as SQP, a smooth approximationL′s(e) to L′(e) is taken:

L′s(e) =

2slog

(

1 + exp((s(e+ h))))

− 2(h− ǫ) if e ≤ ǫ−h2 − ǫ

2slog

(

1+exp((s(e−ǫ)))1+exp((−s(e+ǫ)))

)

if ǫ−h2 − ǫ < |e| <

h−ǫ2 + ǫ

−2slog

(

1 + exp((−s(e− h))))

+ 2(h− ǫ) if e ≥ h−ǫ2 + ǫ

(42)

wheres is a constant that controls the smoothness of the approximation. Figure 2 showsL′s(e) andL′(e)

for ǫ = 2, h = 5, s = 20.

ReplacingL′(·) by L′s(·) in (16) leads to:

f(λ, α, b) =

λα− 12 L

′s(z(α, b))

1TN L

′s(z(α, b))

= 0 (43)

where f : R+ × R

N × R → RN+1, λ = 1/γ, z(α, b) = [z1(α, b); z2(α, b); . . . ; zN (α, b)], α =

[α1;α2; . . . ;αN ], zi(α, b) = Ω(i)α+ b1N , Ω(i) denotes thei-th row of the kernel matrixΩ and

L′s(z(α, b)) =

[

L′s(z1(α, b)); . . . ;L

′s(zN (α, b))

]

.

The set of nonlinear equations (43) is usually solved via minimization of a sum of squares of the

residuals. We impose two additional constraintsαTα ≥ 1 and λ ≥ c where c is a small positive

constant (e.g.c = 10−6. The first constraint prevents the trivial solutionα = 0 and non-interesting

solutions with small norm causing the score variableszi(α, b) to lie in the range−h < |zi(α, b)| < h,

i = 1, . . . , N , therefore avoiding the bounded zone|zi(α, b)| ≥ h of the score function. The proposed

DRAFT

18

nonlinear optimization problem is the following:

minλ,α,b

g(λ, α, b) = f(λ, α, b)T f(λ, α, b) (44)

such that

αTα ≥ 1

λ ≥ c

where

g(λ, α, b) = λ2αTα− λαT L′s(z(α, b)) +

1

4

(

L′s(z(α, b))

)TL′

s(z(α, b)) +(

1TN L

′s(z(α, b))

)2. (45)

All N + 1 equations have to be satisfied at the local solution(λ⋆, α⋆, b⋆) rather than just minimizing the

sum of the squares of the equations. This means that the value of g(λ⋆, α⋆, b⋆) should be zero in order

to (λ⋆, α⋆, b⋆) to be a solution to the problem (43). In the same fashion as in the weighting scheme,

additional constraints can be added to impose orthogonality with previously found components. Note that

sparseness is achievedcomponent-wisedue toαi = γ2L

′(ei) which is one of the optimality conditions

(18). Therefore, each component vector may have a different number of zeroαi values.

An SQP method approximates in every iteration the constrained optimization problem (44) as a

quadratic programming problem of the form:

mind

1

2dTH(k)d+∇g(θ(k))Td (46)

such that

∇c1(θ(k))Td+ c1(θ

(k)) ≥ 0

∇c2(θ(k))Td+ c2(θ

(k)) ≥ 0

(47)

wherek denotes the iteration step,H is the Hessian matrix (or an approximation to it) of the Lagrangian

of (44), θ = [λ;α; b] is the vector of parameters,c1(θ) = αTα− 1 is the first constraint,c2(θ) = λ− c

is the second constraint,g(θ) = g(λ;α; b) and the gradient∇g(θ) ∈ RN+2 is defined as:

∇g(θ) = ∇g(λ, α, b) =

∇λg(λ, α, b)

∇αg(λ, α, b)

∇bg(λ, α, b)

(48)

DRAFT

19

where

∇λg(λ, α, b) = 2λαTα− αT L′s(z(α, b))

∇αg(λ, α, b) = 2λ2α− λ

(

Ωdiag(

L′′s(z(α, b))

)

α+ L′s(z(α, b))

)

+1

2Ωdiag

(

L′′s(z(α, b))

)

L′s(z(α, b)) + 2r1ΩL

′′s(z(α, b))

∇bg(λ, α, b) = −λαT L′′s(z(α, b)) +

1

2

(

L′′s(z(α, b))

)TL′

s(z(α, b)) + 2r1r2

(49)

r1 = 1TN L

′s(z(α, b)), r2 = 1T

N L′′s(z(α, b)). The solution vectord(k) to the QP subproblem (46) can then

be used to form a new iterate:

θ(k+1) = θ(k) + µ(k)d(k) (50)

whereµ(k) is the step length parameter of the line search and can be obtained using a merit function.

A critical issue in nonlinear optimization is the selectionof good starting points. In the proposed SQP

approach, kernel PCA eigenvectors and eigenvalues are used as initial values for the nonlinear solver.

Therefore, we are considering only the first local solution obtained through the proposed initialization.

Given a set ofN d−dimensional data pointsX = x1, . . . , xN, the algorithm gives a set ofNc

component vectorsA = α(1)⋆, . . . , α(Nc)⋆ as follows:

Algorithm 1 SQP Approach1: Set the kernel parameters and the loss function parametersǫ, h.2: Perform an eigendecomposition of the non-centered kernel matrix to obtain initial eigenvaluesλ1, . . . , λN

and initial eigenvectorsα(1), . . . , α(N).3: for l = 1 to Nc do4: Obtain the initial values for the bias termbl = (1/N)1T

NΩα(l)

5: Solve the SQP problem (46), (50) with initial valuesλl, α(l) andbl to obtain the solutionsλ⋆l , α(l)⋆ andb⋆l

6: end for7: Sort the solutions according to the value ofλ⋆

l in decreasing order.8: Normalize the componentsα(l)⋆ ← α(l)⋆/||α(l)⋆||2, l = 1, . . . , Nc

B. Weighted Algorithm

The second algorithm is an implementation of the iterative weighting scheme (21). GivenX , and the

number of iterationsNiter, the algorithm (2) gives a set ofNc componentsα(1), . . . , α(Nc), where

Nr ≤ N . In general, empirical results suggest that only one iteration (Niter = 1) is sufficient to

obtain good results. Sparseness is achieved in step4 by removing datapoints from the training set for

which the corresponding weight is equal to zero across the components. The computational complexity

DRAFT

20

Algorithm 2 Weighted Kernel PCA1: Set the kernel parameters, the loss function parametersǫ, h and the number of principal componentsNc to be

extracted.2: Compute the unweighted score variablese(l)i , i = 1, . . . , N, l = 1, . . . , Nc associated to the firstNc principal

components using classical kernel PCA.3: Compute the weightsv(l)

i such thatv(l)i (e

(l)i )2 = L(e

(l)i ).

4: Remove from the training dataset the datapoints whose corresponding weightv(l)i equals to zero in every

component. The new dataset hasNr datapoints.5: Compute the unweighted score variablese

(l)i , l = 1, . . . , Nc of the reduced dataset using classical kernel PCA.

6: for k = 1 to Niter do7: Recompute the weightsv(1)

i such thatv(1)i (e

(1)i )2 = L(e

(1)i ) with ǫ = 0.

8: Find the first weighted principal componentα(1) which is the solution associated to the largest eigenvaluein V Ωcα = λα, with V = diag(v1, . . . , vNr

).9: Recomputee(1)i usingα(1).

10: for l = 2 to Nc do11: Recompute the weightsv(l)

i such thatv(l)i (e

(j)i )2 = L(e

(l)i ) with ǫ = 0

12: Find thel−th weighted principal componentα(l) which is the solution associated to the largest eigenvaluein (22).

13: Recomputee(l)i usingα(l).14: end for15: end for

of this algorithm isO(NiterNc(ξN + Nc)) compared withO(ξN) for classical kernel PCA using the

Lanczos method for the largestNc components. The factorξ is the maximum number of matrix-vector

computations required and depends of the eigengapλNc+1 − λNc[26], [27].

VII. M ODEL SELECTION

Model selection is a central issue in unsupervised learningtechniques such as kernel PCA. In supervised

learning one can rely on many methods (such as validation set, crossvalidation, generalization bounds,

etc.) for model selection. However, one often cannot apply such criteria in unsupervised learning due to

the lack of an underlying objective function to the method. In kernel PCA, the parameters that define

the model are the number of principal componentsNc to be extracted and the kernel parameters (e.g.σ2

for the RBF kernel). These parameters are typically chosen ina trial-and-error fashion. In the proposed

approaches, model selection in a learning framework can be done naturally on the basis of the explicit

constrained optimization formulation. We propose the following model selection criterion:

maxNc,σ2

Nc∑

l=1

Nval∑

j=1

L(e(l)j ) (51)

wheree(l)j = wTϕ(xvalj )+ bl =

∑Ni=1 α

(l)i K(xi, x

valj ) + bl, xval

l denotes thej-th validation data point,α(l)

is the l-th component andbl is the bias term for thel-th component,l = 1, . . . , Nc.

DRAFT

21

The values ofǫ (sparseness) andh (robustness) are problem-dependent and can be obtained heuristically

from the empirical distribution of the score variables suchthat a given amount of the score values lie

within the epsilon zone of the loss function. In the same way,h can be set such that a given percentage

of the score values lie within the range−h < e < h. Figure 3 shows the proposed heuristic.

VIII. E MPIRICAL RESULTS

In this section, some experimental results are presented toillustrate the effectiveness of the proposed

approaches. We used the MATLAB functionfmincon to solve (46). This function is a SQP solver with

a quasi-Newton approximation of the Hessian of the Lagrangian using the Broyden-Fletcher-Goldfarb-

Shanno (BFGS) method. The step length is calculated with line search and the merit function described

in [28]. We performed denoising experiments and feature extraction simulations which corresponds to

most used applications of kernel PCA.

All experiments reported were carried out using the RBF kernel. It is important to mention that the

kernel should be bounded for these robust methods to work. Themachine used is a Pentium 4, 2.8 GHz,

1GB RAM. The values ofǫ andh were obtained using the heuristic proposed in Figure 3, setting ǫ such

that 10% of the score values lie within the epsilon insensitive rangeand settingh such that90% of the

score values lie within the range−h < e < h.

A. Experiment 1 - Denoising

The first experiment consists of a denoising application. Simulations were done with a curve-shaped

dataset in two dimensions and with the UCImultiple featureshandwritten digit dataset. The curve dataset

consists of 360 data points for training, 420 for validationand 320 for testing. The loss function parameters

ǫ = 0.34, h = 3.7 were set according to Figure 3. Model selection surfaces for the curve dataset can be

seen in Figure 4 showing that in the weighted kernel PCA method,the criterion saturates with respect

to the number of componentsNc. On the other hand, in kernel PCA, the variance of the projected

variables is always increasing with respect toNc. This means that fewer components have to be computed

in the weighted kernel PCA algorithm. Figure 5 shows a slice of the surface plots. The values of the

proposed criterion and the kernel PCA criterion (variance) were rescaled for graphical purposes. Obtained

parameters areσ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2.0, Nc = 8 for the SQP algorithm.

Sparseness of the proposed method can be seen in Figures 6 and Table I. Note that sparseness varies

from component to component. However, most of the outliers are not support vectors therefore they can

be removed from the dataset. Table II shows the percentage ofoutliers present in the support vector set.

DRAFT

22

The reduced set selection method [4] is more sparse than the proposed methods but its support vector

set contains almost all outliers. The components found by theSQP algorithm are more sparse than the

components obtained by the simplified weighting scheme but ata more expensive computational cost.

The handwritten digit training set is composed of 195 digit images of 15×16 pixels. The validation set

consists of 40 digits. We added outliers and Gaussian noise to the dataset and setǫ = 0.1, h = 4 according

to Figure 3. Figure 8 shows the slices of the model selection surfaces with similar results compared to

Figure 4. Obtained parameters areσ = 350, Nc = 190 for kernel PCA andσ2 = 300, Nc = 120 for the

SQP algorithm. The denoising results can be seen in Figures 9 and10 showing improved results of the

SQP algorithm compared to kernel PCA in terms of robustness together with visually appealing digits.

Table III shows the mean squared error (MSE) and standard deviation for kernel PCA and the proposed

methods with respect to the three pre-image algorithms. The MSE was calculated between the denoised

digits and the test set without outliers and Gaussian noise.Figure 11 shows sparseness results in terms

of accumulated number of support vectors with respect to thenumber of components.

# of Components SQP Algorithm Reduced Set [4]1 190 160

2 222 221

3 286 257

4 290 272

TABLE ICURVE EXPERIMENT - ACCUMULATED NUMBER OF SUPPORT VECTORS FOR THESQPALGORITHM AND THE REDUCED SET

SELECTION METHOD[4] WITH RESPECT TO THE NUMBER OF COMPONENTS REQUIRED.

% of outliers Computation timesReduced Set [4] 98% 5.4 secs

Weighted Algorithm 3% 8.6 secsSQP Algorithm 0% 24.7 secs

TABLE IICURVE EXPERIMENT - FRACTION OF THE TOTAL NUMBER OF OUTLIERS PRESENT ON THE SUPPORT VECTOR SET. NOTE

THAT, THE REDUCED SET METHOD IS MORE SPARSE THAN THE PROPOSED METHODS BUT THE MAJORITY OF THE OUTLIERS

ARE IN THE SUPPORT VECTOR SET.

DRAFT

23

Kernel PCA SQP algorithm Weighted AlgorithmPre-image algorithm 1 0.60± 0.40 0.10± 0.15 0.28± 0.30

Pre-image algorithm 2 0.70± 0.30 0.21± 0.20 0.45± 0.32

Pre-image algorithm 3 0.54± 0.40 0.30± 0.23 0.32± 0.28

TABLE IIIHANDWRITTEN DIGITS - MSE AND STANDARD DEVIATION ON TEST DATA OF THE DENOISED DIGITS COMPARED TO THE

CLEAN TEST DATA

B. Experiment 2 - Feature Extraction

The feature extraction experiment was carried out with a square-shaped dataset in two dimensions

consisting of 1000 data points. The weighted kernel PCA algorithm was used. The training scenario

consisted of 300 data points for training, 300 data points for validation and the remaining 400 data

points for testing. We included outliers and performed 10 randomizations of the training and validation

sets. The parameters of the loss function were set toǫ = 0.5, h = 3. Model selection surfaces and

slices were obtained in a similar way as in the previous experiments. Selected parameters areσ2 =

0.45, Nc = 6. Figures 12 and 13 show the feature extraction results for kernel PCA and weighted kernel

PCA respectively. The contours of constant score value are shown in different levels of gray. We can see

that in kernel PCA some feature values are completely determined by the outliers which is not the case

in weighted kernel PCA.

IX. CONCLUSION

A new kernel PCA formulation for general loss functions was derived. This formulation is based on the

LS-SVM context and can be used to impose desirable properties such as sparseness and robustness in the

analysis. These two properties are missing in the classical kernel PCA because it has an associatedL2 loss

function, and the solutions are expressed in terms of dense kernel expansions. To overcome these issues,

we proposed a loss function with an epsilon insensitive zonefor sparseness and bounded score function

for robustness. We have shown two different algorithms to obtain components using this loss function.

The first algorithm solves a set of nonlinear equations using anSQP method. This nonlinear solver was

initialized using classical kernel PCA eigenvectors/values which were shown to be good initial starting

points leading to convergence in most cases. However, we only consider the first local solution obtained

through this initialization while other local optima may beinteresting as well. The second algorithm is a

simplified iterative weighting scheme that leads to a sequence of generalized eigenvalue problems which

DRAFT

24

can be solved efficiently. We proposed a model selection criterion based on our constrained optimization

framework to obtain the kernel parameters and the number of components. A heuristic to find good choices

of the loss function parameters was also discussed. While mutual orthogonality of the scores variables in

classical kernel PCA implies mutual orthogonality of the eigenvectors, that is not the case with a general

loss function. However, orthogonality constraints can be added to the primal optimization problem. From

simulations results, the proposed methods showed good performance in terms of robustness together with

a sparse representation. The components obtained by the SQP algorithm were shown to be more sparse

than the components found by the weighted approach. However, the latter algorithm was shown to be

faster in terms of computation times.

ACKNOWLEDGMENT

This work was supported by grants and projects for the Research Council K.U.Leuven (GOA-Mefisto 666,

GOA-Ambiorics, several PhD / Postdocs & fellow grants), theFlemish Government FWO: PhD / Postdocs grants,

projects G.0240.99, G.0211.05, G.0407.02, G.0197.02, G.0080.01, G.0141.03, G.0491.03, G.0120.03, G.0452.04,

G.0499.04, G.0226.06, G.0302.07, ICCoS, ANMMM; AWI;IWT:PhDgrants, GBOU (McKnow) Soft4s, the Belgian

Federal Government (Belgian Federal Science Policy Office:IUAP V-22; PODO-II (CP/01/40), the EU(FP5-

Quprodis;ERNSI, Eureka 2063-Impact;Eureka 2419-FLiTE) and Contracts Research/Agreements (ISMC/IPCOS,

Data4s, TML, Elia, LMS, IPCOS, Mastercard). Johan Suykens is a professor at the K.U.Leuven, Belgium. The

scientific responsibility is assumed by its authors.

REFERENCES

[1] I. T. Jolliffe, Principal Component Analysis. Springer Verlag, 1986.

[2] B. Scholkopf, A. J. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural

Computation, vol. 10, pp. 1299–1319, 1998.

[3] S. Mika, B. Scholkopf, A. J. Smola, K. R. Muller, M. Scholz, and G. Ratsch, “Kernel PCA and de–noising in feature

spaces,” inAdvances in Neural Information Processing Systems 11, M. S. Kearns, S. A. Solla, and D. A. Cohn, Eds. MIT

Press, 1999.

[4] B. Scholkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, M. Scholz, G. Ratsch, and A. J. Smola, “Input space

versus feature space in kernel-based methods,”IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000–1017,

September 1999.

[5] J. T. Kwok and I. W. Tsang, “The pre-image problem in kernel methods,” IEEE Transactions on Neural Networks, vol. 15,

no. 6, pp. 1517–1525, November 2004.

[6] T. Takahashi and T. Kurita, “Robust de-noising by kernel pca,” International Conference on Artificial Neural Networks

(ICANN2002). In Artificial Neural Networks - ICANN2002, Springer, pp. 739–744, 2002.

[7] J. A. K. Suykens, T. Van Gestel, J. Vandewalle, and B. De Moor, “A support vector machine formulation to PCA analysis

and its kernel version,”IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 447–450, March 2003.

DRAFT

25

[8] P. Huber,Robust Statistics. John Wiley & Sons, 1981.

[9] P. J. Rousseeuw and A. M. Leroy,Robust Regression and Outlier Detection. Wiley New York (N. Y.), 1987.

[10] F. Critchley, “Influence in principal component analysis,”Biometrika, vol. 72, pp. 627–636, 1985.

[11] L. Xu and A. L. Yuille, “Robust principal component analysis by self-organizing rules based on statistical physics approach,”

IEEE Transactions on Neural Networks, vol. 6, no. 1, pp. 131–143, January 1995.

[12] I. Higuchi and S. Eguchi, “The influence function of principal component analysis by self-organizing rule,”Neural

Computation, vol. 10, pp. 1435–1444, 1998.

[13] A. J. Smola, O. Mangasarian, and B. Scholkopf, “Sparse kernel feature analysis,” University of Wisconsin,Data Mining

Institute, Madison, Technical Report 99-03, 1999.

[14] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle,Least Squares Support Vector Machines.

World Scientific, Singapore, 2002.

[15] S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004.

[16] J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines:

Robustness and sparse approximation,”Neurocomputing, vol. 48, pp. 85–105, 2002.

[17] R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 1990.

[18] F. De la Torre and M. Black, “Robust principal component analysis for computer vision,” inIn Proc. International

Conference on Computer Vision (ICCV’2001), 2001, pp. 362–369.

[19] I. Higuchi and S. Eguchi, “Robust principal component analysis with adaptive selection for tuning parameters,”Journal

of Machine Learning Research, vol. 5, pp. 453–471, 2004.

[20] C. Croux and G. Haesbroeck, “Principal component analysis based on robust estimators of the covariance or correlation

matrix: Influence functions and efficiencies,”Biometrika, vol. 87, no. 3, 2000.

[21] M. Hubert, P. J. Rousseeuw, and K. Vanden Branden, “Robpca: a new approach to robust principal component analysis,”

Technometrics, vol. 47, pp. 64–79, 2005.

[22] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel,Robust Statistics: The Approach based on Influence

Functions. John Wiley, New York, 1986.

[23] A. Christmann and I. Steinwart, “On robustness properties of convex risk minimization methods for pattern recognition,”

Journal of Machine Learning Research, vol. 5, pp. 1007–1034, August 2004.

[24] V. Vapnik, Statistical Learning Theory. Wiley, New York, 1998.

[25] J. Nocedal and S. J. Wright,Numerical Optimization. Springer, 2000.

[26] G. H. Golub and C. F. Van Loan,Matrix Computations. The Johns Hopkins University Press, 1996.

[27] J. H. Wilkinson,The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1978.

[28] M. J. D. Powell, “A fast algorithm for nonlinearly constrained optimization calculations,” inProceedings of the 1977

Dundee Biennial Conference on Numerical Analysis, G. A. Watson, Ed. Springer Verlag, 1978, pp. 144–157.

DRAFT

26

CAPTIONS

Figure 1. Left:L2 loss function and its influence function. Right: Proposed loss function withǫ = 2, h = 5 and

its influence function. Note that theL′(e) is bounded which is important for robustness.

Figure 2. Smooth approximation of the Huber epsilon-insensitive influence function. The dashed line corresponds

to L′(e) and the solid line toL′

s(e).

Figure 3. Empirical distribution of the score variables associated to the kernel principal components. The degree

of sparseness and robustness can be determined heuristically from this plot.

Figure 4. Curve Experiment - Model selection surfaces on validation data. Left: Kernel PCA. Right: SQP

algorithm. Note that the kernel PCA criterion is always increasing with respect to the number of components.

Figure 5. Curve Experiment - Slice plot of the model selection surfaces for the best RBF kernel parameter

σ2. Dashed line:Average of the maximal variance on validation data (kernel PCA). Solid line: Average of the

maximalL(evali ) on validation data (SQP algorithm). The arrows show the selected number of components. The

value of the proposed criterion saturates with less components compared to kernel PCA. Obtained parameters are

σ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2, Nc = 8 for the SQP algorithm.

Figure 6. Curve Experiment - Sparseness per component. The figure shows the support vectors (black points)

for the first 4 components. Most of the outliers are not support vectors in any component, therefore they can be

removed from the dataset.

Figure 7. Curve Experiment - Denoising results using three pre-image algorithms. Original data is depicted in

gray and denoised data point in black. Left column: Kernel PCA with σ2 = 0.45, Nc = 40. Right column: SQP

algorithm withσ2 = 2.0, Nc = 8, ǫ = 0.34, h = 3.7. Top: Pre-image algorithm 1. Center: Pre-image algorithm 2.

Bottom: Pre-image algorithm 3.

Figure 8. Handwritten Digits Experiment - Slice plot of the model selection surfaces for the best RBF kernel

parameterσ2. Dashed line:Average of the maximal variance on validation data (kernel PCA). Solid line: Average

of the maximalL(evali ) on validation data (SQP algorithm). The arrows show the selected number of components.

The value of the proposed criterion saturates with less components compared to kernel PCA. Obtained parameters

areσ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2, Nc = 8 for the SQP algorithm.

Figure 9. Handwritten Digits Experiment - Denoising results using kernel PCA withσ2 = 350, Nc = 190. First

DRAFT

27

row: Test set corrupted with Gaussian noise and outliers. Second row: Denoising with pre-image algorithm 1. Third

row: Denoising with pre-image algorithm 2. Fourth row: Denoising with pre-image algorithm 3.

Figure 10. Handwritten Digits Experiment - Denoising results using the SQP algorithm withσ2 = 300, Nc = 120.

First row: Test set corrupted with Gaussian noise and outliers. Second row: Denoising with pre-image algorithm

1. Third row: Denoising with pre-image algorithm 2. Fourth row: Denoising with pre-image algorithm 3. The SQP

algorithm show improved results in terms of robustness compared to Figure 9.

Figure 11. Handwritten Digits Experiment. Number of accumulated support vectors with respect to the number

of components. Solid line: SQP algorithm. Dashed line: Reduced set selection method [4]. The reduced set method

contains less support vectors but outliers are not removed.

Figure 12. Feature Extraction Experiment - Contours of constant projection value for the first 6 principal

components with the RBF kernel and tuned parametersσ2 = 0.3, Nc = 40. The datapoints are shown as circles.

Note that the outliers have influence in most of the components.

Figure 13. Feature Extraction Experiment - Contours of constant projection value for the first 6 weighted principal

components with the RBF kernel and tuned parametersσ2 = 0.45, Nc = 6, ǫ = 0.5, h = 3. The datapoints are

shown as circles. Note that in this case, the projection value is not affected by the outliers.

Figure 14. Feature Extraction Experiment - Contours of constant projection value for the two components with

the normalized linear kernel. Top: Kernel PCA. Bottom: Weighted kernel PCA withǫ = 0.5, h = 3. Note that

the nonlinear structure of the data is lost. In kernel PCA, the outlier cluster defines entirely the direction of the

components which is not the case in weighted kernel PCA.

DRAFT

28

−8 −6 −4 −2 0 2 4 6 8−20

−10

0

10

20

30

40

50

60

70

e

L2(e

),L′ 2(e

)

−8 −6 −4 −2 0 2 4 6 8−10

−5

0

5

10

15

20

25

30

e

L(e

),L′ (e)

Fig. 1.

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

e

L′ (e),L

′ s(e

)

Fig. 2.

DRAFT

29

−6 −4 −2 0 2 4 60

50

100

150

200

250

ei

Fre

quen

cy −h h

2ǫ

Fig. 3.

01

23

45

0

10

20

30

400

50

100

150

200

250

σ2Nc

∑

Nc

j=

1Va

r(e(

j))

01

23

45

0

10

20

30

400

1000

2000

3000

4000

5000

6000

σ2Nc

∑

Nc

j=

1

∑

Nva

l

l=1L

(e(j

)l

)

Fig. 4.

DRAFT

30

0 5 10 15 20 25 30 350.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of componentsNc

Res

cale

dcr

iteria

Fig. 5.

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

Fig. 6.

DRAFT

31

Fig. 7.

DRAFT

32

0 20 40 60 80 100 120 140 160 1800.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Res

cale

dcr

iteria

Fig. 8.

Fig. 9.

DRAFT

33

Fig. 10.

10 50 100 150 20020

40

60

80

100

120

140

160

180

200

220


Num

ber

ofac

cum

ulat

edsu

ppor

tve

ctor

s

Fig. 11.

DRAFT

34

Fig. 12.

Fig. 13.

DRAFT

35

Fig. 14.

DRAFT

Date post:	06-Jan-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

1 Kernel Component Analysis using an Epsilon Insensitive ... · PCA constructs the rank ksubspace...

Documents