1
Kernel Component Analysis using an Epsilon
Insensitive Robust Loss FunctionCarlos Alzate, Johan A. K. Suykens
Katholieke Universiteit Leuven, ESAT-SCD-SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
carlos.alzate, [email protected]
Abstract
Kernel Principal Component Analysis is a technique to perform feature extraction in a high-dimensional
feature space which is nonlinearly related to the original input space. The kernel PCA formulation
corresponds to an eigendecomposition of the kernel matrix:eigenvectors with large eigenvalue correspond
to the principal components in the feature space. Starting from the Least Squares Support Vector Machine
(LS-SVM) formulation to kernel PCA, we extend it to a generalized form with a general underlying loss
function made explicit. For classical kernel PCA, the underlying loss function isL2. In this generalized
form, one can plug in also other loss functions. In the context of robust statistics it is known that the
L2 loss function is not robust because its influence function isnot bounded. Therefore outliers can
skew the solution from the desired one. Another issue with kernel PCA is the lack of sparseness: the
principal components are dense expansions in terms of kernel functions. In this paper, we introduce
robustness and sparseness into kernel component analysis by using an epsilon insensitive robust loss
function. We propose two different algorithms. The first method solves a set of nonlinear equations using
a sequential quadratic programming method with kernel PCA as starting points. The second method uses
a simplified iterative weighting procedure that leads to solving a sequence of generalized eigenvalue
problems. Simulations with toy and real-life data show improvements in terms of robustness together
with a sparse representation.
Keywords:Kernel principal component analysis, Least Squares Support Vector Machines, loss function,
robustness, sparseness, epsilon insensitive loss function.
DRAFT
2
I. I NTRODUCTION
Principal Component Analysis (PCA) is a powerful unsupervised learning technique widely used for
feature extraction, denoising and compression [1]. It is anorthogonal basis transformation where the
new basis is found by an eigendecomposition of the covariance matrixC of a normalized dataset. PCA
constructs the rankk subspace approximation to the training data that is optimalin a least squares sense,
wherek < d andd is the input dimensionality. Although PCA is a powerful technique, it cannot detect
nonlinear structures in the input data [2]. Kernel PCA, as a natural nonlinear generalization of PCA,
first maps the input data into some feature spaceF via a nonlinear feature mapϕ induced by a kernel
and then performs linear PCA on the mapped data. The projections onto the subspace spanned by the
eigenvectors now lie inF but in many applications we need the corresponding patternsin the input
space. In order to recover the extracted features or the denoised patterns we have to map fromF back
to the input space (also called the pre-image problem [3]–[6]). In [7] it was shown that kernel PCA can
be derived as the dual problem to a primal optimization problem formulated in a kernel induced feature
space. With this formulation the underlying loss function associated to classical kernel PCA was shown
to beL2. In the context of robust statistics it is known that theL2 loss function is not robust because
its influence function is not bounded [8], [9], hence outlierscan arbitrarily deviate the solution from the
desired one. Even though the approach based on influence functions was developed mainly for regression,
an explicit formulation for the PCA influence function was derived in [10], showing that it is unbounded
and confirming its sensitivity to outliers and therefore its non-robust nature. In [11] a statistical physics
approach to PCA was studied in order to obtain robust principal components. The resulting formulation
led to a bounded influence function as it was demonstrated in [12]. Another issue with kernel PCA is
the lack of sparseness because the principal components areexpressed in terms of a dense expansion of
kernels associated with every training data point. Several approaches have been developed to introduce
sparseness into kernel PCA, see e.g. [4], [13].
In this paper we derive a generalized formulation where the loss function associated to kernel PCA is
explicit and can easily be changed. By incorporating an epsilon insensitive zone into a robust loss function
it is possible to impose robustness and sparseness into kernel PCA. Two different algorithms are proposed.
The first one solves a set of nonlinear equations using e.g. a sequential quadratic programming (SQP)
method with classical kernel PCA eigenvalues/vectors as starting points. The second proposed algorithm
is a simplified weighting scheme that results in solving a sequence of generalized eigenvalue problems.
With this scheme, every data point in the training set has an associated weight which determines its
DRAFT
3
importance in the component calculation. We also propose a model selection method on validation data
to find the kernel parameters and the number of components. Thistuning method looks for parameters
that maximize a criterion based on the proposed loss function. A heuristic to obtain good choices of the
proposed loss function parameters is also proposed.
This paper is organized as follows: Section II describes the links between Least Squares Support Vector
Machines (LS-SVMs) and kernel PCA. In Section III we propose two approaches to extend the kernel
PCA formulation to other loss functions. Section IV discussesrobustness and sparseness in relation to the
chosen loss function. In Section V we review three pre-image algorithms that can be combined with the
new method. In Section VI we propose two algorithms to find components with a general loss function.
Section VII describes a model selection method. In Section VIII we report the empirical results and in
Section IX we give conclusions.
II. LS-SVMS AND KERNEL PCA
Least squares support vector machines (LS-SVM) formulations todifferent problems were discussed
by Suykenset al. in [14]. It emphasizes primal-dual interpretations for a class of kernel machines in
terms of constrained optimization problems. LS-SVMs have beenstudied with respect to kernel Fisher
discriminant analysis, kernel ridge regression, kernel canonical correlation analysis, kernel partial least-
squares, recurrent networks, control, kernel PCA and others[7], [14]. In this Section we discuss LS-SVM
formulations to kernel PCA which are relevant for the sequel of the paper.
A. Classical PCA formulation
In classical PCA formulation, one considers a given set of zero mean and normalized dataxiNi=1
with xi ∈ Rd. The objective is to find projected variableswTx with maximal variance:
maxw
Var(wTx) ≃1
N
N∑
i=1
(wTxi)2 = wTCw (1)
such thatwTw = 1, whereC = (1/N)∑N
i=1 xixTi is the sample covariance matrix. The Lagrangian for
this constrained optimization problem is given by:
L(w;λ) =1
2wTCw − λ(wTw − 1) (2)
DRAFT
4
whereλ is a Lagrange multiplier. From the conditions for optimality∂L/∂w = 0 , ∂L/∂λ = 0 one
obtains the eigenvalue problem:
Cw = λw. (3)
The eigenvectorw corresponding to the largest eigenvalue defines the direction in which the projected
variables have maximal variance. Subsequent eigenvectors and eigenvalues are usually found by deflating
the data matrix or by adding orthogonality constraints withrespect to the previously found eigenvectors
[1].
B. LS-SVM Approach to Kernel PCA
1) Unconstrained Formulation:Consider the following unconstrained optimization problem:
minwJ(w) =
γ
2N
N∑
i=1
(
wT (ϕ(xi)− µϕ))2−
1
2wTw (4)
where γ is a positive regularization constant,µϕ = (1/N)∑N
j=1 ϕ(xj) and ϕ(·) : Rd → R
nh is the
mapping to a high dimensional feature spaceF of dimensionnh (which can be infinite dimensional).
Proposition 1: Each eigenvectorw⋆ with corresponding eigenvalueρ of the sample covariance matrix
in the feature spaceCϕ = (1/N)∑N
j=1(ϕ(xj)− µϕ)(ϕ(xj)− µϕ)T is a stationary point ofJ(w) given
that γ = 1/ρ. The objective functionJ(w) evaluated at any eigenvectorw⋆ equals zero.
Proof: The optimization problem in (4) can be written as:
minwJ(w) =
γ
2wTCϕw −
1
2wTw
→ =1
2wT (γCϕ − Ih)w.
Using the first-order optimality condition∂J(w)/∂w = 0 leads to
Cϕw⋆ = ρw⋆, (5)
which is an eigenvalue problem ofCϕ with ρ = 1/γ and corresponding eigenvectorw⋆ such that
(w⋆)Tw⋆ = 1. The objective function becomes
J(w⋆) =γ
2(w⋆)Tρw⋆ −
1
2(w⋆)Tw⋆ =
(γρ
2−
1
2
)
(w⋆)Tw⋆ = 0
Remark 1:Note that solving (5) becomes intractable for high dimensional feature spaces and the size
DRAFT
5
of the sample covariance matrixCϕ may become too large.
2) Constrained Formulation:Now consider the following constrained optimization problem obtained
by introducing additional variablesei into (4):
minw,e
Jp(w, e) =γ
2N
N∑
i=1
e2i −1
2wTw (6)
such that ei = wT (ϕ(xi)− µϕ), i = 1, ..., N.
Lemma 1:Given a positive definite kernel functionK : Rd×R
d → R with K(x, z) = ϕ(x)Tϕ(z) and
a regularization constantγ ∈ R+, the dual problem to (6) is given by the following eigenvalueproblem:
Ωcα = λα (7)
whereΩc denotes the centered kernel matrix withij-th entry:Ωc,ij = K(xi, xj)−1N
∑Nr=1K(xi, xr)−
1N
∑Nr=1K(xj , xr) + 1
N2
∑Nr=1
∑Ns=1K(xr, xs) and λ = N/γ. Assume thatΩc − N(IN/γ) ≤ 0 and
γCϕ − IN ≥ 0, then the strong duality property holds.
Proof: Consider the Lagrangian of the problem (6):
L(w, ei;α) =γ
2N
N∑
i=1
e2i −1
2wTw −
N∑
i=1
αi(ei − wT (ϕ(xi)− µϕ))
In general themin-max inequalityholds [15]:
maxα
minw,ei
L(w, ei;α) ≤ minw,ei
maxαL(w, ei;α) (8)
which means that the optimal value to the dual problem is a lower bound on the optimal value to the
primal problem. When (8) is fulfilled with a strict equality then the duality gap is zero and strong duality
holds. MaximizingL(w, ei;α) with respect toα leads to the unconstrained optimization problem (4).
Thus, the right hand side of (8) is zero at the stationary points (see Proposition 1). Minimizing the
Lagrangian with respect toei leads to the optimality condition
∂L
∂ei= 0→ ei =
N
γαi, i = 1, . . . , N. (9)
The Lagrangian is a concave quadratic function ofw therefore is unbounded below. However, the
stationary point is given by the optimality condition
∂L
∂w= 0→ w =
N∑
i=1
αi(ϕ(xi)− µϕ). (10)
DRAFT
6
Eliminatingw andei leads to:
maxαi
L(αi) =1
2
N∑
i,j=1
αiαj(ϕ(xj)− µϕ)T (ϕ(xi)− µϕ)−N
2γ
N∑
i=1
α2i .
By making use of the positive definite kernel function
(ϕ(xi)− µϕ)T (ϕ(xj)− µϕ) =K(xi, xj)−1
N
N∑
r=1
K(xi, xr)−1
N
N∑
r=1
K(xj , xr)
+1
N2
N∑
r=1
N∑
s=1
K(xr, xs) = Ωc(xi, xj)
(11)
one obtains the dual problem:
maxαL(α) =
1
2αT
(
Ωc −N
γIN
)
α (12)
whereα = [α1, . . . , αN ], Ωc = McΩMc, Ω is the kernel matrix withij-th entry Ωij = K(xi, xj) and
Mc = IN − (1/N)1N1TN is the centering matrix. Using∂L(α)/∂α = 0 and definingλ = N/γ leads
to the eigenvalue problem (7). Given thatα⋆ is an eigenvector ofΩc with corresponding eigenvalue
λ = N/γ, the dual objective function value becomesL(α⋆) = 1/2(N/γ − N/γ) = 0, therefore strong
duality holds.
Remark 2:Note that every eigenvalue/eigenvector pair ofΩc is a stationary point of (6) with objective
function value equal to zero. Therefore, the principal components can be seen as a pool of possible
solutions. The problem of component selection can be solved by sorting the components in decreasing
order with respect toλ which corresponds to the variance in theL2 case. This differs from classical PCA
where the objective function value at the optima is the eigenvalue and reflects the amount of variance
captured by the projection onto the corresponding eigenvector.
Remark 3:Note that changing the minimization operator to maximization in (6) leads to the same set
of stationary points (7) and Remark 2 also holds.
Remark 4:Note that (7) corresponds to kernel PCA as proposed by Scholkopf et al. in [2], where
[7] showed that a similar form of (6) is the underlying optimization problem to which kernel PCA is
the dual. It is also remarkable that (7) is regularized by restricting the eigenvectors to the unit ball
while regularization is explicitly present on the other hand in the primal problem in (6) through the
regularization term.
Remark 5:Considering a bias term in (6) leads to centering the kernel matrix Ω in the feature space
DRAFT
7
such that the mapped data have zero mean, see [7]. The bias termbecomes
b = −1
N1T
NΩα. (13)
Remark 6:The projections of a datapointx onto an eigenvector (also called the score variables)
become:
z(x) = wT (ϕ(x)− µϕ) =N
∑
l=1
αl
(
K(xl, x)−1
N
N∑
r=1
K(xr, x)
−1
N
N∑
r=1
K(xr, xl) +1
N2
N∑
r=1
N∑
s=1
K(xr, xs)
)
.
(14)
III. E XTENSION TO OTHER LOSS FUNCTIONS
A. Set of Nonlinear Equations
In (6) the kernel PCA uses theL2 loss function. By extending the formulation now to a generalloss
functionL : R→ R and introducing a bias termb, the following problem can be formulated:
minw,ei,b
Jp(w, ei, b) =γ
2
N∑
i=1
L(ei)−1
2wTw (15)
such that ei = wTϕ(xi) + b, i = 1, ..., N
Lemma 2:Given a positive definite kernel functionK : Rd × R
d → R with K(x, z) = ϕ(x)Tϕ(z) :
Rd ×R
d → R, a differentiable loss functionL : R→ R and positive regularization constantγ, the local
optimal solutions to the problem (15) can be obtained as the solution to the following set of nonlinear
equations with unknownsαi, i = 1, . . . , N (dual problem) andb:
αi −γ2L
′
(
∑Nl=1 αlK(xl, xi) + b
)
= 0, i = 1, ..., N
∑Ni=1 L
′
(
∑Nl=1 αlK(xl, xi) + b
)
= 0
(16)
Proof: Consider the Lagrangian of the problem (6):
L(w, ei, b;α) =γ
2
N∑
i=1
L(ei)−1
2wTw −
N∑
i=1
αi(ei − wTϕ(xi)− b) (17)
DRAFT
8
with Karush-Kuhn-Tucker (KKT) optimality conditions:
∂L∂w
= 0→ w =∑N
i=1 αiϕ(xi)
∂L∂ei
= 0→ αi = γ2L
′(ei), i = 1, ...N
∂L∂b
= 0→∑N
i=1 αi = 0
∂L∂αi
= 0→ ei = wTϕ(xi) + b, i = 1, ...N.
Eliminatingw, ei leads to:
γ
2L′
( N∑
l=1
αlϕ(xl)Tϕ(xi) + b
)
= αi, i = 1, ..., N,
N∑
i=1
L′
( N∑
l=1
αlϕ(xl)Tϕ(xi) + b
)
= 0
and applying the definition of a positive definite kernel function gives the set of nonlinear equations (16).
Remark 7:Note that in general (15) is a non-convex problem, thereforethe KKT optimality conditions
are necessary but not sufficient.
Remark 8:Note that (16) leads to a different centering in the feature space, because the bias term is
also affected by the loss function. This is important for robustness because the bias term is related to the
estimation of the mean in the feature space as shown in theL2 case.
Remark 9:Note that the concept of eigenvalue and eigenvector does notexist in (16).
Remark 10:The score variables become
z(x) = wTϕ(x) + b =N
∑
l=1
αlK(xl, x) + b. (18)
Remark 11:Consider the minimization of theL2 loss of the residuals:
U⋆ = argminU
N∑
i=1
r2i (19)
whereri = ||ϕ(xi)−PUϕ(xi)||2, U denotes an orthogonal subspace andPUϕ(xi) denotes the projection
of the i-th mapped datapoint ontoU . SincePUϕ(xi) is an orthogonal projection, one can write
||ϕ(xi)− PUϕ(xi)||22 = ||ϕ(xi)||
22 − ||PUϕ(xi)||
22
DRAFT
9
which leads to
U⋆ = argmaxU
N∑
i=1
||PUϕ(xi)||22.
This shows that minimizing the sum of theL2 loss of the residuals is equivalent to maximizing the
sum of theL2 loss of the projections. However, in the case of a general loss functionL, this equivalence
no longer holds. Hence, a general loss function can be applied to the projectionse but the loss function
applied to the primal vectorw should beL2 in order to exploit the equivalence between a dot product
in the feature space and a kernel function evaluation.
B. A Simplified Weighting scheme
Another approach to extend kernel PCA to general loss functions is to consider a loss function of the
form L(ei) = v(k)i e2i wherev(k)
i , i = 1, . . . , N are user-defined weights andk is the iteration step. A
two-step weighting scheme was proposed in [16] to impose robustness in LS-SVM regression. In the first
step, an unweighted LS-SVM is calculated. The second stage consists of associating weighting values
to the error variables obtained in the first step. This scheme can also be employed in kernel PCA as
a simplified technique to obtain robust and sparse solutions.Introducing weighting factorsv(k)i in the
formulation leads to the following primal optimization problem:
minw,ei
Jp(w, ei) = γ1
2
N∑
i=1
v(k)i e2i −
1
2wTw (20)
such that ei = wT (ϕ(xi)− µϕ), i = 1, ..., N.
A simple method is to make a two-step approach. In general it can be more steps as well. A precise
convergence analysis is needed in that case. However, here we are interested in a simple way to cheaply
incorporate robustness and sparseness to kernel PCA. In the first step, an unweighted kernel PCA solution
is obtained, (v(1)i = 1, i = 1, . . . , N ). Then the weights are calculated in the second step in such a way
that v(2)i e2i = L(ei). The advantage of this approach is that one extends kernel PCA to general loss
functions by solving a sequence of generalized eigenvalue problems which can be solved efficiently,
while in (16) one needs to solve a set of nonlinear equations which may lead to local optima and slow
convergence.
The first weighted principal component is the solution to (20).The second weighted component is
the solution to (20) with an additional constraint. This constraint imposes orthogonality with respect
to the score variables associated to the previously found weighted component. For the third weighted
component, two constraints are added to ensure orthogonality with respect to the score variables for
DRAFT
10
the second and first weighted principal components. This iterative process is repeated until all required
weighted components are obtained. Hence, them-th weighted principal component is the solution to the
following primal problem with a number ofm− 1 orthogonality constraints:
minw,ei
Jp(w, ei) = γ1
2
N∑
i=1
v(k)i e2i −
1
2wTw
such that
ei = wT (ϕ(xi)− µϕ), i = 1, ..., N
∑Ni=1 eie
(1)i = 0
∑Ni=1 eie
(2)i = 0
...
∑Ni=1 eie
(m−1)i = 0
(21)
wherek = 0, . . . , Niter is the iteration step,e(l)i , l = 1, . . . ,m − 1 are the score variables corresponding
to the l-th previously found weighted principal component, which are assumed to be given.
Lemma 3:Given a positive definite kernel functionK : Rd × R
d → R with K(x, z) = ϕ(x)Tϕ(z),
score variablese(1)i , e(2)i , . . . , e
(m−1)i ∈ R, i = 1, . . . , N and positive weighting factorsv(k)
i ∈ R+, i =
1, . . . , N , the solution to the problem (21) is given by the following generalized eigenvalue problem:
ΩVc 0N×(m−1)
0TN×(m−1) 0(m−1)×(m−1)
α
β
= λ
IN E
ET D
α
β
(22)
whereV (k) = diag(v(k)1 , . . . , v
(k)N ), ΩV
c = V (k)Ωc, 0p×q is the p × q zero matrix,α = [α1; . . . ;αN ],
β = [β1;β2; . . . ;βm−1], λ = 1/γ, IN is theN ×N identity matrix andE,D are defined as follows:
E =
e(1)1
v(k)1
e(2)1
v(k)1
. . . e(m−1)1
v(k)1
e(1)2
v(k)2
e(2)2
v(k)2
. . . e(m−1)2
v(k)2
......
......
e(1)N
v(k)N
e(2)N
v(k)N
. . .e(m−1)N
v(k)N
D =
∑
i(e
(1)i
)2
v(k)i
∑
ie(1)i
e(2)i
v(k)i
. . .∑
ie(1)i
e(m−1)i
v(k)i
∑
ie(1)i
e(2)i
v(k)i
∑
i(e
(2)i
)2
v(k)i
. . .∑
ie(2)i
e(m−1)i
v(k)i
......
......
∑
ie(1)i
e(m−1)i
v(k)i
∑
ie(2)i
e(m−1)i
v(k)i
. . .∑
i(e
(m−1)i
)2
v(k)i
.
DRAFT
11
Proof: Consider the Lagrangian of the problem (21):
L(w, ei;αi, βj) = γ1
2
N∑
i=1
v(k)i e2i −
1
2wTw −
N∑
i=1
αi(ei − wT (ϕ(xi)− µϕ))
− β1
N∑
i=1
eie(1)i − β2
N∑
i=1
eie(2)i − . . .− βi−1
N∑
i=1
eie(m−1)i
(23)
with conditions for optimality:
∂L
∂w= 0→ w =
N∑
i=1
αi(ϕ(xi)− µϕ) (24)
∂L
∂ei= 0→ ei =
1
γv(k)i
(αi + β1e(1)i + β2e
(2)i + . . .+ βi−1e
(m−1)i ), i = 1, . . . , N (25)
∂L
∂αi= 0→ ei = wT (ϕ(xi)− µϕ), i = 1, . . . , N (26)
∂L
∂β1= 0→
N∑
i=1
eie(1)i = 0
∂L
∂β2= 0→
N∑
i=1
eie(2)i = 0
...
∂L
∂βi−1= 0→
N∑
i=1
eie(m−1)i = 0.
Using (24), (25) and (26) leads to:
1
γ(αi + β1e
(1)i + β2e
(2)i + . . .+ βm−1e
(m−1)i ) = vi
N∑
j=1
αj(ϕ(xj)− µϕ)T (ϕ(xi)− µϕ), i = 1, . . . , N.
Applying the property of Mercel kernels:
1
γ(αi + β1e
(1)i + β2e
(2)i + . . .+ βm−1e
(m−1)i ) = v
(k)i
N∑
j=1
αjΩc(xj , xi), i = 1, . . . , N. (27)
DRAFT
12
Substituting (25) into the orthogonality constraints leadsto the following set of equations:
N∑
j=1
αje(1)j
v(k)j
+ β1
N∑
j=1
(e(1)j )2
v(k)j
+ β2
N∑
j=1
e(1)j e
(2)j
v(k)j
+ . . .+ βm−1
N∑
j=1
e(1)j e
(m−1)j
v(k)j
= 0
N∑
j=1
αje(2)j
v(k)j
+ β1
N∑
j=1
e(1)j e
(2)j
v(k)j
+ β2
N∑
j=1
(e(2)j )2
v(k)j
+ . . .+ βm−1
N∑
j=1
e(1)j e
(m−1)j
v(k)j
= 0
... (28)N
∑
j=1
αje(m−1)j
v(k)j
+ β1
N∑
j=1
e(1)j e
(m−1)j
v(k)j
+ β2
N∑
j=1
e(2)j e
(m−1)j
v(k)j
+ . . .+ βm−1
N∑
j=1
(e(m−1)j )2
v(k)j
= 0.
The sets of equations (27) and (28) give the generalized eigenvalue problem (22).
Lemma 4:The matrixM =
ΩVc 0N×(m−1)
0TN×(m−1) 0(m−1)×(m−1)
is positive semidefinite.
Proof: The matrixΩVc is the product of two positive definite matricesV (k) andΩc. Thus, by Theorem
7.6.3 of [17],ΩVc is also positive definite.
The eigendecomposition ofM :
ΩVc 0N×(m−1)
0TN×(m−1) 0(m−1)×(m−1)
x1
x2
= λ
x1
x2
is equal to
ΩVc x1
0(m−1)×1
= λ
x1
x2
.
Hence, the non-zero eigenvalues ofM are identical to the eigenvalues ofΩVc .
Corollary 1: The generalized eigenvalue problem in (22) has real eigenvalues.
Proof: Let (22) be written asMp = λBp:
B =
IN E
ET D
, p =
α
β
with p ∈ CN+i−1.
Then, premultiplying by the complex conjugatep∗ leads to
p∗Mp = λp∗Bp
DRAFT
13
the generalized eigenvalues become:
λ =p∗Mp
p∗Bp.
SinceM is positive semidefinite (Corollary 4), andB is Hermitian, the generalized eigenvalues become
the ratio of two real numbers.
Remark 12:Note thatλ in (22) can be negative. However, only positiveλ values are considered.
IV. ROBUSTNESS ANDSPARSENESS
A. Robustness in kernel PCA
It is known that classical PCA is heavily influenced by outlyingobservations. A single outlier can
by itself determine an entire component. The subspace approximation to the training data that PCA
constructs is optimal only in a least-squares sense [1]. A small deviation from the assumptions produces
a gross error in the principal component estimation. Although several approaches to robustify PCA have
been developed [11], [18], [19], [20], [21], that is not the case for kernel PCA.
In this Section, we tackle the robustness problem in kernel PCAusing the approach based on influence
functions introduced by Hampelet al. in [22]. However, an explicit form of the influence function isnot
given.
Let TN be a target statistic based on the training dataxiNi=1 assuming thatTN can be expressed as a
functional, sayT (FN ), whereFN indicates the empirical distribution function of the data.The influence
function of T (FN ) against an outlierξ is defined as:
IF (ξ; FN , T ) = limt→0
T [(1− t)FN + tδξ]− T (FN )
t. (29)
whereδξ is the pointmass 1 distribution atξ.
If the influence function is bounded, then the influence of the functionalT (FN ) by anyξ should have
a reduced effect.
The i-th principal componentα(i) can be expressed as a functionalα(i)(FN ) and according to [10],
[1], [12] its influence function becomes:
IF (ξ; FN , α(i)) = −β1(ξ)
d∑
j 6=i
βj(ξ)(λj − λi)−1α(j) (30)
whereβj(ξ) = (α(j))T (ξ − µ), λj is the j-th eigenvalue,µ is the sample mean of the data andα(j)
is the j-th principal component. Note that (30) depends of all the components and is unbounded which
confirms the non-robust nature of PCA.
DRAFT
14
For M-estimators[8], the influence function of the functionalT (FN ) is defined as:
IF (ξ; FN , T ) =L′(ξ)
∫
L′(ξ)dFN
. (31)
In other words, the influence function of anM-estimatoris proportional to the first derivative of the loss
function (also called the score function). Several loss functions with bounded score functions have been
proposed. Well-known examples are the Huber, Hampel, Welsch and Tukey loss functions. Kernel PCA
has an underlyingL2 loss function with an unbounded score function (Fig. 1). Hence, kernel PCA is not
robust against outlying data points. Note also that kernel PCA is linear PCA in a kernel-induced feature
space, therefore outliers in the feature space have a large influence. Christmannet al. showed in [23]
that the kernel and the score function should be bounded in convex risk minimization methods to achieve
robustness.
B. Sparseness
As can be seen in (14), the score variables are expressed in terms of kernel expansions in which
every training point contributes. These expansions are typically dense (non-sparse). Different methods
have been proposed to overcome the lack of sparseness in kernel PCA. Smolaet al. in [13] proposed a
sparse version employing anL1 penalty in the expansion coefficients. Scholkopf et al. in [4] introduced
a compact representation for the expansions.
Because the loss function is explicit in our formulation, wecan introduce an epsilon insensitive zone
into the robust loss function and achieve sparseness and robustness at the same time due to the conditions
for optimality in (18). This epsilon insensitive zone was introduced by Vapnik in to impose sparseness in
support vector regression. We propose a robust loss function with bounded influence function and tunable
parametersǫ andh for sparseness and robustness respectively. This loss function is a combination of the
Huber loss and the epsilon insensitive loss function for support vector regression introduced by Vapnik
in [24]. The proposed loss function (Fig. 1) is defined as:
L(e) =
0 if |e| ≤ ǫ
e2 − 2|e|+ ǫ2 if ǫ < |e| < h
2(h− ǫ)|e| − h2 + ǫ2 if |e| ≥ h.
(32)
with h > ǫ. Its score function becomes:
DRAFT
15
ψ(e) = L′(e) =
0 if |e| ≤ ǫ
2sign(e)(|e| − ǫ) if ǫ < |e| < h
2sign(e)(h− ǫ) if |e| ≥ h.
(33)
Taking an epsilon zone in the loss function leads to sparseness because of the conditions for optimality
(18). Note that settingǫ to zero leads to the Huber loss function andǫ > 0, h → ǫ approaches to the
Vapnik epsilon insensitive loss function.
V. THE PRE-IMAGE PROBLEM
The projections onto the subspace spanned by the principal components lie in some feature spaceF
but in most applications of kernel PCA we are interested in a reconstruction in the input space rather
than inF . This mapping from the feature space back to the input space iscalled the pre-image problem.
In this Section we summarize three pre-image algorithms thatwill be used in the sequel.
• Pre-image algorithm 1:
An iterative nonlinear optimization method is proposed in [3] for Gaussian kernels, but it can suffer
from numerical instabilities and local minima. Given a testpoint x we want to recoverx for which
ϕ(x) ≈ PNcϕ(x), wherePNc
ϕ(x) is the projection ofϕ(x) onto the subspace spanned by the first
Nc principal components. For Gaussian kernels, the approximate pre-imagex can be computed using
the following iteration [3]:
x(t) =
∑Ni=1 ηi exp(−‖x(t−1) − xi‖
2/σ2)xi∑N
i=1 ηi exp(−‖x(t−1) − xi‖2/σ2)(34)
whereηi =∑Nc
j=1 βju(j)i , βj is the projection ofϕ(x) onto thej-th componentu(j) and t is the
iteration step.
• Pre-image algorithm 2:
Another approach is proposed in [6]. It is a modified version of(34) to remove iteratively the
influence of outliers. Letl(t)j be the certainty ofx(j), wherex(j) denotes thej-th coordinate of the
test pointx. This certainty is calculated using the difference betweenx(j) and the corresponding
reconstructionx(t−1)(j) as:
l(t)j = exp(−(x(j) − x
(t−1)(j) )2/(2σ2
j )), j = 1, . . . , d. (35)
The parameterσj is the standard deviation of the differences and is estimated using robust estimation
DRAFT
16
[9]:
σj = 1.4826(1 + 5/(N − 1))mediani
√
ε2ij (36)
whereε2ij is the squared error between thej-th dimension of thei-th data point and its reconstruction.
The approximate pre-imagex can be computed using the following iteration [6]:
x(t) =
∑Ni=1 ηi exp(−‖g(t) − xi‖
2/σ2)xi∑N
i=1 ηi exp(−‖g(t) − xi‖2/σ2)(37)
whereg(t) is given by:
g(t) = L(t)x+ (Id − L(t))x(t−1), (38)
for t > 0. The matrixId is thed × d identity matrix,d is the dimensionality of the input vectors,
L(t) is a d× d matrix defined asL(t) = diag(l(t)1 , . . . , l
(t)d ).
• Pre-image algorithm 3:
A newer approach is proposed in [5]. This method is non-iterative, involves only linear algebra
and uses the distances between the projectionPNcϕ(x) and its nearest neighbors. The method can
be summarized as follows: given theNn neighborsx1, . . . , xNn and q = [q1, . . . , qNn
]T where
ql is the input space distance betweenPNcϕ(x) and xl, l = 1, . . . , Nn, first construct the matrix
X = [x1, x2, . . . , xNn] and compute its singular value decomposition (SVD):
XMc = GR (39)
whereMc is the centering matrix,G is the matrix of left singular vectors,R = ΛHT = [r1, . . . , rNn],
Λ is the diagonal matrix of singular values,H is the matrix of right singular vectors andrj is the
projection ofxj ontoG, j = 1, . . . , Nn. The approximate pre-image can be obtained as:
r = −1
2Λ−1HT (q − q0) (40)
whereq0 = [‖r1‖2, . . . , ‖rNn
‖2]T . Transforming back to the original coordinate system in theinput
space leads to:
x = Gr + x (41)
wherex = 1Nn
∑Nn
j=1 xj .
DRAFT
17
VI. A LGORITHMS
We propose two algorithms to obtain the components with the Huber epsilon-insensitive loss function.
The first algorithm solves (16) using a sequential quadratic programming (SQP) approach [25] and smooth
approximation ofL′(e). The second algorithm corresponds to an implementation of the weighted scheme
(22).
A. Sequential Quadratic Programming Algorithm
The Huber epsilon-insensitive influence functionL′(e) is not differentiable at|e| = ǫ and |e| = h
therefore to use a gradient-based approach such as SQP, a smooth approximationL′s(e) to L′(e) is taken:
L′s(e) =
2slog
(
1 + exp((s(e+ h))))
− 2(h− ǫ) if e ≤ ǫ−h2 − ǫ
2slog
(
1+exp((s(e−ǫ)))1+exp((−s(e+ǫ)))
)
if ǫ−h2 − ǫ < |e| <
h−ǫ2 + ǫ
−2slog
(
1 + exp((−s(e− h))))
+ 2(h− ǫ) if e ≥ h−ǫ2 + ǫ
(42)
wheres is a constant that controls the smoothness of the approximation. Figure 2 showsL′s(e) andL′(e)
for ǫ = 2, h = 5, s = 20.
ReplacingL′(·) by L′s(·) in (16) leads to:
f(λ, α, b) =
λα− 12 L
′s(z(α, b))
1TN L
′s(z(α, b))
= 0 (43)
where f : R+ × R
N × R → RN+1, λ = 1/γ, z(α, b) = [z1(α, b); z2(α, b); . . . ; zN (α, b)], α =
[α1;α2; . . . ;αN ], zi(α, b) = Ω(i)α+ b1N , Ω(i) denotes thei-th row of the kernel matrixΩ and
L′s(z(α, b)) =
[
L′s(z1(α, b)); . . . ;L
′s(zN (α, b))
]
.
The set of nonlinear equations (43) is usually solved via minimization of a sum of squares of the
residuals. We impose two additional constraintsαTα ≥ 1 and λ ≥ c where c is a small positive
constant (e.g.c = 10−6. The first constraint prevents the trivial solutionα = 0 and non-interesting
solutions with small norm causing the score variableszi(α, b) to lie in the range−h < |zi(α, b)| < h,
i = 1, . . . , N , therefore avoiding the bounded zone|zi(α, b)| ≥ h of the score function. The proposed
DRAFT
18
nonlinear optimization problem is the following:
minλ,α,b
g(λ, α, b) = f(λ, α, b)T f(λ, α, b) (44)
such that
αTα ≥ 1
λ ≥ c
where
g(λ, α, b) = λ2αTα− λαT L′s(z(α, b)) +
1
4
(
L′s(z(α, b))
)TL′
s(z(α, b)) +(
1TN L
′s(z(α, b))
)2. (45)
All N + 1 equations have to be satisfied at the local solution(λ⋆, α⋆, b⋆) rather than just minimizing the
sum of the squares of the equations. This means that the value of g(λ⋆, α⋆, b⋆) should be zero in order
to (λ⋆, α⋆, b⋆) to be a solution to the problem (43). In the same fashion as in the weighting scheme,
additional constraints can be added to impose orthogonality with previously found components. Note that
sparseness is achievedcomponent-wisedue toαi = γ2L
′(ei) which is one of the optimality conditions
(18). Therefore, each component vector may have a different number of zeroαi values.
An SQP method approximates in every iteration the constrained optimization problem (44) as a
quadratic programming problem of the form:
mind
1
2dTH(k)d+∇g(θ(k))Td (46)
such that
∇c1(θ(k))Td+ c1(θ
(k)) ≥ 0
∇c2(θ(k))Td+ c2(θ
(k)) ≥ 0
(47)
wherek denotes the iteration step,H is the Hessian matrix (or an approximation to it) of the Lagrangian
of (44), θ = [λ;α; b] is the vector of parameters,c1(θ) = αTα− 1 is the first constraint,c2(θ) = λ− c
is the second constraint,g(θ) = g(λ;α; b) and the gradient∇g(θ) ∈ RN+2 is defined as:
∇g(θ) = ∇g(λ, α, b) =
∇λg(λ, α, b)
∇αg(λ, α, b)
∇bg(λ, α, b)
(48)
DRAFT
19
where
∇λg(λ, α, b) = 2λαTα− αT L′s(z(α, b))
∇αg(λ, α, b) = 2λ2α− λ
(
Ωdiag(
L′′s(z(α, b))
)
α+ L′s(z(α, b))
)
+1
2Ωdiag
(
L′′s(z(α, b))
)
L′s(z(α, b)) + 2r1ΩL
′′s(z(α, b))
∇bg(λ, α, b) = −λαT L′′s(z(α, b)) +
1
2
(
L′′s(z(α, b))
)TL′
s(z(α, b)) + 2r1r2
(49)
r1 = 1TN L
′s(z(α, b)), r2 = 1T
N L′′s(z(α, b)). The solution vectord(k) to the QP subproblem (46) can then
be used to form a new iterate:
θ(k+1) = θ(k) + µ(k)d(k) (50)
whereµ(k) is the step length parameter of the line search and can be obtained using a merit function.
A critical issue in nonlinear optimization is the selectionof good starting points. In the proposed SQP
approach, kernel PCA eigenvectors and eigenvalues are used as initial values for the nonlinear solver.
Therefore, we are considering only the first local solution obtained through the proposed initialization.
Given a set ofN d−dimensional data pointsX = x1, . . . , xN, the algorithm gives a set ofNc
component vectorsA = α(1)⋆, . . . , α(Nc)⋆ as follows:
Algorithm 1 SQP Approach1: Set the kernel parameters and the loss function parametersǫ, h.2: Perform an eigendecomposition of the non-centered kernel matrix to obtain initial eigenvaluesλ1, . . . , λN
and initial eigenvectorsα(1), . . . , α(N).3: for l = 1 to Nc do4: Obtain the initial values for the bias termbl = (1/N)1T
NΩα(l)
5: Solve the SQP problem (46), (50) with initial valuesλl, α(l) andbl to obtain the solutionsλ⋆l , α(l)⋆ andb⋆l
6: end for7: Sort the solutions according to the value ofλ⋆
l in decreasing order.8: Normalize the componentsα(l)⋆ ← α(l)⋆/||α(l)⋆||2, l = 1, . . . , Nc
B. Weighted Algorithm
The second algorithm is an implementation of the iterative weighting scheme (21). GivenX , and the
number of iterationsNiter, the algorithm (2) gives a set ofNc componentsα(1), . . . , α(Nc), where
Nr ≤ N . In general, empirical results suggest that only one iteration (Niter = 1) is sufficient to
obtain good results. Sparseness is achieved in step4 by removing datapoints from the training set for
which the corresponding weight is equal to zero across the components. The computational complexity
DRAFT
20
Algorithm 2 Weighted Kernel PCA1: Set the kernel parameters, the loss function parametersǫ, h and the number of principal componentsNc to be
extracted.2: Compute the unweighted score variablese(l)i , i = 1, . . . , N, l = 1, . . . , Nc associated to the firstNc principal
components using classical kernel PCA.3: Compute the weightsv(l)
i such thatv(l)i (e
(l)i )2 = L(e
(l)i ).
4: Remove from the training dataset the datapoints whose corresponding weightv(l)i equals to zero in every
component. The new dataset hasNr datapoints.5: Compute the unweighted score variablese
(l)i , l = 1, . . . , Nc of the reduced dataset using classical kernel PCA.
6: for k = 1 to Niter do7: Recompute the weightsv(1)
i such thatv(1)i (e
(1)i )2 = L(e
(1)i ) with ǫ = 0.
8: Find the first weighted principal componentα(1) which is the solution associated to the largest eigenvaluein V Ωcα = λα, with V = diag(v1, . . . , vNr
).9: Recomputee(1)i usingα(1).
10: for l = 2 to Nc do11: Recompute the weightsv(l)
i such thatv(l)i (e
(j)i )2 = L(e
(l)i ) with ǫ = 0
12: Find thel−th weighted principal componentα(l) which is the solution associated to the largest eigenvaluein (22).
13: Recomputee(l)i usingα(l).14: end for15: end for
of this algorithm isO(NiterNc(ξN + Nc)) compared withO(ξN) for classical kernel PCA using the
Lanczos method for the largestNc components. The factorξ is the maximum number of matrix-vector
computations required and depends of the eigengapλNc+1 − λNc[26], [27].
VII. M ODEL SELECTION
Model selection is a central issue in unsupervised learningtechniques such as kernel PCA. In supervised
learning one can rely on many methods (such as validation set, crossvalidation, generalization bounds,
etc.) for model selection. However, one often cannot apply such criteria in unsupervised learning due to
the lack of an underlying objective function to the method. In kernel PCA, the parameters that define
the model are the number of principal componentsNc to be extracted and the kernel parameters (e.g.σ2
for the RBF kernel). These parameters are typically chosen ina trial-and-error fashion. In the proposed
approaches, model selection in a learning framework can be done naturally on the basis of the explicit
constrained optimization formulation. We propose the following model selection criterion:
maxNc,σ2
Nc∑
l=1
Nval∑
j=1
L(e(l)j ) (51)
wheree(l)j = wTϕ(xvalj )+ bl =
∑Ni=1 α
(l)i K(xi, x
valj ) + bl, xval
l denotes thej-th validation data point,α(l)
is the l-th component andbl is the bias term for thel-th component,l = 1, . . . , Nc.
DRAFT
21
The values ofǫ (sparseness) andh (robustness) are problem-dependent and can be obtained heuristically
from the empirical distribution of the score variables suchthat a given amount of the score values lie
within the epsilon zone of the loss function. In the same way,h can be set such that a given percentage
of the score values lie within the range−h < e < h. Figure 3 shows the proposed heuristic.
VIII. E MPIRICAL RESULTS
In this section, some experimental results are presented toillustrate the effectiveness of the proposed
approaches. We used the MATLAB functionfmincon to solve (46). This function is a SQP solver with
a quasi-Newton approximation of the Hessian of the Lagrangian using the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) method. The step length is calculated with line search and the merit function described
in [28]. We performed denoising experiments and feature extraction simulations which corresponds to
most used applications of kernel PCA.
All experiments reported were carried out using the RBF kernel. It is important to mention that the
kernel should be bounded for these robust methods to work. Themachine used is a Pentium 4, 2.8 GHz,
1GB RAM. The values ofǫ andh were obtained using the heuristic proposed in Figure 3, setting ǫ such
that 10% of the score values lie within the epsilon insensitive rangeand settingh such that90% of the
score values lie within the range−h < e < h.
A. Experiment 1 - Denoising
The first experiment consists of a denoising application. Simulations were done with a curve-shaped
dataset in two dimensions and with the UCImultiple featureshandwritten digit dataset. The curve dataset
consists of 360 data points for training, 420 for validationand 320 for testing. The loss function parameters
ǫ = 0.34, h = 3.7 were set according to Figure 3. Model selection surfaces for the curve dataset can be
seen in Figure 4 showing that in the weighted kernel PCA method,the criterion saturates with respect
to the number of componentsNc. On the other hand, in kernel PCA, the variance of the projected
variables is always increasing with respect toNc. This means that fewer components have to be computed
in the weighted kernel PCA algorithm. Figure 5 shows a slice of the surface plots. The values of the
proposed criterion and the kernel PCA criterion (variance) were rescaled for graphical purposes. Obtained
parameters areσ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2.0, Nc = 8 for the SQP algorithm.
Sparseness of the proposed method can be seen in Figures 6 and Table I. Note that sparseness varies
from component to component. However, most of the outliers are not support vectors therefore they can
be removed from the dataset. Table II shows the percentage ofoutliers present in the support vector set.
DRAFT
22
The reduced set selection method [4] is more sparse than the proposed methods but its support vector
set contains almost all outliers. The components found by theSQP algorithm are more sparse than the
components obtained by the simplified weighting scheme but ata more expensive computational cost.
The handwritten digit training set is composed of 195 digit images of 15×16 pixels. The validation set
consists of 40 digits. We added outliers and Gaussian noise to the dataset and setǫ = 0.1, h = 4 according
to Figure 3. Figure 8 shows the slices of the model selection surfaces with similar results compared to
Figure 4. Obtained parameters areσ = 350, Nc = 190 for kernel PCA andσ2 = 300, Nc = 120 for the
SQP algorithm. The denoising results can be seen in Figures 9 and10 showing improved results of the
SQP algorithm compared to kernel PCA in terms of robustness together with visually appealing digits.
Table III shows the mean squared error (MSE) and standard deviation for kernel PCA and the proposed
methods with respect to the three pre-image algorithms. The MSE was calculated between the denoised
digits and the test set without outliers and Gaussian noise.Figure 11 shows sparseness results in terms
of accumulated number of support vectors with respect to thenumber of components.
# of Components SQP Algorithm Reduced Set [4]1 190 160
2 222 221
3 286 257
4 290 272
TABLE ICURVE EXPERIMENT - ACCUMULATED NUMBER OF SUPPORT VECTORS FOR THESQPALGORITHM AND THE REDUCED SET
SELECTION METHOD[4] WITH RESPECT TO THE NUMBER OF COMPONENTS REQUIRED.
% of outliers Computation timesReduced Set [4] 98% 5.4 secs
Weighted Algorithm 3% 8.6 secsSQP Algorithm 0% 24.7 secs
TABLE IICURVE EXPERIMENT - FRACTION OF THE TOTAL NUMBER OF OUTLIERS PRESENT ON THE SUPPORT VECTOR SET. NOTE
THAT, THE REDUCED SET METHOD IS MORE SPARSE THAN THE PROPOSED METHODS BUT THE MAJORITY OF THE OUTLIERS
ARE IN THE SUPPORT VECTOR SET.
DRAFT
23
Kernel PCA SQP algorithm Weighted AlgorithmPre-image algorithm 1 0.60± 0.40 0.10± 0.15 0.28± 0.30
Pre-image algorithm 2 0.70± 0.30 0.21± 0.20 0.45± 0.32
Pre-image algorithm 3 0.54± 0.40 0.30± 0.23 0.32± 0.28
TABLE IIIHANDWRITTEN DIGITS - MSE AND STANDARD DEVIATION ON TEST DATA OF THE DENOISED DIGITS COMPARED TO THE
CLEAN TEST DATA
B. Experiment 2 - Feature Extraction
The feature extraction experiment was carried out with a square-shaped dataset in two dimensions
consisting of 1000 data points. The weighted kernel PCA algorithm was used. The training scenario
consisted of 300 data points for training, 300 data points for validation and the remaining 400 data
points for testing. We included outliers and performed 10 randomizations of the training and validation
sets. The parameters of the loss function were set toǫ = 0.5, h = 3. Model selection surfaces and
slices were obtained in a similar way as in the previous experiments. Selected parameters areσ2 =
0.45, Nc = 6. Figures 12 and 13 show the feature extraction results for kernel PCA and weighted kernel
PCA respectively. The contours of constant score value are shown in different levels of gray. We can see
that in kernel PCA some feature values are completely determined by the outliers which is not the case
in weighted kernel PCA.
IX. CONCLUSION
A new kernel PCA formulation for general loss functions was derived. This formulation is based on the
LS-SVM context and can be used to impose desirable properties such as sparseness and robustness in the
analysis. These two properties are missing in the classical kernel PCA because it has an associatedL2 loss
function, and the solutions are expressed in terms of dense kernel expansions. To overcome these issues,
we proposed a loss function with an epsilon insensitive zonefor sparseness and bounded score function
for robustness. We have shown two different algorithms to obtain components using this loss function.
The first algorithm solves a set of nonlinear equations using anSQP method. This nonlinear solver was
initialized using classical kernel PCA eigenvectors/values which were shown to be good initial starting
points leading to convergence in most cases. However, we only consider the first local solution obtained
through this initialization while other local optima may beinteresting as well. The second algorithm is a
simplified iterative weighting scheme that leads to a sequence of generalized eigenvalue problems which
DRAFT
24
can be solved efficiently. We proposed a model selection criterion based on our constrained optimization
framework to obtain the kernel parameters and the number of components. A heuristic to find good choices
of the loss function parameters was also discussed. While mutual orthogonality of the scores variables in
classical kernel PCA implies mutual orthogonality of the eigenvectors, that is not the case with a general
loss function. However, orthogonality constraints can be added to the primal optimization problem. From
simulations results, the proposed methods showed good performance in terms of robustness together with
a sparse representation. The components obtained by the SQP algorithm were shown to be more sparse
than the components found by the weighted approach. However, the latter algorithm was shown to be
faster in terms of computation times.
ACKNOWLEDGMENT
This work was supported by grants and projects for the Research Council K.U.Leuven (GOA-Mefisto 666,
GOA-Ambiorics, several PhD / Postdocs & fellow grants), theFlemish Government FWO: PhD / Postdocs grants,
projects G.0240.99, G.0211.05, G.0407.02, G.0197.02, G.0080.01, G.0141.03, G.0491.03, G.0120.03, G.0452.04,
G.0499.04, G.0226.06, G.0302.07, ICCoS, ANMMM; AWI;IWT:PhDgrants, GBOU (McKnow) Soft4s, the Belgian
Federal Government (Belgian Federal Science Policy Office:IUAP V-22; PODO-II (CP/01/40), the EU(FP5-
Quprodis;ERNSI, Eureka 2063-Impact;Eureka 2419-FLiTE) and Contracts Research/Agreements (ISMC/IPCOS,
Data4s, TML, Elia, LMS, IPCOS, Mastercard). Johan Suykens is a professor at the K.U.Leuven, Belgium. The
scientific responsibility is assumed by its authors.
REFERENCES
[1] I. T. Jolliffe, Principal Component Analysis. Springer Verlag, 1986.
[2] B. Scholkopf, A. J. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural
Computation, vol. 10, pp. 1299–1319, 1998.
[3] S. Mika, B. Scholkopf, A. J. Smola, K. R. Muller, M. Scholz, and G. Ratsch, “Kernel PCA and de–noising in feature
spaces,” inAdvances in Neural Information Processing Systems 11, M. S. Kearns, S. A. Solla, and D. A. Cohn, Eds. MIT
Press, 1999.
[4] B. Scholkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, M. Scholz, G. Ratsch, and A. J. Smola, “Input space
versus feature space in kernel-based methods,”IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000–1017,
September 1999.
[5] J. T. Kwok and I. W. Tsang, “The pre-image problem in kernel methods,” IEEE Transactions on Neural Networks, vol. 15,
no. 6, pp. 1517–1525, November 2004.
[6] T. Takahashi and T. Kurita, “Robust de-noising by kernel pca,” International Conference on Artificial Neural Networks
(ICANN2002). In Artificial Neural Networks - ICANN2002, Springer, pp. 739–744, 2002.
[7] J. A. K. Suykens, T. Van Gestel, J. Vandewalle, and B. De Moor, “A support vector machine formulation to PCA analysis
and its kernel version,”IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 447–450, March 2003.
DRAFT
25
[8] P. Huber,Robust Statistics. John Wiley & Sons, 1981.
[9] P. J. Rousseeuw and A. M. Leroy,Robust Regression and Outlier Detection. Wiley New York (N. Y.), 1987.
[10] F. Critchley, “Influence in principal component analysis,”Biometrika, vol. 72, pp. 627–636, 1985.
[11] L. Xu and A. L. Yuille, “Robust principal component analysis by self-organizing rules based on statistical physics approach,”
IEEE Transactions on Neural Networks, vol. 6, no. 1, pp. 131–143, January 1995.
[12] I. Higuchi and S. Eguchi, “The influence function of principal component analysis by self-organizing rule,”Neural
Computation, vol. 10, pp. 1435–1444, 1998.
[13] A. J. Smola, O. Mangasarian, and B. Scholkopf, “Sparse kernel feature analysis,” University of Wisconsin,Data Mining
Institute, Madison, Technical Report 99-03, 1999.
[14] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle,Least Squares Support Vector Machines.
World Scientific, Singapore, 2002.
[15] S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004.
[16] J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines:
Robustness and sparse approximation,”Neurocomputing, vol. 48, pp. 85–105, 2002.
[17] R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 1990.
[18] F. De la Torre and M. Black, “Robust principal component analysis for computer vision,” inIn Proc. International
Conference on Computer Vision (ICCV’2001), 2001, pp. 362–369.
[19] I. Higuchi and S. Eguchi, “Robust principal component analysis with adaptive selection for tuning parameters,”Journal
of Machine Learning Research, vol. 5, pp. 453–471, 2004.
[20] C. Croux and G. Haesbroeck, “Principal component analysis based on robust estimators of the covariance or correlation
matrix: Influence functions and efficiencies,”Biometrika, vol. 87, no. 3, 2000.
[21] M. Hubert, P. J. Rousseeuw, and K. Vanden Branden, “Robpca: a new approach to robust principal component analysis,”
Technometrics, vol. 47, pp. 64–79, 2005.
[22] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel,Robust Statistics: The Approach based on Influence
Functions. John Wiley, New York, 1986.
[23] A. Christmann and I. Steinwart, “On robustness properties of convex risk minimization methods for pattern recognition,”
Journal of Machine Learning Research, vol. 5, pp. 1007–1034, August 2004.
[24] V. Vapnik, Statistical Learning Theory. Wiley, New York, 1998.
[25] J. Nocedal and S. J. Wright,Numerical Optimization. Springer, 2000.
[26] G. H. Golub and C. F. Van Loan,Matrix Computations. The Johns Hopkins University Press, 1996.
[27] J. H. Wilkinson,The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1978.
[28] M. J. D. Powell, “A fast algorithm for nonlinearly constrained optimization calculations,” inProceedings of the 1977
Dundee Biennial Conference on Numerical Analysis, G. A. Watson, Ed. Springer Verlag, 1978, pp. 144–157.
DRAFT
26
CAPTIONS
Figure 1. Left:L2 loss function and its influence function. Right: Proposed loss function withǫ = 2, h = 5 and
its influence function. Note that theL′(e) is bounded which is important for robustness.
Figure 2. Smooth approximation of the Huber epsilon-insensitive influence function. The dashed line corresponds
to L′(e) and the solid line toL′
s(e).
Figure 3. Empirical distribution of the score variables associated to the kernel principal components. The degree
of sparseness and robustness can be determined heuristically from this plot.
Figure 4. Curve Experiment - Model selection surfaces on validation data. Left: Kernel PCA. Right: SQP
algorithm. Note that the kernel PCA criterion is always increasing with respect to the number of components.
Figure 5. Curve Experiment - Slice plot of the model selection surfaces for the best RBF kernel parameter
σ2. Dashed line:Average of the maximal variance on validation data (kernel PCA). Solid line: Average of the
maximalL(evali ) on validation data (SQP algorithm). The arrows show the selected number of components. The
value of the proposed criterion saturates with less components compared to kernel PCA. Obtained parameters are
σ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2, Nc = 8 for the SQP algorithm.
Figure 6. Curve Experiment - Sparseness per component. The figure shows the support vectors (black points)
for the first 4 components. Most of the outliers are not support vectors in any component, therefore they can be
removed from the dataset.
Figure 7. Curve Experiment - Denoising results using three pre-image algorithms. Original data is depicted in
gray and denoised data point in black. Left column: Kernel PCA with σ2 = 0.45, Nc = 40. Right column: SQP
algorithm withσ2 = 2.0, Nc = 8, ǫ = 0.34, h = 3.7. Top: Pre-image algorithm 1. Center: Pre-image algorithm 2.
Bottom: Pre-image algorithm 3.
Figure 8. Handwritten Digits Experiment - Slice plot of the model selection surfaces for the best RBF kernel
parameterσ2. Dashed line:Average of the maximal variance on validation data (kernel PCA). Solid line: Average
of the maximalL(evali ) on validation data (SQP algorithm). The arrows show the selected number of components.
The value of the proposed criterion saturates with less components compared to kernel PCA. Obtained parameters
areσ2 = 0.45, Nc = 40 for kernel PCA andσ2 = 2, Nc = 8 for the SQP algorithm.
Figure 9. Handwritten Digits Experiment - Denoising results using kernel PCA withσ2 = 350, Nc = 190. First
DRAFT
27
row: Test set corrupted with Gaussian noise and outliers. Second row: Denoising with pre-image algorithm 1. Third
row: Denoising with pre-image algorithm 2. Fourth row: Denoising with pre-image algorithm 3.
Figure 10. Handwritten Digits Experiment - Denoising results using the SQP algorithm withσ2 = 300, Nc = 120.
First row: Test set corrupted with Gaussian noise and outliers. Second row: Denoising with pre-image algorithm
1. Third row: Denoising with pre-image algorithm 2. Fourth row: Denoising with pre-image algorithm 3. The SQP
algorithm show improved results in terms of robustness compared to Figure 9.
Figure 11. Handwritten Digits Experiment. Number of accumulated support vectors with respect to the number
of components. Solid line: SQP algorithm. Dashed line: Reduced set selection method [4]. The reduced set method
contains less support vectors but outliers are not removed.
Figure 12. Feature Extraction Experiment - Contours of constant projection value for the first 6 principal
components with the RBF kernel and tuned parametersσ2 = 0.3, Nc = 40. The datapoints are shown as circles.
Note that the outliers have influence in most of the components.
Figure 13. Feature Extraction Experiment - Contours of constant projection value for the first 6 weighted principal
components with the RBF kernel and tuned parametersσ2 = 0.45, Nc = 6, ǫ = 0.5, h = 3. The datapoints are
shown as circles. Note that in this case, the projection value is not affected by the outliers.
Figure 14. Feature Extraction Experiment - Contours of constant projection value for the two components with
the normalized linear kernel. Top: Kernel PCA. Bottom: Weighted kernel PCA withǫ = 0.5, h = 3. Note that
the nonlinear structure of the data is lost. In kernel PCA, the outlier cluster defines entirely the direction of the
components which is not the case in weighted kernel PCA.
DRAFT
28
−8 −6 −4 −2 0 2 4 6 8−20
−10
0
10
20
30
40
50
60
70
e
L2(e
),L′ 2(e
)
−8 −6 −4 −2 0 2 4 6 8−10
−5
0
5
10
15
20
25
30
e
L(e
),L′ (e)
Fig. 1.
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
e
L′ (e),L
′ s(e
)
Fig. 2.
DRAFT
29
−6 −4 −2 0 2 4 60
50
100
150
200
250
ei
Fre
quen
cy −h h
2ǫ
Fig. 3.
01
23
45
0
10
20
30
400
50
100
150
200
250
σ2Nc
∑
Nc
j=
1Va
r(e(
j))
01
23
45
0
10
20
30
400
1000
2000
3000
4000
5000
6000
σ2Nc
∑
Nc
j=
1
∑
Nva
l
l=1L
(e(j
)l
)
Fig. 4.
DRAFT
30
0 5 10 15 20 25 30 350.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of componentsNc
Res
cale
dcr
iteria
Fig. 5.
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
x
y
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
x
y
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
x
y
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
x
y
Fig. 6.
DRAFT
31
Fig. 7.
DRAFT
32
0 20 40 60 80 100 120 140 160 1800.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of componentsNc
Res
cale
dcr
iteria
Fig. 8.
Fig. 9.
DRAFT
33
Fig. 10.
10 50 100 150 20020
40
60
80
100
120
140
160
180
200
220
Number of componentsNc
Num
ber
ofac
cum
ulat
edsu
ppor
tve
ctor
s
Fig. 11.
DRAFT
34
Fig. 12.
Fig. 13.
DRAFT
35
Fig. 14.
DRAFT