[Lecture Notes in Computer Science] Engineering Applications of Bio-Inspired Artificial Neural...

Neural Implementation of the JADE-Algorithm

Christian Ziegaus and Elmar W. Lang

Institute of Biophysics, University of Regensburg, D-93040 Regensburg, Germany [email protected]

Abst rac t . The Joint Approximative DiagonaIization of Eigenmatrices (JADE)-algorithm [6] is an algebraic approach for Independent Com- ponent Analysis (ICA), a recent data analysis technique. The basic assumption of ICA is a linear superposition model where unknown source signals are mixed together by a mixing matrix. The aim is to recover the sources respectively the mixing matrix based upon the mixtures with only minimum or no knowledge about the sources. We will present a neural extension of the JADE-algorithm, discuss the properties of this new extension and apply it to an arbitrary mixture of real-world images.

1 I n t r o d u c t i o n

Principal Component Analysis (PCA) is a well known tool for multivariate data analysis and signal processing. PCA finds the orthogonal set of eigenvectors of the covariance matrix and therefore responds to second-order information of the input data. One often used application of PCA is dimensionality reduction. But second-order information is only sufficient to describe data that are gaussian or close to gaussian. In all other cases higher-order statistical properties must be considered to describe the data appropriately. A recent technique that also includes PCA and that uses higher-order statistics of the input is Independent Component Analysis (ICA).

The basic assumption to perform an ICA is a linear mixture model representing an n-dimensional real vector x = [x0 , . . . ,Xn- i ] T as a superposition of m linear independent but otherwise arbitrary n-dimensional signatures a (p), 0 < p < m, forming the columns of an n x m-dimensional mixing matrix A = [a(~ a(m-i)] . The coefficients of the superposition interpreted as an m-

dimensional vector s = [so , . . . , sin-i] T, leads to the following basic equation of linear ICA:

x = As. (1)

The influence of an additional noise term is assumed to be negligable and will not be considered here. The components of s are often called source signals, those of x mixtures. This reflects the basic assumption that x is given as a mixture of the source signals s. Thereby x is the quantity that can be measured. It is often assumed that the number of mixtures equals the number of sources (n = m).

A few requirements about the statistical properties of the sources have to be met for ICA to be possible [4]. The source signals are assumed to be statistically

488

independent and s tat ionary processes with at most one of the sources following a normal distibution, i.e. has zero kurtosis. Additionally for the sake of simplicity it can be taken for granted that all source signals are zero mean E {s} = 0, 0 < i < m .

The implementat ion of an ICA can be seen in principle as the search for an

m x n-dimensional linear filter matr ix W = [w(~ w( m-l) ] T whose output

n - - 1

y = W x y/ ~ (0 = wj x j j=o

reconstructs the source signals s. Ideally the problem could be solved by choosing W according to

W A = Im,

where Im represents the m-dimensional unit matrix. But it is clear tha t the source signals can only be recovered arbitrari ly permuted with a scaling factor possibly leading to a change of sign, because there is a priori no predetermina- tion which filter leads to which source signal. This means, it is impossible to distinguish As from .4w with A = A ( P S ) and ~ = ( P S ) -1 s, where P represents an arbi t rary orthogonal permutat ion matr ix and S a scaling matr ix ~w~h nonzero diagonal elements [2].

The determination of an arbi t rary mixing matr ix A can be reduced to the problem of finding an orthogonal matr ix U by using second-order information of the input data [3][9][12]. This can be done by whitening or sphering the da ta via a m • n-dimensional whitening matr ix Ws obtained f rom the.correlation matr ix

[ x def R x [R i j = E { x ~ x j ) ) of x leading to

z = W s x R z = E { z z T}=/~

with Im the m-dimensional unit matrix. The m x m-dimensional orthogonal matr ix U that has to be determined after this preprocessing is then given by

z = W s x = W s A s = Us. (2)

2 Determination of the mixing matrix

Serveral algorithms for ICA have been recently proposed (see e.g. [11] and references therein). We focus here on an algebraic approach called J A D E (Joint Approximative Diagonalization of Eigenmatrices) [6].

2.1 Bas i c D e f i n i t i o n s

Consider z as a m-dimensional zero mean, real-valued random vector. The second and fourth-order m o m e n t and correlation tensor of z is given by Corr (zi, zj) = E { z i z j } and Corr ( z i , z j , z k , z l ) = E { z i z j z k z t ) respectively, where E denotes

489

the expectation. The corresponding cumulant tensors of z are defined as the coefficients of the Taylor expansion of the cumulant generating function [10]

(k) = log (r (k)), where r (k) is the Fourier transform of the probability den- sity function (pdf) p (z). Relations exist between moment and cumulant tensors which for the fourth-order cumulant read

Cum (zi, zj, zk, zt) = E{zlzjzkzt}

-E{zizj}E{ZkZt}

-E{ZiZk}E{ztzj}

-E{ziz t }E{zjzk }, (3)

whereby E{zi} - - O, 0 _~ i < m i s assumed. Often autocumulants play a decisive role and are given by

O'i = Cum (zi, zi) Variance ai = Cum (zi, zi, zi, zi) Kurtosis.

The Fourth Order Signal Subspace (FOSS) is defined as the range of the linear mapping

M --~ Qz (M) m --1

[Qz (M)]ij = E Cum (zi, zj, zk, zt) Mkt, k,l=O

where M denotes an arbitrary m x m-dimensional real matrix. The matrices Qz (M) will be called eumulant matrices in the following.

2.2 Representa t ions of the cumulant matrices

For the determination of the orthogonal mixing matrix U according to the whitened model (2) it will be necessary to represent the cumulant matrices first by the orthogonal mixing matrix and second by an eigendecomposition of the fourth-order cumulant of z.

Mixing ma t r ix A Using the multilinearity property of the fourth-order cumulant [5] leads to

m--1

Cum(zi, z j , zk , z t )= E (~) (~) ('~) (~)~ u i uj u k u t ~ .~ , (4) ~:~ =0

with ~ Z ~ = Cum (sa, sz, s , , s~). This yields a representation of the cumulant matrices by the orthogonal mixing matrix U

m --1

Q~(M) = u A ( M ) u T = ~ A(~)u~u~ (5) a , ~ = O

m - 1 m - 1

[ A(M) ] c~ de__f E /~c~,5 E I/'("Y)mk kl?/'(5)l 7 , 5 = 0 k,l=O

490

At this point nothing has been assumed about the statistical structure of the sources. From the statistical independence of the components of s follows, that

cumulants of all orders of s are diagonal leading to [A (M)] ij = 5iJ aiu(i)TMu(j)" The FOSS is thus given as

Range (Qz) = Span (u(~ (~ u(m_l)u(m_l) T)

{ m, } = M I M = E cP u(p)u(p)T

p----0

= { M I M = UAU T, A diagonal. }

(6)

This means that the dimensionality of the FOSS equals m, the number of sources.

E i g e n m a t r i c e s o f t h e f o u r t h - o r d e r c u m u l a n t The fourth-order cumulant tensor of z is (super)symmetric which means that Cure (zi, zj, Zk,Zt) is invari- ant under every permutation of z i , z j , zk ,z t . By resorting the m 4 elements of the fourth-order cumulant tensor into a n m 2 • m2-dimensional symmetric matrix (stacking-unstacking, see [6]) an eigenvectordecomposition can be performed leading to an eigenmatrixdecomposition of Cum (zi, zj, Zk, zt) (after rearranging the resulting eigenvectors) such that

m 2 - - 1

Cum (zi, zi, zk, zt) = E A(P)M(P)M(P)ij kt p--O

(7)

holds with symmetric m x m-dimensional eigenmatrices M (p) , 0 _~ p < m 2. We will assume that there exists an m x m-dimensional orthogonal matrix D = [d(~ d (m-l)] diagonalizing jointly all eigenmatrices of Cure (zi, zj, zk, zt):

D T M ( P ) D : A ( P ) : D i a g ( i t ( P ) , . . . , # ( P m ) l ) , O ~ _ p < m 2 , (8)

where Diag (.) denotes the m x m-dimensional diagonal matrix with the m ar- guments as diagonal elements. The joint diagonalizer D can be found by a maximization of the joint diagonality criterion [7]

m 2 - - 1

c (V) : E Diag (V TM(p)V) 2 , (9) p:0

where [Diag(.)[ is the norm of the vector of diagonal matrix elements, which is equivalent to a minimization of

m 2 --1

Z (lo) p : O

491

where off(W) for an arbitrary m x m-dimensional matrix W = (wi,j)o<i,j<m is defined as

m--1

off(w) %e ~ Iwul 2. i,j=O

It can be shown [5] that (9) is equivalent to

m--1

c(V)= E ICum(hi, hi, hk,ht)12, (11) i,k,l=O

where h = VTz. Consequently the maximization of c (V) is the same as minimization of the squared (cross-)cumulants with distinct first and second indices. For V = U, h is equivalent to the source signals s.

According to (8) for each M (p) now there exists an eigenvectordecomposition with

m--1

M(~) = ~..,(~) u,u,-' _,T (12) i----0

Using (7) and (12) leads to a new representation of the cumulant matrices

Qz(M)=DI"(M)D T , _P(M) = Diag (~/o (M) , . . . , ~(m M)) , (13)

w i t h ~ r (M) , 0 <: r < m d e f i n e d a s

m2--1 m - 1 m--1

= E E 4q)mk'4 r-r r-q ~ p=o q=o \k,t=o /

(14)

The eigenmatrixdecomposition of the fourth-order cumulant leads to m 2 symmetric matrices M (p). On the other hand the set of cumulants of order d of a real m-dimensional random vector z forms a real vector space with dimension

79(m,d) = (m + d - 1 ) .

In the generic case, as defined in [8], the dimension or generic width ~ (m,d) is even smaller than 7:)(m, d). A few values for G (m, d) are given in table 1, also compared to 79 (m, d) and m 2. Additionally follows from (6) that only m out of the m 2 possible eigenvalues A(P) are nonzero, which means that only m eigenmatrices should be really important.

D e t e r m i n a t i o n o f t h e m i x i n g m a t r i x From (5) and (13) we can see that the cumulant matrices are diagonalized by the orthogonal mixing matrix U and by the joint diagonalizer D of the set of eigenmatrices of the fourth-order cumulant. Because of the fact that an eigenvectordecomposition is only unique up to an

492

Table 1. Comparison of the possible dimension of the real vector space given by the set of cumulants Cum (z~, zj, zk, zl) (d = 4) for various dimensionalities m of the real random vector z.

m m 2 ~ (m, 4) 7) (m, 4)

4 16 10 35 5 25 15 70 6 36 22 126

arbi t rary orthogonal matr ix, the question arises whether the orthogonal mixing matr ix U can be identified by the joint diagonalizer of the eigenmatrices M (p), 0 < p < m. The answer is yes and the reason is given by Theorem 2 in [12], which states, tha t D is equal to the transpose (=inverse) of U up to a sign permuta t ion and a rescaling if two conditions are fullfilled

1. Cum (hi, hi) = 5ij 2. Cum (hi, hi, hk, ht) = 0 for at least two nonidentical indices,

with h = DTz. While the first condition is fullfilled by our orthogonal model (2), condition two is given by the way the joint diagonalizer D is determined through (11).

2.3 Neural learning of eigenmatrices

Recently [13] an extension of Oja's learning algorithm for PCA has been deviced to account for higher-order correlations within any given input data. The main idea is, tha t capturing higher-order statistics of the input space implies the use of a neuron which is capable of accepting input from two or more channels at once. In the single neuron case, the learning rule is given as

nw j ( t ) = r ( t ) y ( t ) { z i ( t ) ( t ) - y (t) wit (t)} m - - 1

with y (t) = E w,j (t) zi (t) zj (t). (16) i,j=O

Averaging over the ensemble of input data used for training the network leads to an eigenequation of the fourth-order correlation tensor

m - - I

E Corr (zi, zj, zk, zt) Mkt = #Mij k,l=.O

m--1

I t= E MijCorr(z i ,z j ,zk ,z t ) MkZ. i,j,k,l=O

From equation (3) can be seen tha t the main difference between the fourth- order cumulant and the corresponding correlation tensor is an explicit suppres- sion of two-point correlations. Taking the latter into account we propose a new

493

weight update rule

Lawi (t) = r (t) y (t) { (z i (t) zj (t) - ij) - (t) (y (t) - Tr (W))}

where the weights M after convergence accomplish an eigenequation of the fourth-order cumulant

m - - 1

Cum (zi, z~, zk, zt) Mkt = )~Mij k,l=O

m--1

A = y ~ MijCorr(zi, zj,zk,zt) Mkl -- 2 -- Tr 2 (M). (17) i , j __ ,~ , t --0

Tr (.) denotes the trace of the matrix argument. The corresponding weight update rule in case of m output neurons can be found straightforwardly to read

(18) w~y ) denotes the weight matrix of output neuron p connecting to input where

neurons i and j . The upper bound of the sum over q representing the decay term is intentionally unspecified. The lerning rule can be implemented in two different ways namely with 0 < q < m (Oja-type) and 0 < q _< p (Sanger-type). While in the first case the resulting weight matrices belong to approximately equal eigenvalues, the second case leads to weights whose corresponding eigenvalues are obtained in decreasing-order. The later thus can give information about the number of eigenmatrices necessary to span the FOSS.

Finding an (approximative) orthogonal joint diagonalizer D by minimizing (9) can be interpreted as determining something like an average eigenstructure [7]. Since the criterion can only be minimized but cannot generally driven to zero, this notation corresponds only to an approximate simultanous diagonalization, though the average eigenstructure nevertheless is well defined.

3 E x p e r i m e n t a l R e s u l t s

To investigate the properies of the neural implementation of the JADE-algorithm, we applied it to the problem of separating arbitrary mixtures of real-world grayscale images also known as Blind Source Separation (BSS). The image ensemble can be found in figure 1.

For simplicity we took the same number of sources, mixtures and sources that have to be recovered (n = m). The source signals si (x, y), where i denotes the image number and 0 < x, y < 256 the position within the image are given as the pixel values of the images. The components of the mixing matrix have been chosen normal distributed from the interval [-1.0, 1.0], but the special choice of the interval proofed not to be too important. For the results obtained using the

494

Fig. 1. Image ensemble used to evaluate the algorithm developed within this paper. It consists of 1. the three letters ICA, 2. a painting by Franz Marc titeled 'Der Tiger', 3. an urban image, 4. normal distibuted noise, 5. the Lena image and 6. a natural image gather from the natural image ensemble used in [2]. They are all 256 x 256 pixels in size with pixel values in the intervall [0,. . . , 255]. The images have been normalized to yield unit variance and the mean pixel value has been substracted from each picture.

Oja- respectively Sanger-type learning rule for each m under consideration the same mixing matr ix has been used.

Since statistical independence of the source signals is an important condition to separate the source signals from the mixtures, we calculated the source correlation matr ix (0 _< i, j < m)

256 S clef 1

i j = 2 5 6 2 E si(x,y) sj(x,y). x,y:O

We found many components to be (slightly) different from zero indicating inter- image correlations violating the statistical independence assumption of the sources. ICA has been realized with m = 4, 5, 6 of the images using the learning rule with the Oja- and the Sanger-type decay term. After convergence of the neural network the joint diagonalizer for the set of weight matrices has been calculated using an extension of the Jacobi-technique [7]. The resulting value of the cross- talking error g of the product 7 ) = DTu with

) 1 ) de 2 ~ I ul - I I ul 1

m a x k i=o \ j = o j=o \ i = o (19)

has been calculated to get a measure of how well the demixing or separation has been performed. The closer s is to zero the bet ter the separation, but a value g ~ 1 - 3 usually indicates good demixing.

Table 2 summarizes the experimental results. It can be seen that the average eigenstructure can be determined bet ter with the Oja-type learning rule leading to much bet ter separation results (see also [15]). Figure 2 shows the eigenvalues obtained using the Sanger-type learning rule. For the determination of the joint diagonalizer D here only the first 10, 15, 22 (m = 4, 5, 6) eigenmatrices have been used. After convergence the weights with numbers greater than 10, 15, 22 (m = 4, 5, 6) have died away which means tha t their norm converged to zero (see table 1).

495

Table 2. Summary of the simulation results with m = 4, 5, 6. The table shows the cross-taBr E as defined in (19) obtained using the Oja-type and the Sanger- type learing rule (18).

m E (Oja) s (Sanger)

4 2.05 7.95 5 1.92 16.69 6 3.98 25.25

3.5

3

2.5

2

1.5

1

0.5

0

m = 4 - -

m = 5 - -

m = 6 - -

5 10 15 20 # of weight

Fig. 2. Eigenvalues ~ + 2 according to (17) with m = 4, 5, 6 using the Sanger-type learning rule. The whole set of possible eigenmatrices has been calculated but only a subset has been used for the determination of the joint diagonalizer.

4 D i s c u s s i o n

Use of non l inea r i t i e s . Many ICA algorithms incorporate non-linearities in some more or less spezified way. The main idea behind is that non-linearities provide a way to introduce higher-order statistical properies of the input data into the calculations. Details about how they influence the determination of the mixing matrix A are rarely given. This deficiency in mind the choice of the non- linearity has been called a 'black art ' in [1]. To overcome this unsatisfactory situation we tried to avoid any kind of 'arbitrary' parameter within our model. The kind of higher-order information that is used in the ICA algorithm we propse is well known as fourth-order correlations.

C o m p u t a t i o n a l effor t . The original implementation of the JADE-algorithm as can be found in [6] incorporates the calculation of the fourth-order cumulant from the data samples together with an eigenvectordecomposition. But severe problems arise with high input dimensionalities. The original JADE-algorithm could never be applied to a problem like the one in [2], where the input dimensionality is 144. Experiments showed that it is often not necessary to calculate all possible eigenmatrices of the fourth-order cumulant but it is sufficient to determine the average eigenstructure from a subset of all eigenmatrices.

S t a t i s t i c a l p r o p e r t i e s o f t h e i n p u t d a t a . Higher-order statistical properties of high dimensional data are hard to investigate in respect of the difficulty of visu- alizing the results which often remain lmimaginable. In the case of fourth-order

496

information (correlation and cumulant tensors) the eigenmatrixdecomposition leads to twodimensional structures tha t can easily be visualized. One example where this could be useful are natural images and the influence of their statistical properties on the development of our visual system [2][14].

A c k n o w l e d g e m e n t This work has been supported by a grant from the Claussen Stiftung, Stif terverband ffir die deutsche Wissenschaft, Essen, Germany.

References

1. Anthony J. Bell and Terrence J. Sejnowski. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129-1159, 1995.

2. Anthony J. Bell and Terrence J. Sejnowski. The 'independent components' of natural scenes are edge filters. Vision Research, 37(23):3327-3338, 1997.

3. Jean-Francois Cardoso. Source separation using higher order moments. In Pro- ceedings of the ICASSP, pages 2109-2112, Glasgow, 1989.

4. Jean-Francois Cardoso. Fourth-order cumulant structure forcing, application to blind array processing. In Proceedings of the 6th workshop on statistical signal and array processing (SSAP 1992), pages 136-139, Victoria, Canada, 1992.

5. Jean-Francois Cardoso and Pierre Comon. Independent component analysis, a survey of some algebraic methods. In Proceedings ISCAS 1996, pages 93-96, 1996.

6. Jean-Francois Cardoso and Antoine Souloumiac. Blind beamforming for non- gaussian signals. IEE Proeedings - Part F, 140(6):362-370, 1993.

7. Jean-Francois Cardoso and Antoine Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM Journal on Matrix Analysis and Applications, 17(1):161- 164, 1996. P. Comon and B. Mourrain. Decomposition of quantics in sums of powers of linear forms. Signal Processing, 53(2):96-107, 1996. Pierre Comon. Independent component analysis, a new concept? Signal Processing, 36:287-314, April 1994. Gustavo Deco and Dragan Obradovic. An information-theoretic approach to neural computing. Perspectives in Neural Computing. Springer, New York, Berlin, Heidelberg, 1996. Te-Won Lee, Mark Girolami, Anthony J. Bell, and Terrence J.Sejnowski. A uni- fying information-theoretic framework for independent component analysis. Inter- national Journal on Mathematical and Computer Modeling (in press), 1998. Jean-Pierre Nadal and Nestor Parga. Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Neural Computation, 9:1421-1456, 1997. J. G. Taylor and S. Coombes. Learning higher order correlations. Neural Networks, 6:423-427, 1993. Christian Ziegaus and Elmar W. Lang. Statistics of natural and urban images. Lec- ture Notes m Computer Science (Proceedings ICANN 1997, Lausanne), 1327:219- 224, 1997. Christian Ziegaus and Elmar W. Lang. Independent component extraction of natural images based on fourth-order cumulants. In Proceedings of the ICA (Inde- pendent Component Analysis) 1999, in press.

8.

9.

10.

I I .

12.

13.

14.

15.

Date post:	11-Dec-2016
Category:	Documents
Upload:	juan-v
View:	212 times
Download:	0 times

[Lecture Notes in Computer Science] Engineering Applications of Bio-Inspired Artificial Neural...

Documents