Variable selection for discriminant analysis of fish sounds using matrix correlations

Variable Selection for Discriminant Analysis ofFish Sounds Using Matrix Correlations

Mark WOOD, Ian T. JOLLIFFE, and Graham W. HORGAN

Discriminant analysis is a widely used multivariate technique. In some applicationsthe number of variables available is very large and, as with other multivariate techniques,it is desirable to simplify matters by selecting a subset of the variables in such a waythat little useful information is lost in doing so. Many methods have been suggested forvariable selection in discriminant analysis; this article introduces a new one, based on matrixcorrelation, an idea that has proved useful in the context of principal component analysis.The method is illustrated on an example involving fish sounds. It is important to discriminatebetween the sounds made by different species of fish, and even by individual fish, but thenature of the data is such that many potential variables are available.

Key Words: Canonical variate analysis; Feature selection; Genetic algorithm; Generalizedcoefficient of determination; Wavelets.

1. INTRODUCTION

Discriminant analysis is a well-known family of techniques which examine the relation-ship between membership of one of several groups or populations and a set of interrelatedvariables—see McLachlan (1992) for an extensive treatment of the subject. Examples in-clude deciding the species of an organism from measurements on that organism, determiningland use from remotely sensed satellite data, and diagnosing to which of a number of diseasecategories a patient belongs on the basis of symptoms.

In some cases the number of variables available for discriminant analysis is very large.For example, in determining to which of a number of varieties a carrot belongs, there maybe hundreds of variables describing both the shape (outline) and the texture of the carrot—see, for example, Davey, Horgan, and Talbot (1997) and Horgan (2001). Another exampleconcerns sounds made by fish, where again hundreds of potential variables are availableto distinguish between different individual fish or different species of fish. In such casesit is desirable to reduce the number of variables included in a discriminant analysis, for

Mark Wood completed this work while a Research Student, and Ian T. Jolliffe is Professor (Emeritus) of Statistics,the University of Aberdeen. Graham W. Horgan is head of the Biomathematics and Statistics Scotland Group,Rowett Research Institute, Greenburn Road, Aberdeen, AB21 9SB, UK (E-mail: [email protected]).

©2005 American Statistical Association and the International Biometric SocietyJournal of Agricultural, Biological, and Environmental Statistics, Volume 10, Number 3, Pages 321–336DOI: 10.1198/108571105X58540

321

322 M. WOOD, I. T. JOLLIFFE, AND G. W. HORGAN

several reasons: to make interpretation easier, to decrease the computational burden, andto improve statistical stability of the results. Often a considerable reduction can be madewithout degrading the performance of the discriminant analysis in terms of its probabilitiesof misclassification. There are many methods of selecting variables, or feature selection asit is known in the pattern recognition and data mining literatures (Gose, Johnsonbaugh, andJost 1996; Hand, Mannilla, and Smyth 2001). Some of these methods were described byMcKay and Campbell (1982a, b), McLachlan (1992, chap. 12) and Gose et al. (1996). Inthis article we concentrate on methods which are related to those described by Cadima andJolliffe (2001) in the context of principal component analysis.

The motivating example for our work consists of data on sounds made by fish, specif-ically three haddock. These sounds vary from individual to individual and depend also onthe time of year and behavior of the fish. It is of interest to be able to distinguish betweenfish using the sounds (Hawkins, Wood, and Casaretto 2001). Further information and back-ground for these data will be given in Section 2. The data for the fish take the form ofsegments of time series during the period when the sounds are made. Identifying the seg-ments and constructing variables that describe the structure of the segments are nontrivialproblems. Section 3 summarizes these preprocessing steps, which were largely carried outusing wavelets (Daubechies 1992; Bruce and Gao 1996; Abramovich, Bailey, and Sapati-nas 2000). For a chosen wavelet basis, the wavelet coefficients are used to provide a set offeatures for discrimination.

Section 4 outlines some basic ideas of discriminant analysis, and goes on to describethe criteria that are used for variable/feature selection in our example. These criteria areapplied and results presented in Section 5. Concluding remarks and discussion are given inSection 6.

2. THE DATA

Marine biologists have been studying the sounds produced by certain species of fishfor some time—see, for example, Hawkins and Rasmussen (1978). They have examinedthe biological mechanisms by which fish generate and also hear sounds. Researchers areequally interested in understanding why some species of fish emit sounds, how and whytheir calls vary in different situations, how the sounds relate to individual behavior and theimpact of these sounds on nearby fish.

The data used in this analysis are sound recordings of the haddock (Melanogrammusaeglefinis) provided by the Fisheries Research Services’ Marine Laboratory in Aberdeen,UK. It is known that male haddock are most vocal during the spawning season (February–May), and Hawkins, Casaretto, and Picciuliun (2002) used the distinctive characteristics ofmale haddock calls to locate spawning concentrations of this species in the sea.

The behavior and sound produced by three male haddock were recorded in a semi-annular tank (90 m3) over the year 2000 spawning season. The fish were kept under asimulated natural day/night cycle, in terms of lighting and water temperature. A video cam-era situated above the tank recorded the behavior of the fish while a broad-band hydrophone

VARIABLE SELECTION FOR DISCRIMINANT ANALYSIS 323

Figure 1. Section of recorded sound.

placed in the tank detected the sounds. The sounds were amplified by a preamplifier andsampled at a frequency of 8 kHz.

A short section of a recording known to have been produced by a single fish is shown inFigure 1. As this example shows, the haddock emits low frequency sounds, often composedof nearly identical transient units. These units, referred to as “knocks,” are produced atvarying rates depending on the fish’s behavior. The waveform of the individual knocks,produced by a single male, tends to remain constant, and different to the other two males atany particular time during the spawning season; however, the shapes of the individual knocksdo change with time. Examples of the different waveforms for the three male haddock (A,B, and C) are shown in Figure 2.

Each of the waveforms shown in Figure 2 is composed of 256 amplitudes, standardizedto have unit variance. These were obtained from the original recording using a waveletthreshold technique described in Section 3.

3. FEATURE EXTRACTION

We begin this section with a brief overview of the wavelet transform. Further detailswill be reported elsewhere.

Suppose that the amplitude of the sound produced by a haddock is given by someunknown function of time, f(t). The wavelet decomposition of f is given by

f(t) =∞∑

k=−∞cJ,kφJ,k(t) +

J∑j=−∞

∞∑k=−∞

dj,kψj,k(t), (3.1)

where the φJ,k are predefined scaling functions, and the ψj,k are related wavelet functions.


Figure 2. Standardized haddock knocks.

These two families of functions are parameterized by scale (or resolution) j and location k.The first sum in (3.1) gives a coarse approximation to f , and adding each of the sums in thesecond part (decreasing j from J down to 1) improves this approximation. If f has beensampled (uniformly) at 256 = 28 points, we obtain an approximation to (3.1) given by

f(x) ≈∑

k

cJ,kφJ,k(x) +J∑

j=1

∑k

dj,kψj,k(x) (3.2)

for a chosen J ∈ {1, 2, . . . , 8}, and for a range of k ∈ Z, which both depend on the choiceof wavelet basis. For such a sampled signal, there will be 256 wavelet coefficients cJ,k anddj,k. If the wavelet basis chosen is a good one, many of the wavelet coefficients will benegligible, and the signal may be represented by considerably fewer than 256 points. It wasfound that a member of the Coiflet family of wavelet bases (c12) produced the best results,both in terms of its ability to isolate the individual knocks and the features it produced fordiscrimination. This family, which is included in the S-Plus (Insightful Corp., Washington,USA) add-on module S+Wavelets, is an orthogonal basis with compact support, designedto be as least asymmetric as possible. c12 is smoother than the c6 basis.

In order to capture the individual knocks (Figure 2) from the recorded sounds (Figure1), we made use of the multiresolution analysis provided by the wavelet transform. Thesmall scale coefficients d1,k, d2,k, . . . correspond to local detail in the signal, and the large-scale coefficients . . . , dJ,k, cJ,k correspond to large features in the signal. Using the large


Figure 3. Plot of the d4 coefficients in descending order of absolute value.

scale coefficients of a section of background noise, thresholds were obtained whereby ifthe large scale coefficients of a section of recorded sound exceeded the thresholds then thesound was considered to be a large feature (i.e., a knock) rather than background noise.The individual knocks are about 20 ms long, and the inter-knock length was such thateach knock could be captured in a 32-millisecond (ms) time interval, corresponding to 256amplitude measurements, without overlap. This automated method of extracting knocksfrom the sound recordings was quick, but had the disadvantage that “spurious” signalscaused by splashes and haddock hitting the hydrophone were also captured.

Features for discriminating between the three male haddock as well as the group of spu-rious signals were calculated using the stationary wavelet transform (Nason and Silverman1997). The knocks in Figure 2 are not in the same position within the 32-ms time interval.As a result, the standard wavelet transform of knocks from the same fish are different. Thestationary wavelet transform however is location invariant and avoids the need for knockregistration before applying the standard wavelet transform.

Using the c12 wavelet basis, the individual knocks were decomposed into four reso-lution levels d1,d2,d3,d4, and s4 using the stationary wavelet transform. The vectorsd1,d2. . . . , s4 contain the coefficients d1,k, d2,k, . . . , c4,k respectively from the series ex-pansion (3.2). This produces 256×5 = 1280 stationary wavelet coefficients (the stationarywavelet transform gives an over-determined, or redundant, transformation). The goal is touse these coefficients, or some simple function of them, to discriminate between the knocks.Plots of these coefficients in descending order of absolute value for d1–s4 showed cleardifferences between the three haddock (e.g., Figure 3 shows plots of d4 coefficients for twodifferent sounds from each fish), and on the basis of these plots, features were extractedfrom individual knocks in the following way.


1. Order the coefficients in d1 in descending order of magnitude. Call these d1(1),

d1(2), . . . ,d1(256).2. Divide the first 150 into 15 sets of 10: {d1(1), . . . ,d1(10)}, . . . , {d1(141), . . . ,

d1(150)}.3. The mean of the 10 coefficients in a set forms one feature. This gives 15 features

for d1.4. Repeat with the coefficients in d2,d3,d4, and s4 giving a total of 75 features.Although this is a great reduction in dimensionality from the original 256 amplitude

measurements needed to describe each knock, and it is believed that these features havecaptured important differences between the waveforms of the three fish, 75 features is stilla large number of variables for a discriminant analysis. In the following section we discussthis problem and introduce a new measure for selecting a subset of the variables which weshow can discriminate as well as the full set of variables.

4. FEATURE SELECTION

Discriminant analysis is a statistical technique that assigns observations to one ofseveral distinct populations, based on measurements made on the observations, or featuresderived from them. The results shown in the following section were achieved using alinear discriminant analysis (LDA) rule, although quadratic and regularized discriminantanalysis were also tried, and other forms of discrimination are available; for example, thenearest neighbor classifier. Discriminant analysis is covered in many textbooks; for example,McLachlan (1992), Krzanowski (1996), and Gnanadesikan (1997), and may be formulatedas follows.

Suppose that we have an n × p data matrix X of observations on p variables fromn individuals which may be categorized into g groups. We suppose that there are ni, i =1, 2, . . . , g individuals in each group, so that n =

∑gi=1 ni. The jth individual from group

i (one of the rows of X) can be represented by the vector

xij = (x[j]i1 , x

[j]i2 , . . . , x

[j]ip )′.

The sample mean vector for the ith group is given by xi = 1ni

∑ni

j=1 xij , and the overallmean vector is given by x = 1

n

∑gi=1 nixi.

If we assume that the covariance matrix in each group is the same, then a pooledwithin-groups covariance matrix may be defined as

W =1

n− gg∑

i=1

ni∑j=1

(xij − xi)(xij − xi)′. (4.1)

The sample linear discriminant rule assigns an observation x to the group i for which

(xi − xj)′W−1{x− 12 (xi + xj)} > 0 ∀ j /= i. (4.2)


The above rule requires the inversion of the p× p covariance matrix W. If the numberof features p, is large, then this inversion can be difficult or even impossible. Also, when p islarge, large sample sizes are required to obtain stable estimates of the population covariancematrix. Brown, Fearn, and Haque (1999) stated that there are basically two approaches todealing with these problems. Either reduce the dimensionality to obtain accurate estimatesof the covariance matrix and apply standard discriminant analysis techniques, or introducesome sort of regularization or smoothing in the estimated full-dimensional covariance matri-ces, such as in regularized discriminant analysis as proposed by Friedman (1989). The latter,however, retains features which may be irrelevant to the analysis and may mask genuineeffects which exist in other features. Such techniques are also computationally intensive,and we have taken the dimension-reduction approach.

McKay and Campbell (1982a,b) described and compared several feature-selection tech-niques. They proposed a measure of “additional information” which may be gained byadding features to a previously selected subset. This measure is used in a stepwise selectionprocedure, but they declared that exact significance levels for the tests which admit or rejectfeatures are not known. A common, simple approach that they also described is to calculateF ratios for testing whether the population mean of a feature is the same in each group. Thisis equivalent to performing a standard ANOVA on each feature—the F ratio is essentiallythe ratio of a feature’s variation between groups to its variation within groups. If the F ratiois large, then we may believe that the feature is good for discrimination. The results usingthis method are compared with the results obtained using the new measure introduced inthe following. The F ratio approach, however, like several others, ignores the fact that setsof features may collectively contribute to group discrimination, even if they do not appearto do so individually.

Canonical variate analysis (CVA) provides more possibilities for feature selection. InCVA we seek linear combinations of the p features, y = a′ix, known as canonical variates(CVs), which project the data onto an r-dimensional subspace (r ≤ min(p, g − 1)) inorder to maximize the between-group to within-group variation, subject to the CVs beinguncorrelated within groups and between groups.

Using the notation above, define

B =1

g − 1

g∑i=1

ni(xi − x)(xi − x)′ (4.3)

to be the between-groups “covariance” matrix. For the first CV we need to find a1 whichmaximizes

l =a′1Ba1

a′1Wa1. (4.4)

It can be shown (Krzanowski 1996, sec. 11.1), that this reduces to solving

[B− lW]a1 = 0 (4.5)

or [W−1B− lI]a1 = 0.


Hence the maximum value of l is the largest eigenvalue of W−1B, so we take a1 to bethe eigenvector corresponding to the largest eigenvalue. Krzanowski (1996) also showedthat the eigenvector a2 corresponding to the second largest eigenvalue gives the directionalong which (4.4) is second largest subject to “uncorrelatedness” conditions—a′1Wa2 = 0and a′1Ba2 = 0. Similarly the eigenvectors a3,a4, . . . ,ar lead to the remaining possibleCVs.

Because the eigenvectors in (4.5) are not unique (e.g., ca1 satisfies (4.5) for any constantc ∈ R), the CVs are standardized to have unit variance within groups so that if A is thematrix whose columns are the standardized CV coefficients, then A′WA = I. Putting allCVs together, we can write (4.5) as

BA = WAL (4.6)

where L=diag(l1, l2, . . . , lr).The use of CVA in feature selection was documented by McKay and Campbell (1982a).

Very often these procedures are based on the elements of the vectors ai. For example, if thefirst element of a1 is large and the second element is close to 0, then we would favor the firstfeature over the second, as the second adds nothing to the value of the first CV. However, it isimportant to remember that this situation could well be reversed when considering the secondCV. This suggests that it would be better to select features which are highly correlated withthe first few CVs; however, McKay and Campbell (1982a) noted that important separatingvariables may have low correlations, particularly with a CV which represents a contrastamong the features.

A common dimension reduction technique is principal component analysis (PCA).Jolliffe (2002, sec. 9.1) deals with discrimination in the context of PCA. PCA finds linearcombinations of the data, known as principal components (PCs), that successively explainthe maximum variation among the variables. Very often a small number of these PCsexplain most of the variation, so p-dimensional data can be reduced to q (< p) dimensionsby evaluating the value of each observation on each of the first q PCs. The first q PCs areusually chosen as they explain the greatest percentage of total variation, but it does notautomatically follow that they should provide good features for discrimination.

Jolliffe, Morgan, and Young (1996) investigated the selection of a subset of PCs whichprovide good discrimination. They show that in the two-group case, the decision of whetheror not to include a particular PC in the discriminant function can be based on a test, whosetest statistic has a t-distribution. Corbitt and Ganesalingam (2001) compared Jolliffe et al.’s(1996) approach to selecting PCs for linear discriminant analysis with that proposed byDillon, Mulani, and Frederick (1989). Dillon et al. calculated measures of the proportion ofthe variation in the kth PC which is due to between-group variation. These measures are usedto rank the PCs so that the best subset of them can be selected. Corbitt and Ganesalingamshowed that both approaches lead to similar discrimination success rates, but that Jolliffe etal.’s (1996) method is to be preferred as it bases its selection on the value of a test statistic.Corbitt and Ganesalingam (2001) also showed that the two approaches are not necessarilyoptimal; they found that using the same number of selected variables as PCs, the selectedvariables achieve the highest success rates.


The issue of which of the original variables best approximate the full set of variables ora preselected set of PCs was addressed by Cadima and Jolliffe (2001). Our new measure isderived in a similar way to the measures that Cadima and Jolliffe developed, but we look forthe subset of original features which can discriminate between the groups as successfullyas the full set of features.

Above we saw that the canonical variates could be formed by maximizing the ratio

a′Baa′Wa

, (4.7)

where B and W are the between- and within-group covariance matrices. It is convenientfor the derivation of the new measure to use the matrices of sums of squares and crossproducts, B0 = (g−1)B and W0 = (n−g)W. These can be written in terms of the n×pdata matrix X by introducing a new matrix C which defines the group structure of the data,and 1n—a vector of 1’s. C is a n× g matrix with a 1 in position (i, j) if the ith row of thedata matrix X belongs to group j. If the rows of X occur in group order, then C takes thefollowing form:

C =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 · · · 01 0 0 · · · 0...

......

......

1 0 0 · · · 00 1 0 · · · 00 1 0 · · · 0...

......

......

0 1 0 · · · 0...

......

......

0 0 0 · · · 10 0 0 · · · 1...

......

......

0 0 0 · · · 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

ng

�

�

n2

�

�

n1

�

�

and 1n =

⎛⎜⎜⎜⎜⎝

11...1

⎞⎟⎟⎟⎟⎠

n

�

�

where n =∑

i ni, and 1n is an n-vector.If we define

PC = C(C′C)−1C′

to be the projection matrix associated with C, and

P1n= 1n(1′n1n)−11′n

to be the projection matrix associated with 1n, then B0 = X′(PC − P1n)X and W0 =

X′(I−PC)X. The total sums of squares and cross-products matrix can then be defined asT0 = B0 + W0 = X′(I−P1n

)X.


To keep things tidy, we first center X by premultiplying it by (I−P1n). Because PC

and P1nare idempotent symmetric matrices, and PCP1n

= P1nPC = P1n

, we can showthat if Y = (I−P1n

)X then

T0 = Y′Y and B0 = Y′(PC −P1n)Y.

Maximizing (4.7) is therefore equivalent to maximizing

a′B0aa′T0a

=a′Y′(PC −P1n

)Yaa′Y′Ya

.

Using this criterion, the CVs formed from the full dataset are given by YA, where Ais the p × s matrix whose columns are the eigenvectors of T−1

0 B0 in descending order ofeigenvalues, and s = min(p, g − 1).

Following the notation in Cadima and Jolliffe (2001), letK be a set of k integers chosenfrom the set {1, 2, . . . , p} to identify a subset of the variables. Define IK to be the p × k

matrix formed from the p × p identity matrix by removing those columns not in K. ThenYIK is the n× k submatrix of Y containing only those variables in K.

The CVs formed from those variables inK are given by YIKAK, where AK is the k×tmatrix whose columns are the eigenvectors of T−1

K BK in descending order of eigenvaluesand where t = min(k, g − 1). TK = I′KT0IK is formed from T0 by removing those rowsand columns not in K. Similarly BK = I′KB0IK.

The measure that we propose is the matrix correlation between the matrix of projectionsonto the space spanned by the t CVs formed by the k variable subset, and the projectionmatrix PC−P1n

. This is a direct application of Yanai’s generalized coefficient of determi-nation (Ramsay, ten Berge, and Styan 1984), and is an indicator of the similarity betweentwo subspaces. The projection matrix PC projects onto the subspace defined by the g groups(points belonging to the same group are sent to the group sample mean), and P1n

projectsall the points onto a single overall sample mean, ignoring any group structure. We thereforewant to project onto that space which maximizes group separation (PC), and orthogonal tothe space which provides no group separation (P1n

). Thus, our measure is

GCD = corr(YIKAK(A′KI′KY′YIKAK)−1A′KI′KY′,PC −P1n)

=tr[YIKAK(A′KI′KY′YIKAK)−1A′KI′KY′(PC −P1n

)]√tr[YIKAK(A′KI′KY′YIKAK)−1A′KI′KY′]tr[PC −P1n

]

=tr[(A′KI′KY′YIKAK)−1A′KI′KY′(PC −P1n

)YIKAK]√t(g − 1)

=tr[(A′KI′KT0IKAK)−1A′KI′KB0IKAK]√

t(g − 1)

=tr[(A′KTKAK)−1A′KBKAK]√

t(g − 1).

From an equation similar to (4.6), we have BKAK = TKAKLK from which it followsthat


GCD =tr[(A′KTKAK)−1A′KTKAKLK]√

t(g − 1)

=tr[LK]√t(g − 1)

.

But LK is the diagonal matrix whose entries are the eigenvalues of T−1K BK, thus

GCD =tr[T−1

K BK]√t(g − 1)

. (4.8)

5. RESULTS

We have searched the feature space for subsets of size 5–12, for the May haddock datadescribed in Sections 2 and 3, which maximize the new criterion (4.8). These subsets havebeen used in a discriminant analysis, and the success rates compared with those obtained ifthe largestF ratios were used to select the features, as well as those obtained using randomlyselected feature subsets.

A genetic algorithm was used to determine the subsets which maximized the GCD(4.8); however, other optimization techniques are available. The genetic algorithm used andmore details about its parameters were given by Cadima, Cerdeira, and Minhoto (2004).An initial population of 195 observations was selected at random from the available 847observations, as the genetic algorithm used limited the initial population size to 200. Thebreakdown into separate groups was: 31 A haddock, 86 B haddock, 40 C haddock, and 35spurious sounds.

The other parameters of the genetic algorithm were set so that five generations wereevolved (to reduce computational time), and the number of clones allowed in each generationwas restricted to three to maintain sufficient genetic diversity and to avoid convergence ona local optimum. For each subset, the genetic algorithm was run five times to allow for thefact that not every feature is guaranteed to be in the initial population.

The features selected by the genetic algorithm were used in a discriminant analysiswhich was run on all 847 observations. To make comparisons easier, linear discriminantanalysis—Equation (4.2)—was used, with leave-one-out cross-validation (Lachenbruch andMickey 1968) to obtain the classification success rates.

Table 1 shows for subsets of size 5–12 the features selected by the genetic algorithmwhich optimize GCD. It is clear that selecting more features tends to produce (though notalways) larger values for GCD.

It is interesting to note which features are selected by the criterion. The most commonlypicked feature is 46 which is the mean of the 10 coefficients in level d4 with the largestabsolute values. Features 18, 35, 47, and 49 also occur frequently. The majority of selectedfeatures are from levels d3 and d4, but it is surprising to see a substantial minority fromlevels d1 and d2 which are thought to contain only coefficients corresponding to backgroundnoise and would not normally be considered to be useful for discrimination. Table 1 alsogives the 12 features with the largest F ratios, and we observe two interesting properties.


Table 1. May Haddock Features Selected Using GCD via a Genetic Algorithm

No. of features Features GCD

5 19, 37, 41, 46, 50 .6536 18, 33, 43, 46, 49, 69 .6717 12, 20, 35, 43, 46, 48, 70 .6708 4, 9, 10, 19, 35, 43, 46, 50 .6789 18, 22, 34, 35, 44, 47, 48, 69, 71 .68810 1, 15, 16, 34, 44, 47, 48, 54, 68, 73 .69111 6, 9, 14, 20, 24, 27, 34, 42, 47, 49, 66 .70112 2, 6, 18, 21, 35, 42, 43, 47, 49, 50, 53, 66 .707

12 features with largest F ratios (in descending order)35, 34, 33, 36, 32, 37, 46, 31, 47, 48, 38, 63

First, the features selected are all very close to each other in terms of how they were derived.This is a good example of why one should not select features one at a time, because in thiscase they all have high F ratios simply because they all take similar values. The secondinteresting point is that among the first five features with the largest F ratios, only one ofthese is one of the more frequently selected features selected using GCD.

Table 2 compares the LDA success rates (using leave-one-out cross-validation) achievedusing different criteria for selecting the best subset. For the larger subsets, the success ratefor those subsets selected by their F ratios do only slightly worse than those selected byGCD. However, selection based on the highest F ratios does much worse than the featuresselected by the GCD until feature 46 enters the “highestF ratio features.” Recall that feature46 was the most commonly selected feature by the GCD criteria for the smaller subsets.Comparing the values of GCD in Table 1 with the corresponding success rates in the finalcolumn of Table 2, we see that there is not an obvious correspondence between the values ofGCD and success rate. After including seven features, there is no improvement in successrate when additional features are included, although GCD continues to rise.

To investigate the relationship between GCD and the corresponding discriminationsuccess rate, 99 subsets each of size 5, 9, and 12 were generated at random from the 75stationary wavelet features, without replacement. For each subset, the LDA success rate andvalues of the GCD were calculated. Scatterplots of success rate against GCD are shown inFigure 4.

Figure 4 shows positive relationships between the measures and success rate. The

Table 2. Comparison of Success Rates With Different Feature Selection Methods

LDA success rates (%) usingNo. of features Largest F ratios GCD

5 80.28 89.496 80.64 91.157 88.31 92.448 89.26 91.389 88.90 91.97

10 89.49 91.1511 89.85 91.1512 90.08 90.91


Figure 4. Plots of success rate against GCD for randomly generated subsets of size 5 (top), 9 (bottom left), and12 (bottom right).

correlations between success rate and GCD are .934, .884, and .907, respectively, for subsetsof size 5, 9, and 12. These results show that choosing the subset with the highest value ofGCD does not necessarily guarantee the best discrimination, although the high positivecorrelations show that it is better to choose a subset with a large value for the GCD thanone with a low value.

6. DISCUSSION

There are a number of aspects of the research described earlier that could be expandedor extended. Some of these extensions have been investigated; others await further research.Many methods have been suggested for variable selection in discriminant analysis, in a num-ber of different literatures. It has not been the objective of this article to review or comparethem. Instead we have concentrated on a new measure, which has theoretical appeal, andhave shown that it does well compared to a standard use of F ratios, in discriminating


between sounds made by fish.As well as different methods of variable selection, there are also different techniques

for discrimination. Our results are based on linear discriminant analysis, but quadratic dis-criminant analysis and regularized discriminant analysis have also been tried on the haddockdata. Regularized discriminant analysis (Friedman 1989) is a family of techniques indexedby two parameters. One parameter, λ, determines to what extent the covariance matrices areshrunk towards a pooled value, and the second parameter, γ, shrinks the covariance matricestowards multiples of the identity matrix. γ = 0, λ = 1 corresponds to linear discriminantanalysis, and γ = 0, λ = 0 to quadratic discriminant analysis. For most of the haddock data(except those recorded in April) little is gained, when using subsets of variables chosen byF tests, by moving far away from linear discriminant analysis towards a quadratic analysis,and there is no advantage in shrinking towards multiples of the identity matrix.

The data come in the form of time series and before variable (feature) selection cantake place, our variables (features) first need to be extracted. This is done using wavelets.Choice of wavelet basis and the way that features are constructed from the wavelets leads tomany different possibilities. The choices used in our example are not completely arbitrary;they are based on comparisons made for this dataset and others. These comparisons rule outsome options, but there still remain a number of possibilities with similar properties. Ourchoice is among the best. Wavelets can also be used in feature extraction for images, eitherin one dimension when using outlines of images, or two-dimensionally when textures ofimages are of interest.

The example discussed has its own peculiarities which need further research to tacklethem comprehensively. One problem is that, as seen in Figure 1, the sounds (knocks) madeby the fish are separated by periods of background noise. Identifying the knocks, anddistinguishing them from spurious sounds, caused by splashes for example, is difficult. Inour approach, “spurious sounds” form an extra group in the discriminant analysis, but itwould be good to eliminate them altogether. There is also the question of whether the fishproducing a knock was correctly identified. This can be tricky if two fish are more-or-lessequidistant from the hydrophone. In identifying the correct fish, it should be noted thatconsecutive knocks recorded at the hydrophone are more likely to originate from the samefish than from different fish. The knocks produced by the fish change at different times ofyear. An analysis which models this aspect would be useful. The example discriminatesbetween different fish of the same species in a tank. Similar techniques can be used in thecommercially more important application of distinguishing between fish of different speciesin the open sea. Here the problem of correctly identifying the species making the sound ismore difficult than in the confines of a tank.

For the purposes of producing software which may allow scientists to analyse soundsin real-time at sea, it would be preferable for all the analyses to be performed in a singleenvironment. Wavelet analyses, matrix calculations and genetic algorithms are all freelyavailable in R (The R Project, www.r-project.org). The study did not involve any inves-tigation into the performance of the genetic algorithm itself, and more carefully chosenparameter values may lead to improved classification.


ACKNOWLEDGMENTSThe fish sounds data were collected by Licia Casaretto and Tony Hawkins of the Fisheries Research Services’

Marine Laboratory, Aberdeen. We are grateful to Jorge Cadima who supplied the code for the genetic algorithm.Much of the work was done while Mark Wood was a research student based at the University of Aberdeen and atBioSS, funded by BBSRC and BioSS. We are also grateful to a referee for helping us to clarify some aspects ofour work.

[Received March 2004. Revised March 2005.]

REFERENCES

Abramovich, F., Bailey, T. C., and Sapatinas, T. (2000), “Wavelet Analysis and its Statistical Applications,” Journalof the Royal Statistical Society, Ser. D, 49, 1–29.

Brown, P. J., Fearn, T., and Haque, M. S. (1999), “Discrimination With Many Variables,” Journal of the AmericanStatistical Association, 94, 1320–1329.

Bruce, A., and Gao, H.-Y. (1996), Applied Wavelet Analysis with S-PLUS, New York: Springer-Verlag.

Cadima, J., Cerdeira, J. O., Orestes, J., and Minhoto, M. A. (2004), “Computational Aspects of Algorithms forVariable Selection in the Context of Principal Components,” Computational Statistics and Data Analysis,47, 225–236.

Cadima, J., and Jolliffe, I. T. (2001), “Variable Selection and the Interpretation of Principal Subspaces,” Journalof Agricultural, Biological, and Environmental Statistics, 6, 62–79.

Corbitt, B., and Ganesalingam, S. (2001), “Comparison of Two Leading Multivariate Techniques in Terms ofVariable Selection for Linear Discriminant Analysis,” Journal of Statistics and Management Systems, 4,93–108.

Daubechies, I. (1992), Ten Lectures on Wavelets, Philadelphia: Society for Industrial and Applied Mathematics.

Davey, J. C., Horgan, G. W., and Talbot, M. (1997), “Image Analysis: A Tool for Assessing Plant Uniformity andVariety Matching,” Journal of Applied Genetics, 38, 120–135.

Dillon, W. R., Mulani, N., and Frederick, D. G. (1989), “On the Use of Component Scores in the Presence ofGroup Structure,” Journal of Consumer Research, 16, 106–112.

Friedman, J. H. (1989), “Regularized Discriminant Analysis,” Journal of the American Statistical Association, 84,165–175.

Gnanadesikan, R. (1997), Methods for Statistical Data Analysis of Multivariate Observations (2nd ed.), NewYork: Wiley.

Gose, E., Johnsonbaugh, R., and Jost, S. (1996), Pattern Recognition and Image Analysis, New Jersey: PrenticeHall.

Hand, D., Mannila, H., and Smyth, P. (2001), Principles of Data Mining, Cambridge, MA: MIT Press.

Hawkins, A. D., Casaretto, L., and Picciulin, M. (2002), “Locating Spawning Haddock by Means of Sound,”Bioacoustics—Special Issue on Fish Bioacoustics, 12, 284–286.

Hawkins, A. D., and Rasmussen, K. J. (1978), “The Calls of Gadoid Fish,” Journal of the Marine Biology Asso-ciation U.K., 58, 881–911.

Hawkins, A. D., Wood, M., and Casaretto, L. (2001), “Detection, Analysis and Discrimination of UnderwaterSounds Produced by Marine Fish,” in Proceedings of the Institute of Acoustics, 23, pp. 13(i)–13(ix).

Horgan, G. W. (2001), “The Statistical Analysis of Plant Part Appearance—A Review,” Computers and Electronicsin Agriculture, 31, 169–190.

Jolliffe, I. T. (2002), Principal Component Analysis (2nd ed.), New York: Springer-Verlag.


Jolliffe, I. T., Morgan, B. J. T., and Young, P. J. (1996), “A Simulation Study of the Use of Principal Componentsin Linear Discriminant Analysis,” Journal of Statistical Computation and Simulation, 55, 353–366.

Krzanowski, W. J. (1996), Principles of Multivariate Analysis—A Users Perspective, Oxford, UK: Oxford Uni-versity Press.

Lachenbruch, P. A., and Mickey, M. R. (1968), “Estimation of Error Rates in Discriminant Analysis,” Technomet-rics, 10, 1–11.

McKay, R. J., and Campbell, N. A. (1982a), “Variable Selection Techniques in Discriminant Analysis I. Descrip-tion,” British Journal of Mathematical and Statistical Psychology, 35, 1–29.

(1982b), “Variable Selection Techniques in Discriminant Analysis II. Allocation,” British Journal ofMathematical and Statistical Psychology, 35, 30–41.

McLachlan, G. J. (1992), Discriminant Analysis and Statistical Pattern Recognition, New York: Wiley.

Nason, G. P., and Silverman, B. W. (1997), “The Stationary Wavelet Transform and Some Statistical Applications,”in Wavelets and Statistics, eds. A. Antoniadis and G. Oppenheim, New York: Springer-Verlag, pp. 281–300.

Ramsay, J. O., ten Berge, J., and Styan, G. P. H. (1984), “Matrix Correlation,” Psychometrika, 49, 403–423.

Date post:	04-Dec-2016
Category:	Documents
Upload:	mark-wood
View:	212 times
Download:	0 times

Variable selection for discriminant analysis of fish sounds using matrix correlations

Documents