A maximum linear separation criterion for the analysis of neurophysiological data

C

A

JTa

b

c

d

h

��

a

ARRA

KLCEHNT

1

ioat

mjjupid

0h

Journal of Neuroscience Methods 214 (2013) 233– 245

Contents lists available at SciVerse ScienceDirect

Journal of Neuroscience Methods

jou rna l h om epa ge: www.elsev ier .com/ locate / jneumeth

omputational Neuroscience

maximum linear separation criterion for the analysis of neurophysiological data

ose L. Marroquina,∗, Omar Mendoza-Montoyab, Rolando J. Biscayc, Salvador Ruiz-Correaa,halia Harmonyd, Thalia Fernandezd

Center for Research in Mathematics (CIMAT), Apartado Postal 402, Guanajuato, Gto. 36000, MexicoInstitut für Informatik, Freie Universität Berlin, 14195 Berlin, GermanyCIMFAV, Universidad de Valparaiso, ChileDepartamento de Neurobiologia Conductual y Cognitiva, Instituto de Neurobiologia, UNAM Campus Juriquilla, Queretaro 76230, Mexico

i g h l i g h t s

A method for extracting features that differentiate two populations in a neurophysiological experiment is presented.These features are efficiently computed and amenable of clear interpretation.The features provide maximal separation of the subjects and may be used for classification of new subjects.Examples using simulated and real data are presented.

r t i c l e i n f o

rticle history:eceived 17 October 2012eceived in revised form 28 January 2013ccepted 31 January 2013

a b s t r a c t

In this paper we propose an approach for the extraction of features that differentiate two populationsor two experimental conditions in a neurophysiological experiment. These features consist of summa-rizing variables defined as total activity (e.g., total normalized log-power), computed over sets of sitesin a discrete domain, such as the time–frequency–topography space. These sets are obtained as thosethat maximize the linear separation between the two populations, and the corresponding maps provide

eywords:inear separationlassificationlectrophysiologyypothesis testingeurophysiology

information that may complement that obtained by standard procedures, such as statistical parametricmapping. It is shown experimentally, using both simulated and real data, that the proposed approachmay provide useful information even when the standard procedures fail, due to the conservative natureof the multiple comparison correction that must be applied in the later case.

© 2013 Elsevier B.V. All rights reserved.
ime–frequency
. Introduction

In neuroscientific research, extracting informative and discrim-native features from neurophysiological signals such as EEG, MEG,r FMRI is of critical importance for revealing distinctive brainctivity features differentiating two populations or two experimen-al conditions during the execution of a certain task.

In a typical experiment, the extraction of features involves threeain steps. First, for each subject i in a group consisting of Ns sub-

ects, one computes normalized variables Zij(u), where i = 1, . . ., Ns, = 1, . . ., Ji, are indices that refer to the jth trial for the ith subject, and

∈ U indexes a discrete domain U. For example, in a typical electro-
hysiological experiment, recorded EEG signals can be represented
n the time–frequency–topography (TFT) domain. More specifically,enote by Vi,j(t, e) a raw EEG signal recorded for the jth trial of the

∗ Corresponding author. Tel.: +52 473 7327155x49534; fax: +52 473 7325749.E-mail address: [email protected] (J.L. Marroquin).

165-0270/$ – see front matter © 2013 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.jneumeth.2013.01.027

ith subject, where t represents the time course (frequently dividedinto the basal and post-basal periods), and e indexes an electrodesite. The raw signal is processed by means of a bank consisting ofcomplex-valued quadrature filters tuned at different frequenciesf = 1, . . ., Nf. The output signals from the bank are further processedto obtain a log-amplitude signal for each frequency. The normal-ized variable Zij(t, f, e) is obtained by a) computing the differencebetween the log-amplitude signal measured at time t (with fixed fand e) and the average of the log-amplitude signal measured overthe basal period; and b) dividing the obtained difference by thestandard deviation measured over the basal period. This meansthat U (the TFT-domain in this case) consists of elements u = (t, f,e) (Marroquin et al., 2004).

In the second step, for each element u ∈ U, the estimated aver-age difference between groups d(u) with respect to the normalized
variables is computed. Averages are taken with respect to trialsand subjects. Finally, suitable procedures are applied to deal withstatistical questions of interest concerning the sets A+ = {u ∈ U :E[d(u)] > 0} and A− = {u ∈ U : E[d(u)] < 0}, where E[·] denotes the
dx.doi.org/10.1016/j.jneumeth.2013.01.027

http://www.sciencedirect.com/science/journal/01650270

http://www.elsevier.com/locate/jneumeth

mailto:[email protected]

dx.doi.org/10.1016/j.jneumeth.2013.01.027

2 roscie

ms

p

1

2

3

AtdeeieFmcspTcn

cmitrtmomhhg

obdgrcrtafvmv1i

D

It

34 J.L. Marroquin et al. / Journal of Neu

ean value according to the sampling distribution. We call theseets A-Regions.

In this context, it is useful to distinguish between several relatedroblems:

. The problem of determining if there is a statistically significantdifference between the responses of the subjects belonging toone group and those of the subjects belonging to the other group,regardless of the localization of the elements responsible forthese differences.

. Estimation of the A-Regions A± by means of data-dependent setsA±.

. Finding a small set of summarizing variables that have a clearneurophysiological interpretation, and that can be used as fea-tures for the construction of a classifier that can predict the groupmembership of new subjects based on their responses.

The standard approach for statistical inference concerning the-Regions is to detect (with controlled probability of false detec-ion) whether there are some elements in U displaying statisticalissimilarities, which leads to estimated sets A± consisting of thoselements in U where significant differences are detected (Fristont al., 2007; Nichols, 2001). This often entails a univariate approachn which differences for each particular element u ∈ U are consid-red relevant if a test defined over a suitable statistic (e.g., t-test,-test, etc.) exceeds a given statistical threshold. Because each ele-ent is considered separately, a very large number of statistical

omparisons are needed. Therefore it is necessary to correct thetatistical threshold of significance for the multiple-comparisonsroblem in order to properly control the number of false positives.his often makes the corresponding estimates for the sets A± tooonservative, specially for those data sets in which the signal tooise ratio is relatively low.

In many cases, however, finding those elements in U that bestharacterize the membership of subjects in each group may beore important than achieving a strict false positive control. Specif-

cally, the proposal we present here is to find a set of elements suchhat the total inter-group difference computed over this set sepa-ates the groups as much as possible. In this way, one may constructwo related maps that are complementary, in the sense that they

ay lead to a better understanding of the data set being analyzed:ne, constructed using the standard approach, which maps the ele-ents in U where the difference between groups is significantly

igh – with strict false positive control – and the one we proposeere, which maps the set of elements which best separate the tworoups in terms of the total difference computed over this set.

In the approach presented here, problem 1 is solved by meansf a single global statistical test. Problems 2 and 3 are tackledy computing a small set of summarizing variables, which areefined as total responses over the sets A±, and measuring theroup separation in terms of them. These variables have a clear neu-ophysiological interpretation in many cases. For example, in thease of the analysis of electrophysiological experiments, they cor-espond to total normalized log-power differences with respect tohe basal period for particular subsets of electrodes, over given timend frequency intervals. As we show in the next sections, classifiersor group membership of subjects based on these variables haveery good performance, under some reasonable assumptions. As aeasure of inter-group separation in the space spanned by these

ariables our proposal is to use the Fisher criterion (FC) (McLachlan,992; Duda et al., 2000), which may be computed as the ratio of

nter-group to intra-group variances for these variables. Define

+ = {u : d(u) > 0}; D− = {u : d(u) < 0} (1)

f one considers sets S+ ⊆ D+ and S− ⊆ D−, denoting by F(Y(S+, S−))he theoretical FC based on the summarizing variables Y(S+, S−)

nce Methods 214 (2013) 233– 245

defined as total responses over these sets, the estimators A+, A−are defined as the maximizers of F(Y(S+, S−)) over all possible setsS+, S−, and in this sense, they may be interpreted as sets of elementsthat best characterize group membership.

The paper is organized as follows. A review of related work ispresented in Section 1.1. The proposed method and some theoret-ical properties, as well as practical algorithms for implementing itare presented in Section 2. Section 3 presents some experiments,performed with simulated and real data, to illustrate its perfor-mance. The corresponding results are presented in Section 4, andfinally, a discussion and conclusions are presented in Section 5.

1.1. Related work

As mentioned above, the standard approach for the estimationof the sets A± by local hypothesis testing (LHT), entails the appli-cation of suitable corrections for keeping the overall false positiverate under control. Several solutions have been proposed to addressthis problem, such as the Bonferroni correction (Abdi, 2007) andapproaches based on the family-wise error rate (FWER) (Nichols,2002; Nichols and Hayasaka, 2003), or the false discovery-rate(FDR) (Genovese et al., 2002). Although more permissive than theBonferroni correction, these later methods are in general conser-vative, being the FDR method less conservative than FWER.

Closely related to the approach presented here, are methods thattry to exploit the fact that the regions A± usually have a higherdegree of spatial contiguity than the fields generated under H0, i.e.,A± usually consist of clusters of contiguous elements in U. Thus, thesize of connected clusters of elements u where d(u) is above a giventhreshold may be used as a statistic for testing local hypotheses,as in the the supra-threshold cluster size (STCS) method (Fristonet al., 1994). Other methods, based on similar ideas are: the supra-threshold cluster mass (Bullmore et al., 1999; Zhang et al., 2009), inwhich the local statistic is computed as the sum of the d values of allthe connected elements that belong to the cluster; the threshold-free cluster enhancement technique (Smith and Nichols, 2009), inwhich the statistic of interest combines cluster sizes found usinga set of thresholds, and the morphology-based hypothesis testingmethod (Marroquin et al., 2012), in which the corresponding statis-tic combines minimum values of d(u) (i.e., morphological erosions)in nested neighborhoods of different sizes.

In the approach presented here, the idea of considering con-nected clusters of elements with supra-threshold values of d(u) fora set of thresholds is considered as well. Unlike the above meth-ods, however, the criterion for accepting or rejecting the inclusionof a connected cluster in the estimates A± is not a p-value, but thefact that the proposed measure of inter-group separation increases.This is explained in detail in the next section.

Since our proposal includes the definition of summarizingvariables and a separation criterion which is closely related to clas-sification, it is related to the topics of variable selection, as well asto dimensionality reduction. In the context of variable selection,our approach is related to the wrapper methodology, in which theprediction performance of a classifier is used to assess the relativeusefulness of subsets of features. It may also be considered as anembedded method (Kohavi and John, 1997; Guyon and Elisseeff,2003), in which feature selection is performed in the process oftraining a linear classifier. Unlike the classic wrapper and embed-ded methods, however, we propose the use of a domain-specificstructure, which includes the use of summarizing variables to con-struct the classifier, which is then based on a fixed, small number
of features. This is important, since the number of variables is inour case very large (it equals the number of elements in U) andclassifiers based on large numbers of variables with small samplesizes usually produce strongly biased results (Klemet et al., 2008).

roscie

Fari

1sroct(2RPcas

2

RW

a

Z

we

d

wbd

s

Y

mc1dˇg

F

wamf

ˇ

wSl

F

J.L. Marroquin et al. / Journal of Neu

urthermore, the proposed summarizing variables have thedvantage of allowing for a straightforward interpretation by neu-ophysiologists, in terms of total activity over specific regions, e.g.,n TFT space.

Our method is also related to partial least squares (PLS) (Wold,966), which is a multivariate technique that has been used exten-ively in neuroimaging research (see Krishnan et al., 2011 andeferences contained therein). This technique was originally devel-ped for multivariate linear regression, and it is specially suited forases where the available data is limited and the number of predic-ors is high. It has also been often used for classification problemsBarker and Rayens, 2003; Boulesteix, 2004; Ding and Gentleman,004), by using a dummy matrix that records group membership.elations and peculiarities of our approach in comparison with thisLS-based classification technique and with other regression-basedlassification methods that also involve dimensionality reductionnd sparse representations, are further elaborated in the discussionection.

. Theory and methods

This section develops our approach for estimating the true A-egions A+, A− ⊂ U exhibiting dissimilar activity between groups.e first introduce some notation:The average normalized variable Zi for each subject i s defined

s:

i(u) = 1Ji

Ji∑j=1

Zij(u) (2)

here Zij and Ji are defined in the introduction. The average differ-nce field d is defined as:

(u) = 1∑i∈G1

Ji

∑i∈G1

JiZi(u) − 1∑i∈G2

Ji

∑i∈G2

JiZi(u) (3)

here G1 and G2 are sets of subject indices that correspond to mem-ership in groups 1 and 2, respectively. Given d, the sets D+, D− areefined by Eq. (1).

Given a set of elements S ⊂ U, the summarizing variable Yi(S) forubject i over set S is:

i(S) =∑u∈S

Zi(u)

Given two fixed sets S1, S2, the Fisher criterion, as used here,easures the performance of a linear discriminant function for

lassifying the subjects according to their membership in group or 2, based on the vector Y(S1, S2) = (Y(S1), Y(S2))T , where Tenotes transpose. If this linear discriminant function is f (Y; ˇ) =

· Y , where is a vector of coefficients, the theoretical FC F isiven by:

(Y(S1, S2)) = (ˇT (E1[Y(S1, S2)] − E2[Y(S1, S2)]))2

ˇT (V1[Y(S1, S2)] + V2[Y(S1, S2)])ˇ(4)

here E1, E2 denote expectations for groups 1 and 2, respectively,nd V1 and V2 are the corresponding covariance matrices. Maxi-ization with respect to leads to the optimal linear discriminant

unction with

= (V1[Y] + V2[Y])−1(E1[Y] − E2[Y])

here the explicit dependence of Y on S , S is omitted for brevity.
1 2ubstituting this value of into (4), one obtains F as the Maha-anobis distance:
(Y) = (E1[Y] − E2[Y])T (V1[Y] + V2[Y])−1(E1[Y] − E2[Y]) (5)

nce Methods 214 (2013) 233– 245 235

This expression corresponds to the theoretical FC, since it involvesthe true (population-based) expectations and covariances for theY variables. These may be replaced by simple estimators obtainedfrom the available data Z to obtain the (empirical) FC, F:

F(Y) = (E1[Y] − E2[Y])T (V1[Y] + V2[Y])−1(E1[Y] − E2[Y]) (6)

with

Eg[Y] = 1|Gg |

∑i∈Gg

Yi (7)

Vg[Y] = 1|Gg |

∑i∈Gg

(Yi − Eg[Y])(Yi − Eg[Y])T (8)

for g ∈ {1, 2}, where | · | denotes the cardinality of a set.Now assume that one has available two independent data sets

Z(1), Z(2), each with N subjects. One may use Z(1) to compute theaverage difference field d, and from this, the sets D+, D−. Then usingZ(2), one may compute F(Y(S+, S−)) for any sets S+ ⊆ D+, S− ⊆ D−using (6) (conventionally, if the sets S+, S− are empty, F is set tozero). The criterion that we propose to find the sets A+, A− thatestimate the true A-Regions is the maximization of F with respectto the sets S+, S−, i.e.,

(S∗+, S∗

−) = (A+, A−) = arg maxS+,S−

F(Y(S+, S−)) (9)

The quantities Yi(S∗+)/|S∗+|, Yi(S∗−)/|S∗−| have a clear neuro-physiological interpretation, since they represent the averagevalues of the normalized variable Zi(u) taken over sets of elementsthat best separate the two groups, and where the average differ-ence between groups is respectively positive or negative. Therefore,this estimation procedure may also be understood as a method forreducing the dimensionality of the data, producing interpretablevariables which best characterize subjects with respect to theirmembership in each group.

The value of the FC F is closely related to the performance of alinear classifier constructed using the summarizing variables. Let Hdenote the area under the ROC curve (Swets, 1996) for the linearclassifier based on Y(S+, S−). One can show then that the followinginequality is satisfied:

H ≥ 1 − 11 + F

(10)

(see Appendix A for a proof). This means that large values for theoptimal F are associated with highly efficient classifiers. This isconfirmed by our experiments (see Section 4).

As will be shown in Section 3 through simulated and real data,the sets S∗+ and S∗− yield not only good performance of the classifierbased on the features Y(S∗+) and Y(S∗−) (as theoretically expectedfrom inequality (10)), but also represent good estimators for thetrue A−Regions A+ and A−. To get further insight on this issue,consider the asymptotic behavior of the following simple model:

Let the average normalized variables Zi be modeled as a randomfield:

Zi(u) = �i(u) + ei(u) (11)

where

�i(u) = ai+(u), for u ∈ A+

= −ai−(u), for u ∈ A−

= 0, elsewhere(12)

where ai+(u), ai−(u) are positive (deterministic) functions. A com-mon assumption in the analysis of neurophysiological data is that

2 roscie

aa

a

C

wwt∑v

Tum

ittD

atc

F

w

fFttnpcgoefhuSj

atawls

TesAse

ahma


i+(u) = a+, ai−(u) = a− for all i, u and for some positive constants+, a−.

The ei random variables are assumed to have 0 mean (w.l.o.g.),nd covariance:

ov(ei(u), ei(v)) = �2e �(u, v)

here the correlation coefficient �(u, v) is assumed to decreaseith the distance between u and v at a sufficiently large rate, so

hat for any site u ∈ U,

/= u

|�(u, v)| � min(|A+|, |A−|)

his mixing assumption usually holds for neuroimages; in partic-lar it holds if e is white noise, or follows a Markov random fieldodel.Suppose that the complete available sample of subjects is split

nto two sub-samples with index sets G1, G2 and G′1, G′

2, respec-ively, and the sub-sample with indices G′

1, G′2 is used to compute

he field d using (3), and from this, the sets D+ = {u : d(u) > 0},− = {u : d(u) < 0} are determined.

Consider now two sets: S+ ⊂ D+, S− ⊂ D−. Computing the Y vari-bles from the sub-sample with indices G1, G2, one gets that forhe proposed model and for large cardinalities |G1|, |G2|, the Fisherriterion may be approximated by:

(Y(S+, S−)) ≈ (a+|S+⋂

A+|)2

�2e |S+| + (a−|S−

⋂A−|)2

�2e |S−|

hich is maximized for S+ = A+, S− = A−.Note that in this asymptotic case, F itself has built in the

alse positive control, since if, for instance, S+ ⊃ A+, F(Y(S+, S−)) <(Y(A+, S−)), and similarly for the case where S− ⊃ A−. However,he requirement that different sub-samples are used to computehe field d (and hence, the sets D+, D−) and the Y variables, isecessary for this criterion to work properly: if the same sam-le is used for both purposes, one is in fact sampling from theonditional distributions Pg(Z(u)|d(u) > 0) and Pg(Z(u)|d(u) < 0), for

= 1, 2, instead of sampling from Pg(Z(u)) for the computationf Eg[Y(S+, S−)] which causes that, even under the null hypoth-sis (i.e., for a+ = a− = 0), one may obtain high spurious valuesor F(Y(S+, S−)). The use of different sub-samples, on the otherand, implies that the maximum separability criterion is valid fornseen data, so that the Y variables obtained with the optimal sets∗+, S∗− may be effectively used for the classification of new sub-ects.

This result may be generalized to the case where ai+(u) = a+(u),i−(u) = a−(u) for all i, u and for some positive deterministic func-ions a+(u), a−(u). In this case, one may find by direct substitution

sufficient condition for the sets S∗± which maximize F to coincideith A±. This condition is that the maximum and minimum abso-

ute values amin, amax of aw(u) inside each region Aw , w ∈ {+, −}atisfy:

amin

amax> 1 − |Aw| +

√(|Aw| − 1)2 + |Aw| − 1

his bound tends to 0.5 for large values of |Aw|, e.g., for |Aw| = 100 itquals 0.498. Note that even if this condition does not hold, if for anyet S one has that S

⋂Ac± /= ∅, where Ac± denotes the complement of

± in U, one will always have that, asymptotically, F(S⋂

A ±) > F(S),o that asymptotically, one always has false positiverror control.

Note that from the definition of Zi given by Eq. (2), one may
ssume that the distribution of Zi is Gaussian. This assumption,owever is not necessary to justify the good performance of theethod; since the features Yi are defined as sums of the Zi(u) vari-
bles over sets S+, S−, the method will still work if the distribution of

nce Methods 214 (2013) 233– 245

Zi(u) is non-Gaussian, or even not linearly separable. This is shownin the simulations below.

2.1. Computationally efficient algorithms

The application of a direct search strategy for the maximizationof (6) is not computationally feasible, since it implies a search overall subsets (i.e., the power set) of D+ and D−. For this reason, a com-putationally efficient heuristic is necessary. The basic idea in whichthis heuristic is based is to replace the search over the power sets bya search over the positive and negative excursion sets of the fieldd, i.e., Q+(C+) = {u : d(u) ≥ C+} and Q−(C−) = {u : d(u) ≤ − C−}, whereC+, C− are positive thresholds. The rationale for this is that the wayin which the sets A+, A− are usually estimated is also based on asearch over the space spanned by these thresholds (with a suitablenormalization of the field d), except that in that case one looks forvalues that keep the false positive rate below a prescribed signifi-cance value, while in this case one has an additional performancecriterion, namely, the FC F .

The proposed method is implemented using a two-phase algo-rithm: in phase 1, the idea is to find the thresholds C∗± for which theaggregated variables computed over the corresponding excursionsets maximize F . These maximizing sets are called “seed regions”.In phase 2, one allows these seed regions to grow by adding to themadjacent elements if this addition increases the value of F . The finalgrown regions will be the estimates A±.

In detail, the goal of phase 1 algorithm (which we call MaximumLinear Separation Phase 1, MLSP1) is to find thresholds C∗+, C∗− suchthat

F(Y(Q+(C∗+), Q−(C∗

−))) = maxC+,C−

F(Y(Q+(C+), Q−(C−))).

For this purpose, the discrete search space DSS1 for C+, C− is definedas:

DSS1 ={

C+, C− : Cw = Cminw + j

Cmaxw − Cmin

w

N(1)w − 1

, w ∈ {+, −}, j = 0, . . . , N(1)w − 1

}

i.e., DSS1 is a discrete grid with spacing (Cmaxw − Cmin

w )/(N(1)w − 1),

w ∈ {+, −} (see the implementation details below for the valuesof Cmax

w , Cminw and N(1)

w ). The seed regions estimated by MLSP1 are:S∗+ = Q+(C∗+), S∗− = Q−(C∗−).

In the case of phase 2, recomputing F each time an elementis added to the seed regions is not computationally efficient. Amore practical scheme makes also use of the excursion sets Q±by searching over thresholds (C+, C−) smaller than (C∗+, C∗−), andallowing the seed regions to grow as much as possible insidethe corresponding excursion sets before evaluating F . Specifi-cally, one grows S∗+ and S∗− separately. For w ∈ {+, −}, and foreach threshold Cw in the search space, one finds the regionRw(Cw) as the set of elements u ∈ Qw(Cw) that are arc-connectedto S∗

w via elements v also belonging to Qw(Cw). This may beefficiently computed using a standard region-growing algorithm(Pratt, 2007), which we call Grow-Region(S∗

w, Qw(Cw)), and whichreturns the corresponding region Rw(Cw). If the thresholds Cw

decrease from C∗w at a sufficiently small rate, only a few elements

will be added at a time before F is re-evaluated on the basis ofthe aggregated variables corresponding to the regions Rw(Cw), sothat the regions Rw(C∗∗

w ) that maximize F will approximate theexact phase 2 procedure. We call this approximation “AlgorithmMLSP2”.

The search space for phase 2, DSS2 is defined as:

DSS2 = {C+, C− : Cw = jC∗

w

N(2)w − 1

, w ∈ {+, −}, j = 0, . . . , N(2)w − 1}

where C∗± are the values obtained by MLSP1.

roscie

1

2

12

3

4

dtnTb1eo

d

i

te

Y

sY

Y

T

1

2

a

afis

p

t

R


In precise terms, the complete algorithms are as follows.Algorithm MLSP1

. For each threshold pair (C+, C−) in the search space DSS1:(a) Compute the excursion sets Q+(C+), Q−(C−) and the summa-

rizing variables Y(Q+(C+), Q−(C−));(b) Compute F(Y(Q+(C+), Q−(C−))).

. Set S∗+ = Q+(C∗+), S∗− = Q−(C∗−), where (C∗+, C∗−) are the maximiz-ers of F(Y(Q+(C+), Q−(C−))) over the search space.

Algorithm MLSP2

. Set S∗+, S∗− to the values obtained by MLSP1.

. For all C+, C− ∈ DSS2 do the following:(a) Compute the excursion sets Q+(C+), Q−(C−);(b) Compute R+(C+), R−(C−) using R+(C+) = Grow-

Region(S∗+, Q+(C+)) and R−(C−) = Grow-Region(S∗−, Q−(C−));(c) Set F(C+, C−) = F(Y(R+(C+), R−(C−))).

. Set C∗∗ = arg maxC+,C−∈DSS2

F(C+, C−).

. Set A+ = R+(C∗∗+ ), A− = R−(C∗∗− ).

As discussed above, these algorithms require two independentata sets: one for computing the field d, and the other for computinghe Y variables and F at the corresponding steps. If one has a reducedumber of subjects, however, this approach may not be adequate.herefore, we propose the following modification of the algorithms,ased on a leave-one-out (LOO) procedure (Geisser, 1993; Efron,982): the idea is to compute the Y variables for each subject i atach step based on a field d(−i) obtained by leaving this subject outf the computation of field d, i.e.,

(−i)(u) = 1∑j∈G1,j /= iJj

∑j∈G1,j /= i

JjZj(u) − 1∑j∈G2

Jj

∑j∈G2

JjZj(u) (13)

f i ∈ G1, with the corresponding modification for i ∈ G2.Note that this implies that in algorithm MLSP1, for each

hreshold pair C+, C−, one needs to compute different LOOxcursion sets Q (−i)

+ (C+), Q (−i)− (C−) and summarizing variables

(−i)i

(Q (−i)+ (C+)), Y (−i)

i(Q (−i)

− (C−)) for each subject, but from these a

ingle F(Y (−)(C+, C−)) is obtained, where the ith element of vector˜ (−)(C+, C−) is:

˜ (−)i

(C+, C−) = (Y (−i)i

(Q (−i)+ (C+)), Y (−i)

i(Q (−i)

− (C−)))

he resulting algorithm is as follows.Algorithm MLSP1-LOO

. For each threshold pair (C+, C−) in the search space DSS1:(a) Compute for each subject i the excursion sets

Q (−i)+ (C+), Q (−i)

− (C−);(b) Compute the summarizing variables

Y (−)i

(C+, C−) = (Y (−i)i

(Q (−i)+ (C+)), Y (−i)

i(Q (−i)

− (C−)));

(c) Compute F(Y (−)(C+, C−)).. Set S∗+ = Q+(C∗+), S∗− = Q−(C∗−), where (C∗+, C∗−) are the maximiz-

ers of F(Y (−)(C+, C−)) over the search space.

In the case of algorithm MLSP2, note that one has avail-ble, as a result of MLSP1-LOO, not only the sets S∗±, but

lso the LOO excursion sets Q (−i)± (C∗±). Our proposed modi-

cation of MLSP2 to construct MLSP2-LOO, is to consider a∗ ∗
et of thresholds C+ < C+, C− < C− as before, but now com-
ute sets R(−i)+ , R(−i)

− using the sets Q (−i)± (C∗±) as seed regions,

hrough calls to Grow-Region(Q (−i)+ (C∗+), Q (−i)

+ (C+)) and Grow-

egion(Q (−i)− (C∗−), Q (−i)

− (C−)), respectively. From these sets, one

nce Methods 214 (2013) 233– 245 237

computes then the summarizing variables Y (−)(C+, C−) and the cor-responding FC F(Y (−)(C+, C−)). The final estimator is then obtainedas A± = Grow-Region(S∗±, Q (−i)

± (C∗∗± )), where C∗∗+ , C∗∗− are the max-imizers of F(Y (−)(C+, C−)) over the search space. In precise terms,the algorithm is as follows.

Algorithm MLSP2-LOO

1. Set S∗+, S∗− to the values obtained by MLSP1-LOO.2. For all C+, C− ∈ DSS2 do the following:

(a) Compute the excursion sets Q (−i)+ (C+), Q (−i)

− (C−) for all sub-jects i;

(b) Compute R(−i)+ (C+), R(−i)

− (C−) using R(−i)+ (C+) =

Grow-Region(Q (−i)+ (C∗+), Q (−i)

+ (C+)) and R(−i)− (C−) = Grow-

Region(Q (−i)− (C∗−), Q (−i)

− (C−)) for all subjects i;(c) Compute the summarizing variables

Y (−)i

(C+, C−) = (Y (−i)i

(R(−i)+ (C+)), Y (−i)

i(R(−i)

− (C−)));

(d) Set F(C+, C−) = F(Y (−)(C+, C−)).3. Set C∗∗ = arg max

C+,C−∈DSS2

F(C+, C−).

4. Compute the excursion sets Q+(C∗∗+ ) = {u : d(u) ≥ C∗∗+ } andQ−(C∗∗− ) = {u : d(u) ≤ −C∗∗− }.

5. Set A+ =Grow-Region(S∗+, Q+(C∗∗+ )), A− =Grow-Region(S∗−, Q−(C∗∗− )).

2.1.1. Implementation detailsGlobal significance testFor data sets generated under H0, the FC F(Y(C+, C−)) becomes

almost constant as a function of C+, C−, with random oscillationsaround a small value. As a consequence, one may get maxima forarbitrarily small thresholds, which may generate spurious estima-tors for A+, A− with relatively large cardinalities, although usuallyconsisting of isolated elements. To avoid misinterpreting theseresults, it is necessary to apply a global significance test to the fieldd, to reject the null hypothesis H0 : d(u) = 0 for all u ∈ U. There areseveral tests that may be applied. Here, we propose an adaptation ofthe K-FWER test (Lehmann and Romano, 2005), which in this casemay be implemented by finding the null distribution (e.g., usinga permutation procedure) for the Kth largest value of |d(u)|. Notethat here one applies the threshold corresponding to the given sig-nificance level ( = 0.05 in this case) only once to the Kth largestelement of the test field, and not to each element u, as is usuallydone. A related test, which we call the K Largest Average (KLA),which we also implement, uses as statistic of interest the averageof the K largest values of |d(u)|. Note that in both cases it is notnecessary to correct for multiple comparisons, since each test isapplied only once to the complete field d.

Search space for C+, C−For small values of the true inter-group differences, one may

have the same problem mentioned above, namely, the appearanceof spurious maxima of F(Y(C+, C−)) for small values of C+, C−. Thisproblem may be avoided by limiting the search space DSS1 for theoptimal thresholds C∗+, C∗−, so that the minimal allowable valueCmin+ , Cmin− corresponds to the threshold of a local (element-wise)significance test, say with a significance level = 0.05, without cor-recting for multiple comparisons. The maximum values Cmax+ , Cmax−are limited by the maximum and minimum values of d(u), for u ∈ U.We use a sampling resolution of 0.01 for the search spaces. There-fore, the values for (N(1)

+ , N(1)− ) are set to (100, 0) for the experiments

with simulated data, and to (100, 100) for the experiments with realdata (see next section). For the search space DSS2, we use (100, 0)
and (100, 100), respectively, for (N(2)
+ , N(2)− ).

Global maximization of F(Y(C+, C−))The maximization of F(Y(C+, C−)) effected in MLSP1 or MLSP1-

LOO, is accomplished in the 2-dimensional space spanned by C+, C−.

2 roscie

Abtaccihetiwiec“mlWsts

3

taatws

3

ovbapatgL

gMvs(T

T

NTfcautis


lthough there are computationally efficient algorithms that maye applied for this purpose (Ashlock, 2006), it is more conveniento simply discretize the search space in a two-dimensional grid,nd perform an exhaustive search on this grid. The reasons for thishoice are not only its simplicity and the fact that the computationalost using modern hardware is reasonable (see Section 4), but moremportantly, the fact that the function F(Y(C+, C−)) usually presentsigh frequency random oscillations due to noise, which may gen-rate spurious maxima. If one has available the values of F in a grid,hese oscillations may be attenuated simply by applying a smooth-ng operator to this grid before searching for the global maximum,

hereas in the case of a more sophisticated maximization schemet is necessary to compute sets of values of F in the neighborhood ofach visited point (C+, C−) to perform the smoothing, which furtheromplicates the method. In any case, since one wants to eliminatenarrow” maxima, a good choice for the smoothing operator is aorphological erosion (Serra, 1984), which in this case is equiva-

ent to taking the minimum value for F inside a moving window.e have used a square window with side equal to 9 times the grid

pacing, but the precise value is not critical for the performance ofhe algorithm (window sizes from 5 through 19 give practically theame results).

. Experiments

In this section we describe the experiments performed to illus-rate the performance of the proposed method. These experimentsre of two types: first, we use the simple model described by (11)nd (12), with the additional simplification of considering a− = 0,o analyze in detail the properties of the proposed approach. Then,e present two examples of its application to the analysis of data

ets obtained from real EEG experiments.

.1. Experiments with simulated data

The first set of experiments simulate a field of 50 × 50 elementsn a square lattice, with an active region A with a constant acti-ation level a+ consisting of a square of 5 × 5 elements, with theackground noise modeled as a field of independent random vari-bles with zero mean and unit variance. We considered 15 subjectser group and 500 independent realizations to compute the aver-ge performances. These experiments were conducted to comparehe behavior of the theoretical FC given by (5), the empirical FCiven by (6) and the corresponding estimation using the proposedOO approximation, as the threshold C+ is allowed to vary.

On a second set of experiments with the same data, we investi-ated the quality of the estimation results obtained with algorithmsLSP1-LOO and MLSP2-LOO for different conditions: constant

ersus variable activation levels and effect of non-Gaussian andpatially correlated noise. For these comparisons, the Jaccard indexJaccard, 1912), also known as Tanimoto coefficient (TC), was used.his coefficient is defined as:

C = |A⋂

A||A

⋃A|

(14)

ote that TC ∈ [0, 1]: if A = A, TC = 1, and if A and A are disjoint,C = 0. When comparing the performance of estimation methodsor A based on the excursion sets of the field d (i.e., on the appli-ation of a threshold to each element d(u)), it is interesting to uses a reference the maximum performance that is possible to get
sing this type of methods. This may be obtained by finding thehreshold that maximizes the performance measure itself, i.e., TCn this case. Note that this is only a theoretical performance limit,ince the determination of this optimal threshold implies that one
nce Methods 214 (2013) 233– 245

knows exactly the true region A. This theoretical limit is used in theperformance evaluations that result from our experiments.

Finally, the empirical detection probability obtained with theproposed methods, and the classification efficiency of a linear clas-sifier based on the summarizing variables, quantified using the areaunder the ROC curve (Fawcett, 2006), were also studied.

3.2. Experiments with real data

These experiments were performed on data sets where thestandard approach (LHT with correction for multiple comparisons)does not yield any interpretable results. The description of the cor-responding electrophysiological experiments is as follows.

3.2.1. Experiment R1: relation between electrophysiologicalauditory responses and language development in prematureinfants with diffuse white matter injury

This experiment is related to a study, presented in Avecilla-Ramirez et al. (2011), in which the electrophysiological responsesto a set of auditory stimuli consisting of series of syllables wererecorded from a population of premature infants with a diffusewhite matter injury known as periventricular leukomalacia (PVL) at46 weeks of post-conceptional age. A communicative developmentinventory: the Spanish version of the McArthur communicativedevelopment inventories standardized for the Mexican population(Jackson-Maldonado et al., 2003) was applied to this populationduring a follow-up study performed at 14 months of age. The resultsof this later test were analyzed with a statistical clustering proce-dure, which resulted in two well-defined groups identified as thehigh-score (HS) and low-score (LS) groups. The goal of this studywas to determine if membership in each group could be char-acterized by early electrophysiological measurements, and if thiswas the case, to localize these characterizing responses in time,frequency band and topographic location of the correspondingelectrodes.

A total of 26 infants (14 females and 12 males with a meangestational age of 33.6 weeks and a standard deviation of 2.4weeks) with MRI results indicating PVL were selected for this studyfrom the Neurodevelopment Research Unit of the NeurobiologyInstitute, Mexico. Informed written parental consent for partici-pation in this study was obtained for all subjects. Infant EEGs wererecorded in a soundproof room. The mothers held the infants duringthe session, and two loudspeakers were located 50 cm away fromeach of the infant’s ears. The experiment was conducted while theinfants were asleep, and it lasted for approximately one hour. TheEEG was recorded using leads according to the International 10-20 System with earlobes linked together. Differential amplifierswith a band pass between 0.5 and 50 Hz were used with a sam-pling rate of 200 Hz. The data were edited off-line. The phoneticstimuli consisted of consonant-vowel syllables (i.e., /pa/) spokenby a female adult whose mother tongue was Spanish. The durationof each syllable was 255 ms, and the intensity at the infant’s earwas 71 dB (SPL). The stimuli were presented in blocks composed of50 series of three repeated consonant–vowel syllables. Within eachtrain, the stimuli were presented at 700 ms intervals from onset toonset. The three-stimuli series were presented with 2-s intervals.There was a 10-s interval between blocks.

The analysis of Event Related Power (ERP) was conducted on thetime interval of 0–2800 ms (i.e., 700 ms preceding the presentationof the first stimulus of each train until the 2100 ms following) foreach subject, for frequencies from 1 through 30 Hz, and for eachone of the 19 electrodes at which recordings were performed.

The normalized variables Zij(t, f, e) were obtained by comput-ing the difference between the log-amplitude signal measured attime t (with fixed f and e) and the average of the log-amplitude sig-nal measured over the basal period (pre-stimulus) divided by the

roscience Methods 214 (2013) 233– 245 239

sta

3rl

roddEstai1ta6ssobawSwoiwdTbaet

Mab

4

4

Ms

1

2

3

4

Arot

Fig. 1. Three upper curves: variation of: the theoretical FC (*); F obtained usingMLSP1 (×) and F obtained using MLS1P1-LOO (triangles) versus threshold. Lower

As one can see, MLSP1-LOO represents a more efficient way of usingthe available data, so this is the variant that we consider in the nextexperiments.


tandard deviation as indicated in the introduction, and from these,he subject averages Zi, and inter-group differences (field d), whichre required as input to the MLS procedure were computed.

.2.2. Experiment R2: analysis of the electrophysiologicalesponse to pairs of semantically unrelated words in normal andearning disabled children

In this experiment, presented in Fernandez et al. (2012), event-elated EEG oscillations to semantically related and unrelated pairsf words were studied in a group of 19 children with learningisabilities and in another group of 16 children with normal aca-emic achievement. The goal of the study was to determine if theEG oscillations could be used to characterize the membership ofubjects to each group. The EEGs were recorded using the 10-20 sys-em. EEG segments were edited by visual inspection 1000 ms beforend after the stimulus, and only correct responses were consideredn the analysis. The amplifier bandwidth was set between 0.5 and00 Hz. The EEG was sampled every 5 ms using a MEDICID IV Sys-em and edited offline. The stimuli were presented in white over

black background on a PC monitor, and the subjects were seated0 cm away from the stimuli. A Mind Tracer system (Neuronic, Inc.)ynchronized with a Trackwalker data system (Neuronic, Inc.) pre-ented the task. The stimuli were presented in 4 blocks of 30 pairsf words that lasted 4 min each, and the children rested after eachlock. There were a total of 120 pairs of words, including 60 relatednd 60 unrelated pairs. The second words did not begin or endith the same phoneme of the previous word. All of the words were

panish nouns with no more than three syllables. Additionally, eachord had a unique meaning according to the Spanish Dictionary

f the Royal Academy of Spanish Language and was selected fromnfantile lectures (Ahumada and Montenegro, 2007). The children

ere asked to respond by pressing the right or left mouse button,epending on whether the second word was related or unrelated.he related and unrelated pairs of words were counterbalancedetween subjects (see Fernandez et al., 2012 for details). The ERPnd the computation of the normalized variables Z and the differ-nce field d was conducted as in the previous experiment, excepthat now the frequency range was from 1 through 50 Hz.

In both experiments R1 and R2, the growth of regions in theLSP2-LOO procedure was effected only with respect to time

nd frequency, to avoid the introduction of spurious correlationsetween electrodes.

. Results

.1. Experiment 1: simulated data

Fig. 1 shows the average curves of F versus threshold for 500onte Carlo repetitions of experiment 1. The activation level was

et to a = 0.8. We show

. The value of the theoretical F given by (5), where the corre-sponding expectations were approximated by averages over 500independent samples.

. The empirical F obtained from two independent data sets, i.e.,F(Y(Q+(C+)) as used in algorithm MLSP1.

. The empirical F(Y (−)(Q+(C+)) obtained using the LOO procedureof algorithm MLSP1-LOO.

. As a reference, we also show the variation of TC with respect tothe threshold (theoretical TC curve).

s one can see, although the values of F are somewhat biased withespect to the theoretical F, the location of the maximizing thresh-ld is practically the same for all the curves, and coincides withhe threshold that maximizes the theoretical TC. The filled square

curve: theoretical TC (squares). The filled square indicates the TC obtained by MLSP2-LOO.

in the plot corresponds to the average value of TC obtained afterapplying the MLSP2-LOO procedure. Note that it is greater than themaximum TC obtainable from any procedure that is based on theapplication of a threshold.

Note that in this plot, the procedure MLSP1 has in fact moreinformation available than MLSP1-LOO, because it uses two inde-pendent data sets, each one with 30 subjects (15 per group). It istherefore interesting to see which one of the two strategies, i.e., sep-arating the total number of available subjects into two sub-samplesor using the total number of subjects in the MLSP1-LOO scheme, ismore efficient. In Fig. 2 we show the relative performance of bothschemes as the activation level is varied: in both cases one assumesthat one has information from 60 subjects, but in the MLSP1 casethis data set is divided into two sub-samples with 30 subjects each.

Fig. 2. Comparison between MLSP1 (×) and MLSP1-LOO (circles) when both proce-dures have the same amount of information (60 subjects).

240 J.L. Marroquin et al. / Journal of Neuroscience Methods 214 (2013) 233– 245

Fig. 3. (Panel a) TC versus activation level for: the theoretical TC (*); MLSP1-LOO(circles); MLSP2-LOO (×) and FDR with = 0.05. The vertical dashed line marked “S”itan

tpocvfut((sas

detul

region A). These results are shown in Fig. 6. As one can see, the clas-sifier that uses A obtained with MLSP2-LOO outperforms not onlythe one based on MLSP1-LOO, but also the one based in the bestpossible A obtainable by applying a threshold. One may conclude,

ndicates the average activation value for which the global significance test rejectedhe null hypothesis ( = 0.05). The standard error was below 0.02 for all curves andll activation levels. Panel (b) illustrates the performance of the method when aon-Gaussian distribution for the noise is used (see text).

In Fig. 3a, we show a plot of TC versus the activation level a forhe MLSP1-LOO and MLSP2-LOO procedures. As a reference, we alsolot the theoretical TC and the TC obtained when the estimator A isbtained by applying the standard LHT with a multiple comparisonorrection (in this case, FDR with significance level = 0.05). Theertical dashed line marked “S” indicates the average value of aor which the global null hypothesis H0 : P1(Z) = P2(Z) is rejectedsing the KLA test with K = 10 and = 0.05. As one can see fromhese curves, the performance of MLSP1-LOO for small values of ai.e., for low SNR) is close to the theoretical upper bound for the TCtheoretical TC curve), while MLSP2-LOO exceeds this bound and isignificantly better than the standard FDR approach. For high SNR,ll methods approach the limit TC = 1. Error bars are not plotted,ince the standard error is below 0.02 in all cases.

Fig. 3b shows the effect of introducing a non-Gaussian noiseistribution. Specifically, we use a mixture of two Gaussians with
qual weights, centered at ±1 and with standard deviation equalo 0.25. Note that in this case a linear classifier based on an individ-al variable Zi(u) would exhibit poor performance for an activation
evel a < 2, since the groups are not linearly separable. Using the

Fig. 4. Same as Fig. 3 for a non-constant activation level (see text).

summarizing variables Yi, however, the groups are well separatedand the performance, as illustrated in this figure, is similar to thatof the Gaussian case of Fig. 3a.

In Fig. 4 we show the effect of introducing a spatial gradient inthe activation level: instead of a constant level, now a(u) is shapedlike a plane with slope of 0.1 pixels in the x direction, so that at theright border of the 5 × 5 square that constitutes A, the level is halfthe nominal level. As one can see, the performance of MLSP1-LOO isdegraded, while MLSP2-LOO is less affected, as one would expect.

In Fig. 5, we explore the effect of introducing spatial correlationin the noise process �(u). This is done by convolving the � field witha Gaussian kernel with � = 0.5 pixels. As one can see, the qualita-tive behavior is similar to the one observed in Fig. 3, although theperformance of all methods is degraded.

It is interesting to investigate the performance of a linearclassifier based on the summarizing variables Y obtained withMLSP1-LOO and MLSP2-LOO. This performance may be character-ized by the area under the ROC curve for these optimal classifiers(note that to construct these curves one has to know the true active

Fig. 5. Same as Fig. 3 for spatially correlated noise (see text).

J.L. Marroquin et al. / Journal of Neuroscience Methods 214 (2013) 233– 245 241

FM

tt

uLuMTowfp

4

4

a

Fib((

ference between groups (p-values equal to 0.006 and 0.001 for theK-FWER and KLA tests, respectively, both with K = 10).
ig. 6. Area under the ROC curve versus activation level for: MLSP1-LOO (circles);
LSP2-LOO (×) and the theoretical TC (*).

herefore, that MLSP2-LOO is the algorithm of choice, at least forhis type of problems.

Finally, Fig. 7 shows the probability of declaring each element as belonging to A for the procedures MLSP1-LOO and MLSP2-OO. These probability maps are estimated by counting, for each, the number of times it belongs to the corresponding A in eachonte Carlo repetition, divided by the total number of repetitions.

he corresponding maps for A estimated by maximizing TC (the-retical upper bound for threshold-based procedures) and by FDRith = 0.05 are also shown as a reference. Details of the maps

or MLSP2-LOO are shown in Fig. 8. Note that the false positiverobability decreases as the activation level becomes larger.

.2. Experiment 2: real data

.2.1. Difference between HS and LS groups of infants with PVLThis experiment illustrates the case where the standard

pproach of LHT with correction for multiple comparisons (e.g.,

ig. 7. Probability of detection for the experiment with simulated data. The squaren the upper left shows the true A-Region in red. The two sets of 4 squares in theottom row show the color-coded probabilities, for activation levels equal to 0.6left set) and 0.9 (right set), for: MLSP1-LOO (upper left in each set); MLSP2-LOOupper right); Theoretical TC (bottom left) and FDR with = 0.05 (bottom right).

Fig. 8. Detail of Fig. 7 showing the estimated A-Regions for MLSP2-LOO for activationlevels of 0.6 (left) and 0.9 (right).

using FDR with = 0.05) produce maps where there are almost noactive elements, although the global tests indicate a significant dif-

Fig. 9. (a) TFT map for A+ (red pixels) and A− (green pixels) estimated by MLSP2-LOO for the data of experiment R1. The circles at each (time, frequency) positionrepresent head diagrams facing upwards. (b) TFT map for the uncorrected LHT with˛ = 0.05. Red pixels correspond to elements where the normalized log-power forthe HS group was significantly higher than that of the LS group, and green pixelsindicate the opposite situation.

242 J.L. Marroquin et al. / Journal of Neuroscience Methods 214 (2013) 233– 245

Fig. 10. (a) TFT map for A+ (red pixels) and A− (green pixels) estimated by MLSP2-LOO fop as sigs

rLdalftbcwhor˛tH(s

t(ifottraD

Klimesch, 1999), activation of working memory (Deiber et al., 2007;

ixels correspond to elements where the normalized log-power for the HS group wituation.

Fig. 9a shows the TFT maps (Marroquin et al., 2004) for theegions A+ (red pixels) and A− (green pixels) obtained by MLSP2-OO. The circles at each (time, frequency) position represent headiagrams facing upwards, where the location of each lead is coloredppropriately according to its activation status. The vertical blueines indicate the time samples where the second and third syllablesor each series were uttered. As a reference, Fig. 9b shows the maphat corresponds to the results of a non-parametric (permutation-ased) LHT, with significance level ˛ = 0.05 and without multipleomparison correction. Here the red pixels correspond to elementshere the normalized log-power for the HS group was significantlyigher than that of the LS group, and green pixels indicate thepposite situation (LS group greater than HS group). The map cor-esponding to the case where this correction is applied (FDR with

= 0.05) is not shown, since no significant elements are found inhis case. Fig. 11a shows a scatter plot for the subjects of groupsS and LS in the space spanned by the summarizing variables Y

panel (b)), which as one can see, perform an excellent inter-groupeparation.

The map of Fig. 9a shows a larger response for the subjects inhe LS group (i.e., green pixels) in the time intervals (500–700 ms),900–1000 ms) and (1400–1500 ms). This is amenable of a clearnterpretation: for example, the larger response of the LS groupor the second (900–1000 ms) and third (1400–1500 ms) stimulusf the train, located mostly in the frontal and pre-fontal elec-rodes, in the beta (15–20 Hz) range, are consistent with the facthat in normal infants the responses (amplitudes of the event
elated potentials) to a series of syllables decreases graduallyfter the second stimulus in the series (Dehaene-Lambertz andehaene, 1994). This has been considered a result from habituation
r the data of experiment R2. (b) TFT map for the uncorrected LHT with = 0.05. Rednificantly higher than that of the LS group, and green pixels indicate the opposite

(Hernandez-Peon et al., 1956; Sable et al., 2004), which is con-sidered a basic form of non-associative learning (Rankin et al.,2009), which may be diminished in subjects from the LS group.This is also consistent with the work of Avecilla-Ramirez et al.(2011), where a classifier for group membership was constructedusing summarizing variables based on regions detected by theuncorrected LHT procedure.

4.2.2. Difference between the electrophysiological responses tosemantically unrelated words of normal and learning disabledchildren

As in the previous case, in this experiment the standard LHTapproach corrected for multiple comparisons produces maps withtoo few activations to be useful. The global tests indicate a signif-icant difference between groups: p-values of 0.001 and 0.001 forthe K-FWER and KLA test, respectively, both with K = 10.

Fig. 10 shows the TFT maps obtained with MLSP2-LOO anduncorrected LHT procedures, with the same color coding as inthe previous case. In this case, the A− region is practically empty,meaning that the elements that best characterize the inter-groupdifferences are those where the control group exhibited the largestresponse. These elements are located in most electrodes in thedelta and theta ranges (1–8 Hz) from 200 through 700 ms andin the gamma range (40–50 Hz) from 300 through 450 ms. Thetapower increase has been reported in several situations: duringencoding and memory retrieval (Burgess and Gruzelier, 1997;

Gevins et al., 1997; Krause et al., 2000), semantic violations (Haldet al., 2006) and allocation of attention related to target stimuli(Missonier et al., 2006). Increases in delta power have been related

J.L. Marroquin et al. / Journal of Neuroscie

Fe

ttvitasrtv

5

aeocutdra

1

ig. 11. Scatter plots in the space of the summarizing variables for the subjects inach group in experiments R1 (panel a) and R2 (panel b).

o attention to internal processes (Harmony et al., 1996). Therefore,hese results may be indicative of deficits in attention, in the acti-ation of working memory and in encoding and memory retrievaln learning disabled children. These findings are consistent withhe results reported in Fernandez et al. (2012), where the datanalysis was conducted using the LHT method combined with theupra-threshold cluster size statistic (Friston et al., 1994) to cor-ect for multiple comparisons. The scatter plot for the subjects inhe control and LD group in the space spanned by the summarizingariables is shown in Fig. 11b.

. Discussion and conclusions

In this work, a method for mapping the set of elements in certain space – e.g., the time–frequency–topography space inlectrophysiological experiments – that best separate two groupsf observations was presented. This approach may be consideredomplementary to the usual LHT procedure, and may be partic-larly useful when the later method fails to produce estimateshat are sufficiently populated to be amenable of interpretation,ue to the conservative nature of the multiple comparisons cor-ections that must be applied. The proposed approach offers other
dvantages:
. Produces interpretable features (summarizing variables) thatmay be used, in principle, for the classification of future data.

nce Methods 214 (2013) 233– 245 243

2. Produces accurate estimators for the active regions A, particu-larly for low values of SNR. These estimates remain reasonablefor non-constant activation levels, and for non-Gaussian and spa-tially correlated noise.

3. There is no need to specify any free parameter for the estima-tion of A. In particular, there is no need to specify a significancelevel ˛, since the optimal thresholds C∗± and C∗∗± are obtained bythe global maximization of F(Y(C+, C−)) in the MLSP1-LOO andMLSP2-LOO procedures.

Since the direct maximization of the proposed criterion, namely,the FC, is computationally unfeasible, we proposed a method thatconsists of two phases: in phase 1, the maximization is effected bysearching over the space spanned by two thresholds, estimating A±as the corresponding excursion sets of the field d. In phase 2, thesearch is extended to elements that are arc-connected to the setsfound in phase 1, even if the absolute values of d in these elementsare smaller than the optimal thresholds, provided that their inclu-sion in the estimates increases the value of the FC. This means thatin phase 2, the regions found in phase 1 may grow, but no newregions will appear.

Although the false positive error is not explicitly controlled bythis method, due to the asymptotic properties of the FC discussed inSection 2, it remains under reasonable limits. Thus, for the experi-ments with simulated data whose results appear in Fig. 3, for allvalues of a, the false positive error rates for the phase 1 algo-rithm MLSP1-LOO, are under 0.003, while for phase 2 are below0.015. The additional false positives found by MLSP2-LOO, however,are mostly located in the neighborhood of the true active region,as is apparent in Figs. 7 and 8. Since the Tanimoto coefficient issignificantly better when phase 2 is used, this means that usingMLSP2-LOO one obtains in general a significant increment in sen-sitivity, at the expense of a possible overestimation of the extentof the active regions, but without detecting new spurious regionsthat may not be connected with the true ones.

Another important contribution of this work is the use of a mod-ified algorithm for estimating the FC when data from only a limitednumber of subjects is available. In this scheme, each summarizingvariable Y (−i)

iis computed based on an estimate d(−i) of the field d,

which is obtained by leaving out subject i from the computation.In this way, one optimizes the use of available information (seeFig. 2), while avoiding the introduction of bias in the estimation ofthe maximum value of the estimated FC.

In this presentation, we have focused on the case where thereare two groups and only one experiment, so that we ended up withtwo summarizing variables, that correspond to positive and neg-ative differences in the average responses of the two groups. Inmany cases, however, the membership of each subject to a group isrelated to his or her electrophysiological response to more than oneexperiment or experimental condition. In this case, although it ispossible, in principle, to simply increase the number of summariz-ing variables and use the general definition for F given by (6), theincreased dimensionality of the space spanned by the thresholdsmakes the search for the optimal ones computationally very expen-sive. A simple solution may be obtained under the assumption thatthe random variables that correspond to the response variables Z(u)for each experiment are uncorrelated with each other, i.e., that theintra-group covariance matrices S are block-diagonal. In this case,the maximization of F may be effected by finding the estimates forA± and the corresponding summarizing variables for each exper-iment in a decoupled way, so that the computational complexitygrows only linearly with respect to the number of experiments,
while the classification efficiency is likely to increase, since theclassifier operates on a space with more dimensions.
The approach presented here leads to a discriminant functionf (X) = ˇT Y(A+, A−) = bT X between the two groups. This function

2 roscie

d(ooar(cktfmrbeeso

Q

wastlPts2iXt2

rfifatbbmdfria

etooEcgtutlbop

t


epends linearly on the vector of random field values X = Z =Z(u) : u ∈ U), for a new subject. In this sense, it is related to meth-ds that attempt to find optimal directions b for high dimensionalbservation vectors X, so that the discriminant function constructeds f(X) = bTX where b maximizes (or minimizes) a suitable crite-ion that involves the available data and the response variable W:Xi, Wi), i ∈ G1, G2. If the variables Wi are binary, i.e., if they indi-ate the membership of the subjects in one of the two groups, it isnown that minimizing the least squares criterion correspondingo the regression Wi = bTXi leads to the standard Fisher discriminantunction (McLachlan, 1992). For high dimensional data, it may be

ore suitable to adopt a normalized covariance criterion with theesponse variables, which results in a partial least squares (PLS)-ased discriminant function (Barker and Rayens, 2003), which isssentially a regularized Fisher discriminant function. More gen-rally, a number of modern methods for dimension reduction andparse representation are based on the minimization of a functionf the form:

(b) = L(b) + �P(b)

here the term L(b) quantifies the fitness to the data in a suit-ble sense, such as: the sum of residual squares of Wi ≈ bTXi, in thetandard Fisher discriminant function; a normalized covariance cri-erion in PLS-based discrimination as discussed above, or a marginoss function in Support Vector Machines (Vapnik, 1998). The term(b) is a specified penalization for the coefficient vector b, such ashe L1 norm in the LASSO approach (Tibshirani, 1996), or a mea-ure of model complexity (Burnham and Anderson, 2002; Massart,007; Devroye et al., 1996). The relative weight of this term is spec-

fied by the non-negative hyperparameter �, and the feature vector may be obtained from the original data Z through a non-linearransformation, as in the kernel methods (Scholkopf and Smola,002).

Although all these methods are intended to obtain a sparse rep-esentation where many coefficients b(u) are equal to zero, they inact admit solutions with arbitrary real values b(u) ∈ R. In contrast,n our approach, we are constraining the discriminant function(X) to be a linear function of Y(A+, A−) = (Y(A+), Y(A−)), with just

two-dimensional coefficient vector = (ˇ+, ˇ−), which implieshe constraint b(u) = ˇ+ for all u ∈ A+, b(u) = ˇ− for all u ∈ A−, and(u) = 0 elsewhere. This particular structure marks a key differenceetween our procedure and other methods; as explained above, theotivation for imposing this constraint is to achieve an effective

imensionality reduction while keeping a simple interpretationor the constructed variables Y, namely, as average activities overegions of the U space. Also, in this way, the regions where b(u)s non-zero may be interpreted directly as estimators of the truective regions.

From another perspective, the present approach may be consid-red as an instance of feature selection for a classifier, constrainedo a particular simple structure, through the maximization of anbjective function F . This function represents a resampling estimatef a theoretical criterion F for prediction performance (Efron, 1982;fron and Tibshirani, 1993). Specifically, we adopt as a performanceriterion the theoretical Fisher criterion F of separability betweenroups achieved by the two-dimensional feature Y , and constructhe estimate F of F by means of leave-one-out (LOO) methods. These of this resampling method of estimation is essential in ordero avoid biased estimates of prediction performance that wouldead to the overestimation of the number of non-null coefficients(u), which means large false positive error rates in the estimation
f the active regions A+, A−, with the consequent poor predictiveerformance of the classifier.
The Fisher criterion has been used before for feature selec-ion (Duda et al., 2000; Wang et al., 2007). What distinguishes the

nce Methods 214 (2013) 233– 245

method presented here is the domain-specific structure describedabove, which permits an efficient search in the space spanned bythresholds C+, C−. Since one may efficiently look for the globalmaximum of the criterion F in this space, the method combinesthe benefits of both “forward” and “backward” variable selectionmethods (Guyon and Elisseeff, 2003). Since the features consist ofsums of random variables, one may expect an approximate Gauss-ian behavior – and hence, linear separability – under the usualassumptions that are commonly used in neuroimaging, namely,single constant activation levels a± for all the subjects of a group. Ifwithin the same group, the subjects have different activation lev-els ai±, it is possible to obtain configurations that are not linearlyseparable in the space spanned by the Y variables. In this case,it is still possible to use the domain-specific structure proposedhere, simply by replacing, in the computation of F, the Fisher cri-terion by another more general, such as mutual information (seeGuyon and Elisseeff, 2003) or the Hilbert–Schmidt independencecriterion (HSIC) (Song et al., 2012), at the expense of increased com-putational cost, and possibly, a less intuitive interpretation. In thiscase, a classifier of a more general type, such as a support vectormachine (Vapnik, 1998) may be constructed, based on the resultingY variables.

In summary, we have presented a methodology, complemen-tary to the standard LHT approach, for the estimation of activationregions in random fields. The interpretation of the resulting mapsis different in both cases. Instead of showing elements where thelocal inter-group difference in the response is significantly large,in the MLS approach one shows the elements where the responsesprovide the maximum separation for subjects with respect to theirmembership in each group. The computational complexity of theproposed approach is reasonable, in particular if the method isimplemented in modern (multi-core) hardware: for the data ofexperiment R1, for example, the total processing time in a 4-corework station running at 3.12 GHz was of 824 s.

In this work, we have focused on electrophysiological data inthe space spanned by time, frequency and topography, but the pre-sented approach may be applied to other situations. One exampleis the correlation analysis of the signals associated with pairs ofelectrodes measured at a frequency f and a given time interval.In this case the normalized variable Zij could be the normalizedcross-spectrum, so that the elements of U have the form u = (e, e,f). Another example is the analysis of FMRI data to compare thefunctional response of two groups of subjects to a given stimulustrain. In this case, U corresponds to the physical space occupied bythe brain, so that u = (x, y, z) (the usual rectangular coordinates). Thenormalized variables Z may correspond, for example, to the F statis-tic used to test the null hypothesis = 0, where is the coefficientvector in a general linear model formulation.

Acknowledgement

The authors were supported in part by grant 131771 from theConsejo Nacional de Ciencia y Tecnologia (Conacyt), Mexico.

Appendix A.

Here we present a technical proof for Eq. (10) of Section 3. Con-sider the following proposition.

Proposition 1. Consider a classifier (not necessarily linear), basedon the discriminant function f(Z) between two populations P1, P2, i.e.,
large values of f(Z) indicate that the data Z was extracted from P1. Let
�2 = (E1[f (Z)] − E2[f (Z)])2

V1[f (Z)] + V2[f (Z)]

roscie

w2

H

PprH

H

wh

P

I

P

f

ˇo

R

A

A

A

A

B

B

B

B

B

B

D

D

D

D

D

E

EFF

F

F

G


here Eg, Vg denote expectation and variance for population Pg, g ∈ {1,}, and assume that E1[f(Z)] > E2[f(Z)].

Let H denote the area under the ROC curve for the classifier. Then,

≥ 1 − 11 + �2

roof. By definition, H = P(f(Z1) > f(Z2)), where Z1, Z2 are inde-endent observations obtained from populations P1 ans P2,espectively. Following Biscay and Pascual (1993), one may write

as:

= 1 − P(W − E[W] < −E[W])

here W = f(Z1) − f(Z2). From Cantelli’s theorem (Rao, 1973), oneas that for any � < 0,

(W − E[W] < �) ≤ V [W]V [W] + �2

n particular, by setting � = − E[W], one obtains:

(W − E[W] < −E[W]) ≤ 11 + �2

rom which the proposition follows.Inequality (10) then follows from Proposition 1 by putting f (Z) =

T Y(S+, S−), so that �2 = F(Y(S+, S−). Note that (10) is also valid ifne considers only positive activations, so that f(Z) = Y+(S+).

eferences

bdi H. Bonferroni and Sidak corrections for multiple comparisons. In: Salkind NJ,editor. Encyclopedia of measurement and statistics. Thousand Oaks, CA: Sage;2007].

humada R, Montenegro A. Juguemos a leer. Desarrollo de competencias dellenguaje. Libro de lectura. Editorial Trillas Infantil; 2007].

shlock D. Evolutionary computation for modelling and optimization. Springer Ver-lag; 2006].

vecilla-Ramirez GN, Ruiz-Correa S, Marroquin JL, Harmony T, Alba A, Mendoza-Montoya O. Electrophysiological auditory responses and language developmentin infants with periventricular leukomalacia. Brain Language 2011];119:175–83.

arker M, Rayens W. Partial least squares for discrimination. J Chemometr2003];17:166–73.

iscay RJ, Pascual R. Distance between probability measures based on the ROCcurves. Statistics 1993];24:371–6.

oulesteix A-L. PLS dimension reduction for classification with microarray data. StatAppl Genet Mol Biol 2004];3, Article 33.

ullmore ET, Suckling J, Overmeyer SR, Taylor E, Brammer MJ. Global, voxel andcluster tests, by theory and permutation, for a difference between two groups ofstructural MR images of the brain. IEEE Trans Med Imaging 1999];18(1):32–42.

urgess AP, Gruzelier JH. Short duration synchronization of human theta rhythmduring recognition memory. NeuroReport 1997];8:1039–42.

urnham KP, Anderson DR. Model selection and multimodel inference. A practicalinformation-theoretic approach. New York: Springer-Verlag; 2002].

ehaene-Lambertz G, Dehaene S. Speed and cerebral correlates of syllable discrim-ination in infants. Nature 1994];370:292–5.

eiber MP, Missonier P, Bertrand O, Gold G, Fazio-Costa L, Ibanez V, et al. Distinc-tion between perceptual and attentional processing in working memory tasks:a study of phase-locked and induced oscillatory brain dynamics. J Cogn Neurosci2007];19:158–72.

evroye L, Gyorfi L, Lugosi G. A probabilistic theory of pattern recognition. NewYork: Springer-Verlag; 1996].

ing B, Gentleman R. Classification using generalized partial least squares. J ComputGraph Stat 2004];14:280–98.

uda RO, Hart PE, Stork DG. Pattern recognition. 2nd ed. California: John Wiley andSons; 2000].

fron B. The jacknife, the bootstrap and other resampling plans. USA: Society forIndustrial and Applied Mathematics; 1982].

fron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall; 1993].awcett T. An introduction to ROC analysis. Pattern Recogn Lett 2006];27:861–74.ernandez T, Harmony T, Mendoza-Montoya O, Lopez-Alanis P, Marroquin J, Otero

G, Ricardo-Garcell J. Event-related EEG oscillations to semantically unrelatedwords in normal and disabled children. Brain Cogn 2012];80:74–82.

riston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, Evans AC. Assessing the
significance of focal activations using their spatial extent. Hum Brain Mapp1994];1:214–20.
riston KJ, Ashburner JT, Kiebel SJ, Nichols TE, Penny WD, editors. Statistical para-metric mapping: the analysis of functional brain images. Academic Press; 2007].

eisser S. Predictive inference. New York: Chapman and Hall; 1993].

nce Methods 214 (2013) 233– 245 245

Genovese CR, Lazar N, Nichols TE. Thresholding of statistical maps in functionalneuroimaging using the false discovery rate. Neuroimage 2002];15:870–8.

Gevins A, Smith M, McEvoy L, Yu D. High-resolution EEG mapping of cortical acti-vation related to working memory: effects of task difficulty, type of processing,and practice. Cereb Cortex 1997];7:374–85.

Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach LearnRes 2003];3:1157–82.

Hald LA, Bastiaansen MHC, Hagoort P. EEG theta and gamma responses to semanticviolations in online sentence processing. Brain Language 2006];96:90–105.

Harmony T, Fernandez T, Silva J, Bernal J, Diaz-Comas L, Reyes A, Marosi E, Rodriguez-Camacho M, Rodriguez ME. EEG delta activity: an indicator of attention tointernal processing during the performance of mental tasks. Int J Psychophysiol1996];24:161–71.

Hernandez-Peon R, Scherrer RH, Jouvet M. Modification of electrical activ-ity in cochlear nucleus during “attention” in unanesthetized cats. Science1956];123:331–2.

Jaccard P. The distribution of flora in the alpine zone. New Phytol 1912];11:37–50.

Jackson-Maldonado D, Thal D, Marchman V, Newton T, Fenson L, Conboy B. User’sguide and technical manual contents. MacArthur inventories. Paul Brookes Pub-lishing Co; 2003].

Klemet S, Mamlouk AM, Martinez T. Reliability of cross-validation for SVMs in high-dimensional. low sample size scenarios. In: Proceedings ICANN’08: proceedingsof the 18th international conference on artificial neural networks, Part I.Springer-Verlag; 2008].

Klimesch W. EEG alpha and theta oscillations reflect cognitive and memory perfor-mance: a review and analysis. Brain Res Rev 1999];29:169–95.

Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell1997];97:273–324.

Krause CM, Sillanmaki L, Koivisto M, Saarela C, Haggovist A, Laine M, et al. The effectsof memory load on event-related EEG desynchronization and synchronization.Clin Neurophysiol 2000];111:2071–8.

Krishnan AW, Williams LJ, McIntosh AR, Abdi H. Partial least squares (PLS) methodsfor neuroimaging: a tutorial and review. Neuroimage 2011];56(2):455–75.

Lehmann EL, Romano JP. Generalizations of the family-wise error rate. Ann Stat2005];33:1138–54.

Marroquin JL, Harmony T, Rodriguez V, Valdes-Sosa P. Exploratory EEG data analysisfor physiological experiments. Neuroimage 2004];21:991–9.

Marroquin JL, Biscay RJ, Ruiz-Correa S, Alba A, Ramirez R, Armony JL. Morphology-based hypothesis testing in discrete random fields: a non-parametric methodto address the multiple-comparison problem in neuroimaging. Neuroimage2012];59:3061–74.

Massart P. Concentration inequalities and model selection. Berlin: Springer-Verlag;2007].

McLachlan GJ. Discriminant analysis and statistical pattern recognition. John Wileyand Sons; 1992].

Missonier P, Deiber MP, Gold G, Millet P, Gex-Fabry Pun M, Fazio-Costa L, et al.Frontal theta event-related synchronization: comparison of directed attentionand working memory load effects. J Neural Trans 2006];113:1477–86.

Nichols TE. Nonparametric permutation tests for functional neuroimaging: a primerwith examples. Hum Brain Mapp 2001];15:1–25.

Nichols TE. FWE-corrected inference: parametric conservativeness and nonpara-metric alternatives. In: Proceedings of the joint statistical meetings; 2002].

Nichols TE, Hayasaka S. Controlling the familywise error rate in func-tional neuroimaging: a comparative review. Stat Meth Med Res 2003];12:419–46.

Pratt WK. Digital image processing. Los Altos, CA: John Wiley and Sons; 2007].Rankin CH, Abrams T, Barry RJ, Bhatnagar S, Clayton DF, Colombo J. Habituation

revisited: an updated and revised description of the behavioral characteristicsof habituation. Neurobiol Learn Memory 2009];92:135–8.

Rao CR. Linear statistical inference and its applications. New York: John Wiley andSons; 1973].

Sable JJ, Low KA, Maclin EL, Fabiani M, Gratton G. Latent inhibition mediates N1attenuation to repeating sounds. Psychophysiology 2004];41:636–42.

Scholkopf B, Smola AJ. Learning with kernels. Support vector machines regulariza-tion optimization and beyond. Cambridge, MA, USA: The MIT Press; 2002].

Serra J. Image analysis and mathematical morphology. Ac. Press; 1984].Smith SM, Nichols TE. Threshold-free cluster enhancement: addressing problems

of smoothing, threshold dependence and localisation in cluster inference. Neu-roimage 2009];44:83–98.

Song L, Smola A, Gretton A, Bedo J, Borgwardt K. Feature selection via dependencemaximization. J Mach Learn Res 2012];13:1393–434.

Swets JA. Signal detection theory and ROC analysis in psychology and diagnostics:collected papers. Mahwah, NJ: Lawrence Erlbaum Associates; 1996].

Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B1996];58:267–88.

Vapnik VN. Statistical learning theory. John Wiley and Sons; 1998].Wang SA, Liu C, Zheng L. Feature selection by combining Fisher criterion and prin-

cipal feature analysis. In: Proceedings of the sixth international conference onmachine learning and cybernetics; 2007].

Wold H. Estimation of principal components and related models by iterative leastsquares and related models. In: Krishnaia PR, editor. Multivariate analysis. NewYork: Academic Press; 1966].

Zhang H, Nichols TE, Johnson TD. Cluster mass inference via random field theory.Neuroimage 2009];44:51–61.

Date post:	03-Jan-2017
Category:	Documents
Upload:	thalia
View:	215 times
Download:	3 times

A maximum linear separation criterion for the analysis of neurophysiological data

Documents