EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION …by unequal variances and unequal sample sizes. Even...

EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS

By

Eun Yi Chung Joseph P. Romano

Technical Report No. 2011-05 May 2011

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS

By

Eun Yi Chung Joseph P. Romano

Stanford University

Technical Report No. 2011-05 May 2011

This research was supported in part by National Science Foundation grant DMS 0707085.

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

http://statistics.stanford.edu

Exact and Asymptotically Robust Permutation Tests

EunYi Chung

Department of Economics

Stanford University

Joseph P. Romano∗

Departments of Statistics and Economics

Stanford University

May 5, 2011

Abstract

Given independent samples from P and Q, two-sample permutation tests allow

one to construct exact level tests when the null hypothesis is P = Q. On the other

hand, when comparing or testing particular parameters θ of P and Q, such as their

means or medians, permutation tests need not be level α, or even approximately

level α in large samples. Under very weak assumptions for comparing estima-

tors, we provide a general test procedure whereby the asymptotic validity of the

permutation test holds while retaining the exact rejection probability α in finite

samples when the underlying distributions are identical. A quite general theory

is possible based on a coupling construction, as well as a key contiguity argument

for the binomial and hypergeometric distributions. The ideas are broadly appli-

cable and special attention is given to a nonparametric k-sample Behrens-Fisher

problem, whereby a permutation test is constructed which is exact level α under

the hypothesis of identical distributions, but has asymptotic rejection probability

α under the more general null hypothesis of equality of means. A Monte Carlo

simulation study is performed.

2010 MSC subject classifications. Primary 62E20, Secondary 62G10

KEY WORDS: Behrens-Fisher problem; Coupling; Permutation test

∗Research has been supported by NSF Grant DMS-0707085.

1

1 Introduction

In this article, we consider the behavior of two-sample (and later also k-sample) permu-

tation tests for testing problems when the fundamental assumption of identical distribu-

tions need not hold. Assume X1, . . . , Xm are i.i.d. according to a probability distribution

P , and independently, Y1, . . . , Yn are i.i.d. Q. The underlying model specifies a family of

pairs of distributions (P,Q) in some space Ω. For the problems considered here, Ω spec-

ifies a nonparametric model, such as the set of all pairs of distributions. Let N = m+n,

and write

Z = (Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn) . (1)

Let Ω = (P,Q) : P = Q. Under the assumption (P,Q) ∈ Ω, the joint distribution of

(Z1, . . . , ZN) is the same as (Zπ(1), . . . , Zπ(N)), where (π(1), . . . , π(N)) is any permutation

of 1, . . . , N. It follows that, when testing any null hypothesis H0 : (P,Q) ∈ Ω0, where

Ω0 ⊂ Ω, then an exact level α test can be constructed by a permutation test. To

review how, let GN denote the set of all permutations π of 1, . . . , N. Then, given any

test statistic Tm,n = Tm,n(Z1, . . . , ZN), recompute Tm,n for all permutations π; that is,

compute Tm,n(Zπ(1), . . . , Zπ(N)) for all π ∈ GN, and let their ordered values be

T (1)m,n ≤ T (2)

m,n ≤ · · · ≤ T (N !)m,n .

Fix a nominal level α, 0 < α < 1, and let k be defined by

k = N !− [αN !] ,

where [αN !] denotes the largest integer less than or equal to αN !. Let M+(z) and M0(z)

be the number of values T(j)m,n(z) (j = 1, . . . , N !) which are greater than T (k)(z) and equal

to T (k)(z), respectively. Set

a(z) =αN !−M+(z)

M0(z).

Define the randomization test function φ(Z) to be equal to 1, a(Z), or 0 according

to whether Tm,n(Z) > T(k)m,n(Z), Tm,n(X) = T (k)(Z), or Tm,n(Z) < T (k)(Z), respectively.

Then, under any (P,Q) ∈ Ω,

EP,Q[φ(X1, . . . , Xm, Y1, . . . , Yn)] = α .

2

Also, define the permutation distribution as

RTm,n(t) =

1

N !

∑π∈GN

ITm,n(Zπ(1), . . . , Zπ(N)) ≤ t , (2)

where GN denotes the N ! permutations of 1, 2, . . . , N. Roughly speaking (after ac-

counting for discreteness), the permutation test rejects H0 if the test statistic Tm,n(evaluated at the original data set) exceeds T

(k)m,n, or a 1−α quantile of this permutation

distribution.

However, problems arise if Ω0 is strictly bigger than Ω. Since a transformed permuted

data set no longer has the same distribution as the original data set, the argument leading

to the exact construction of a level α test fails, and faulty inferences can occur.

To be concrete, consider constructing a permutation test based on the difference of

sample means

Tm,n = m1/2(Xm − Yn) .

Note that we are not taking the absolute difference, so that the test is one-sided, as

we are rejecting for large positive values of the difference. First of all, one needs to be

very careful in deciding what family of distributions Ω0 is being tested under the null

hypothesis. If the null specifies P = Q, then without further assumptions, a test based

on Xm − Yn is not appropriate. First of all, even if P = Q so that the permutation

construction will result in probability of rejection equal to α, the test clearly will not

have any power against distributions P and Q whose means are identical but P 6= Q.

The test is only warranted if it can be assumed that lack of equality of distributions is

accompanied by a corresponding change in population means. Such an assumption may

be inappropriate. Consider the case where one group receives a treatment and the other

a placebo. Then, no treatment effect may arguably be considered equivalent to both

groups receiving a placebo, in which case the distributions would be the same. However,

even in this case, if there is an effect due to treatment, P and Q may differ not only in

location but also in other aspects of the distribution such as scale and shape. Moreover,

if the two groups being compared are distinct in a way other than the assignment of

treatment or placebo, as in comparing educational achievement between boys and girls,

then it is especially crucial to clarify what is being tested and the implicit underlying

assumptions.

In such cases, the permutation test based on the difference of sample means is only

appropriate as a test of equality of population means. However, the permutation test no

longer controls the level of the test, even in large samples. As is well-known (Romano,

1990), the permutation test possesses a certain asymptotic robustness as a test of differ-

3

ence in means if m/n→ 1 as n→∞, or the underlying variances of P and Q are equal,

in the sense that the rejection probability under the null hypothesis of equal means tends

to the nominal level. Without equal variances and comparable sample sizes, the rejec-

tion probability can be much larger than the nominal level, which is a concern. Because

of the lack of robustness and the increased probability of a Type 1 error, rejection of the

null may incorrectly be interpreted as rejection of equal means, when in fact it is caused

by unequal variances and unequal sample sizes. Even more alarming is the possibility of

rejecting a one-sided null hypothesis in favor of a positive mean difference when in fact

the difference in means is negative. Further note that there is also the possibility that

the rejection probability can be much less than the nominal level, which by continuity

implies the test is biased and has little power of detecting a true difference in means.

The situation is even worse when basing a test on a difference in sample medians,

in the sense that regardless of sample sizes, the asymptotic rejection probability of the

permutation test will be α under very stringent conditions, which essentially means only

in the case where the underlying distributions are the same.

However, in a very insightful paper in the context of random censoring models,

Neuhaus (1993) first realized that by proper studentization of a test statistic, the per-

mutation test can result in asymptotically valid inference even when the underlying

distributions are not the same. Later, Janssen (1997) showed that, in the case of the

difference of sample means, by proper studentization of a test statistic, the permutation

test is a valid asymptotic approach. In particular, his results imply that, if the underly-

ing population means are identical (and population variances are finite and may differ),

then the asymptotic rejection probability of the permutation test is α. Furthermore, the

use of the permutation test retains the property that the exact rejection probability is

α if the underlying distributions are identical. This result has been extended to other

specific problems, such as comparing variances by Pauly (2010) and the two-sample

Wilcoxon test by Neubert and Brunner (2007). Other results on permutation tests are

presented in Janssen (2005), Janssen and Pauls (2003), and Janssen and Pauls (2005).

The goal of this paper is to obtain a quite general result of the same phenomenon.

That is, when basing a permutation test using some test statistic as a test of a parameter

(usually a difference of parameters associated with marginal distributions), we would like

to retain the exactness property when P = Q, and also have the rejection probability be

α for the more general null hypothesis specifying the parameter (such as the difference

being zero). Of course, there are many alternatives to getting asymptotic tests, such as

the bootstrap or subsampling. However, we do not wish to give up the exactness property

under P = Q, and resampling methods do not have such finite sample properties. The

main problem becomes: what is the asymptotic behavior of RTm,n(·) defined in (2) for

4

general test statistic sequences Tm,n when the underlying distributions differ. Only

for suitable test statistics is it possible to achieve both finite sample exactness when the

underlying distributions are equal, but also maintain a large sample rejection probability

near the nominal level when the underlying distributions need not be equal. In this sense,

our results are both exact and asymptotically robust for heterogenous populations.

This paper provides a framework for testing a parameter that depends on P and Q.

We construct a general test procedure where the asymptotic validity of the permutation

test holds in a general setting. Assuming that estimators are asymptotically linear and

consistent estimators are available for their asymptotic variance, we provide a test that

has asymptotic rejection probability equal to the nominal level α, but still retains the

exact rejection probability of α in finite samples if P = Q. It is not even required

that the estimators are based on differentiable functionals, and some methods like the

bootstrap would not necessarily be even asymptotically valid under such conditions,

let alone retain the finite sample exactness property when P = Q. The arguments of

the paper are quite different from Janssen and previous authors, and hold under great

generality. For example, they immediately apply to comparing means, variances, or

medians. The key idea is to show that the permutation distribution behaves like the

unconditional distribution of the test statistic when all N observations are i.i.d. from

the mixture distribution pP + (1− p)Q, where p, where p is such that m/N → p. This

seems intuitive because the permutation distribution permutes the observations so that a

permuted sample is almost like a sample from the mixture distribution. In order to make

this idea precise, a coupling argument is given in Section 3.3 Of course, the permutation

distribution depends on all permuted samples (for a given original data set). But even

for one permuted data set, it cannot exactly be viewed as a sample from pP + (1− p)Q.

Indeed, the first m observations from the mixture would include Bm observations from

P and the rest from Q, where Bm has the binomial distribution based on m trials and

success probability p. On the other hand, for a permuted sample, if Hm denotes the

number of observations from P , then Hm has the hypergeometric distribution with mean

mp. The key argument that allows for such a general result concerns the contiguity of

the distributions of Bm and Hm. Section 3 highlights the main technical ideas required

for the proofs. Section 4 applies these ideas to the k-sample Behrens-Fisher problem,

though no assumption of normality is required. Once again, exact level is achieved when

all k distributions are equal, but the asymptotic rejection probability equals the nominal

level under the null hypothesis of mean equality (under a finite variance assumption).

Lastly, Monte Carlos simulation studies illustrating our results are presented in Section

5. All proofs are reserved for the appendix.

5

2 Robust Studentized Two-sample Test

In this section, we consider the general problem of inference from the permutation

distribution when comparing parameters from two populations. Specifically, assume

X1, . . . , Xm are i.i.d. P and, independently, Y1, . . . , Yn are i.i.d. Q. Let θ(·) be a real-

valued parameter, defined on some space of distributions P . The problem is to test the

null hypothesis

H0 : θ(P ) = θ(Q) . (3)

Of course, when P = Q, one can construct permutation tests with exact level α. Unfor-

tunately, if P 6= Q, the test need not be valid in the sense that the probability of a Type

1 error need not be α even asymptotically. Thus, our goal is to construct a procedure

that has asymptotic rejection probability equal to α quite generally, but also retains the

exactness property in finite samples when P = Q.

We will assume that estimators are available that are asymptotically linear. Specif-

ically, assume that, under P , there exists an estimator θm = θm(X1, . . . , Xm) which

satisfies

m1/2[θm − θ(P )] =1√m

m∑i=1

fP (Xi) + oP (1) (4)

Similarly, we assume that, based on the Yj (under Q),

n1/2[θn − θ(Q)] =1√n

n∑j=1

fQ(Yj) + oQ(1) (5)

The functions determining the linear approximation fP and fQ can of course depend on

the underlying distributions. Different forms of differentiability guarantee such linear

expansions in the special case when θm takes the form an empirical estimate θ(Pm),

where Pm is the empirical measure constructed from X1, . . . , Xm, but we will not need

to assume such stronger conditions. We will argue that our assumptions of asymptotic

linearity already imply a result about the permutation distribution corresponding to the

statistic m1/2[θm(X1, . . . , Xm)− θn(Y1, . . . , Yn)], without having to impose any differen-

tiability assumptions. However, we will assume the expansion (4) holds not just for i.i.d.

samples under P , and also under Q, but also when sampling i.i.d. observations from

the mixture distribution P = pP + qQ. This is a weak assumption and replaces having

to study the permutation distribution based on variables that are no longer indepen-

dent nor identically distributed with a simple assumption about the behavior under an

i.i.d. sequence. Indeed, we will argue that in all cases, the permutation distribution be-

haves asymptotically like the unconditional limiting sampling distribution of the studied

6

statistic sequence when sampling i.i.d. observations from P .

Theorem 2.1. Assume X1, . . . , Xm are i.i.d. P and, independently, Y1, . . . , Yn are i.i.d.

Q. Consider testing the null hypothesis (3) based on a test statistic of the form

Tm,n = m1/2[θm(X1, . . . , Xm)− θn(Y1, . . . , Yn)] ,

where the estimators satisfy (4) and (5). Further assume EPfP (Xi) = 0 and

0 < EPf2P (Xi) ≡ σ2(P ) <∞ ,

and the same with P replaced by Q. Let m→∞, n→∞, with N = m+n, pm = m/N ,

qm = n/N and pm → p ∈ (0, 1) with

pm − p = O(m−1/2) . (6)

Assume the estimator sequence also satisfies (4) with P replaced by P = pP + qQ with

σ2(P ) <∞.

Then, the permutation distribution of Tm,n given by (2) satisfies

supt|RT

m,n(t)− Φ(t/τ(P ))| P→ 0 ,

where

τ 2(P ) = σ2(P ) +p

1− pσ2(P ) =

1

1− pσ2(P ) . (7)

Remark 2.1. Under H0, the true unconditional sampling distribution of Tm,n is asymp-

totically normal with mean 0 and variance

σ2(P ) +p

1− pσ2(Q) , (8)

which does not equal τ 2(P ) defined by (7) in general.

Example 2.1. (Difference of Means) As is well-known, even for the case of comparing

population means by sample means, equality holds if and only if p = 1/2 or σ2(P ) =

σ2(Q).

Example 2.2. (Difference of Medians) Let F and G denote the c.d.f.s corresponding

to P and Q. Let θ(F ) denote the median of F, i.e., θ(F ) = infx : F (x) ≥ 12. Then,

it is well known (Serfling, 1980) that, if F is continuously differentiable at θ(P ) with

7

derivative F ′ (and the same with F replaced by G), then

m1/2[θ(Pm)− θ(P )] =1√m

m∑i=1

12− IXi ≤ θ(P )F ′(θ(P ))

+ oP (1)

and similarly,

n1/2[θ(Qn)− θ(Q)] =1√n

n∑j=1

12− IYj ≤ θ(Q)G′(θ(Q))

+ oQ(1).

Thus, we can apply Theorem 1.1 and conclude that, when θ(P ) = θ(Q) = θ, the

permutation distribution of Tm,n is approximately a normal distribution with mean 0

and variance1

4(1− p)[pF ′(θ) + (1− p)G′(θ)]2

in large samples. On the other hand, the true sampling distribution is approximately a

normal distribution with mean 0 and variance

v2(P,Q) ≡ 1

4[F ′(θ)]2+

p

1− p1

4[G′(θ)]2. (9)

Thus, the permutation distribution and the true unconditional sampling distribution

behave differently asymptotically unless F ′(θ) = G′(θ) is satisfied. Since we do not

assume P = Q, this condition is a strong assumption. Hence, the permutation test

for testing equality of medians is generally not valid in the sense that the rejection

probability tends to a value that is far from the nominal level α.

Remark 2.2. The assumption (6) is of course a little stronger than the more basic

assumption m/N → p, where no rate is required between the difference m/N and p. Of

course, we are free to choose p as m/N in any situation, and the assumption is rather

innocuous. (Indeed, for any m0 and N0 with m0/N0 = p, we can always let m and N

tend to infinity with m = km0 and N = kN0 and let k → ∞.) Alternatively, we can

replace (6) with the more basic assumption m/N → p as long as we slightly strengthen

the basic assumption that the statistic has a linear expansion under P = pP + qQ to

also have a linear expansion under sequences

Pm,n =m

NP +

n

NQ ,

which is a rather weak form of local uniform triangular array type of convergence. We

prefer to assume the convergence hypothesis based on an i.i.d. sequence from a fixed P ,

though it is really a matter of choice. Usually, we can appeal to some basic convergence in

8

distributions results with ease, but if linear expansions are available (or can be derived)

which are “uniform” in the underlying probability distribution near P , then such results

can be used instead with the weaker hypothesis pm → p.

The main goal now is to show how studentizing the test statistic leads to a general

correction.

Theorem 2.2. Assume the setup and conditions of Theorem 2.1. Further assume that

σm(X1, . . . , Xm) is a consistent estimator of σ(P ) when X1, . . . , Xm are i.i.d. P . Assume

consistency also under Q and P , so that σn(V1, . . . , Vn)P→ σ(P ) as n→∞ when the Vi

are i.i.d. P . Define the studentized test statistic

Sm,n =Tm,nVm,n

, (10)

where

Vm,n =

√σ2m(X1, . . . , Xm) +

m

nσ2n(Y1, . . . , Yn) .

and consider the permutation distribution defined in (2) with T replaced by S. Then,

supt|RS

m,n(t)− Φ(t)| P→ 0 . (11)

Thus, the permutation distribution is asymptotically standard normal, as is the true

unconditional limiting distribution of the test statistics Sm,n. Indeed, as mentioned in

Remark 2.1, the true unconditional limiting distribution of Tm,n is normal with mean 0

and variance given by (8). But, when sampling m observations from P and n from Q,

V 2m,n tends in probability to (8), and hence the limiting distribution of Tm,n is standard

normal, the same as that of the permutation distribution.

Example 2.1. (continued) As proved by Janssen (1997), even when the underlying

distributions may have different variances and different sample sizes, permutation tests

based on studentized statistics

Tm,n =m1/2(Xm − Yn)√

S2X + m

nS2Y

,

where S2X = 1

m−1

∑mi=1(Xi − Xm)2 and S2

Y = 1n−1

∑nj=1(Yi − Ym)2, can allow one to

construct a test that attains asymptotic rejection probability α when P 6= Q while

providing an additional advantage of maintaining exact level α when P = Q.

9

Example 2.2. (continued) Define the studentized median statistic

Mm,n =m1/2[θ(Pm)− θ(Qn)]

vm,n,

where vm,n is a consistent estimator of v(P,Q) defined in (9). There are several choices for

a consistent estimator of v(P,Q). Examples include the usual kernel estimator (Devroye

and Wagner, 1980), bootstrap estimator (Efron, 1979), and the smoothed bootstrap

(Hall, DiCiccio, and Romano, 1989).

Remark 2.3. Suppose that the true unconditional distribution of a test Tm,n is, under

the null hypothesis, asymptotically given by a distribution R(·). Typically a test rejects

when Tm,n > rm,n, where rm,n is nonrandom, as happens in many classical settings.

Then, we typically have rm,n → r(1 − α) ≡ R−1(1 − α). Assume that Tm,n converges

to some limit law R′(·) under some sequence of alternatives which are contiguous to

some distribution satisfying the null. Then, the power of the test against such a se-

quence would tend to 1 − R′(r(1 − α)). The point here is that, under the conditions

of Theorem 2.2, the permutation test based on a random critical value rm,n obtained

from the permutation distribution satisfies, under the null, rm,nP→ r(1− α). But then,

contiguity implies the same behavior under a sequence of contiguous alternatives. Thus,

the permutation test has the same limiting local power as the “classical” test which uses

the nonrandom critical value. So, to first order, there is no loss in power in using a

permutation critical value. Of course, there are big gains because the permutation test

applies much more broadly than for usual parametric models, in that it retains the level

exactly across a broad class of distributions and is at least asymptotically justified for a

large nonparametric family.

3 Four Technical Ingredients

In this section, we discuss four separate ingredients, from which the main results flow.

These results are separated out so they can easily be applied to other problems and so

that the main technical arguments are highlighted. The first two apply more generally

to randomization tests, not just permutation tests, and are stated as such.

3.1 Hoeffding’s Condition

Suppose data Xn has distribution Pn in Xn, and Gn is a finite group of transformations g

of Xn onto itself. For a given statistic Tn = Tn(Xn), let RTn (·) denote the randomization

10

distribution of Tn, defined by

RTn (t) =

1

|Gn|∑g∈Gn

ITn(gXn) ≤ t. (12)

(In the case of permutation tests, Xn corresponds to Z = (X1, . . . , Xm, Y1, . . . , Yn) and g

varies over the permutations of 1, . . . , N.) Hoeffding (1952) gave a sufficient condition

to derive the limiting behavior of RTn (·). This condition is verified repeatedly in the

proofs, but we add the result that the condition is also necessary.

Theorem 3.1. Let Gn and G′n be independent and uniformly distributed over Gn (and

independent of Xn). Suppose, under Pn,

(Tn(GnXn), Tn(G′nX

n))d→ (T, T ′), (13)

where T and T ′ are independent, each with common c.d.f. RT (·). Then, for all continuity

points t of RT (·),

RTn (t)

P→ RT (·) . (14)

Conversely, if (14) holds for some limiting c.d.f. RT (·) whenever t is a continuity point,

then (13) holds.

The reason we think it is important to add the necessity part of the result is that

our methodology is somewhat different than that of other authors mentioned in the

introduction, who take a more conditional approach to proving limit theorems. After all,

the permutation distribution is indeed a distribution conditional on the observed set of

observations (without regard to ordering). However, the theorem shows that a sufficient

condition is obtained by verifying an unconditional weak convergence property, which

may look surprising at first in that it includes additional auxiliary randomization G′ in

its statement. Nevertheless, simple arguments (see the appendix) show the condition is

indeed necessary and so taking such an approach is not fanciful.

3.2 Slutsky’s Theorem for Randomization Distributions

Consider the general setup of Subsection 3.1. The result below describes Slutsky’s the-

orem in the context of randomization distributions. In this context, the randomization

distributions are random themselves, and therefore the usual Slutsky’s theorem does not

quite apply. Because of its utility in the proofs of our main results, we highlight the

11

statement. Given sequences of statistics Tn, An and Bn, let RAT+Bn (·) denote the ran-

domization distribution corresponding to the statistic sequence AnTn + Bn, i.e. replace

Tn in (12) by AnTn +Bn, so

RAT+Bn (t) ≡ 1

|Gn|∑g∈Gn

IAn(gXn)Tn(gXn) +Bn(gXn) ≤ t (15)

Theorem 3.2. Let Gn and G′n be independent and uniformly distributed over Gn (and

independent of Xn). Assume Tn satisfies (13). Also, assume

An(GnXn)

P→ a (16)

and

Bn(GnXn)

P→ b , (17)

for constants a and b. Let RaT+b(·) denote the distribution of aT + b, where T is the

limiting random variable assumed in (13). Then,

RAT+Bn (t)

P→ RaT+b(t) ,

if the distribution RaT+b(·) of aT + b is continuous at t. (Of course, RaT+b(t) = RT ( t−ba

)

if a 6= 0.)

Remark 3.1. Under the randomization hypothesis that the distribution of Xn is the

same as that of gXn for any g ∈ Gn, the conditions (16) and (17) are equivalent to the

assumptions that An(Xn)P→ a and Bn(Xn)

P→ b, i.e. the convergence in probability just

based on the original sample Xn without first transforming by a random Gn. For more

on the randomization hypothesis, see Section 15.2 of Lehmann and Romano (2005).

3.3 A Coupling Construction

Consider the general situation where k samples are observed from possibly different

distributions. Specifically, assume for i = 1, . . . , k that Xi,1, . . . , Xi,niis a sample of ni

i.i.d. observations from Pi. All N ≡∑

i ni observations are mutually independent. Put

all the observations together in one vector

Z = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) .

The basic intuition driving the results concerning the behavior of the permutation

distribution stems from the following. Since the permutation distribution considers the

12

empirical distribution of a statistic evaluated at all permutations of the data, it clearly

does not depend on the ordering of the observations. Let ni/N denote the proportion

of observations in the ith sample, and assume that ni →∞ in such a way that

pi −niN

= O(N−1/2) . (18)

Then the behavior of the permutation distribution based on Z should behave approxi-

mately like the behavior of the permutation distribution based on a sample of N i.i.d.

observations

Z = (Z1, . . . , ZN)

from the mixture distribution

P ≡ p1P1 + · · ·+ pkPk .

Of course, we can think of the N observations generated from P arising out of a two-

stage process: for i = 1, . . . , N , first draw an index j at random with probability pj;

then, conditional on the outcome being j, sample Zi from Pj. However, aside from the

fact that the ordering of the observations in Z is clearly that of n1 observations from

P1, following by n2 observations from P2, etc., the original sampling scheme is still only

approximately like that of sampling from P . For example, the number of observations

Zi out of the N which are from P1 is binomial with parameters N and p1 (and so has

mean equal to p1N ≈ n1), while the number of observations from P1 in the original

sample Z is exactly n1.

Along the same lines, let π = (π(1), . . . , π(N)) denote a random permutation of

1, . . . , N. Then, if we consider a random permutation of both Z and Z, then the

number of observations in the first n1 coordinates of Z which were Xs has the hyperge-

ometric distribution, while the number of observations in the first n1 coordinates of Z

which were Xs is still binomial.

We can make a more precise statement by constructing a certain coupling of Z

and Z. That is, except for ordering, we can construct Z to include almost the same

set of observations as in Z. The simple idea goes as follows. Given Z, we will construct

observations Z1, . . . , ZN via the two-stage process as above, using the observations drawn

to make up the Zi as much as possible. First, draw an index j among 1, . . . , k at

random with probability pj; then, conditionally on the outcome being j, set Z1 = Xj,1.

Next, if the next index i drawn among 1, . . . , k at random with probability pi is

different from j from which Z1 was sampled, then Z2 = Xi,1; otherwise, if i = j as in

the first step, set Z2 = Xj,2. In other words, we are going to continue to use the Zi to

fill in the observations Zi. However, after a certain point, we will get stuck because we

13

will have already exhausted all the nj observations from the jth population governed by

Pj. If this happens and an index j was drawn again, then just sample a new observation

Xj,nj+1 from Pj. Continue in this manner so that as many as possible of the original

Zi observations are used in the construction of Z. Now, we have both Z and Z. At

this point, Z and Z have many of the same observations in common. The number of

observations which differ, say D, is the (random) number of added observations required

to fill up Z. (Note that we are obviously using the word “differ” here to mean the

observations are generated from different mechanisms, though in fact there may be a

positive probability that the observations still are equal if the underlying distributions

have atoms. Still, we count such observations as differing.)

Moreover, we can reorder the observations in Z by a permutation π0 so that Zi and

Zπ0(i) agree for all i except for some hopefully small (random) number D. To do this,

recall that Z has the observations in order, i.e., the first n1 observations arose from

P1 and the next set of n2 observations came from P2, etc. Thus, to couple Z and Z,

simply put all the observations in Z which came from P1 first up to n1. That is, if

the number of observations in Z from P1 is greater than or equal to n1, then Zπ(i) for

i = 1, . . . , n1 are filled with the observations in Z which came from P1 and if the number

was strictly greater than n1, put them aside for now. On the other hand, if the number

of observations in Z which came from P1 is less than n1, fill up as many of Z from

P1 as possible and leave the rest slots among the first n1 spots blank for now. Next,

move onto the observations in Z which came from P2 and repeat the above procedure for

n1 +1, . . . , n1 +n2 spots, i.e., we start filling up the spots from n1 +1 as many of Z which

came from P2 as possible up to n2 of them. After going though all the distributions Pifrom which each of observations in Z came, one must then complete the observations in

Zπ0 ; simply “fill up” the empty spots with the remaining observations that have been

put aside. (At this point, it does not matter where each of the remaining observations

gets inserted; but, to be concrete, fill the empty slots by inserting the observations which

came from the index Pi in chronological order from when constructed.) This permuting

of observations in Z corresponds to a permutation π0 and satisfies Zi = Zπ0(i) for indices

i except for D of them.

For example, suppose there are k = 2 populations. Suppose that N1 of the Z

observations came from P1 and so N−N1 from P2. Of course, N1 is random and has the

binomial distribution with parameters N and p1. If N1 ≥ n1, then the above construction

yields the first n1 observations in Z and Zπ0 completely agree. Furthermore, if N1 > n1,

then the number of observations in Z from P2 is N −N1 < N −n1 = n2, and N −N1 of

the last n2 indices in Z match those of Zπ0 , with the remaining differ. In this situation,

14

we have

Z = (X1, . . . , Xn1 , Y1, . . . , Yn2)

and

Zπ0 = (X1, . . . , Xn1 , Y1, . . . , YN−N1 , Xn1+1, . . . , XN1) ,

so that Z and Zπ0 differ only in the last N1−n1 places. In the opposite situation where

N1 < n1, Z and Zπ are equal in the first N1 and last n2 places, only differing in spots

N1 + 1, . . . , n1.

The number of observations D where Z and Zπ0 differ is random and we now analyze

how large it is. Let Nj denote number of observations in Z which are generated from

Pj. Then, (N1, . . . , Nk) has the multinomial distribution based on N trials and success

probabilities (p1, . . . , pk). In terms of the Nj, the number of differing observations in the

above coupling construction is

D =k∑j=1

max(nj −Nj, 0) .

If we assume pj > 0 for all j, then by the usual central limit theorem,

Nj −Npj = OP (N1/2) ,

which together with (18) yields

Nj − nj = (Nj −Npj) + (Npj − nj) = OP (N1/2) .

It follows that D = OP (N1/2) and so D/N converges to 0 in probability. It also follows

that

E(D) ≤k∑j=1

E|Nj − nj| ≤k∑j=1

E|Nj − pjN |+ |pjN − nj|

≤k∑j=1

E[(Nj −Npj)2]1/2 +O(N1/2) =k∑j=1

[Npj(1− pj)]1/2 +O(N1/2) = O(N1/2) .

In summary, the coupling construction shows that only a fraction of the N ob-

servations in Z and Zπ0 differ with high probability. Therefore, if the randomization

distribution is based on a statistic TN(Z) such that the difference TN(Z) − TN(Zπ0) is

small in some sense whenever Z and Zπ0 mostly agree, then one should be able to deduce

the behavior of the permutation distribution under samples from P1, . . . , Pk from the

behavior of the permutation distribution when all N observations come from the same

15

distribution P . Whether or not this can be done requires some knowledge of the form of

the statistic, but intuitively it should hold if the statistic cannot strongly be affected by a

change in a small proportion of the observations; its validity though must be established

on a case by case basis. The point is that it is a worthwhile and beneficial route to pur-

sue because the behavior of the permutation distribution under N i.i.d. observations is

typically much easier to analyze than under the more general setting when observations

have possibly different distributions. Furthermore, the behavior under i.i.d. observa-

tions seems fundamental as this is the requirement for the “randomization hypothesis”

to hold, i.e. the requirement to yield exact finite sample inference.

To be more specific, suppose π and π′ are independent random permutations, and

independent of the Zi and Zi. Suppose we can show that

(TN(Zπ), TN(Zπ′))d→ (T, T ′) , (19)

where T and T are independent with common c.d.f R(·). Then, by Theorem 3.1, the

randomization distribution based on TN converges in probability to R(·) when all obser-

vations are i.i.d. according to P . But since ππ0 (meaning π composed with π0 so π0 is

applied first) and π′π0 are also independent random permutations, (19) also implies

(TN(Zππ0), TN(Zπ′π0))d→ (T, T ′) ,

Using the coupling construction to construct Z, suppose it can be shown that

TN(Zππ0)− TN(Zπ)P→ 0 . (20)

Then, it also follows that

TN(Zπ′π0)− TN(Zπ′)P→ 0 ,

and so by Slutsky’s Theorem, it follows that

(TN(Zπ), TN(Zπ′))d→ (T, T ′) . (21)

Therefore, again by Theorem 3.1, the randomization distribution also converges in prob-

ability to R(·) under the original model of k samples from possibly different distributions.

In summary, the coupling construction of Z, Z and π0 and the one added requirement

(20) allow us to reduce the study of the permutation distribution under possibly k dif-

ferent distributions to the i.i.d. case when all N observations are i.i.d. according to P .

We summarize this as follows.

Lemma 3.1. Assume (19) and (20). Then, (21) holds, and so the permutation distri-

16

bution based on k samples from possibly different distributions behaves asymptotically as

if all observations are i.i.d. from the mixture distribution P and satisfies

RTm,n(t)

P→ R(t),

if t is a continuity point of the distribution R of T in (19).

Example 3.1 (Difference of Sample Means). To appreciate what is involved in the

verification of (20), consider the two-sample problem considered in Theorem 2.1, in the

special case of testing equality of means. The unknown variances may differ and are

assumed finite. Consider the test statistic Tm,n = m1/2[Xm − Yn]. By the coupling

construction, Zππ0 and Zπ have the same components except for at most D places. Now,

Tm,n(Zππ0)− Tm,n(Zπ) = m1/2[1

m

m∑i=1

(Zππ0(i) − Zπ(i))]−m1/2[1

n

N∑j=m+1

(Zππ0(j) − Zπ(j))] .

All of the terms in the above two sums are zero except for at most D of them. But any

nonzero term like Zππ0(i) − Zπ(i) has variance bounded above by

2 max(V ar(X1), V ar(Y1)) <∞ .

Note the above random variable has mean zero under the null hypothesis that E(Xi) =

E(Yj). To bound its variance, condition on D and π, and note it has conditional mean

0 and conditional variance bounded above by

m1

min(m2, n2)2 max(V ar(X1), V ar(Y1))D ,

and hence unconditional variance bounded above by

m1

min(m2, n2)2 max(V ar(X1), V ar(Y1))O(N1/2) = O(N−1/2) = o(1) ,

implying (20). In words, we have shown that the behavior of the permutation dis-

tribution can be deduced from the behavior of the permutation distribution when all

observations are i.i.d. with mixture distribution P .

Two final points are relevant. First, the limiting distribution R is typically the

same as the limiting distribution of the true unconditional distribution of TN under

P . The true limiting distribution under (P1, . . . , Pk) need not be the same as under

P . However, suppose the choice of test statistic TN is such that it is an asymptotic

pivot in the sense that its limiting distribution does not depend on the underlying

17

probability distributions. Then, typically the randomization or permutation distribution

under (P1, . . . , Pk) will asymptotically reflect the true unconditional distribution of TN ,

resulting in asymptotically valid inference. Indeed, the general results in Section 2 yield

many examples of this phenomenon. However, that these statements need qualification

is made clear by the following two (somewhat contrived) examples.

Example 3.2. Here, we illustrate a situation where coupling works, but the true sam-

pling distribution does not behave like the permutation distribution under the mixture

model P . In the two-sample setup with m = n, suppose X1, . . . , Xn are i.i.d according

to uniform on the set of x where |x| < 1, and Y1, . . . , Yn are i.i.d. uniform on the set of

y with 2 < |y| < 3. So, E(Xi) = E(Yj) = 0. Consider a test statistic Tn,n defined as

Tn,n(X1, . . . , Xn, Y1, . . . , Yn) = n−1/2[n∑i=1

I|Yi| > 2 − I|Xi| < 2] .

Under the true sampling scheme, Tn,n is zero with probability one. However, if all

2n observations are sampled from the mixture model, it is easy to see that Tn,n is

asymptotically normal N(0, 2), which is the same limit for the permutation distribution

(in probability). So here, the permutation distribution under the given distributions is

the same as under P , though it does not reflect the actual true unconditional sampling

distribution.

Example 3.3. Here, we consider a situation where both populations are indeed iden-

tical, so there is no need for a coupling argument. However, the point is that the

permutation distribution does not behave like the true unconditional sampling distri-

bution. Assume X1, . . . , Xn and Y1, . . . , Yn are all i.i.d. N(0, 1) and consider the test

statistic

Tn,n(X1, . . . , Xn, Y1, . . . , Yn) = n−1/2

n∑i=1

(Xi + Yi) .

Unconditionally, Tn,n converges in distribution to N(0, 2). However, the permutation

distribution places mass one at n1/2(Xn + Yn) because the statistic Tn,n is permutation

invariant.

Certainly the moral of the examples is that the statistic needs to reflect an actual

comparison between P and Q, such as a difference between the same functional evaluated

at P and Q.

18

3.4 An Auxiliary Contiguity Result

Fix m and n with N = m+ n. Eventually, m = m(n)→∞ as n→∞. Set pm = m/N .

Let Pm be the binomial distribution based on m trials and success probability pm. Also,

let Qm be the hypergeometric distribution representing the number of objects labeled X

sampled without replacement; here, m objects are sampled without replacement from

N objects, of which m are labeled X and n are labeled Y .

Lemma 3.2. Assume the above setup with pm → p ∈ (0, 1) as m → ∞. Let Bm

be a random variable having distribution Pm. Consider the likelihood ratio Lm(x) =

dQm(x)/dPm(x).

(i) The limiting distribution of Lm(Bm) satisfies

Lm(Bm)L→ 1√q

exp(− p

2qZ2) , (22)

where Z ∼ N(0, 1) denotes a standard normal random variable and q = 1− p.

(ii) Qm and Pm are mutually contiguous.

Remark 3.2. With Bm having the binomial distribution with parameters m and pmas in Lemma 3.2, also let Bm have the binomial distribution with parameters m and p.

Then, the distributions of Bm and Bm are contiguous if and only if |pm−p| = O(m−1/2),

not just pm → p.

Lemma 3.3. Suppose V1, . . . , Vm are i.i.d. according to the mixture distribution

P ≡ pP + qQ ,

where p ∈ (0, 1) and P and Q are two probabilities (on some general space). Assume,

for some sequence Wm of statistics,

Wm(V1, . . . , Vm)P→ t , (23)

for some constant t (which can depend on P , Q and p). Let m → ∞, n → ∞, with

N = m+ n, pm = m/N , qm = n/N and pm → p ∈ (0, 1) with

pm − p = O(m−1/2) . (24)

19

Further, let X1, . . . , Xm be i.i.d. P and Y1, . . . , Yn be i.i.d. Q. Let

(Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn) .

Let (π(1), . . . , π(N)) denote a random permutation of 1, . . . , N (and independent of

all other variables). Then,

Wm(Zπ(1), . . . , Zπ(m))P→ t . (25)

Remark 3.3. The importance of Lemma 3.3 is that is allows us to deduce the behavior

of the statistic Wm under the randomization or permutation distribution from the basic

assumption of how Wm behaves under i.i.d. observations from the mixture distribution

P . Note that in (23), the convergence in probability assumption is required when the Viare P (so the P over the arrow is just a generic symbol for convergence in probability).

Remark 3.4. As mentioned in Remark 2.2, the assumption (24) is stronger than the

more basic assumption m/N → p, where no rate is required between the difference m/N

and p. Alternatively, we can replace (24) with the more basic assumption m/N → p as

long as we slightly strengthen the requirement (23) to

Wm(Zm,1, . . . , Zm,m)P→ t

when Zm,1, . . . , Zm,m are i.i.d. according to the mixture distribution pmP + qmQ (rather

than pP + qQ), so that the data distribution at time m depends on m. We prefer

to assume the convergence hypothesis based on an i.i.d. sequence, though it is really a

matter of choice. Usually, we can appeal to some basic convergence in probability results

with ease, but if convergence in probability results are available (or can be derived) which

are “uniform” in the underlying probability distribution, then such results can be used

instead with the weaker hypothesis pm → p.

4 Nonparametric k-sample Behrens-Fisher Problem

From our general considerations, we are now guided by the principle that the large sample

distribution of the test statistic should not depend on the underlying distributions; that

is, it should be asymptotically pivotal under the null. Of course, it can be something

other than normal, and we next consider the important problem of testing equality of

means of k-samples (where a limiting Chi-squared distribution is obtained).

20

The problem studied is the nonparameric one-way layout in the analysis of variance.

Assume we observe k independent samples of i.i.d. observations. Specifically, assume

Xi,1, . . . , Xi,niare i.i.d. Pi. Some of our results will hold for fixed n1, . . . , nk, but we also

have asymptotic results as N ≡∑

i ni → ∞. Let n = (n1, . . . , nk), and the notation

n→∞ will mean mini ni →∞.

The Pi are unknown probability distributions on the real line, assumed to have finite

variance. Let µ(P ) and σ2(P ) denote the mean and variance of P , respectively. The

problem of interest is to test the null hypothesis

H0 : µ(P1) = · · · = µ(Pk)

against the alternative

H1 : µ(Pi) 6= µ(Pj) for some i, j .

The classical approach is to assume Pi is normal N(µ, σ2) with a common variance.

Here, we will not impose normality, nor the assumption of common variance.

One approach used to robustify the usual F -test is to apply a permutation test.

The underlying distributions need not be normal for the permutation approach to yield

exact level α tests, but what is needed is that Pi is just Pj shifted for all i and j. To

put it another way, it must be the case that the c.d.f. Fi corresponding to Pi satisfy

Fi(x) = F (x−µi) for some unknown F and constant µi (which can then be taken to be

the mean of Fi, assuming the mean exists). In other words, under H0, the observations

must be mutually independent and identically distributed. Of course, this is much

weaker than the usual normal theory assumptions. Unfortunately, a permutation test

applied to the usual F -statistic will fail to control the probability of a Type 1 error, even

asymptotically.

The goal here is to construct a method that retains the exact control of the probability

of a Type 1 error when the observations are i.i.d., but also asymptotically controls the

probability of a Type 1 error under very weak assumptions, specifically finite variances

of the underlying distributions.

The first step is a choice of test statistic. In order to preserve the good power

properties of the classical test under normality, consider the generalized likelihood ratio

for testing H0 against H1 under the normal model where it is assumed Pi ∼ N(µi, σ2i ).

If, for now, we further assume that the σi are known, then it is easily checked that the

21

generalized likelihood ratio test rejects for large values of

Tn,0 =k∑i=1

niσ2i

[Xn,i −

∑ki=1 niXn,i/σ

2i∑k

i=1 ni/σ2i

]2

, (26)

where Xn,i =∑ni

j=1Xi,j/ni. Since the σi will not be assumed known, we replace σi in

(26) with Sn,i, where

S2n,i =

1

ni − 1

ni∑j=1

(Xi,j − Xn,i)2 ,

yielding

Tn,1 =k∑i=1

niS2n,i

[Xn,i −

∑ki=1 niXn,i/S

2n,i∑k

i=1 ni/S2n,i

]2

. (27)

We need the limiting behavior of Tn,1, not just under normality or equal distributions.

(Some relatively recent large sample approaches which do not retain our finite sample

exactness property to this specific problem are given in Rice and Gaines (1989) and

Krishnamoorthy, Lu and and Mathew (2007).)

Lemma 4.1. Consider the above set-up with 0 < σ2i = σ2(Pi) < ∞. Assume ni → ∞

with ni/N → pi > 0. Then, under H0, both Tn,0 and Tn,1 converge in distribution to the

Chi-squared distribution with k − 1 degrees of freedom.

Let Rn,1(·) denote the permutation distribution corresponding to Tn,1. In words, Tn,1is recomputed over all permutations of the data. Specifically, if we let

(Z1, . . . , ZN) = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) ,

then, Rn,1(t) is formally equal to the right side of (2), with Tm,n replaced by Tn,1.

Theorem 4.1. Consider the above set-up with 0 < σ2(Pi) <∞. Assume ni →∞ with

ni/N → pi > 0. Then, under H0,

Rn,1(t)P→ Gk−1(t) ,

where Gd denotes the Chi-squared distribution with d degrees of freedom. Moreover, if

P1, . . . , Pk satisfy H0, then the probability that the permutation test rejects H0 tends to

the nominal level α.

22

5 Simulation Results

Monte Carlos simulation studies illustrating our results are presented in this section.

Table 1 tabulates the rejection probabilities of one-sided tests for the studentized per-

mutation median test where the nominal level considered is α = 0.05. . The simulation

results confirm that the studentized permutation median test is valid in the sense that

it approximately attains level α in large samples.

In the simulation, odd numbers of sample sizes are selected in the Monte Carlo simu-

lation for simplicity. We consider several pairs of distinct sample distributions that share

the same median as listed in the first column of Table 1. For each situation, 10,000 simu-

lations were performed. Within a given simulation, the permutation test was calculated

by randomly sampling 999 permutations. Note that neither the exactness properties nor

the asymptotic properties are changed at all (as long as the number of permutations

sampled tends to infinity). For a discussion on stochastic approximations to the permu-

tation distribution, see the end of Section 15.2.1 in Lehmann and Romano (2005) and

Section 4 in Romano (1989). As is well-known, when the underlying distributions of two

distinct independent samples are not identical, the permutation median test is not valid

in the sense that the rejection probability is far from the nominal level α = 0.05. For

example, although a logistic distribution with location parameter 0 and scale parameter

1 and a continuous uniform distribution with the support ranging from -10 to 10 have

the same median of 0, the rejection probability for the sample sizes examined is between

0.0991 and 0.2261 and moves further away from the nominal level α = 0.05 as sample

sizes increase.

In contrast, the studentized permutation test results in rejection probability that

tends to the nominal level α asymptotically. We apply the bootstrap method (Efron,

1979) to estimate the variance for the median 14f2P (θ)

in the simulation given by

mm∑l=1

[X(l) − θ(Pm)

]2

· P(θ(P ∗m) = X(l)

),

where for an odd number m,

P(θ(P ∗m) = X(l)

)= P

(Binomial

(m,

l − 1

m

)≤ m− 1

2

)−P

(Binomial

(m,

l

m

)≤ m− 1

2

).

As noted earlier, there exist other choices such as the kernel estimator and the smoothed

bootstrap estimator. We emphasize, however, that using the bootstrap to obtain an

estimate of standard error does not destroy the exactness of permutation tests under

23

identical distributions.

Distributionsm 5 13 51 101 101 201 401

n 5 21 101 101 201 201 401

N(0,1)

N(0,5)

Not Studentized 0.1079 0.1524 0.1324 0.2309 0.2266 0.2266 0.2249

Studentized 0.0802 0.1458 0.095 0.0615 0.0517 0.0517 0.0531

N(0,1)

T(5)

Not Studentized 0.0646 0.1871 0.2411 0.1769 0.1849 0.1849 0.1853

Studentized 0.0707 0.1556 0.0904 0.0776 0.0661 0.0661 0.0611

Logistic(0,1)

U(-10, 10)

Not Studentized 0.0991 0.1413 0.1237 0.2258 0.2233 0.2233 0.2261

Studentized 0.0771 0.1249 0.0923 0.0686 0.0574 0.0574 0.0574

Laplace(ln 2, 1)

exp(1)

Not Studentized 0.0420 0.0462 0.0477 0.048 0.0493 0.0461 0.0501

Studentized 0.0386 0.0422 0.0444 0.0502 0.0485 0.0505 0.0531

Table 1: Monte-Carlo Simulation Results for Studentized Permutation Median Test

(One-sided, α = 0.05)

6 Conclusion

When the fundamental assumption of identical distributions need not hold, two-sample

permutation tests are invalid unless quite stringent conditions are satisfied depending

on the precise nature of the problem. For example, the two-sample permutation test

based on the difference of sample means is asymptotically valid only when either the

distributions have the same variance or they are comparable in sample size. Thus, a

careful interpretation of rejecting the null is necessary; rejecting the null based on the

permutation tests does not necessarily imply the rejection of the null that some real-

valued parameter θ(F,G) is some specified value θ0. We provide a framework that allows

one to obtain asymptotic rejection probability α in two-sample permutation tests. One

great advantage of utilizing the proposed test is that it retains the exactness property

in finite samples when P = Q, a desirable property that bootstrap and subsampling

methods fail to possess.

To summarize, if the true goal is to test whether the parameter of interest θ is

some specified value θ0, permutation tests based on correctly studentized statistic is an

attractive choice. When testing the equality of means, for example, the permutation

t-test based on a studentized statistic obtains asymptotic rejection probability α in

24

general while attaining exact rejection probability equal to α when P = Q. In the case

of testing the equality of medians, the studentized permutation median test yields the

same desirable property. Moreover, the results extend to quite general settings based

on asymptotically linear estimators. The results extend to k-sample problems as well,

and analogous results hold in the nonparametric k-sample Behrens-Fisher problem. The

guiding principle is to use a test statistic that is asymptotically distribution-free or

pivotal. Then, the technical arguments developed in this paper can be shown that

the permutation test behaves asymptotically the same as when all observations share

a common distribution. Consequently, if the permutation distribution reflects the true

underlying sampling distribution, asymptotic justification is achieved.

A Proofs

Proofs in Section 3.

Proof of Theorem 3.1. The sufficiency part due to Hoeffding (1952) is proved in

Theorem 15.2.3. of Lehmann and Romano (2005). To prove the necessity part, suppose

s and t are continuity points of RT (·). Then,

PTn(GnXn) ≤ s, Tn(G′nX

n) ≤ t = E[PTn(GnXn) ≤ s, Tn(G′nX

n) ≤ t|Xn]

= E[RTn (s)RT

n (t)]→ RT (s)RT (t) ,

since convergence in probability of a bounded sequence of random variables entails con-

vergence of moments. Convergence for a dense set of rectangles in the plane entails weak

convergence.

Before proving Slutsky’s Theorem for Randomization Distributions (Theorem 3.2),

we need three lemmas.

Lemma A.1. Suppose Xn has distribution Pn in Xn, and Gn is a finite group of trans-

formations g of Xn onto itself. Also, let Gn be a random variable that is uniform on

Gn. Assume Xn and Gn are mutually independent. Let RAn denotes the randomization

distributions of An, defined by

RAn (t) =

1

|Gn|∑g∈Gn

IAn(gXn) ≤ t. (28)

25

Suppose, under Pn,

An(GnXn)

P→ a.

Then, under Pn,

RAn (t) =

1

|Gn|∑g∈Gn

IAn(gXn) ≤ t P→ δa(t) if t 6= a, (29)

where δc(·) denotes the distribution function corresponding to the point mass function at

c.

Proof of Lemma A.1: Let G′n have the same distribution as Gn and be independent

from Gn and Xn. Since An(GnXn) converges in probability to a constant a, An(G′nX

n)P→

a and the independence of the limiting distributions is satisfied. Thus, the result follows

from Theorem 3.1.

Lemma A.2. Let Bn and Tn be sequences of random variables satisfying the conditions

above, i.e.,

Bn(GnXn)

P→ b,

and


n))d→ (T, T ′), (30)

where T and T ′ are independent, each with common c.d.f. RT (·). Let RT+Bn (t) denote

the randomization distribution of Tn + Bn, defined in (28) with A replaced by T + B.

Then, RT+Bn (t) converges to T + b in probability. In other words,

RT+Bn (t) ≡ 1

|G|∑g

ITn(gXn) +Bn(gXn) ≤ t P→ RT+b(t) if RT+b is continuous at t,

where RT+b(·) denotes the corresponding c.d.f. of T+b. (Of course, RT+b(t) = RT (t−b).)

Proof of Lemma A.2: Without loss of generality, assume b = 0. For any ε > 0,

1

|G|∑

ITn(gXn) +Bn(gXn) ≤ t− ε − 1

|G|∑

I|Bn(gXn)| > ε

≤ 1

|G|∑

ITn(gXn) +Bn(gXn) ≤ t

≤ 1

|G|∑

ITn(gXn) +Bn(gXn) ≤ t+ ε+1

|G|∑

I|Bn(gXn)| > ε.

26

Note that 1|G|∑

I|Bn(gXn)| > ε of the first line and the third line converges in prob-

ability to 0 by Lemma A.1. Also, by Theorem 3.1, (30) implies

RTn (t) =

1

|Gn|∑g∈Gn

ITn(gXn) ≤ t P→ RT (t) (31)

if RT (·) is continuous at t. Thus, if both t− ε and t+ ε are continuity points of RT (·),the first term of the first line and the first term of the third line converge in probability

to RT (t− ε) and RT (t+ ε), respectively. Therefore,

RT (t− ε) ≤ RT+bn (t) ≤ RT (t+ ε)

with probability tending to one, for continuity points t− ε and t+ ε of RT (·).

Now, let ε ↓ 0 through continuity points to deduce that

RT+Bn (t)

P→ RT (t).

Lemma A.3. Let An and Tn be sequences of random variables satisfying the conditions

above, i.e.,

An(GnXn)

P→ a

where a is nonzero, and


n))d→ (T, T ′),

where T and T ′ are independent, each with common c.d.f. RT (·). Then, the randomiza-

tion distribution of AnTn converges to aT in probability. In other words,

RATn (t) ≡ 1

|G|∑g

IAn(gXn)Tn(gXn) ≤ t P→ RaT (t),

if RaT is continuous at t, where RaT (·) denotes the corresponding c.d.f. of aT.

Proof of Lemma A.3: Write

AnTn = aTn + (An − a)Tn .

Then, we can apply Lemma A.2 with Bn = (An − a)Tn, if we can verify the condition

Bn(GnXn)

P→ 0. But,

Bn(GnXn) = [An(GnX

n)− a]Tn(GnXn)

P→ 0 · T = 0 ,

27

by the usual Slutsky’s Theorem. Finally, the behavior of aTn follows trivially from that

of Tn.

Proof of Theorem 3.2: The proof follows from Lemma A.2 and Lemma A.3.

Proof of Lemma 3.2: First,

Lm(x) =

(mx

)(n

m−x

)(Nm

)(mx

)pxm(1− pm)m−x

(32)

=n!n!m!

(n+m)!(m− x)!(n−m+ x)!pxm(1− pm)m−x.

Applying Stirling’s approximation

n! ∼√

2πn(n/e)n(1 +O(1

n)) as n→∞

yields Lm(x) ∼ L′m(x), where

L′m(x) =n2n+1mm+ 1

2

(n+m)n+m+ 12 (m− x)m−x+ 1

2 (n−m+ x)n−m+x+ 12pxm(1− pm)m−x

;

the approximation holds as long as min(m,n,m − x, n − m + x) → ∞. Of course,

Bm = mpm +OP (m1/2), and so

min(m,n,m−Bm, n−m+Bm)P→∞ .

Therefore, Lm(Bm) has the same limiting distribution as L′m(Bm) (assuming it has one,

which we show below). Write L′m = a · b · c and qm = 1− pm, where

a =n2n+1mm+ 1

2

(n+m)n+m+ 12

,

b =1

(m− x)m−x+ 12 (n−m+ x)n−m+x+ 1

2

and

c =1

pxmqm−xm

.

Then,

a = q2n+1m p

m+ 12

m (n+m)n+1 ,

28

and so

a · c = pA+ 1

2m q2n+1−A

m (n+m)n+1 ,

where A = m− x. Also,

b =1

AA+ 12 (n− A)n−A+ 1

2

=

(npmA

)A+ 12(nqmn−A

)n−A+ 12

(npm)A+ 12 (nqm)n−A+ 1

2

.

Therefore, L′m = a · b · c equals

L′m =pA+ 1

2m q2n+1−A

m (n+m)n+1

(npm)A+ 12 (nqm)n−A+ 1

2

· 1(Anpm

)A+ 12(n−Anqm

)n−A+ 12

=qn+ 1

2m (n+m)n+1

nn+1· 1(

Anpm

)A+ 12(n−Anqm

)n−A+ 12

=1√qm· 1(

Anpm

)A+ 12(n−Anqm

)n−A+ 12

We will evaluate Lm and L′m not at a generic x, but at the binomial variable Bm, which

satisfies

Bm = mpm +OP (m1/2) ,

in which case Am = A(Bm) = m−Bm satisfies

Amnpm

=mqmnpm

+OP (m1/2)

npm= 1 +

OP (m1/2)

npm,

since npm = mqm. Also,

Am = m−Bm = mqm +OP (m1/2) .

Since Bm/mP→ p, we also have

Amnpm

P→ 1 andn− Amnqm

P→ 1 ,

orAmn

P→ p andn− Am

n

P→ q . (33)

Therefore, we can expand the logarithm in L′m as long as we keep both the linear and

29

quadratic terms,

log(t) = (t− 1)− 1

2(t− 1)2 + o(|t− 1|2) as t→ 1 .

Hence,

− log[√qmL

′m(Bm)] = (Am +

1

2) log

(Amnpm

)+ (n− Am +

1

2) log

(n− Amnqm

)

= Am log

(Amnpm

)+ (n− Am) log

(n− Amnqm

)+ oP (1)

= Am

(Amnpm

− 1

)+(n−Am)

(n− Amnqm

− 1

)−1

2Am

(Amnpm

− 1

)2

−1

2(n−Am)

(n− Amnqm

− 1

)2

+oP (1)

≡ d+ e+ f + g + oP (1) ,

where we have just identified the four terms in the last expression. Noting that

n− Amnqm

− 1 = −pmqm

(Amnpm

− 1

), (34)

we have that

d+ e =

(Amnpm

− 1

)·[Am +

pmqm

(Am − n)

]=

(Amnpm

− 1

)· 1

qm· (Am − npm)

=(Am − npm)2

npmqm= Z2

m ·pmqm

,

where

Zm =Am −mqm√mpmqm

= −Bm −mpm√mpmqm

L→ Z ∼ N(0, 1) .

Again using (34), we find that

−2(f + g) =

(Amnpm

− 1

)2

·(Am + (n− Am)

p2m

q2m

)

= Z2m

(pm +

p2m

qm

)+ oP (1) = Z2p

q+ oP (1) ,

using (33). Therefore,

d+ e+ f + g =p

2qZ2 + oP (1) .

30

Hence, we conclude that

Lm(Bm)L→ 1√q

exp(− p

2qZ2) ,

and (i) is shown. To prove (ii), note that

E

[1√q

exp(− p

2qZ2)

]= 1 ,

since Z2 has the Chi-squared distribution with one degree of freedom and moment

generating function ψ(t) = (1 − 2t)−1/2. Since the mean of the limiting distribution

has mean 1, by Theorem 12.3.2 (iii) of Lehmann and Romano (2005) Qm is contiguous

with respect to Pm. Since the limiting distribution has no mass at 0, by Problem 12.23

it also follows that Pm is contiguous to Qm.

Proof of Lemma 3.3 We must show, for any ε > 0,

P|Wm(Zπ(1), . . . , Zπ(m))− t| > ε → 0 as m→∞ . (35)

We compare the left side of (35) with

P|Wm(V1, . . . , Vm)− t| > ε .

Imagine V1, . . . , Vm are sampled in a two-stage process where first Bm is drawn from

the binomial distribution with parameters m and p, and then V1, . . . , Vm are obtained

by drawing Bm i.i.d. observations from P and (m − Bm) i.i.d. observations from Q.

Similarly, let Hm denote the number of observations among Zπ(1), . . . , Zπ(m) which were

among the Xis, so that Hm has the hypergeometric distribution (m,m,N) based on

sampling m objects from N , m of which are “special”. By Lemma 3.2, Remark 3.2

and (24), Bm and Hm are contiguous. Importantly, conditional on Bm = Hm = b, the

conditional probabilities

PWm(V1, . . . , Vm)− t| > ε|Bm = b (36)

and

P|Wm(Zπ(1), . . . , Zπ(m))− t| > ε|Hm = b (37)

are the same, because Wm is evaluated at a random sample of b observations from P

31

and m− b observations from Q in both cases. Let fm(Bm) be defined by

fm(Bm) ≡ P|Wm(V1, . . . , Vm)− t| > ε|Bm . (38)

By assumption (23),

E[fm(Bm)]→ 0 ,

and hence

fm(Bm)P→ 0 ,

by Markov’s inequality. But then, by contiguity,

fm(Hm)P→ 0 ,

and so

E[fm(Hm)]→ 0 , (39)

since fm is uniformly bounded. But the left hand side of (39) is exactly the left hand

side of (35).

Proofs of Theorems in Section 2.

Proof of Theorem 2.1 First, argue in the case θ(P ) =∫xdP (x), so fP (x) = x for

all x, P and

Tm,n = m1/2(Xm − Yn) = m−1/2( m∑i=1

Xi −m

n

n∑j=1

Yj).

Independent of Zs, let (π(1), . . . , π(N)) and (π′(1), . . . , π′(N)) be independent random

permutations of 1, . . . , N. Then, by Example 15.2.6 of Lehmann and Romano (2005),(Tm,n(Zπ(i)), Tm,n(Zπ′(i))

)converges in distribution to a bivariate normal distribution with independent, identically

distributed marginals having mean 0 and variance

τ 2(P ) =p

1− pσ2(P ) + σ2(Q) =

1

1− pσ2(P ),

where σ2(P ) denotes the variance of P. Thus, Theorem 3.1 can be applied and the result

follows.

Next, consider the case θ(P ) =∫f(x)dP (x). However, this problem is the same as

the mean case. Instead of observing (Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn), we now

32

observe (Z1, . . . , ZN) = (f(X1), . . . , f(Xm), f(Y1), . . . , f(Yn)) and we are interested in

means of Zs. Thus, the proof for this case would be the same as above except we

replace σ2(P ) = EPX2i with EPf

2P (Xi).

Finally, we consider the general case. Let π be a random permutation of 1, . . . , N,so that

Tm,n(Zπ(1), . . . , Zπ(N)) = m1/2[θm(Zπ(1), . . . , Zπ(m))− θn(Zπ(m+1), . . . , Zπ(N))] .

Let V1, V2, . . . be i.i.d. P . By assumption,

m1/2[θm(V1, . . . , Vm)− θ(P )]−m−1/2

m∑i=1

fP (Vi)P→ 0 . (40)

By Lemma 3.3 and (40),

εm(Zπ(1), . . . , Zπ(m)) ≡ m1/2[θm(Zπ(1), . . . , Zπ(m))− θ(P )]−m−1/2

m∑i=1

fP (Zπ(i))P→ 0 .

Similarly,

εn(Zπ(m+1), . . . , Zπ(N)) ≡ n1/2[θn(Zπ(m+1), . . . , Zπ(N))−θ(P )]−n−1/2

N∑j=m+1

fP (Zπ(j))P→ 0 ,

which implies

m1/2[θn(Zπ(m+1), . . . , Zπ(N))− θ(P )]− (m

n)1/2n−1/2

N∑j=m+1

fP (Zπ(j))P→ 0 .

Hence, we can write

Tm,n(Zπ(1), . . . , Zπ(N)) =

m1/2[1

m

m∑i=1

fP (Zπ(i))−1

n

N∑j=m+1

fP (Zπ(j))]+εm(Zπ(1), . . . , Zπ(m))−(m

n)1/2εn(Zπ(m+1), . . . , Zπ(N)) ,

and each of the last two terms goes to zero in probability. Therefore, we can apply

Slutsky’s Theorem for randomization distributions; that is, it suffices to determine the

limit behavior of just

m1/2[1

m

m∑i=1

fP (Zπ(i))−1

n

N∑j=m+1

fP (Zπ(j))] ,

33

which reduces the problem to the previous case considered.

Proof of Theorem 2.2: Write Vm,n = Vm,n(Z1, . . . , ZN), where the Zi are defined in

(1). Let (π(1), . . . , π(N)) denote a random permutation of 1, . . . , N (and independent

of all other variables). We first will show that

V 2m,n(Zπ(1), . . . , Zπ(N))

P→ τ 2(P ) . (41)

To do this, it suffices to show that

σ2m(Zπ(1), . . . , Zπ(m))

P→ σ2(P ) (42)

and

σ2n(Zπ(m+1), . . . , Zπ(N))

P→ σ2(P ) . (43)

But (42) and (43) both follow from Lemma 3.3. Now let RVm,n(·) denote the permutation

distribution corresponding to the statistic Vm,n, as defined in (2) with T replaced by V .

By Lemma A.1, RVm,n(t) converges to δτ2(P )(t) for all t 6= τ 2(P ), where δc(·) denotes the

c.d.f. of the distribution placing mass one at the constant c. Using this fact together with

Theorem 2.1, we can apply Lemma A.3 to conclude that the permutation distribution

of the ratio of statistics Sm,n satisfies (11).

Proofs in Section 4.

Proof of Lemma 4.1: First, we consider Tn,0. Without loss of generality, assume

µ(Pi) = 0 for all i. Let Zn be the column vector with ith component n1/2i Xn,i/σi. Also,

let I denote the k× k identity matrix, let 1 denote the k× 1 vector of ones, and let Dn

denote the diagonal matrix with (i, i) entry Nσ2i /ni. Then, we can write

Tn,0 = Z ′nPnZn .

where

Pn ≡ (I − D−1/2n 11′D

−1/2n

1′D−1n 1

) .

Of course, Zn converges in distribution to Z, where Z has the multivariate normal

distribution with mean 0 and covariance matrix I. If we let D denote the diagonal

matrix with (i, i) entry σ2i /pi, then the convergence of Dn to D (for each entry) as well

as the convergence of D−1n to D−1 implies (using the continuous mapping theorem) that

Tn,0d→ Z ′PZ ,

34

where P is the matrix

P ≡ (I − D−1/211′D−1/2

1′D−11) . (44)

The matrix P is a symmetric idempotent or projection matrix, and its rank therefore is

its trace, which is then easily checked to be k − 1. Indeed, P represents the projection

orthogonal to the unit vector D−1/21/1′D−11. It follows that Z ′PZ ∼ χ2k−1, as required.

To handle Tn,1, let tn be the column vector with ith component n1/2i Xn,i/Sn,i and let

Dn be the diagonal matrix with (i, i) entry NS2n,i/ni. Then, let Pn be the projection

matrix where D is replaced by Dn in the definition (44) of P . Of course by Slutsky’s

Theorem, Zn − tn converges in probability to 0. Also, Dn converges in probability to D

(as well as its inverse), Pn converges in probability to P , and so Pn − Pn converges in

probability to 0. Since

Tn,0 − Tn,1 = Z ′nPnZn − t′nPntn

= (Zn − tn)′Pn(Zn − tn) + 2(Zn − tn)′Pntn + t′n(Pn − Pn)tnP→ 0 ,

then Tn,1 must have the same limiting distribution as that of Tn,0.

Proof of Theorem 4.1: Put all the N =∑

i ni observations in one vector

(Z1, . . . , ZN) = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) .

For now, we consider the case where all the N observations are i.i.d., so that Pi = P

for i = 1, . . . , k. Without loss of generality, we can assume µ(P ) = 0 and we write

σ2 = σ2(P ). In this case, Tn,0 simplifies to

Tn,0 =1

σ2

k∑i=1

ni(Xn,i − ZN)2 ,

where

ZN =1

N

N∑l=1

Zl .

First, we show that the randomization distribution based on Tn,0, say Rn,0(·), behaves

the same was as Tn,1. (Of course, we can’t use Tn,0 as σ is unknown, but we treat it now

in essence as if it is known.) Let π = (π(1), . . . , π(N)) denote a random permutation of

1, . . . , N (and independent of the observations). From Theorem 3.1, we must verify

that the joint limiting distribution of (Tn,0(Z), Tn,0(Zπ)) is that of two independent Chi-

squared variables with k − 1 degrees of freedom. (Note that we do not need to consider

35

the joint behavior of Tn,0 at Zπ and at Zπ′ , where π′ is another independent random

permutation, because since the Zl are i.i.d., Zπ′ and Z have the same distribution.) To

do this, define

Vn,i = n1/2i Xn,i = n

−1/2i

N∑l=1

ZlIl ∈ Ii

and

V ′n,j = n−1/2j

N∑l=1

ZlIπ(l) ∈ Ij ,

where Ii is the set of indices corresponding to the ith sample; that is, I1 = 1, . . . , n1.I2 = n1 + 1, . . . , n1 + n2 , and ultimately Ik = N − nk + 1, . . . , N. We claim the

joint asymptotic normality of

(Vn,1, . . . , Vn,k, V′n,1, . . . , V

′n,k) .

To do with we use the Cramer-Wold device, i.e., we must show that

Vn ≡ VN(a, b) ≡k∑i=1

(aiVn,i + biV′n,i)

is asymptotically normal for any choices of constants ai and bi. We can write

Vn =N∑l=1

Cn,lZl ,

where

Cn,l =k∑i=1

[aiIl ∈ Ii

n1/2i

+biIπ(l) ∈ Ii

n1/2i

].

Note that the Cn,l are random (as they depend on the random permutation π), but are

independent of the Zl. By Lemma 11.3.3 in Lehmann and Romano (2005), a sufficient

condition forN∑l=1

Cn,lZl/

√√√√ N∑l=1

C2n,l

d→ N(0, σ2) (45)

ismaxl=1,...,N C

2n,l∑N

l=1C2n,l

P→ 0 (46)

36

as N →∞. Note that

C2n,l =

k∑i=1

[aiIl ∈ Ii

n1/2i

+biIπ(l) ∈ Ii

n1/2i

]·

k∑j=1

[ajIl ∈ Ij

n1/2j

+bjIπ(l) ∈ Ij

n1/2j

]

=k∑i=1

a2i

niIl ∈ Ii+

k∑i=1

k∑j=1

ai

n1/2i

bj

n1/2j

Il ∈ Ii, π(l) ∈ Ij

+k∑i=1

k∑j=1

biaj

n1/2i n

1/2j

Iπ(l) ∈ Ii, l ∈ Ij+k∑i=1

b2i

niIπ(l) ∈ Ii .

Certainly,

maxl=1,...,N

C2n,l = OP (1/N)→ 0 .

Furthermore,

N∑l=1

C2n,l =

k∑i=1

a2i

ni

N∑l=1

Il ∈ Ii+k∑i=1

k∑j=1

ai

n1/2i

bj

n1/2j

N∑l=1

Il ∈ Ii, π(l) ∈ Ij

+k∑i=1

k∑j=1

biaj

n1/2i n

1/2j

N∑l=1

Iπ(l) ∈ Ii, l ∈ Ij+k∑i=1

b2i

ni

N∑l=1

Iπ(l) ∈ Ii

=k∑i=1

(a2i + b2

i ) +k∑i=1

k∑j=1

ai

n1/2i

bj

n1/2j

N∑l=1

[Il ∈ Ii, π(l) ∈ Ij+ Iπ(l) ∈ Ii, l ∈ Ij] .

Now, the term

An(i, j) ≡N∑l=1

Il ∈ Ii, π(l) ∈ Ij (47)

represents the the ni indices in Ii such that, after permuted by π, are in Ij; hence, its

distribution is that of the hypergeometric distribution when sampling ni observations

from N , of which nj are “special”. The expectation of (47) is then ninj/N . Hence,

E[An(i, j)/ni]P→ pj

and V ar[An(i, j)/ni] = O(1/ni), implying

An(i, j)/niP→ pj .

37

It follows thatN∑l=1

C2n,l

P→k∑i=1

(a2i + b2

i ) + 2k∑i=1

k∑j=1

aibjp1/2i p

1/2j . (48)

Of course, the right side of (48) is nonnegative. By the Cauchy-Schwarz inequality,

|k∑i=1

aip1/2i | ≤ [

k∑i=1

a2i ]

1/2 ,

with equality if and only if ai = cp1/2i for some constant c. It follows that the right side

of (48) is greater than or equal to (A1/2−B1/2)2, where A =∑

i a2i and B =

∑i b

2i , and

is equal to 0 if and only if A = B, i.e. ai = cp1/2i and bi = −cp1/2

i .

When the right side of (48) is positive, we have that the condition (46) holds, and so

N∑l=1

Cn,lZld→ N(0, σ2[

k∑i=1

(a2i + b2

i ) + 2k∑i=1

k∑j=1

aibjp1/2i p

1/2j ]) . (49)

But, even if the right side of (48) is zero, we can still claim∑

l Cn,lZl converges in

distribution to N(0, 0), i.e., it converges in probability to 0. To see why,∑

l Cn,lZl has

mean 0 and variance σ2∑

lE(C2n,l). But the above argument showing

∑l C

2n,l converges

to 0 in probability (in this case only) shows that its expectation does as well.

In general, we can now conclude that

(Vn,1, . . . , Vn,k, V′n,1, . . . , V

′n,k)

d→ (V, V ′)

is asymptotically multivariate normal with mean 0 (and each of V and V ′ are k-vectors).

Moreover, by appropriate choices of constants ai and bi, we can read off the covariance

matrix from the limiting variance in (49). In particular, by taking ai = 1 and aj = 0

if j 6= i and taking bj = 0 for all j, yields V ar(Vn,i) = σ2. Also, Cov(Vn,i, Vn,j) = 0 if

i 6= j. Similarly, V ar(V ′n,j) = σ2, and for i 6= j, (by taking ai = 1 = bj and the rest of

the constants 0),

Cov(Vn,i, V′n,j) = σ2(pipj)

1/2 . (50)

Of course, the statistic Tn,0 that is of current interest is indeed a function of the

Vn,i; however, the fact that the covariances in (50) are nonzero would not allow us to

conclude the asymptotic independence of Tn,0(Z) and Tn,0(Zπ). So we first need to

38

consider a simple transformation of the Vn,i and V ′n,j. For i = 1, . . . , k, define

Wn,i ≡ n1/2i (Xn,i − ZN)

= Vn,i − n1/2i ZN = Vn,i − (ni/N)1/2

k∑m=1

p1/2m Vn,m .

Similarly,

W ′n,j = V ′n,j − (nj/N)1/2

k∑m=1

p1/2m V ′n,m .

The joint asymptotic multivariate normality of the Vn,i together with the V ′n,j implies

the joint asymptotic multivariate normality of the Wn,i together with the W ′n,j. Indeed,

(Wn,1, . . . ,Wn,k,W′n,1, . . . ,W

′n,k)

d→ (W1, . . . ,Wk,W′1, . . . ,W

′k) ,

where

Wi = Vi − p1/2i

k∑m=1

p1/2m Vm

and

W ′j = V ′j − p

1/2i

k∑m=1

p1/2m V ′m .

Importantly,

Cov(Wi,W′j) = Cov(Vi − p1/2

i

k∑m=1

p1/2m Vm, V

′j − p

1/2j

k∑m=1

p1/2m V ′m)

= Cov(Vi, V′j )− p

1/2j

k∑m=1

p1/2m Cov(Vi, V

′m)− p1/2

i

k∑m=1

p1/2m Cov(Vm, V

′j )+

(pipj)1/2

k∑l=1

k∑m=1

(plpm)1/2Cov(Vl, V′m)

= σ2[(pipj)1/2 − p1/2

j

k∑m=1

p1/2i pm − p1/2

i

k∑m=1

p1/2j pm + (pipj)

1/2

k∑l=1

k∑m=1

plpm]

= σ2[(pipj)1/2 − (pipj)

1/2 − (pipj)1/2 + (pipj)

1/2] = 0 .

39

It follows that (W1, . . . ,Wk) and (W ′1, . . . ,W

′k) are independent. But since

Tn,0(Z) =1

σ2

k∑i=1

W 2n,i

d→ 1

σ2

k∑i=1

W 2i

and

Tn,0(Zπ) =1

σ2

k∑i=1

(W ′n,i)

2 d→ 1

σ2

k∑i=1

(W ′i )

2 ,

it now follows that Tn,0(Z) and Tn,0(Zπ) are asymptotically independent. Moreover, by

Lemma 4.1, Tn,0(Z) is asymptotically Chi-squared with k− 1 degrees of freedom. Since,

Tn,0(Zπ) has the same distribution as Tn,0(Z), it has the same limiting distribution as

well.

Next, we show the same result with Tn,1 replaced by Tn,0. However, by the fact that

Z and Zπ have the same distribution,

Tn,1(Zπ)− Tn,0(Zπ)d= Tn,1(Z)− Tn,0(Z) ,

and so by the proof of Lemma 4.1,

Tn,1(Zπ)− Tn,0(Zπ)P→ 0 .

Writing Tn,1 = Tn,0 + [Tn,1−Tn,0], we can then apply Slutsky’s Theorem for Randomiza-

tion distributions, Theorem 3.2 to conclude that Rn,1(·) has the same limiting behavior

as Rn,0(·).

The proof is now complete under the assumption that all N observations are i.i.d. We

now argue, using the coupling argument in Section 3.3, that the behavior of the permu-

tation distribution under general P1, . . . , Pk (satisfying the finite variance assumption)

is the same when all observations are i.i.d. with distribution given by the mixture dis-

tribution P . So, construct Z, Z and Zπ0 as in the coupling construction. It suffices to

show that, for a random permutation π,

Tn,1(Zπ)− Tn,1(Zππ0)P→ 0 . (51)

Write

Tn,1(Z) =k∑i=1

1

S2n,i

[n

1/2i Xn,i −

∑kj=1 n

1/2j Xn,j(n

1/2i n

1/2j /N)/S2

n,j∑kj=1(nj/N)/S2

n,j

]2

. (52)

40

Then, Tn,1(Zπ) is computed by replacing

Xn,i = Xn,i(Z) =1

ni

N∑l=1

ZlIl ∈ Ii

with

Xn,i(Zπ) =1

ni

N∑l=1

ZlIπ(l) ∈ Ii

and S2n,i = S2

n,i(Z) gets replaced by

S2n,i(Zπ) ≡ 1

ni − 1

[N∑l=1

Z2l Iπ(l) ∈ Ii − niX2

n,i(Zπ)

].

From (52), it now suffices to show that, for each i,

n1/2i Xn,i(Zπ)− n1/2

i Xn,i(Zππ0)P→ 0 (53)

and

S2n,i(Zπ)− S2

n,i(Zππ0)P→ 0 . (54)

To show (53), first note that the left side has mean 0; so, it suffices to show its variance

tends to 0. Now, remember that Zπ and Zππ0 differ in at most D = OP (N1/2) entries.

But, conditional on π, π0 and the multinomial variables (N1, . . . , Nk) in the coupling

construction, for indices l where Zl 6= Zπ0(l),

V ar(Zl − Zπ0(l)|π, π0, N1, . . . , Nk) ≤ 2V .

where V = max(σ21, . . . , σ

2k). But the left side of ( 53) is

n−1/2i

N∑l=1

[Zl − Zπ0(l)]Iπ(l) ∈ Ii ,

and the sum here is conditionally a sum of at most D independent variables with variance

≤ 2V . Hence, the variance of the left side of (53) is conditionally at most 2V D/ni, and

hence the unconditional variance is at most 2V E(D)/ni → 0.

To show (54), note that

1

ni

N∑l=1

[Z2l − Z2

π0(l)]Iπ(l) ∈ Ii ,

41

has mean 0 conditional on π, π′ and N1, . . . , Nk, and its absolute value is bounded above

by1

ni

∑l∈J

[Z2l + Z2

π0(l)] .

Here, the sum is over the set of indices in J , where Zl 6= Zπ0(l). But, conditionally, there

are at most D nonzero terms in J , each term having expectation bounded by 2V , and

so the whole expression has mean bounded above by 2V E(D)/ni → 0. The result (54)

now follows easily.

References

Devroye, L., and Wagner, T.J. (1980). The strong uniform consistency of kernel density

estimates. Multivariate Analysis V (P.R. Krishnaiah, ed.). North Holland, 59–77.

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statis-

tics 7, 1–26.

Hall, P., DiCiccio, T., and Romano, J. (1989). On Smoothing and the Bootstrap. Annals

of Statistics 17, 692–704.

Hoeffding, W. (1952). The large-sample power of tests based on permutations of obser-

vations. The Annals of Mathematical Statistics 23, 169–192.

Janssen, A. (1997). Studentized permutation tests for non-i.i.d. hypotheses and the

generalized Behrens-Fisher problem. Statistics and Probability Letters 36, 9–21.

Janssen, A. (2005). Resampling student’s t-type statistics. Annals of the Institute of

Statistical Mathematics 57, 507–529.

Janssen, A. and Pauls, T. (2003). How do bootstrap and permutation tests work? Annals

of Statistics 31, 768–806.

Janssen, A. and Pauls, T. (2005). A Monte Carlo comparison of studentized bootstrap

and permutation tests for heteroscedastic two-sample problems. Computational Statis-

tics 20, 369–383.

Krishnamoorthy, K., Lu, F. and Mathew, T. (2007). A parametric bootstrap approach

for ANOVA with unequal variances: FIxed and random models. Computational Statis-

tics & Data Analysis, 51, 5731–5742.

42

Lehmann, E. L. (1998). Nonparametrics: Statistical Methods Based on Ranks. revised

first edition, Prentice Hall, New Jersey.

Lehmann, E. L. (1999). Elements of Large-Sample Theory. Springer-Verlag, New York.

Lehmann, E. L. (2009). Parametric versus nonparametrics: two alternative methodolo-

gies. Journal of Nonparametric Statistics 21, 397–405.

Lehmann, E. L. and Romano, J. (2005). Testing Statistical Hypotheses. 3rd edition,

Springer-Verlag, New York.

Neubert, K. and Brunner, E. (2007). A Studentized permutation test for the non-

parametric Behrens-Fisher problem. Computational Statistics & Data Analysis 51,

5192–5204.

Neuhaus, G. (1993). Conditional rank tests for the two-sample problem under random

censorship. Annals of Statistics 21, 1760–1779.

Pauly, M. (2010). Discussion about the quality of F-ratio resampling tests for comparing

variances. TEST, 1–17.

Politis, D., Romano, J. and Wolf, M. (1999). Subsampling. Springer-Verlag, New York.

Rice, W. and Gaines, S. (1989). One-way analysis of variance with unequal variances.

Proc. Nat. Acad. Sci. 86, 8183–8184.

Romano, J. (1989). Bootstrap and randomization tests of some nonparametric hypoth-

esis. Annals of Statistics 17, 141–159.

Romano, J. (1990). On the behavior of randomization tests without a group invariance

assumption. Journal of the American Statistical Association 85, 686–692.

Romano, J. (2009). Discussion of “parametric versus nonparametrics: Two alternative

methodologies”.

Serfling, S. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New

York.

van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press, New

York.

ADDRESS:

43

EunYi Chung: Department of Economics, Stanford University, Stanford, CA 94305-

6072; [email protected]

Joseph P. Romano: Departments of Statistics and Economics, Stanford University, Stan-

ford, CA 94305-4065; [email protected]

44

Date post:	30-Dec-2019
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION …by unequal variances and unequal sample sizes. Even...

Documents