EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS
By
Eun Yi Chung Joseph P. Romano
Technical Report No. 2011-05 May 2011
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS
By
Eun Yi Chung Joseph P. Romano
Stanford University
Technical Report No. 2011-05 May 2011
This research was supported in part by National Science Foundation grant DMS 0707085.
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
http://statistics.stanford.edu
Exact and Asymptotically Robust Permutation Tests
EunYi Chung
Department of Economics
Stanford University
Joseph P. Romano∗
Departments of Statistics and Economics
Stanford University
May 5, 2011
Abstract
Given independent samples from P and Q, two-sample permutation tests allow
one to construct exact level tests when the null hypothesis is P = Q. On the other
hand, when comparing or testing particular parameters θ of P and Q, such as their
means or medians, permutation tests need not be level α, or even approximately
level α in large samples. Under very weak assumptions for comparing estima-
tors, we provide a general test procedure whereby the asymptotic validity of the
permutation test holds while retaining the exact rejection probability α in finite
samples when the underlying distributions are identical. A quite general theory
is possible based on a coupling construction, as well as a key contiguity argument
for the binomial and hypergeometric distributions. The ideas are broadly appli-
cable and special attention is given to a nonparametric k-sample Behrens-Fisher
problem, whereby a permutation test is constructed which is exact level α under
the hypothesis of identical distributions, but has asymptotic rejection probability
α under the more general null hypothesis of equality of means. A Monte Carlo
simulation study is performed.
2010 MSC subject classifications. Primary 62E20, Secondary 62G10
KEY WORDS: Behrens-Fisher problem; Coupling; Permutation test
∗Research has been supported by NSF Grant DMS-0707085.
1
1 Introduction
In this article, we consider the behavior of two-sample (and later also k-sample) permu-
tation tests for testing problems when the fundamental assumption of identical distribu-
tions need not hold. Assume X1, . . . , Xm are i.i.d. according to a probability distribution
P , and independently, Y1, . . . , Yn are i.i.d. Q. The underlying model specifies a family of
pairs of distributions (P,Q) in some space Ω. For the problems considered here, Ω spec-
ifies a nonparametric model, such as the set of all pairs of distributions. Let N = m+n,
and write
Z = (Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn) . (1)
Let Ω = (P,Q) : P = Q. Under the assumption (P,Q) ∈ Ω, the joint distribution of
(Z1, . . . , ZN) is the same as (Zπ(1), . . . , Zπ(N)), where (π(1), . . . , π(N)) is any permutation
of 1, . . . , N. It follows that, when testing any null hypothesis H0 : (P,Q) ∈ Ω0, where
Ω0 ⊂ Ω, then an exact level α test can be constructed by a permutation test. To
review how, let GN denote the set of all permutations π of 1, . . . , N. Then, given any
test statistic Tm,n = Tm,n(Z1, . . . , ZN), recompute Tm,n for all permutations π; that is,
compute Tm,n(Zπ(1), . . . , Zπ(N)) for all π ∈ GN, and let their ordered values be
T (1)m,n ≤ T (2)
m,n ≤ · · · ≤ T (N !)m,n .
Fix a nominal level α, 0 < α < 1, and let k be defined by
k = N !− [αN !] ,
where [αN !] denotes the largest integer less than or equal to αN !. Let M+(z) and M0(z)
be the number of values T(j)m,n(z) (j = 1, . . . , N !) which are greater than T (k)(z) and equal
to T (k)(z), respectively. Set
a(z) =αN !−M+(z)
M0(z).
Define the randomization test function φ(Z) to be equal to 1, a(Z), or 0 according
to whether Tm,n(Z) > T(k)m,n(Z), Tm,n(X) = T (k)(Z), or Tm,n(Z) < T (k)(Z), respectively.
Then, under any (P,Q) ∈ Ω,
EP,Q[φ(X1, . . . , Xm, Y1, . . . , Yn)] = α .
2
Also, define the permutation distribution as
RTm,n(t) =
1
N !
∑π∈GN
ITm,n(Zπ(1), . . . , Zπ(N)) ≤ t , (2)
where GN denotes the N ! permutations of 1, 2, . . . , N. Roughly speaking (after ac-
counting for discreteness), the permutation test rejects H0 if the test statistic Tm,n(evaluated at the original data set) exceeds T
(k)m,n, or a 1−α quantile of this permutation
distribution.
However, problems arise if Ω0 is strictly bigger than Ω. Since a transformed permuted
data set no longer has the same distribution as the original data set, the argument leading
to the exact construction of a level α test fails, and faulty inferences can occur.
To be concrete, consider constructing a permutation test based on the difference of
sample means
Tm,n = m1/2(Xm − Yn) .
Note that we are not taking the absolute difference, so that the test is one-sided, as
we are rejecting for large positive values of the difference. First of all, one needs to be
very careful in deciding what family of distributions Ω0 is being tested under the null
hypothesis. If the null specifies P = Q, then without further assumptions, a test based
on Xm − Yn is not appropriate. First of all, even if P = Q so that the permutation
construction will result in probability of rejection equal to α, the test clearly will not
have any power against distributions P and Q whose means are identical but P 6= Q.
The test is only warranted if it can be assumed that lack of equality of distributions is
accompanied by a corresponding change in population means. Such an assumption may
be inappropriate. Consider the case where one group receives a treatment and the other
a placebo. Then, no treatment effect may arguably be considered equivalent to both
groups receiving a placebo, in which case the distributions would be the same. However,
even in this case, if there is an effect due to treatment, P and Q may differ not only in
location but also in other aspects of the distribution such as scale and shape. Moreover,
if the two groups being compared are distinct in a way other than the assignment of
treatment or placebo, as in comparing educational achievement between boys and girls,
then it is especially crucial to clarify what is being tested and the implicit underlying
assumptions.
In such cases, the permutation test based on the difference of sample means is only
appropriate as a test of equality of population means. However, the permutation test no
longer controls the level of the test, even in large samples. As is well-known (Romano,
1990), the permutation test possesses a certain asymptotic robustness as a test of differ-
3
ence in means if m/n→ 1 as n→∞, or the underlying variances of P and Q are equal,
in the sense that the rejection probability under the null hypothesis of equal means tends
to the nominal level. Without equal variances and comparable sample sizes, the rejec-
tion probability can be much larger than the nominal level, which is a concern. Because
of the lack of robustness and the increased probability of a Type 1 error, rejection of the
null may incorrectly be interpreted as rejection of equal means, when in fact it is caused
by unequal variances and unequal sample sizes. Even more alarming is the possibility of
rejecting a one-sided null hypothesis in favor of a positive mean difference when in fact
the difference in means is negative. Further note that there is also the possibility that
the rejection probability can be much less than the nominal level, which by continuity
implies the test is biased and has little power of detecting a true difference in means.
The situation is even worse when basing a test on a difference in sample medians,
in the sense that regardless of sample sizes, the asymptotic rejection probability of the
permutation test will be α under very stringent conditions, which essentially means only
in the case where the underlying distributions are the same.
However, in a very insightful paper in the context of random censoring models,
Neuhaus (1993) first realized that by proper studentization of a test statistic, the per-
mutation test can result in asymptotically valid inference even when the underlying
distributions are not the same. Later, Janssen (1997) showed that, in the case of the
difference of sample means, by proper studentization of a test statistic, the permutation
test is a valid asymptotic approach. In particular, his results imply that, if the underly-
ing population means are identical (and population variances are finite and may differ),
then the asymptotic rejection probability of the permutation test is α. Furthermore, the
use of the permutation test retains the property that the exact rejection probability is
α if the underlying distributions are identical. This result has been extended to other
specific problems, such as comparing variances by Pauly (2010) and the two-sample
Wilcoxon test by Neubert and Brunner (2007). Other results on permutation tests are
presented in Janssen (2005), Janssen and Pauls (2003), and Janssen and Pauls (2005).
The goal of this paper is to obtain a quite general result of the same phenomenon.
That is, when basing a permutation test using some test statistic as a test of a parameter
(usually a difference of parameters associated with marginal distributions), we would like
to retain the exactness property when P = Q, and also have the rejection probability be
α for the more general null hypothesis specifying the parameter (such as the difference
being zero). Of course, there are many alternatives to getting asymptotic tests, such as
the bootstrap or subsampling. However, we do not wish to give up the exactness property
under P = Q, and resampling methods do not have such finite sample properties. The
main problem becomes: what is the asymptotic behavior of RTm,n(·) defined in (2) for
4
general test statistic sequences Tm,n when the underlying distributions differ. Only
for suitable test statistics is it possible to achieve both finite sample exactness when the
underlying distributions are equal, but also maintain a large sample rejection probability
near the nominal level when the underlying distributions need not be equal. In this sense,
our results are both exact and asymptotically robust for heterogenous populations.
This paper provides a framework for testing a parameter that depends on P and Q.
We construct a general test procedure where the asymptotic validity of the permutation
test holds in a general setting. Assuming that estimators are asymptotically linear and
consistent estimators are available for their asymptotic variance, we provide a test that
has asymptotic rejection probability equal to the nominal level α, but still retains the
exact rejection probability of α in finite samples if P = Q. It is not even required
that the estimators are based on differentiable functionals, and some methods like the
bootstrap would not necessarily be even asymptotically valid under such conditions,
let alone retain the finite sample exactness property when P = Q. The arguments of
the paper are quite different from Janssen and previous authors, and hold under great
generality. For example, they immediately apply to comparing means, variances, or
medians. The key idea is to show that the permutation distribution behaves like the
unconditional distribution of the test statistic when all N observations are i.i.d. from
the mixture distribution pP + (1− p)Q, where p, where p is such that m/N → p. This
seems intuitive because the permutation distribution permutes the observations so that a
permuted sample is almost like a sample from the mixture distribution. In order to make
this idea precise, a coupling argument is given in Section 3.3 Of course, the permutation
distribution depends on all permuted samples (for a given original data set). But even
for one permuted data set, it cannot exactly be viewed as a sample from pP + (1− p)Q.
Indeed, the first m observations from the mixture would include Bm observations from
P and the rest from Q, where Bm has the binomial distribution based on m trials and
success probability p. On the other hand, for a permuted sample, if Hm denotes the
number of observations from P , then Hm has the hypergeometric distribution with mean
mp. The key argument that allows for such a general result concerns the contiguity of
the distributions of Bm and Hm. Section 3 highlights the main technical ideas required
for the proofs. Section 4 applies these ideas to the k-sample Behrens-Fisher problem,
though no assumption of normality is required. Once again, exact level is achieved when
all k distributions are equal, but the asymptotic rejection probability equals the nominal
level under the null hypothesis of mean equality (under a finite variance assumption).
Lastly, Monte Carlos simulation studies illustrating our results are presented in Section
5. All proofs are reserved for the appendix.
5
2 Robust Studentized Two-sample Test
In this section, we consider the general problem of inference from the permutation
distribution when comparing parameters from two populations. Specifically, assume
X1, . . . , Xm are i.i.d. P and, independently, Y1, . . . , Yn are i.i.d. Q. Let θ(·) be a real-
valued parameter, defined on some space of distributions P . The problem is to test the
null hypothesis
H0 : θ(P ) = θ(Q) . (3)
Of course, when P = Q, one can construct permutation tests with exact level α. Unfor-
tunately, if P 6= Q, the test need not be valid in the sense that the probability of a Type
1 error need not be α even asymptotically. Thus, our goal is to construct a procedure
that has asymptotic rejection probability equal to α quite generally, but also retains the
exactness property in finite samples when P = Q.
We will assume that estimators are available that are asymptotically linear. Specif-
ically, assume that, under P , there exists an estimator θm = θm(X1, . . . , Xm) which
satisfies
m1/2[θm − θ(P )] =1√m
m∑i=1
fP (Xi) + oP (1) (4)
Similarly, we assume that, based on the Yj (under Q),
n1/2[θn − θ(Q)] =1√n
n∑j=1
fQ(Yj) + oQ(1) (5)
The functions determining the linear approximation fP and fQ can of course depend on
the underlying distributions. Different forms of differentiability guarantee such linear
expansions in the special case when θm takes the form an empirical estimate θ(Pm),
where Pm is the empirical measure constructed from X1, . . . , Xm, but we will not need
to assume such stronger conditions. We will argue that our assumptions of asymptotic
linearity already imply a result about the permutation distribution corresponding to the
statistic m1/2[θm(X1, . . . , Xm)− θn(Y1, . . . , Yn)], without having to impose any differen-
tiability assumptions. However, we will assume the expansion (4) holds not just for i.i.d.
samples under P , and also under Q, but also when sampling i.i.d. observations from
the mixture distribution P = pP + qQ. This is a weak assumption and replaces having
to study the permutation distribution based on variables that are no longer indepen-
dent nor identically distributed with a simple assumption about the behavior under an
i.i.d. sequence. Indeed, we will argue that in all cases, the permutation distribution be-
haves asymptotically like the unconditional limiting sampling distribution of the studied
6
statistic sequence when sampling i.i.d. observations from P .
Theorem 2.1. Assume X1, . . . , Xm are i.i.d. P and, independently, Y1, . . . , Yn are i.i.d.
Q. Consider testing the null hypothesis (3) based on a test statistic of the form
Tm,n = m1/2[θm(X1, . . . , Xm)− θn(Y1, . . . , Yn)] ,
where the estimators satisfy (4) and (5). Further assume EPfP (Xi) = 0 and
0 < EPf2P (Xi) ≡ σ2(P ) <∞ ,
and the same with P replaced by Q. Let m→∞, n→∞, with N = m+n, pm = m/N ,
qm = n/N and pm → p ∈ (0, 1) with
pm − p = O(m−1/2) . (6)
Assume the estimator sequence also satisfies (4) with P replaced by P = pP + qQ with
σ2(P ) <∞.
Then, the permutation distribution of Tm,n given by (2) satisfies
supt|RT
m,n(t)− Φ(t/τ(P ))| P→ 0 ,
where
τ 2(P ) = σ2(P ) +p
1− pσ2(P ) =
1
1− pσ2(P ) . (7)
Remark 2.1. Under H0, the true unconditional sampling distribution of Tm,n is asymp-
totically normal with mean 0 and variance
σ2(P ) +p
1− pσ2(Q) , (8)
which does not equal τ 2(P ) defined by (7) in general.
Example 2.1. (Difference of Means) As is well-known, even for the case of comparing
population means by sample means, equality holds if and only if p = 1/2 or σ2(P ) =
σ2(Q).
Example 2.2. (Difference of Medians) Let F and G denote the c.d.f.s corresponding
to P and Q. Let θ(F ) denote the median of F, i.e., θ(F ) = infx : F (x) ≥ 12. Then,
it is well known (Serfling, 1980) that, if F is continuously differentiable at θ(P ) with
7
derivative F ′ (and the same with F replaced by G), then
m1/2[θ(Pm)− θ(P )] =1√m
m∑i=1
12− IXi ≤ θ(P )F ′(θ(P ))
+ oP (1)
and similarly,
n1/2[θ(Qn)− θ(Q)] =1√n
n∑j=1
12− IYj ≤ θ(Q)G′(θ(Q))
+ oQ(1).
Thus, we can apply Theorem 1.1 and conclude that, when θ(P ) = θ(Q) = θ, the
permutation distribution of Tm,n is approximately a normal distribution with mean 0
and variance1
4(1− p)[pF ′(θ) + (1− p)G′(θ)]2
in large samples. On the other hand, the true sampling distribution is approximately a
normal distribution with mean 0 and variance
v2(P,Q) ≡ 1
4[F ′(θ)]2+
p
1− p1
4[G′(θ)]2. (9)
Thus, the permutation distribution and the true unconditional sampling distribution
behave differently asymptotically unless F ′(θ) = G′(θ) is satisfied. Since we do not
assume P = Q, this condition is a strong assumption. Hence, the permutation test
for testing equality of medians is generally not valid in the sense that the rejection
probability tends to a value that is far from the nominal level α.
Remark 2.2. The assumption (6) is of course a little stronger than the more basic
assumption m/N → p, where no rate is required between the difference m/N and p. Of
course, we are free to choose p as m/N in any situation, and the assumption is rather
innocuous. (Indeed, for any m0 and N0 with m0/N0 = p, we can always let m and N
tend to infinity with m = km0 and N = kN0 and let k → ∞.) Alternatively, we can
replace (6) with the more basic assumption m/N → p as long as we slightly strengthen
the basic assumption that the statistic has a linear expansion under P = pP + qQ to
also have a linear expansion under sequences
Pm,n =m
NP +
n
NQ ,
which is a rather weak form of local uniform triangular array type of convergence. We
prefer to assume the convergence hypothesis based on an i.i.d. sequence from a fixed P ,
though it is really a matter of choice. Usually, we can appeal to some basic convergence in
8
distributions results with ease, but if linear expansions are available (or can be derived)
which are “uniform” in the underlying probability distribution near P , then such results
can be used instead with the weaker hypothesis pm → p.
The main goal now is to show how studentizing the test statistic leads to a general
correction.
Theorem 2.2. Assume the setup and conditions of Theorem 2.1. Further assume that
σm(X1, . . . , Xm) is a consistent estimator of σ(P ) when X1, . . . , Xm are i.i.d. P . Assume
consistency also under Q and P , so that σn(V1, . . . , Vn)P→ σ(P ) as n→∞ when the Vi
are i.i.d. P . Define the studentized test statistic
Sm,n =Tm,nVm,n
, (10)
where
Vm,n =
√σ2m(X1, . . . , Xm) +
m
nσ2n(Y1, . . . , Yn) .
and consider the permutation distribution defined in (2) with T replaced by S. Then,
supt|RS
m,n(t)− Φ(t)| P→ 0 . (11)
Thus, the permutation distribution is asymptotically standard normal, as is the true
unconditional limiting distribution of the test statistics Sm,n. Indeed, as mentioned in
Remark 2.1, the true unconditional limiting distribution of Tm,n is normal with mean 0
and variance given by (8). But, when sampling m observations from P and n from Q,
V 2m,n tends in probability to (8), and hence the limiting distribution of Tm,n is standard
normal, the same as that of the permutation distribution.
Example 2.1. (continued) As proved by Janssen (1997), even when the underlying
distributions may have different variances and different sample sizes, permutation tests
based on studentized statistics
Tm,n =m1/2(Xm − Yn)√
S2X + m
nS2Y
,
where S2X = 1
m−1
∑mi=1(Xi − Xm)2 and S2
Y = 1n−1
∑nj=1(Yi − Ym)2, can allow one to
construct a test that attains asymptotic rejection probability α when P 6= Q while
providing an additional advantage of maintaining exact level α when P = Q.
9
Example 2.2. (continued) Define the studentized median statistic
Mm,n =m1/2[θ(Pm)− θ(Qn)]
vm,n,
where vm,n is a consistent estimator of v(P,Q) defined in (9). There are several choices for
a consistent estimator of v(P,Q). Examples include the usual kernel estimator (Devroye
and Wagner, 1980), bootstrap estimator (Efron, 1979), and the smoothed bootstrap
(Hall, DiCiccio, and Romano, 1989).
Remark 2.3. Suppose that the true unconditional distribution of a test Tm,n is, under
the null hypothesis, asymptotically given by a distribution R(·). Typically a test rejects
when Tm,n > rm,n, where rm,n is nonrandom, as happens in many classical settings.
Then, we typically have rm,n → r(1 − α) ≡ R−1(1 − α). Assume that Tm,n converges
to some limit law R′(·) under some sequence of alternatives which are contiguous to
some distribution satisfying the null. Then, the power of the test against such a se-
quence would tend to 1 − R′(r(1 − α)). The point here is that, under the conditions
of Theorem 2.2, the permutation test based on a random critical value rm,n obtained
from the permutation distribution satisfies, under the null, rm,nP→ r(1− α). But then,
contiguity implies the same behavior under a sequence of contiguous alternatives. Thus,
the permutation test has the same limiting local power as the “classical” test which uses
the nonrandom critical value. So, to first order, there is no loss in power in using a
permutation critical value. Of course, there are big gains because the permutation test
applies much more broadly than for usual parametric models, in that it retains the level
exactly across a broad class of distributions and is at least asymptotically justified for a
large nonparametric family.
3 Four Technical Ingredients
In this section, we discuss four separate ingredients, from which the main results flow.
These results are separated out so they can easily be applied to other problems and so
that the main technical arguments are highlighted. The first two apply more generally
to randomization tests, not just permutation tests, and are stated as such.
3.1 Hoeffding’s Condition
Suppose data Xn has distribution Pn in Xn, and Gn is a finite group of transformations g
of Xn onto itself. For a given statistic Tn = Tn(Xn), let RTn (·) denote the randomization
10
distribution of Tn, defined by
RTn (t) =
1
|Gn|∑g∈Gn
ITn(gXn) ≤ t. (12)
(In the case of permutation tests, Xn corresponds to Z = (X1, . . . , Xm, Y1, . . . , Yn) and g
varies over the permutations of 1, . . . , N.) Hoeffding (1952) gave a sufficient condition
to derive the limiting behavior of RTn (·). This condition is verified repeatedly in the
proofs, but we add the result that the condition is also necessary.
Theorem 3.1. Let Gn and G′n be independent and uniformly distributed over Gn (and
independent of Xn). Suppose, under Pn,
(Tn(GnXn), Tn(G′nX
n))d→ (T, T ′), (13)
where T and T ′ are independent, each with common c.d.f. RT (·). Then, for all continuity
points t of RT (·),
RTn (t)
P→ RT (·) . (14)
Conversely, if (14) holds for some limiting c.d.f. RT (·) whenever t is a continuity point,
then (13) holds.
The reason we think it is important to add the necessity part of the result is that
our methodology is somewhat different than that of other authors mentioned in the
introduction, who take a more conditional approach to proving limit theorems. After all,
the permutation distribution is indeed a distribution conditional on the observed set of
observations (without regard to ordering). However, the theorem shows that a sufficient
condition is obtained by verifying an unconditional weak convergence property, which
may look surprising at first in that it includes additional auxiliary randomization G′ in
its statement. Nevertheless, simple arguments (see the appendix) show the condition is
indeed necessary and so taking such an approach is not fanciful.
3.2 Slutsky’s Theorem for Randomization Distributions
Consider the general setup of Subsection 3.1. The result below describes Slutsky’s the-
orem in the context of randomization distributions. In this context, the randomization
distributions are random themselves, and therefore the usual Slutsky’s theorem does not
quite apply. Because of its utility in the proofs of our main results, we highlight the
11
statement. Given sequences of statistics Tn, An and Bn, let RAT+Bn (·) denote the ran-
domization distribution corresponding to the statistic sequence AnTn + Bn, i.e. replace
Tn in (12) by AnTn +Bn, so
RAT+Bn (t) ≡ 1
|Gn|∑g∈Gn
IAn(gXn)Tn(gXn) +Bn(gXn) ≤ t (15)
Theorem 3.2. Let Gn and G′n be independent and uniformly distributed over Gn (and
independent of Xn). Assume Tn satisfies (13). Also, assume
An(GnXn)
P→ a (16)
and
Bn(GnXn)
P→ b , (17)
for constants a and b. Let RaT+b(·) denote the distribution of aT + b, where T is the
limiting random variable assumed in (13). Then,
RAT+Bn (t)
P→ RaT+b(t) ,
if the distribution RaT+b(·) of aT + b is continuous at t. (Of course, RaT+b(t) = RT ( t−ba
)
if a 6= 0.)
Remark 3.1. Under the randomization hypothesis that the distribution of Xn is the
same as that of gXn for any g ∈ Gn, the conditions (16) and (17) are equivalent to the
assumptions that An(Xn)P→ a and Bn(Xn)
P→ b, i.e. the convergence in probability just
based on the original sample Xn without first transforming by a random Gn. For more
on the randomization hypothesis, see Section 15.2 of Lehmann and Romano (2005).
3.3 A Coupling Construction
Consider the general situation where k samples are observed from possibly different
distributions. Specifically, assume for i = 1, . . . , k that Xi,1, . . . , Xi,niis a sample of ni
i.i.d. observations from Pi. All N ≡∑
i ni observations are mutually independent. Put
all the observations together in one vector
Z = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) .
The basic intuition driving the results concerning the behavior of the permutation
distribution stems from the following. Since the permutation distribution considers the
12
empirical distribution of a statistic evaluated at all permutations of the data, it clearly
does not depend on the ordering of the observations. Let ni/N denote the proportion
of observations in the ith sample, and assume that ni →∞ in such a way that
pi −niN
= O(N−1/2) . (18)
Then the behavior of the permutation distribution based on Z should behave approxi-
mately like the behavior of the permutation distribution based on a sample of N i.i.d.
observations
Z = (Z1, . . . , ZN)
from the mixture distribution
P ≡ p1P1 + · · ·+ pkPk .
Of course, we can think of the N observations generated from P arising out of a two-
stage process: for i = 1, . . . , N , first draw an index j at random with probability pj;
then, conditional on the outcome being j, sample Zi from Pj. However, aside from the
fact that the ordering of the observations in Z is clearly that of n1 observations from
P1, following by n2 observations from P2, etc., the original sampling scheme is still only
approximately like that of sampling from P . For example, the number of observations
Zi out of the N which are from P1 is binomial with parameters N and p1 (and so has
mean equal to p1N ≈ n1), while the number of observations from P1 in the original
sample Z is exactly n1.
Along the same lines, let π = (π(1), . . . , π(N)) denote a random permutation of
1, . . . , N. Then, if we consider a random permutation of both Z and Z, then the
number of observations in the first n1 coordinates of Z which were Xs has the hyperge-
ometric distribution, while the number of observations in the first n1 coordinates of Z
which were Xs is still binomial.
We can make a more precise statement by constructing a certain coupling of Z
and Z. That is, except for ordering, we can construct Z to include almost the same
set of observations as in Z. The simple idea goes as follows. Given Z, we will construct
observations Z1, . . . , ZN via the two-stage process as above, using the observations drawn
to make up the Zi as much as possible. First, draw an index j among 1, . . . , k at
random with probability pj; then, conditionally on the outcome being j, set Z1 = Xj,1.
Next, if the next index i drawn among 1, . . . , k at random with probability pi is
different from j from which Z1 was sampled, then Z2 = Xi,1; otherwise, if i = j as in
the first step, set Z2 = Xj,2. In other words, we are going to continue to use the Zi to
fill in the observations Zi. However, after a certain point, we will get stuck because we
13
will have already exhausted all the nj observations from the jth population governed by
Pj. If this happens and an index j was drawn again, then just sample a new observation
Xj,nj+1 from Pj. Continue in this manner so that as many as possible of the original
Zi observations are used in the construction of Z. Now, we have both Z and Z. At
this point, Z and Z have many of the same observations in common. The number of
observations which differ, say D, is the (random) number of added observations required
to fill up Z. (Note that we are obviously using the word “differ” here to mean the
observations are generated from different mechanisms, though in fact there may be a
positive probability that the observations still are equal if the underlying distributions
have atoms. Still, we count such observations as differing.)
Moreover, we can reorder the observations in Z by a permutation π0 so that Zi and
Zπ0(i) agree for all i except for some hopefully small (random) number D. To do this,
recall that Z has the observations in order, i.e., the first n1 observations arose from
P1 and the next set of n2 observations came from P2, etc. Thus, to couple Z and Z,
simply put all the observations in Z which came from P1 first up to n1. That is, if
the number of observations in Z from P1 is greater than or equal to n1, then Zπ(i) for
i = 1, . . . , n1 are filled with the observations in Z which came from P1 and if the number
was strictly greater than n1, put them aside for now. On the other hand, if the number
of observations in Z which came from P1 is less than n1, fill up as many of Z from
P1 as possible and leave the rest slots among the first n1 spots blank for now. Next,
move onto the observations in Z which came from P2 and repeat the above procedure for
n1 +1, . . . , n1 +n2 spots, i.e., we start filling up the spots from n1 +1 as many of Z which
came from P2 as possible up to n2 of them. After going though all the distributions Pifrom which each of observations in Z came, one must then complete the observations in
Zπ0 ; simply “fill up” the empty spots with the remaining observations that have been
put aside. (At this point, it does not matter where each of the remaining observations
gets inserted; but, to be concrete, fill the empty slots by inserting the observations which
came from the index Pi in chronological order from when constructed.) This permuting
of observations in Z corresponds to a permutation π0 and satisfies Zi = Zπ0(i) for indices
i except for D of them.
For example, suppose there are k = 2 populations. Suppose that N1 of the Z
observations came from P1 and so N−N1 from P2. Of course, N1 is random and has the
binomial distribution with parameters N and p1. If N1 ≥ n1, then the above construction
yields the first n1 observations in Z and Zπ0 completely agree. Furthermore, if N1 > n1,
then the number of observations in Z from P2 is N −N1 < N −n1 = n2, and N −N1 of
the last n2 indices in Z match those of Zπ0 , with the remaining differ. In this situation,
14
we have
Z = (X1, . . . , Xn1 , Y1, . . . , Yn2)
and
Zπ0 = (X1, . . . , Xn1 , Y1, . . . , YN−N1 , Xn1+1, . . . , XN1) ,
so that Z and Zπ0 differ only in the last N1−n1 places. In the opposite situation where
N1 < n1, Z and Zπ are equal in the first N1 and last n2 places, only differing in spots
N1 + 1, . . . , n1.
The number of observations D where Z and Zπ0 differ is random and we now analyze
how large it is. Let Nj denote number of observations in Z which are generated from
Pj. Then, (N1, . . . , Nk) has the multinomial distribution based on N trials and success
probabilities (p1, . . . , pk). In terms of the Nj, the number of differing observations in the
above coupling construction is
D =k∑j=1
max(nj −Nj, 0) .
If we assume pj > 0 for all j, then by the usual central limit theorem,
Nj −Npj = OP (N1/2) ,
which together with (18) yields
Nj − nj = (Nj −Npj) + (Npj − nj) = OP (N1/2) .
It follows that D = OP (N1/2) and so D/N converges to 0 in probability. It also follows
that
E(D) ≤k∑j=1
E|Nj − nj| ≤k∑j=1
E|Nj − pjN |+ |pjN − nj|
≤k∑j=1
E[(Nj −Npj)2]1/2 +O(N1/2) =k∑j=1
[Npj(1− pj)]1/2 +O(N1/2) = O(N1/2) .
In summary, the coupling construction shows that only a fraction of the N ob-
servations in Z and Zπ0 differ with high probability. Therefore, if the randomization
distribution is based on a statistic TN(Z) such that the difference TN(Z) − TN(Zπ0) is
small in some sense whenever Z and Zπ0 mostly agree, then one should be able to deduce
the behavior of the permutation distribution under samples from P1, . . . , Pk from the
behavior of the permutation distribution when all N observations come from the same
15
distribution P . Whether or not this can be done requires some knowledge of the form of
the statistic, but intuitively it should hold if the statistic cannot strongly be affected by a
change in a small proportion of the observations; its validity though must be established
on a case by case basis. The point is that it is a worthwhile and beneficial route to pur-
sue because the behavior of the permutation distribution under N i.i.d. observations is
typically much easier to analyze than under the more general setting when observations
have possibly different distributions. Furthermore, the behavior under i.i.d. observa-
tions seems fundamental as this is the requirement for the “randomization hypothesis”
to hold, i.e. the requirement to yield exact finite sample inference.
To be more specific, suppose π and π′ are independent random permutations, and
independent of the Zi and Zi. Suppose we can show that
(TN(Zπ), TN(Zπ′))d→ (T, T ′) , (19)
where T and T are independent with common c.d.f R(·). Then, by Theorem 3.1, the
randomization distribution based on TN converges in probability to R(·) when all obser-
vations are i.i.d. according to P . But since ππ0 (meaning π composed with π0 so π0 is
applied first) and π′π0 are also independent random permutations, (19) also implies
(TN(Zππ0), TN(Zπ′π0))d→ (T, T ′) ,
Using the coupling construction to construct Z, suppose it can be shown that
TN(Zππ0)− TN(Zπ)P→ 0 . (20)
Then, it also follows that
TN(Zπ′π0)− TN(Zπ′)P→ 0 ,
and so by Slutsky’s Theorem, it follows that
(TN(Zπ), TN(Zπ′))d→ (T, T ′) . (21)
Therefore, again by Theorem 3.1, the randomization distribution also converges in prob-
ability to R(·) under the original model of k samples from possibly different distributions.
In summary, the coupling construction of Z, Z and π0 and the one added requirement
(20) allow us to reduce the study of the permutation distribution under possibly k dif-
ferent distributions to the i.i.d. case when all N observations are i.i.d. according to P .
We summarize this as follows.
Lemma 3.1. Assume (19) and (20). Then, (21) holds, and so the permutation distri-
16
bution based on k samples from possibly different distributions behaves asymptotically as
if all observations are i.i.d. from the mixture distribution P and satisfies
RTm,n(t)
P→ R(t),
if t is a continuity point of the distribution R of T in (19).
Example 3.1 (Difference of Sample Means). To appreciate what is involved in the
verification of (20), consider the two-sample problem considered in Theorem 2.1, in the
special case of testing equality of means. The unknown variances may differ and are
assumed finite. Consider the test statistic Tm,n = m1/2[Xm − Yn]. By the coupling
construction, Zππ0 and Zπ have the same components except for at most D places. Now,
Tm,n(Zππ0)− Tm,n(Zπ) = m1/2[1
m
m∑i=1
(Zππ0(i) − Zπ(i))]−m1/2[1
n
N∑j=m+1
(Zππ0(j) − Zπ(j))] .
All of the terms in the above two sums are zero except for at most D of them. But any
nonzero term like Zππ0(i) − Zπ(i) has variance bounded above by
2 max(V ar(X1), V ar(Y1)) <∞ .
Note the above random variable has mean zero under the null hypothesis that E(Xi) =
E(Yj). To bound its variance, condition on D and π, and note it has conditional mean
0 and conditional variance bounded above by
m1
min(m2, n2)2 max(V ar(X1), V ar(Y1))D ,
and hence unconditional variance bounded above by
m1
min(m2, n2)2 max(V ar(X1), V ar(Y1))O(N1/2) = O(N−1/2) = o(1) ,
implying (20). In words, we have shown that the behavior of the permutation dis-
tribution can be deduced from the behavior of the permutation distribution when all
observations are i.i.d. with mixture distribution P .
Two final points are relevant. First, the limiting distribution R is typically the
same as the limiting distribution of the true unconditional distribution of TN under
P . The true limiting distribution under (P1, . . . , Pk) need not be the same as under
P . However, suppose the choice of test statistic TN is such that it is an asymptotic
pivot in the sense that its limiting distribution does not depend on the underlying
17
probability distributions. Then, typically the randomization or permutation distribution
under (P1, . . . , Pk) will asymptotically reflect the true unconditional distribution of TN ,
resulting in asymptotically valid inference. Indeed, the general results in Section 2 yield
many examples of this phenomenon. However, that these statements need qualification
is made clear by the following two (somewhat contrived) examples.
Example 3.2. Here, we illustrate a situation where coupling works, but the true sam-
pling distribution does not behave like the permutation distribution under the mixture
model P . In the two-sample setup with m = n, suppose X1, . . . , Xn are i.i.d according
to uniform on the set of x where |x| < 1, and Y1, . . . , Yn are i.i.d. uniform on the set of
y with 2 < |y| < 3. So, E(Xi) = E(Yj) = 0. Consider a test statistic Tn,n defined as
Tn,n(X1, . . . , Xn, Y1, . . . , Yn) = n−1/2[n∑i=1
I|Yi| > 2 − I|Xi| < 2] .
Under the true sampling scheme, Tn,n is zero with probability one. However, if all
2n observations are sampled from the mixture model, it is easy to see that Tn,n is
asymptotically normal N(0, 2), which is the same limit for the permutation distribution
(in probability). So here, the permutation distribution under the given distributions is
the same as under P , though it does not reflect the actual true unconditional sampling
distribution.
Example 3.3. Here, we consider a situation where both populations are indeed iden-
tical, so there is no need for a coupling argument. However, the point is that the
permutation distribution does not behave like the true unconditional sampling distri-
bution. Assume X1, . . . , Xn and Y1, . . . , Yn are all i.i.d. N(0, 1) and consider the test
statistic
Tn,n(X1, . . . , Xn, Y1, . . . , Yn) = n−1/2
n∑i=1
(Xi + Yi) .
Unconditionally, Tn,n converges in distribution to N(0, 2). However, the permutation
distribution places mass one at n1/2(Xn + Yn) because the statistic Tn,n is permutation
invariant.
Certainly the moral of the examples is that the statistic needs to reflect an actual
comparison between P and Q, such as a difference between the same functional evaluated
at P and Q.
18
3.4 An Auxiliary Contiguity Result
Fix m and n with N = m+ n. Eventually, m = m(n)→∞ as n→∞. Set pm = m/N .
Let Pm be the binomial distribution based on m trials and success probability pm. Also,
let Qm be the hypergeometric distribution representing the number of objects labeled X
sampled without replacement; here, m objects are sampled without replacement from
N objects, of which m are labeled X and n are labeled Y .
Lemma 3.2. Assume the above setup with pm → p ∈ (0, 1) as m → ∞. Let Bm
be a random variable having distribution Pm. Consider the likelihood ratio Lm(x) =
dQm(x)/dPm(x).
(i) The limiting distribution of Lm(Bm) satisfies
Lm(Bm)L→ 1√q
exp(− p
2qZ2) , (22)
where Z ∼ N(0, 1) denotes a standard normal random variable and q = 1− p.
(ii) Qm and Pm are mutually contiguous.
Remark 3.2. With Bm having the binomial distribution with parameters m and pmas in Lemma 3.2, also let Bm have the binomial distribution with parameters m and p.
Then, the distributions of Bm and Bm are contiguous if and only if |pm−p| = O(m−1/2),
not just pm → p.
Lemma 3.3. Suppose V1, . . . , Vm are i.i.d. according to the mixture distribution
P ≡ pP + qQ ,
where p ∈ (0, 1) and P and Q are two probabilities (on some general space). Assume,
for some sequence Wm of statistics,
Wm(V1, . . . , Vm)P→ t , (23)
for some constant t (which can depend on P , Q and p). Let m → ∞, n → ∞, with
N = m+ n, pm = m/N , qm = n/N and pm → p ∈ (0, 1) with
pm − p = O(m−1/2) . (24)
19
Further, let X1, . . . , Xm be i.i.d. P and Y1, . . . , Yn be i.i.d. Q. Let
(Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn) .
Let (π(1), . . . , π(N)) denote a random permutation of 1, . . . , N (and independent of
all other variables). Then,
Wm(Zπ(1), . . . , Zπ(m))P→ t . (25)
Remark 3.3. The importance of Lemma 3.3 is that is allows us to deduce the behavior
of the statistic Wm under the randomization or permutation distribution from the basic
assumption of how Wm behaves under i.i.d. observations from the mixture distribution
P . Note that in (23), the convergence in probability assumption is required when the Viare P (so the P over the arrow is just a generic symbol for convergence in probability).
Remark 3.4. As mentioned in Remark 2.2, the assumption (24) is stronger than the
more basic assumption m/N → p, where no rate is required between the difference m/N
and p. Alternatively, we can replace (24) with the more basic assumption m/N → p as
long as we slightly strengthen the requirement (23) to
Wm(Zm,1, . . . , Zm,m)P→ t
when Zm,1, . . . , Zm,m are i.i.d. according to the mixture distribution pmP + qmQ (rather
than pP + qQ), so that the data distribution at time m depends on m. We prefer
to assume the convergence hypothesis based on an i.i.d. sequence, though it is really a
matter of choice. Usually, we can appeal to some basic convergence in probability results
with ease, but if convergence in probability results are available (or can be derived) which
are “uniform” in the underlying probability distribution, then such results can be used
instead with the weaker hypothesis pm → p.
4 Nonparametric k-sample Behrens-Fisher Problem
From our general considerations, we are now guided by the principle that the large sample
distribution of the test statistic should not depend on the underlying distributions; that
is, it should be asymptotically pivotal under the null. Of course, it can be something
other than normal, and we next consider the important problem of testing equality of
means of k-samples (where a limiting Chi-squared distribution is obtained).
20
The problem studied is the nonparameric one-way layout in the analysis of variance.
Assume we observe k independent samples of i.i.d. observations. Specifically, assume
Xi,1, . . . , Xi,niare i.i.d. Pi. Some of our results will hold for fixed n1, . . . , nk, but we also
have asymptotic results as N ≡∑
i ni → ∞. Let n = (n1, . . . , nk), and the notation
n→∞ will mean mini ni →∞.
The Pi are unknown probability distributions on the real line, assumed to have finite
variance. Let µ(P ) and σ2(P ) denote the mean and variance of P , respectively. The
problem of interest is to test the null hypothesis
H0 : µ(P1) = · · · = µ(Pk)
against the alternative
H1 : µ(Pi) 6= µ(Pj) for some i, j .
The classical approach is to assume Pi is normal N(µ, σ2) with a common variance.
Here, we will not impose normality, nor the assumption of common variance.
One approach used to robustify the usual F -test is to apply a permutation test.
The underlying distributions need not be normal for the permutation approach to yield
exact level α tests, but what is needed is that Pi is just Pj shifted for all i and j. To
put it another way, it must be the case that the c.d.f. Fi corresponding to Pi satisfy
Fi(x) = F (x−µi) for some unknown F and constant µi (which can then be taken to be
the mean of Fi, assuming the mean exists). In other words, under H0, the observations
must be mutually independent and identically distributed. Of course, this is much
weaker than the usual normal theory assumptions. Unfortunately, a permutation test
applied to the usual F -statistic will fail to control the probability of a Type 1 error, even
asymptotically.
The goal here is to construct a method that retains the exact control of the probability
of a Type 1 error when the observations are i.i.d., but also asymptotically controls the
probability of a Type 1 error under very weak assumptions, specifically finite variances
of the underlying distributions.
The first step is a choice of test statistic. In order to preserve the good power
properties of the classical test under normality, consider the generalized likelihood ratio
for testing H0 against H1 under the normal model where it is assumed Pi ∼ N(µi, σ2i ).
If, for now, we further assume that the σi are known, then it is easily checked that the
21
generalized likelihood ratio test rejects for large values of
Tn,0 =k∑i=1
niσ2i
[Xn,i −
∑ki=1 niXn,i/σ
2i∑k
i=1 ni/σ2i
]2
, (26)
where Xn,i =∑ni
j=1Xi,j/ni. Since the σi will not be assumed known, we replace σi in
(26) with Sn,i, where
S2n,i =
1
ni − 1
ni∑j=1
(Xi,j − Xn,i)2 ,
yielding
Tn,1 =k∑i=1
niS2n,i
[Xn,i −
∑ki=1 niXn,i/S
2n,i∑k
i=1 ni/S2n,i
]2
. (27)
We need the limiting behavior of Tn,1, not just under normality or equal distributions.
(Some relatively recent large sample approaches which do not retain our finite sample
exactness property to this specific problem are given in Rice and Gaines (1989) and
Krishnamoorthy, Lu and and Mathew (2007).)
Lemma 4.1. Consider the above set-up with 0 < σ2i = σ2(Pi) < ∞. Assume ni → ∞
with ni/N → pi > 0. Then, under H0, both Tn,0 and Tn,1 converge in distribution to the
Chi-squared distribution with k − 1 degrees of freedom.
Let Rn,1(·) denote the permutation distribution corresponding to Tn,1. In words, Tn,1is recomputed over all permutations of the data. Specifically, if we let
(Z1, . . . , ZN) = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) ,
then, Rn,1(t) is formally equal to the right side of (2), with Tm,n replaced by Tn,1.
Theorem 4.1. Consider the above set-up with 0 < σ2(Pi) <∞. Assume ni →∞ with
ni/N → pi > 0. Then, under H0,
Rn,1(t)P→ Gk−1(t) ,
where Gd denotes the Chi-squared distribution with d degrees of freedom. Moreover, if
P1, . . . , Pk satisfy H0, then the probability that the permutation test rejects H0 tends to
the nominal level α.
22
5 Simulation Results
Monte Carlos simulation studies illustrating our results are presented in this section.
Table 1 tabulates the rejection probabilities of one-sided tests for the studentized per-
mutation median test where the nominal level considered is α = 0.05. . The simulation
results confirm that the studentized permutation median test is valid in the sense that
it approximately attains level α in large samples.
In the simulation, odd numbers of sample sizes are selected in the Monte Carlo simu-
lation for simplicity. We consider several pairs of distinct sample distributions that share
the same median as listed in the first column of Table 1. For each situation, 10,000 simu-
lations were performed. Within a given simulation, the permutation test was calculated
by randomly sampling 999 permutations. Note that neither the exactness properties nor
the asymptotic properties are changed at all (as long as the number of permutations
sampled tends to infinity). For a discussion on stochastic approximations to the permu-
tation distribution, see the end of Section 15.2.1 in Lehmann and Romano (2005) and
Section 4 in Romano (1989). As is well-known, when the underlying distributions of two
distinct independent samples are not identical, the permutation median test is not valid
in the sense that the rejection probability is far from the nominal level α = 0.05. For
example, although a logistic distribution with location parameter 0 and scale parameter
1 and a continuous uniform distribution with the support ranging from -10 to 10 have
the same median of 0, the rejection probability for the sample sizes examined is between
0.0991 and 0.2261 and moves further away from the nominal level α = 0.05 as sample
sizes increase.
In contrast, the studentized permutation test results in rejection probability that
tends to the nominal level α asymptotically. We apply the bootstrap method (Efron,
1979) to estimate the variance for the median 14f2P (θ)
in the simulation given by
mm∑l=1
[X(l) − θ(Pm)
]2
· P(θ(P ∗m) = X(l)
),
where for an odd number m,
P(θ(P ∗m) = X(l)
)= P
(Binomial
(m,
l − 1
m
)≤ m− 1
2
)−P
(Binomial
(m,
l
m
)≤ m− 1
2
).
As noted earlier, there exist other choices such as the kernel estimator and the smoothed
bootstrap estimator. We emphasize, however, that using the bootstrap to obtain an
estimate of standard error does not destroy the exactness of permutation tests under
23
identical distributions.
Distributionsm 5 13 51 101 101 201 401
n 5 21 101 101 201 201 401
N(0,1)
N(0,5)
Not Studentized 0.1079 0.1524 0.1324 0.2309 0.2266 0.2266 0.2249
Studentized 0.0802 0.1458 0.095 0.0615 0.0517 0.0517 0.0531
N(0,1)
T(5)
Not Studentized 0.0646 0.1871 0.2411 0.1769 0.1849 0.1849 0.1853
Studentized 0.0707 0.1556 0.0904 0.0776 0.0661 0.0661 0.0611
Logistic(0,1)
U(-10, 10)
Not Studentized 0.0991 0.1413 0.1237 0.2258 0.2233 0.2233 0.2261
Studentized 0.0771 0.1249 0.0923 0.0686 0.0574 0.0574 0.0574
Laplace(ln 2, 1)
exp(1)
Not Studentized 0.0420 0.0462 0.0477 0.048 0.0493 0.0461 0.0501
Studentized 0.0386 0.0422 0.0444 0.0502 0.0485 0.0505 0.0531
Table 1: Monte-Carlo Simulation Results for Studentized Permutation Median Test
(One-sided, α = 0.05)
6 Conclusion
When the fundamental assumption of identical distributions need not hold, two-sample
permutation tests are invalid unless quite stringent conditions are satisfied depending
on the precise nature of the problem. For example, the two-sample permutation test
based on the difference of sample means is asymptotically valid only when either the
distributions have the same variance or they are comparable in sample size. Thus, a
careful interpretation of rejecting the null is necessary; rejecting the null based on the
permutation tests does not necessarily imply the rejection of the null that some real-
valued parameter θ(F,G) is some specified value θ0. We provide a framework that allows
one to obtain asymptotic rejection probability α in two-sample permutation tests. One
great advantage of utilizing the proposed test is that it retains the exactness property
in finite samples when P = Q, a desirable property that bootstrap and subsampling
methods fail to possess.
To summarize, if the true goal is to test whether the parameter of interest θ is
some specified value θ0, permutation tests based on correctly studentized statistic is an
attractive choice. When testing the equality of means, for example, the permutation
t-test based on a studentized statistic obtains asymptotic rejection probability α in
24
general while attaining exact rejection probability equal to α when P = Q. In the case
of testing the equality of medians, the studentized permutation median test yields the
same desirable property. Moreover, the results extend to quite general settings based
on asymptotically linear estimators. The results extend to k-sample problems as well,
and analogous results hold in the nonparametric k-sample Behrens-Fisher problem. The
guiding principle is to use a test statistic that is asymptotically distribution-free or
pivotal. Then, the technical arguments developed in this paper can be shown that
the permutation test behaves asymptotically the same as when all observations share
a common distribution. Consequently, if the permutation distribution reflects the true
underlying sampling distribution, asymptotic justification is achieved.
A Proofs
Proofs in Section 3.
Proof of Theorem 3.1. The sufficiency part due to Hoeffding (1952) is proved in
Theorem 15.2.3. of Lehmann and Romano (2005). To prove the necessity part, suppose
s and t are continuity points of RT (·). Then,
PTn(GnXn) ≤ s, Tn(G′nX
n) ≤ t = E[PTn(GnXn) ≤ s, Tn(G′nX
n) ≤ t|Xn]
= E[RTn (s)RT
n (t)]→ RT (s)RT (t) ,
since convergence in probability of a bounded sequence of random variables entails con-
vergence of moments. Convergence for a dense set of rectangles in the plane entails weak
convergence.
Before proving Slutsky’s Theorem for Randomization Distributions (Theorem 3.2),
we need three lemmas.
Lemma A.1. Suppose Xn has distribution Pn in Xn, and Gn is a finite group of trans-
formations g of Xn onto itself. Also, let Gn be a random variable that is uniform on
Gn. Assume Xn and Gn are mutually independent. Let RAn denotes the randomization
distributions of An, defined by
RAn (t) =
1
|Gn|∑g∈Gn
IAn(gXn) ≤ t. (28)
25
Suppose, under Pn,
An(GnXn)
P→ a.
Then, under Pn,
RAn (t) =
1
|Gn|∑g∈Gn
IAn(gXn) ≤ t P→ δa(t) if t 6= a, (29)
where δc(·) denotes the distribution function corresponding to the point mass function at
c.
Proof of Lemma A.1: Let G′n have the same distribution as Gn and be independent
from Gn and Xn. Since An(GnXn) converges in probability to a constant a, An(G′nX
n)P→
a and the independence of the limiting distributions is satisfied. Thus, the result follows
from Theorem 3.1.
Lemma A.2. Let Bn and Tn be sequences of random variables satisfying the conditions
above, i.e.,
Bn(GnXn)
P→ b,
and
(Tn(GnXn), Tn(G′nX
n))d→ (T, T ′), (30)
where T and T ′ are independent, each with common c.d.f. RT (·). Let RT+Bn (t) denote
the randomization distribution of Tn + Bn, defined in (28) with A replaced by T + B.
Then, RT+Bn (t) converges to T + b in probability. In other words,
RT+Bn (t) ≡ 1
|G|∑g
ITn(gXn) +Bn(gXn) ≤ t P→ RT+b(t) if RT+b is continuous at t,
where RT+b(·) denotes the corresponding c.d.f. of T+b. (Of course, RT+b(t) = RT (t−b).)
Proof of Lemma A.2: Without loss of generality, assume b = 0. For any ε > 0,
1
|G|∑
ITn(gXn) +Bn(gXn) ≤ t− ε − 1
|G|∑
I|Bn(gXn)| > ε
≤ 1
|G|∑
ITn(gXn) +Bn(gXn) ≤ t
≤ 1
|G|∑
ITn(gXn) +Bn(gXn) ≤ t+ ε+1
|G|∑
I|Bn(gXn)| > ε.
26
Note that 1|G|∑
I|Bn(gXn)| > ε of the first line and the third line converges in prob-
ability to 0 by Lemma A.1. Also, by Theorem 3.1, (30) implies
RTn (t) =
1
|Gn|∑g∈Gn
ITn(gXn) ≤ t P→ RT (t) (31)
if RT (·) is continuous at t. Thus, if both t− ε and t+ ε are continuity points of RT (·),the first term of the first line and the first term of the third line converge in probability
to RT (t− ε) and RT (t+ ε), respectively. Therefore,
RT (t− ε) ≤ RT+bn (t) ≤ RT (t+ ε)
with probability tending to one, for continuity points t− ε and t+ ε of RT (·).
Now, let ε ↓ 0 through continuity points to deduce that
RT+Bn (t)
P→ RT (t).
Lemma A.3. Let An and Tn be sequences of random variables satisfying the conditions
above, i.e.,
An(GnXn)
P→ a
where a is nonzero, and
(Tn(GnXn), Tn(G′nX
n))d→ (T, T ′),
where T and T ′ are independent, each with common c.d.f. RT (·). Then, the randomiza-
tion distribution of AnTn converges to aT in probability. In other words,
RATn (t) ≡ 1
|G|∑g
IAn(gXn)Tn(gXn) ≤ t P→ RaT (t),
if RaT is continuous at t, where RaT (·) denotes the corresponding c.d.f. of aT.
Proof of Lemma A.3: Write
AnTn = aTn + (An − a)Tn .
Then, we can apply Lemma A.2 with Bn = (An − a)Tn, if we can verify the condition
Bn(GnXn)
P→ 0. But,
Bn(GnXn) = [An(GnX
n)− a]Tn(GnXn)
P→ 0 · T = 0 ,
27
by the usual Slutsky’s Theorem. Finally, the behavior of aTn follows trivially from that
of Tn.
Proof of Theorem 3.2: The proof follows from Lemma A.2 and Lemma A.3.
Proof of Lemma 3.2: First,
Lm(x) =
(mx
)(n
m−x
)(Nm
)(mx
)pxm(1− pm)m−x
(32)
=n!n!m!
(n+m)!(m− x)!(n−m+ x)!pxm(1− pm)m−x.
Applying Stirling’s approximation
n! ∼√
2πn(n/e)n(1 +O(1
n)) as n→∞
yields Lm(x) ∼ L′m(x), where
L′m(x) =n2n+1mm+ 1
2
(n+m)n+m+ 12 (m− x)m−x+ 1
2 (n−m+ x)n−m+x+ 12pxm(1− pm)m−x
;
the approximation holds as long as min(m,n,m − x, n − m + x) → ∞. Of course,
Bm = mpm +OP (m1/2), and so
min(m,n,m−Bm, n−m+Bm)P→∞ .
Therefore, Lm(Bm) has the same limiting distribution as L′m(Bm) (assuming it has one,
which we show below). Write L′m = a · b · c and qm = 1− pm, where
a =n2n+1mm+ 1
2
(n+m)n+m+ 12
,
b =1
(m− x)m−x+ 12 (n−m+ x)n−m+x+ 1
2
and
c =1
pxmqm−xm
.
Then,
a = q2n+1m p
m+ 12
m (n+m)n+1 ,
28
and so
a · c = pA+ 1
2m q2n+1−A
m (n+m)n+1 ,
where A = m− x. Also,
b =1
AA+ 12 (n− A)n−A+ 1
2
=
(npmA
)A+ 12(nqmn−A
)n−A+ 12
(npm)A+ 12 (nqm)n−A+ 1
2
.
Therefore, L′m = a · b · c equals
L′m =pA+ 1
2m q2n+1−A
m (n+m)n+1
(npm)A+ 12 (nqm)n−A+ 1
2
· 1(Anpm
)A+ 12(n−Anqm
)n−A+ 12
=qn+ 1
2m (n+m)n+1
nn+1· 1(
Anpm
)A+ 12(n−Anqm
)n−A+ 12
=1√qm· 1(
Anpm
)A+ 12(n−Anqm
)n−A+ 12
We will evaluate Lm and L′m not at a generic x, but at the binomial variable Bm, which
satisfies
Bm = mpm +OP (m1/2) ,
in which case Am = A(Bm) = m−Bm satisfies
Amnpm
=mqmnpm
+OP (m1/2)
npm= 1 +
OP (m1/2)
npm,
since npm = mqm. Also,
Am = m−Bm = mqm +OP (m1/2) .
Since Bm/mP→ p, we also have
Amnpm
P→ 1 andn− Amnqm
P→ 1 ,
orAmn
P→ p andn− Am
n
P→ q . (33)
Therefore, we can expand the logarithm in L′m as long as we keep both the linear and
29
quadratic terms,
log(t) = (t− 1)− 1
2(t− 1)2 + o(|t− 1|2) as t→ 1 .
Hence,
− log[√qmL
′m(Bm)] = (Am +
1
2) log
(Amnpm
)+ (n− Am +
1
2) log
(n− Amnqm
)
= Am log
(Amnpm
)+ (n− Am) log
(n− Amnqm
)+ oP (1)
= Am
(Amnpm
− 1
)+(n−Am)
(n− Amnqm
− 1
)−1
2Am
(Amnpm
− 1
)2
−1
2(n−Am)
(n− Amnqm
− 1
)2
+oP (1)
≡ d+ e+ f + g + oP (1) ,
where we have just identified the four terms in the last expression. Noting that
n− Amnqm
− 1 = −pmqm
(Amnpm
− 1
), (34)
we have that
d+ e =
(Amnpm
− 1
)·[Am +
pmqm
(Am − n)
]=
(Amnpm
− 1
)· 1
qm· (Am − npm)
=(Am − npm)2
npmqm= Z2
m ·pmqm
,
where
Zm =Am −mqm√mpmqm
= −Bm −mpm√mpmqm
L→ Z ∼ N(0, 1) .
Again using (34), we find that
−2(f + g) =
(Amnpm
− 1
)2
·(Am + (n− Am)
p2m
q2m
)
= Z2m
(pm +
p2m
qm
)+ oP (1) = Z2p
q+ oP (1) ,
using (33). Therefore,
d+ e+ f + g =p
2qZ2 + oP (1) .
30
Hence, we conclude that
Lm(Bm)L→ 1√q
exp(− p
2qZ2) ,
and (i) is shown. To prove (ii), note that
E
[1√q
exp(− p
2qZ2)
]= 1 ,
since Z2 has the Chi-squared distribution with one degree of freedom and moment
generating function ψ(t) = (1 − 2t)−1/2. Since the mean of the limiting distribution
has mean 1, by Theorem 12.3.2 (iii) of Lehmann and Romano (2005) Qm is contiguous
with respect to Pm. Since the limiting distribution has no mass at 0, by Problem 12.23
it also follows that Pm is contiguous to Qm.
Proof of Lemma 3.3 We must show, for any ε > 0,
P|Wm(Zπ(1), . . . , Zπ(m))− t| > ε → 0 as m→∞ . (35)
We compare the left side of (35) with
P|Wm(V1, . . . , Vm)− t| > ε .
Imagine V1, . . . , Vm are sampled in a two-stage process where first Bm is drawn from
the binomial distribution with parameters m and p, and then V1, . . . , Vm are obtained
by drawing Bm i.i.d. observations from P and (m − Bm) i.i.d. observations from Q.
Similarly, let Hm denote the number of observations among Zπ(1), . . . , Zπ(m) which were
among the Xis, so that Hm has the hypergeometric distribution (m,m,N) based on
sampling m objects from N , m of which are “special”. By Lemma 3.2, Remark 3.2
and (24), Bm and Hm are contiguous. Importantly, conditional on Bm = Hm = b, the
conditional probabilities
PWm(V1, . . . , Vm)− t| > ε|Bm = b (36)
and
P|Wm(Zπ(1), . . . , Zπ(m))− t| > ε|Hm = b (37)
are the same, because Wm is evaluated at a random sample of b observations from P
31
and m− b observations from Q in both cases. Let fm(Bm) be defined by
fm(Bm) ≡ P|Wm(V1, . . . , Vm)− t| > ε|Bm . (38)
By assumption (23),
E[fm(Bm)]→ 0 ,
and hence
fm(Bm)P→ 0 ,
by Markov’s inequality. But then, by contiguity,
fm(Hm)P→ 0 ,
and so
E[fm(Hm)]→ 0 , (39)
since fm is uniformly bounded. But the left hand side of (39) is exactly the left hand
side of (35).
Proofs of Theorems in Section 2.
Proof of Theorem 2.1 First, argue in the case θ(P ) =∫xdP (x), so fP (x) = x for
all x, P and
Tm,n = m1/2(Xm − Yn) = m−1/2( m∑i=1
Xi −m
n
n∑j=1
Yj).
Independent of Zs, let (π(1), . . . , π(N)) and (π′(1), . . . , π′(N)) be independent random
permutations of 1, . . . , N. Then, by Example 15.2.6 of Lehmann and Romano (2005),(Tm,n(Zπ(i)), Tm,n(Zπ′(i))
)converges in distribution to a bivariate normal distribution with independent, identically
distributed marginals having mean 0 and variance
τ 2(P ) =p
1− pσ2(P ) + σ2(Q) =
1
1− pσ2(P ),
where σ2(P ) denotes the variance of P. Thus, Theorem 3.1 can be applied and the result
follows.
Next, consider the case θ(P ) =∫f(x)dP (x). However, this problem is the same as
the mean case. Instead of observing (Z1, . . . , ZN) = (X1, . . . , Xm, Y1, . . . , Yn), we now
32
observe (Z1, . . . , ZN) = (f(X1), . . . , f(Xm), f(Y1), . . . , f(Yn)) and we are interested in
means of Zs. Thus, the proof for this case would be the same as above except we
replace σ2(P ) = EPX2i with EPf
2P (Xi).
Finally, we consider the general case. Let π be a random permutation of 1, . . . , N,so that
Tm,n(Zπ(1), . . . , Zπ(N)) = m1/2[θm(Zπ(1), . . . , Zπ(m))− θn(Zπ(m+1), . . . , Zπ(N))] .
Let V1, V2, . . . be i.i.d. P . By assumption,
m1/2[θm(V1, . . . , Vm)− θ(P )]−m−1/2
m∑i=1
fP (Vi)P→ 0 . (40)
By Lemma 3.3 and (40),
εm(Zπ(1), . . . , Zπ(m)) ≡ m1/2[θm(Zπ(1), . . . , Zπ(m))− θ(P )]−m−1/2
m∑i=1
fP (Zπ(i))P→ 0 .
Similarly,
εn(Zπ(m+1), . . . , Zπ(N)) ≡ n1/2[θn(Zπ(m+1), . . . , Zπ(N))−θ(P )]−n−1/2
N∑j=m+1
fP (Zπ(j))P→ 0 ,
which implies
m1/2[θn(Zπ(m+1), . . . , Zπ(N))− θ(P )]− (m
n)1/2n−1/2
N∑j=m+1
fP (Zπ(j))P→ 0 .
Hence, we can write
Tm,n(Zπ(1), . . . , Zπ(N)) =
m1/2[1
m
m∑i=1
fP (Zπ(i))−1
n
N∑j=m+1
fP (Zπ(j))]+εm(Zπ(1), . . . , Zπ(m))−(m
n)1/2εn(Zπ(m+1), . . . , Zπ(N)) ,
and each of the last two terms goes to zero in probability. Therefore, we can apply
Slutsky’s Theorem for randomization distributions; that is, it suffices to determine the
limit behavior of just
m1/2[1
m
m∑i=1
fP (Zπ(i))−1
n
N∑j=m+1
fP (Zπ(j))] ,
33
which reduces the problem to the previous case considered.
Proof of Theorem 2.2: Write Vm,n = Vm,n(Z1, . . . , ZN), where the Zi are defined in
(1). Let (π(1), . . . , π(N)) denote a random permutation of 1, . . . , N (and independent
of all other variables). We first will show that
V 2m,n(Zπ(1), . . . , Zπ(N))
P→ τ 2(P ) . (41)
To do this, it suffices to show that
σ2m(Zπ(1), . . . , Zπ(m))
P→ σ2(P ) (42)
and
σ2n(Zπ(m+1), . . . , Zπ(N))
P→ σ2(P ) . (43)
But (42) and (43) both follow from Lemma 3.3. Now let RVm,n(·) denote the permutation
distribution corresponding to the statistic Vm,n, as defined in (2) with T replaced by V .
By Lemma A.1, RVm,n(t) converges to δτ2(P )(t) for all t 6= τ 2(P ), where δc(·) denotes the
c.d.f. of the distribution placing mass one at the constant c. Using this fact together with
Theorem 2.1, we can apply Lemma A.3 to conclude that the permutation distribution
of the ratio of statistics Sm,n satisfies (11).
Proofs in Section 4.
Proof of Lemma 4.1: First, we consider Tn,0. Without loss of generality, assume
µ(Pi) = 0 for all i. Let Zn be the column vector with ith component n1/2i Xn,i/σi. Also,
let I denote the k× k identity matrix, let 1 denote the k× 1 vector of ones, and let Dn
denote the diagonal matrix with (i, i) entry Nσ2i /ni. Then, we can write
Tn,0 = Z ′nPnZn .
where
Pn ≡ (I − D−1/2n 11′D
−1/2n
1′D−1n 1
) .
Of course, Zn converges in distribution to Z, where Z has the multivariate normal
distribution with mean 0 and covariance matrix I. If we let D denote the diagonal
matrix with (i, i) entry σ2i /pi, then the convergence of Dn to D (for each entry) as well
as the convergence of D−1n to D−1 implies (using the continuous mapping theorem) that
Tn,0d→ Z ′PZ ,
34
where P is the matrix
P ≡ (I − D−1/211′D−1/2
1′D−11) . (44)
The matrix P is a symmetric idempotent or projection matrix, and its rank therefore is
its trace, which is then easily checked to be k − 1. Indeed, P represents the projection
orthogonal to the unit vector D−1/21/1′D−11. It follows that Z ′PZ ∼ χ2k−1, as required.
To handle Tn,1, let tn be the column vector with ith component n1/2i Xn,i/Sn,i and let
Dn be the diagonal matrix with (i, i) entry NS2n,i/ni. Then, let Pn be the projection
matrix where D is replaced by Dn in the definition (44) of P . Of course by Slutsky’s
Theorem, Zn − tn converges in probability to 0. Also, Dn converges in probability to D
(as well as its inverse), Pn converges in probability to P , and so Pn − Pn converges in
probability to 0. Since
Tn,0 − Tn,1 = Z ′nPnZn − t′nPntn
= (Zn − tn)′Pn(Zn − tn) + 2(Zn − tn)′Pntn + t′n(Pn − Pn)tnP→ 0 ,
then Tn,1 must have the same limiting distribution as that of Tn,0.
Proof of Theorem 4.1: Put all the N =∑
i ni observations in one vector
(Z1, . . . , ZN) = (X1,1, . . . , X1,n1 , X2,1, . . . , X2,n2 , . . . , Xk,1, . . . , Xk,nk) .
For now, we consider the case where all the N observations are i.i.d., so that Pi = P
for i = 1, . . . , k. Without loss of generality, we can assume µ(P ) = 0 and we write
σ2 = σ2(P ). In this case, Tn,0 simplifies to
Tn,0 =1
σ2
k∑i=1
ni(Xn,i − ZN)2 ,
where
ZN =1
N
N∑l=1
Zl .
First, we show that the randomization distribution based on Tn,0, say Rn,0(·), behaves
the same was as Tn,1. (Of course, we can’t use Tn,0 as σ is unknown, but we treat it now
in essence as if it is known.) Let π = (π(1), . . . , π(N)) denote a random permutation of
1, . . . , N (and independent of the observations). From Theorem 3.1, we must verify
that the joint limiting distribution of (Tn,0(Z), Tn,0(Zπ)) is that of two independent Chi-
squared variables with k − 1 degrees of freedom. (Note that we do not need to consider
35
the joint behavior of Tn,0 at Zπ and at Zπ′ , where π′ is another independent random
permutation, because since the Zl are i.i.d., Zπ′ and Z have the same distribution.) To
do this, define
Vn,i = n1/2i Xn,i = n
−1/2i
N∑l=1
ZlIl ∈ Ii
and
V ′n,j = n−1/2j
N∑l=1
ZlIπ(l) ∈ Ij ,
where Ii is the set of indices corresponding to the ith sample; that is, I1 = 1, . . . , n1.I2 = n1 + 1, . . . , n1 + n2 , and ultimately Ik = N − nk + 1, . . . , N. We claim the
joint asymptotic normality of
(Vn,1, . . . , Vn,k, V′n,1, . . . , V
′n,k) .
To do with we use the Cramer-Wold device, i.e., we must show that
Vn ≡ VN(a, b) ≡k∑i=1
(aiVn,i + biV′n,i)
is asymptotically normal for any choices of constants ai and bi. We can write
Vn =N∑l=1
Cn,lZl ,
where
Cn,l =k∑i=1
[aiIl ∈ Ii
n1/2i
+biIπ(l) ∈ Ii
n1/2i
].
Note that the Cn,l are random (as they depend on the random permutation π), but are
independent of the Zl. By Lemma 11.3.3 in Lehmann and Romano (2005), a sufficient
condition forN∑l=1
Cn,lZl/
√√√√ N∑l=1
C2n,l
d→ N(0, σ2) (45)
ismaxl=1,...,N C
2n,l∑N
l=1C2n,l
P→ 0 (46)
36
as N →∞. Note that
C2n,l =
k∑i=1
[aiIl ∈ Ii
n1/2i
+biIπ(l) ∈ Ii
n1/2i
]·
k∑j=1
[ajIl ∈ Ij
n1/2j
+bjIπ(l) ∈ Ij
n1/2j
]
=k∑i=1
a2i
niIl ∈ Ii+
k∑i=1
k∑j=1
ai
n1/2i
bj
n1/2j
Il ∈ Ii, π(l) ∈ Ij
+k∑i=1
k∑j=1
biaj
n1/2i n
1/2j
Iπ(l) ∈ Ii, l ∈ Ij+k∑i=1
b2i
niIπ(l) ∈ Ii .
Certainly,
maxl=1,...,N
C2n,l = OP (1/N)→ 0 .
Furthermore,
N∑l=1
C2n,l =
k∑i=1
a2i
ni
N∑l=1
Il ∈ Ii+k∑i=1
k∑j=1
ai
n1/2i
bj
n1/2j
N∑l=1
Il ∈ Ii, π(l) ∈ Ij
+k∑i=1
k∑j=1
biaj
n1/2i n
1/2j
N∑l=1
Iπ(l) ∈ Ii, l ∈ Ij+k∑i=1
b2i
ni
N∑l=1
Iπ(l) ∈ Ii
=k∑i=1
(a2i + b2
i ) +k∑i=1
k∑j=1
ai
n1/2i
bj
n1/2j
N∑l=1
[Il ∈ Ii, π(l) ∈ Ij+ Iπ(l) ∈ Ii, l ∈ Ij] .
Now, the term
An(i, j) ≡N∑l=1
Il ∈ Ii, π(l) ∈ Ij (47)
represents the the ni indices in Ii such that, after permuted by π, are in Ij; hence, its
distribution is that of the hypergeometric distribution when sampling ni observations
from N , of which nj are “special”. The expectation of (47) is then ninj/N . Hence,
E[An(i, j)/ni]P→ pj
and V ar[An(i, j)/ni] = O(1/ni), implying
An(i, j)/niP→ pj .
37
It follows thatN∑l=1
C2n,l
P→k∑i=1
(a2i + b2
i ) + 2k∑i=1
k∑j=1
aibjp1/2i p
1/2j . (48)
Of course, the right side of (48) is nonnegative. By the Cauchy-Schwarz inequality,
|k∑i=1
aip1/2i | ≤ [
k∑i=1
a2i ]
1/2 ,
with equality if and only if ai = cp1/2i for some constant c. It follows that the right side
of (48) is greater than or equal to (A1/2−B1/2)2, where A =∑
i a2i and B =
∑i b
2i , and
is equal to 0 if and only if A = B, i.e. ai = cp1/2i and bi = −cp1/2
i .
When the right side of (48) is positive, we have that the condition (46) holds, and so
N∑l=1
Cn,lZld→ N(0, σ2[
k∑i=1
(a2i + b2
i ) + 2k∑i=1
k∑j=1
aibjp1/2i p
1/2j ]) . (49)
But, even if the right side of (48) is zero, we can still claim∑
l Cn,lZl converges in
distribution to N(0, 0), i.e., it converges in probability to 0. To see why,∑
l Cn,lZl has
mean 0 and variance σ2∑
lE(C2n,l). But the above argument showing
∑l C
2n,l converges
to 0 in probability (in this case only) shows that its expectation does as well.
In general, we can now conclude that
(Vn,1, . . . , Vn,k, V′n,1, . . . , V
′n,k)
d→ (V, V ′)
is asymptotically multivariate normal with mean 0 (and each of V and V ′ are k-vectors).
Moreover, by appropriate choices of constants ai and bi, we can read off the covariance
matrix from the limiting variance in (49). In particular, by taking ai = 1 and aj = 0
if j 6= i and taking bj = 0 for all j, yields V ar(Vn,i) = σ2. Also, Cov(Vn,i, Vn,j) = 0 if
i 6= j. Similarly, V ar(V ′n,j) = σ2, and for i 6= j, (by taking ai = 1 = bj and the rest of
the constants 0),
Cov(Vn,i, V′n,j) = σ2(pipj)
1/2 . (50)
Of course, the statistic Tn,0 that is of current interest is indeed a function of the
Vn,i; however, the fact that the covariances in (50) are nonzero would not allow us to
conclude the asymptotic independence of Tn,0(Z) and Tn,0(Zπ). So we first need to
38
consider a simple transformation of the Vn,i and V ′n,j. For i = 1, . . . , k, define
Wn,i ≡ n1/2i (Xn,i − ZN)
= Vn,i − n1/2i ZN = Vn,i − (ni/N)1/2
k∑m=1
p1/2m Vn,m .
Similarly,
W ′n,j = V ′n,j − (nj/N)1/2
k∑m=1
p1/2m V ′n,m .
The joint asymptotic multivariate normality of the Vn,i together with the V ′n,j implies
the joint asymptotic multivariate normality of the Wn,i together with the W ′n,j. Indeed,
(Wn,1, . . . ,Wn,k,W′n,1, . . . ,W
′n,k)
d→ (W1, . . . ,Wk,W′1, . . . ,W
′k) ,
where
Wi = Vi − p1/2i
k∑m=1
p1/2m Vm
and
W ′j = V ′j − p
1/2i
k∑m=1
p1/2m V ′m .
Importantly,
Cov(Wi,W′j) = Cov(Vi − p1/2
i
k∑m=1
p1/2m Vm, V
′j − p
1/2j
k∑m=1
p1/2m V ′m)
= Cov(Vi, V′j )− p
1/2j
k∑m=1
p1/2m Cov(Vi, V
′m)− p1/2
i
k∑m=1
p1/2m Cov(Vm, V
′j )+
(pipj)1/2
k∑l=1
k∑m=1
(plpm)1/2Cov(Vl, V′m)
= σ2[(pipj)1/2 − p1/2
j
k∑m=1
p1/2i pm − p1/2
i
k∑m=1
p1/2j pm + (pipj)
1/2
k∑l=1
k∑m=1
plpm]
= σ2[(pipj)1/2 − (pipj)
1/2 − (pipj)1/2 + (pipj)
1/2] = 0 .
39
It follows that (W1, . . . ,Wk) and (W ′1, . . . ,W
′k) are independent. But since
Tn,0(Z) =1
σ2
k∑i=1
W 2n,i
d→ 1
σ2
k∑i=1
W 2i
and
Tn,0(Zπ) =1
σ2
k∑i=1
(W ′n,i)
2 d→ 1
σ2
k∑i=1
(W ′i )
2 ,
it now follows that Tn,0(Z) and Tn,0(Zπ) are asymptotically independent. Moreover, by
Lemma 4.1, Tn,0(Z) is asymptotically Chi-squared with k− 1 degrees of freedom. Since,
Tn,0(Zπ) has the same distribution as Tn,0(Z), it has the same limiting distribution as
well.
Next, we show the same result with Tn,1 replaced by Tn,0. However, by the fact that
Z and Zπ have the same distribution,
Tn,1(Zπ)− Tn,0(Zπ)d= Tn,1(Z)− Tn,0(Z) ,
and so by the proof of Lemma 4.1,
Tn,1(Zπ)− Tn,0(Zπ)P→ 0 .
Writing Tn,1 = Tn,0 + [Tn,1−Tn,0], we can then apply Slutsky’s Theorem for Randomiza-
tion distributions, Theorem 3.2 to conclude that Rn,1(·) has the same limiting behavior
as Rn,0(·).
The proof is now complete under the assumption that all N observations are i.i.d. We
now argue, using the coupling argument in Section 3.3, that the behavior of the permu-
tation distribution under general P1, . . . , Pk (satisfying the finite variance assumption)
is the same when all observations are i.i.d. with distribution given by the mixture dis-
tribution P . So, construct Z, Z and Zπ0 as in the coupling construction. It suffices to
show that, for a random permutation π,
Tn,1(Zπ)− Tn,1(Zππ0)P→ 0 . (51)
Write
Tn,1(Z) =k∑i=1
1
S2n,i
[n
1/2i Xn,i −
∑kj=1 n
1/2j Xn,j(n
1/2i n
1/2j /N)/S2
n,j∑kj=1(nj/N)/S2
n,j
]2
. (52)
40
Then, Tn,1(Zπ) is computed by replacing
Xn,i = Xn,i(Z) =1
ni
N∑l=1
ZlIl ∈ Ii
with
Xn,i(Zπ) =1
ni
N∑l=1
ZlIπ(l) ∈ Ii
and S2n,i = S2
n,i(Z) gets replaced by
S2n,i(Zπ) ≡ 1
ni − 1
[N∑l=1
Z2l Iπ(l) ∈ Ii − niX2
n,i(Zπ)
].
From (52), it now suffices to show that, for each i,
n1/2i Xn,i(Zπ)− n1/2
i Xn,i(Zππ0)P→ 0 (53)
and
S2n,i(Zπ)− S2
n,i(Zππ0)P→ 0 . (54)
To show (53), first note that the left side has mean 0; so, it suffices to show its variance
tends to 0. Now, remember that Zπ and Zππ0 differ in at most D = OP (N1/2) entries.
But, conditional on π, π0 and the multinomial variables (N1, . . . , Nk) in the coupling
construction, for indices l where Zl 6= Zπ0(l),
V ar(Zl − Zπ0(l)|π, π0, N1, . . . , Nk) ≤ 2V .
where V = max(σ21, . . . , σ
2k). But the left side of ( 53) is
n−1/2i
N∑l=1
[Zl − Zπ0(l)]Iπ(l) ∈ Ii ,
and the sum here is conditionally a sum of at most D independent variables with variance
≤ 2V . Hence, the variance of the left side of (53) is conditionally at most 2V D/ni, and
hence the unconditional variance is at most 2V E(D)/ni → 0.
To show (54), note that
1
ni
N∑l=1
[Z2l − Z2
π0(l)]Iπ(l) ∈ Ii ,
41
has mean 0 conditional on π, π′ and N1, . . . , Nk, and its absolute value is bounded above
by1
ni
∑l∈J
[Z2l + Z2
π0(l)] .
Here, the sum is over the set of indices in J , where Zl 6= Zπ0(l). But, conditionally, there
are at most D nonzero terms in J , each term having expectation bounded by 2V , and
so the whole expression has mean bounded above by 2V E(D)/ni → 0. The result (54)
now follows easily.
References
Devroye, L., and Wagner, T.J. (1980). The strong uniform consistency of kernel density
estimates. Multivariate Analysis V (P.R. Krishnaiah, ed.). North Holland, 59–77.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statis-
tics 7, 1–26.
Hall, P., DiCiccio, T., and Romano, J. (1989). On Smoothing and the Bootstrap. Annals
of Statistics 17, 692–704.
Hoeffding, W. (1952). The large-sample power of tests based on permutations of obser-
vations. The Annals of Mathematical Statistics 23, 169–192.
Janssen, A. (1997). Studentized permutation tests for non-i.i.d. hypotheses and the
generalized Behrens-Fisher problem. Statistics and Probability Letters 36, 9–21.
Janssen, A. (2005). Resampling student’s t-type statistics. Annals of the Institute of
Statistical Mathematics 57, 507–529.
Janssen, A. and Pauls, T. (2003). How do bootstrap and permutation tests work? Annals
of Statistics 31, 768–806.
Janssen, A. and Pauls, T. (2005). A Monte Carlo comparison of studentized bootstrap
and permutation tests for heteroscedastic two-sample problems. Computational Statis-
tics 20, 369–383.
Krishnamoorthy, K., Lu, F. and Mathew, T. (2007). A parametric bootstrap approach
for ANOVA with unequal variances: FIxed and random models. Computational Statis-
tics & Data Analysis, 51, 5731–5742.
42
Lehmann, E. L. (1998). Nonparametrics: Statistical Methods Based on Ranks. revised
first edition, Prentice Hall, New Jersey.
Lehmann, E. L. (1999). Elements of Large-Sample Theory. Springer-Verlag, New York.
Lehmann, E. L. (2009). Parametric versus nonparametrics: two alternative methodolo-
gies. Journal of Nonparametric Statistics 21, 397–405.
Lehmann, E. L. and Romano, J. (2005). Testing Statistical Hypotheses. 3rd edition,
Springer-Verlag, New York.
Neubert, K. and Brunner, E. (2007). A Studentized permutation test for the non-
parametric Behrens-Fisher problem. Computational Statistics & Data Analysis 51,
5192–5204.
Neuhaus, G. (1993). Conditional rank tests for the two-sample problem under random
censorship. Annals of Statistics 21, 1760–1779.
Pauly, M. (2010). Discussion about the quality of F-ratio resampling tests for comparing
variances. TEST, 1–17.
Politis, D., Romano, J. and Wolf, M. (1999). Subsampling. Springer-Verlag, New York.
Rice, W. and Gaines, S. (1989). One-way analysis of variance with unequal variances.
Proc. Nat. Acad. Sci. 86, 8183–8184.
Romano, J. (1989). Bootstrap and randomization tests of some nonparametric hypoth-
esis. Annals of Statistics 17, 141–159.
Romano, J. (1990). On the behavior of randomization tests without a group invariance
assumption. Journal of the American Statistical Association 85, 686–692.
Romano, J. (2009). Discussion of “parametric versus nonparametrics: Two alternative
methodologies”.
Serfling, S. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New
York.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press, New
York.
ADDRESS:
43
EunYi Chung: Department of Economics, Stanford University, Stanford, CA 94305-
6072; [email protected]
Joseph P. Romano: Departments of Statistics and Economics, Stanford University, Stan-
ford, CA 94305-4065; [email protected]
44