+ All Categories
Home > Documents > j.1467-842X.1990.tb01011.x

j.1467-842X.1990.tb01011.x

Date post: 11-Nov-2014
Category:
Upload: yuana-sukmawaty
View: 3 times
Download: 0 times
Share this document with a friend
Popular Tags:
14
Austral. J. Statist., 32(2), 1990, 177-190 ON BOOTSTRAP HYPOTHESIS TESTING NICHOLAS I. FISHER' AND PETER HALL' CSIRO and The. Australian National University Summary We describe methods for constructing bootstrap hypothesis tests, illustrating our approach using analysis of variance. The importance of pivotalness is discussed. Pivotal statistics usually result in improved accuracy of level. We note that hypothesis tests and confidence in- tervals call for different methods of resampling, so as to ensure that accurate critical point estimates are obtained in the former case even when data fail to comply with the null hypothesis. Our main points are illustrated by a simulation study and application to three real data sets. Key words: Analysis of variance; Behrens-Fisher problem; bootstrap; hy- pothesis test; level error; Monte Car10 test; pivotal; resample. 1. Introduction Our aim in this paper is to describe the construction of bootstrap hypothesis tests, illustrated using the example of analysis of variance. The notion of a general bootstrap hypothesis test dates back to Efron (1979), and appears also in Berm (1988) and Hinkley (1988), but has not been developed to anywhere near the extent of bootstrap ideas for confidence intervals. While there are close links between bootstrap methods for testing and for interval estimation, there are important, explicit differences which call for a speciahed treatment of the bootstrap testing problem. We contend that, as in the case of confidence intervals, an important feature which bootstrap tests should have is that they be based on asymp- Received February 1989; revised April 1989. ' CSIRO, Division of Mathematics and Statistics, Sydney. Statistics Research Section, School of Mathematical Sciences, The Australian National University, Canberra. Acknowledgements. The hospitality of SFAL is noted with gratitude.
Transcript
Page 1: j.1467-842X.1990.tb01011.x

Austral. J. Statist., 32(2), 1990, 177-190

O N B O O T S T R A P HYPOTHESIS TESTING

NICHOLAS I. FISHER' AND PETER HALL'

CSIRO and The. Australian National University

Summary

We describe methods for constructing bootstrap hypothesis tests, illustrating our approach using analysis of variance. The importance of pivotalness is discussed. Pivotal statistics usually result in improved accuracy of level. We note that hypothesis tests and confidence in- tervals call for different methods of resampling, so as to ensure that accurate critical point estimates are obtained in the former case even when data fail to comply with the null hypothesis. Our main points are illustrated by a simulation study and application to three real data sets.

Key words: Analysis of variance; Behrens-Fisher problem; bootstrap; hy- pothesis test; level error; Monte Car10 test; pivotal; resample.

1. Introduction

Our aim in this paper is to describe the construction of bootstrap hypothesis tests, illustrated using the example of analysis of variance. The notion of a general bootstrap hypothesis test dates back to Efron (1979), and appears also in Berm (1988) and Hinkley (1988), but has not been developed to anywhere near the extent of bootstrap ideas for confidence intervals. While there are close links between bootstrap methods for testing and for interval estimation, there are important, explicit differences which call for a speciahed treatment of the bootstrap testing problem.

We contend that, as in the case of confidence intervals, an important feature which bootstrap tests should have is that they be based on asymp-

Received February 1989; revised April 1989. ' CSIRO, Division of Mathematics and Statistics, Sydney.

Statistics Research Section, School of Mathematical Sciences, The Australian National University, Canberra. Acknowledgements. The hospitality of SFAL is noted with gratitude.

Page 2: j.1467-842X.1990.tb01011.x

178 NICHOLAS I. FISHER AND PETER HALL

totically pivotal statistics. This leads us to rule out certain test statistics which feature prominently in some accounts of analysis of variance (e.g., Dijkstra & Werter, 1981). Pivotalness is not universally accepted as a desirable feature of confidence intervals - see for example the discussion of Hall (1988). However, its acceptance is increasing. Its virtue for tests is similar to that for confidence intervals: it results in tests with more accurate levels. With a sample of size n, tests based on pivotal statistics often result in level errors of O(n-2) , compared with only O(n- l ) for tests based on non-pivotal statistics. Our insistence on using pivotal statistics results in our advocating different test statistics in homoscedastic and het- eroscedastic problems. The analysis of variance example is an ideal vehicle for bringing out this fundamental point.

One major feature distinguishing hypothesis tests from confidence in- tervals is that when testing, it is important to have accurate estimates of critical points even when the data fail to satisfy the null hypothesis. This means that the data must be transformed appropriately, so that any failure to comply with the null hypothesis is not reflected in reduced accuracy of level error. Put another way, we want the bootstrap distribution of our test statistic to be invariant under a change from the null hypothesis to any of the members of the set of alternative hypotheses which the test is designed to detect. Therefore, we argue, bootstrap analysis proceeds rather differently in testing and confidence interval problems.

Our approach to bootstrap hypothesis testing for one-way analysis of variance is outlined in Section 2. In the case of homoscedastic problems, our test statistic is the classic F-ratio introduced by R.A. Fisher, although of course it does not have Fisher’s F distribution in the nonparametric context which we treat. For heteroscedastic problems we use a statistic proposed by James (1951). An alternative statistic suggested by Brown & Forsythe (1974a, b) was not investigated because it fails to be asymptoti- cally pivotal.

Section 3 extends our results to tweway analysis of variance, and Section 4 discusses further generalizations and extensions. Section 5 sum- marizes the results of a simulation study and gives applications to real data.

2. Methodology for One-way Analysis of Variance

This section motivates the general approach to bootstrap hypothesis testing by making explicit its application to the comparison of means of several samples.

Page 3: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 179

Subsection 2.1 introduces two test statistics, one classical and the other more suited to heteroscedastic problems. Subsection 2.2 develops a resampling scheme tailored to the heteroscedastic case, and Subsection 2.3 presents an Edgeworth expansion argument showing that the more classical of the two statistics from Subsection 2.1 will tend t o perform poorly in heteroscedastic problems. Finally, Subsection 2.4 presents new resampling schemes, and outlines of asymptotic theory, for homoscedastic problems.

2.1 Test Statistics

Assume that a random sample { X ; j , l 5 j 5 n;} is gathered from population I I j , for 1 5 i 5 T . The i t h population has mean pi and variance a:. Without making distributional assumptions about the ITi’s - indeed, often without supposing that the Hi’s have common variance - we wish to test the null hypothesis

Put n = C n ; , X ; . ni C j X j j and x.. n-l C C X j j . There are several possible test statistics, of which just two are

Pr * Ho : = / ~ 2 = * . * = -1 -

and

Under the model that each IT; is Normal N ( p , a 2 ) , where neither p nor u2 depends on i, ( r - l)-’Tl is distributed as Fr-1,n-r. The statistic TI is standard in this context, indeed in most circumstances where hornoscedas- ticity is a realistic assumption. On the other hand, if each IIi is Normal N ( p , u : ) where the ui’s are unequal, TI has a distribution depending on the ai’s in a complex manner. In this situation the distribution of TZ is that of

r -1 lj2 2 n ; / 2 2 j ) 2/w; , ( j = 1

C(n; - 1) 2; - n ni i = l

where 2; is N ( 0 , l), W; is xii-l, and these variables are stochastically independent. The statistic T2 therefore has the advantage that its null distribution does not depend on the unknown u j ’ s . However, i n cases of homoscedasticity it gives rise to a less powerful test. I t dates back to

Page 4: j.1467-842X.1990.tb01011.x

180 NICHOLAS I. FISHER AND PETER HALL

James (1951), and has been considered by Beran (1988) in the bootstrap case with T = 2.

Now drop the assumption that II; is Normal, supposing only that it has mean p and variance a:. As the sample sizes ni increase, the distri- bution of T2 converges to which of course does not depend on the ai ’s. On the other hand, the limit distribution of TI does depend on the 0;’s. Therefore T2, unlike T I , is an asymptotically pivotal statistic, and is well-suited to bootstrap methods; see Subsection 2.3 for details. For discussions of the importance of pivotalness to bootstrap resampling, see Hartigan (1986), Beran (1987) and Hall (1988).

2.2 Bootstrap Resampling in Heteroscedastic Problems

The idea is to approximate the distribution of the test statistic under the null hypothesis, and is a little different from the approach in confidence interval construction. Our bootstrap method is based on a very simple idea, which we now describe.

Put Y ; j z Xij - pi , define c. and r. in the obvious manner, and let

The distributions of To1 and To2 are invariant under choices of the pi’s. When Ho is true, these distributions are identical to those of T1 and T2, respectively. This fact suggests the following procedure. Draw a re- sample X;* = { X s , . . . ,X,. , ,} , with replacement, from the sample Xi = { X i l , . . . , Xin i } . Naturally, each sample is resampled independently of the others. -Put Yi3 E Xi;. - xi., define = n;’ xjl$ and 9: ZE

n-l C C Y,;, and let

Our object is to approximate the distributions of To1 and TO^, under Ho, by those of T<; and Tc2 respectively.

Let X = UX; denote the entire sample. By repeated resampling we may approximate as closely as desired the distributions of TO; and T&,

Page 5: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 181

given X. Thus, we may compute the bootstrap critical point io;, defined by

for i = 1 and 2. An approximate a-level test of Ho is to reject Ho if T; > 20;. We argue in the next subsection that in heteroscedastic problems, the level-error of this test is smaller when i = 2 than it is when i = 1.

2.3 Choice of T2 over 2'1 in Heteroscedastic Problems

We claim that the bootstrap approximation of the null distribution of To1 is accurate only to order n-'j2, whereas that of To2 is accurate to order n-3 /2 . To see why, let F( . ; oi, . . . , o:) denote the asymptotic distribution function of Tol. It may be proved by Edgeworth expansion that

P(T& 5 io; I X) = 1 - a , (1)

P(T01 SZ) = F(s;a,2, ..., u p ) + n - l m ( z ) + n - 2 p 2 ( ~ ) + . . . , (2)

where functions p l , p 2 , . . . depend on 01,. . . , u, and on other cumulants of the populations II;, but not on n. Terms of order n-J /2 for odd j do not enter the expansion, because the hypothesis test is two-sided; Barndorff- Nielsen & Hall (1988) show how to prove this type of result. Let l j j denote the version of p j when all population cumulants are replaced by their sample counterparts, and put C7: = n;' Cj(X:j - a;)2. Then

the last relation holding since I j j = p j -t 0 , ( n - 1 / 2 ) . However, the fact that 6; is distant O,(n-'l2) from u: means that F ( z ; 6;, . . . , 6:) is O,(n-'I2) from F(z; 412,. . . ,a:). Consequently, comparing (2) and (3)) we see that P(T& 5 z I X ) is Op(n-'i2) from P(To1 5 z), assuming Ho, as claimed.

If we repeat this argument for To2 instead of Tol, we find that the first term on the right-hand side of (2) does not depend on any of 012,. . . ,or. 2

Comparing versions of (2) and (3) in this case we conclude that

Therefore the error in the bootstrap approximation to the null distri- bution of To2 is of order n-3/2 not n-1/2 .

In fact, level-error of a bootstrap hypothesis test based on To2 is O(n-2) , not O ( ~ Z - ~ / ~ ) , due to inherent symmetry of general two-sided

Page 6: j.1467-842X.1990.tb01011.x

182 NICHOLAS I. FISHER AND PETER HALL

testing problems; see for example BarndorfF-Nielsen & Hall (1988). On the other hand, level-error of a test based on 2'01 is only O(n-'). It could be reduced to O ( n - 2 ) by bootstrap iteration (e.g., Hall, 1986; Beran, 1987; Hall & Martin, 1988b), but the numerical expense of that procedure can be considerable.

2.4 Homoscedastic Problems

Treatment of homoscedastic problems should be different from that described above. Firstly, i t is cleat that when the ui's are identical, TI is a more appropriate test statistic than Tz because it uses all the data to estimate scale in each population. Secondly, the method of resampling should take account of the assumption of homoscedasticity. We shall con- fine attention throughout to tests based on TI, and discuss two different resampling sch'emes. The first scheme is appropriate when the ui's are identical but populations I I i may have differing shapes, and the second scheme when the Hi's are the same except possibly for their means pi. The object here is to highlight how the method of resampling should be modified according to assumptions; of course, it is rare in practice to en- counter the first of these models, and there are standard, non-bootstrap methods for handling the second model.

Put Zij z (Xij -p i ) /u i , define Zi and 2.. in the obvious manner, and let

The distribution of T11 is invariant under translations and re-scalings of the data, and is identical to that of 2'1 when Ho is true and u1 = u2 = . . . = or. This observation motivates the following resampling scheme. Let Xi+ be as in Subsection 2.2; put Zi*j = (Xi*j-Xi.J/t?i where ii? nf' C j ( X i j - X i . J 2 ; define 2; = nf' Cj 2; and 2: = n-l C C 2;; and let

If (71 = gz = -.. = up then an approximate a-level test of Ho is to reject Ho if T' > 211, where 211 is defined by

P(T;, 5 211 I X) = 1 - a.

Arguing as in Subsection 2.3 we may show that the error in the boot- strap approximation to the null distribution of 2'1 is O , ( ~ Z - ~ / ~ ) , and that

Page 7: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 183

the level-error of the test is O(n-’). To appreciate the basis for the resam- pling schemes, observe that conditional on Xi the resample 2%* 3 (2; , 1 5 j 5 n ; } is drawn from a population whose mean and variance do not de- pend on i . The population from which 27 comes does depend on i .

To treat the case where populations are assumed distributed identi- cally about their means, let W denote the set of all n values of ( X i j - X i . ) / & i , and let W* = {W$,1 5 j 5 ni and 1 5 i 5 T } be an n-sample drawn with replacement from W. Define mT and in the obvious man- ner, and put

An approximate a-level test of Ho is to reject Ho if TI > 2 2 1 , where 2 2 1 is defined by

Assuming that populations are identically distributed about their means, this bootstrap approximation t o the null distribution of TI is in error by 0 , ( n - 3 / 2 ) , and the level-error of the test is O(n-’).

P(T;, 5 221 I X ) = 1 - a.

3. Methodology for Two-way Analysis of Variance

In the presence of replication it is feasible to conduct two-way andy- sis of variance without assuming homoscedasticity. Then, an analogue of the procedure developed for one-way analysis in subsections 2.2 and 2.3 may be employed. However, for simplicity of exposition we shall content ourselves here with a resampling scheme tailored to the non-replicated, homoscedastic case; of course, standard ANOVA procedures are usually adequate for this situation.

Our model is

X i j = p + a i t p j t E i j , 1 5 i S ~ and 1 5 j < ~ ,

where C ai = C distributed.

= 0 and the ~ j j ’ s are independent and identically

The hypothesis under test is H o : C X ~ = O ~ = ... =a,..

Put n TS, Xi- s-’ Cj Xi j , X,j 5 r-’ xi Xij and X.. t z - l C C X i j . Then our test statistic is

Page 8: j.1467-842X.1990.tb01011.x

184 NICHOLAS I. FISHER AND PETER HALL

TABLE 1 Level accuracy when T = 2

Test based on T1 Test based on T2 Sample sizes Std devs u1 ,u2 Std dew u1,u2

n2 1 , l 1,3 3 , l 1,6 6 , l 1 , l 1,3 3 , l 1,6 6 , l 11.1

15 2.2 7.3 7.3 9.2 9.2 1.6 5.8 5.8 7.4 7.4 15 20 20 3.5 7.8 7.8 7.4 7.4 2.1 5.6 5.6 6.7 6.7 15 25 3.8 6.4 9.6 6.6 9.4 3.1 3.9 6.6 6.5 8.6

A resampling scheme for approximating the null distribution of T may be developed as follows. Put W { X i j - X;. - X . j + X.. , 1 5 i 5 T

and 1 5 j 5 s}. Let W* zz {W;, 1 5 i 5 T and 1 5 j 5 s} denote an n-sample drawn with replacement from W . Define w2 = 3-l cj W;, W.; = r-l x i WG, We: f n-l -

WG, and

An approximate a-level test of Ho is to reject Ho if T > 2, where 2 is defined by

P(T* 5 f I X) = 1 -a.

It may be shown that the approximation error P(T 5 z) - P(T’ 5 z I X) equals 0,(n-3/2), and that the level error of the test is O(n-2).

4. Monte Carlo Results, and Data Applications

4.1 Simulation Study

To illustrate our point numerically we simulated application of our test to skew, heteroscedastic data in the context of one-way ANOVA. Assume that the i th population follows the model

for 1 5 i 5 T . Here the ~ j j ’ s were simulated as independent exponential variables with unit mean. Under Ho, p; does not depend on i and may be taken as zero. Tables 1 and 2 depict actual levels, in percentages, each

Page 9: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 185

TABLE 2

Level accuracy when T = 3

Test based on TI Test based on Ta Sample sizes Std dew u1,~2,~3 Std devs ul ,u2 ,u3 nl nz n3 1,1,1 1,3,6 6,3,1 1,5,10 10,5,1 1,1,1 1,3,6 6,3,1 1,5,10 10,5,1

15 15 15 2.3 8.0 8.0 7.1 7.1 3.5 5.0 8.0 4.0 4.0 20 20 20 1.4 6.7 6.7 7.2 7.2 2.6 4.6 6.7 5.0 5.0 15 25 25 2.9 7.4 8.6 6.2 9.9 3.3 3.6 3.9 4.0 5.3

estimated from 1,000 trials with B = 200 bootstrap resamples, of nominal 5% bootstrap tests based on either TI or T2. The tests were made at three sample sizes and various standard deviations, and for T = 2 (Table 1) and T = 3 (Table 2). The data were exponential, distributed according to model ( 5 ) with p1 = pz = 0.

Recall that under conditions of heteroscedasticity, only T2 provides a test which is asymptotically pivotal. Our argument in Section 2 predicted that for this reason, and when heteroscedasticity is present, tests based on T2 should have greater level accuracy than tests based on TI. Our simulation study confirms that this tends to be the case. Only in the case of fixed ui does the test based on TI offer greater level accuracy than the test based on T2, and then only for the case T = 2. For symmetric errors this phenomenon is not clearly so evident because of higher level accuracy of both tests.

These results are part of a larger study, exploring a range of issues in bootstrap re-sampling. Also included in the study was an investiga- tion of the power of the bootstrap procedures, which generally indicated satisfactory performance. Table 3 depicts actual powers, each estimated from 1000 trials, of bootstrap tests based on either TI or T2. Data were exponential, distributed according to model ( 5 ) with T = 3 samples and p1 = 0, p2 = 1, p3 = 2, u1 = 5 , u2 = 3, u3 = 1; nominal level of each test was 5%.

4.2 Application to Data Sets

Some comparison between bootstrap testing procedures and more clas- sical procedures, in situations where the latter are appropriate, can be made by looking at their relative performances on particular data sets.

Page 10: j.1467-842X.1990.tb01011.x

186 NXCHOLAS I. FISHER AND PETER HALL

TABLE 3 Power when T = 3.

Sample sizes Power of test Power of test =I nz n3 bas& on TI based on T2

10 10 10 24.8% 20.1% 20 20 20 35.9% 42.2%

30 30 30 45.6% 61.0%

We have investigated a range of published examples in which the means of two or more samples were compared, in the presence of heterogeneous variances. The conclusion drawn in each case was the same for the boot- strap tests (based on either TI, or Tz) as i t was for the parametric test. A few of these are reported below. Example 1. Tippett (1952, Table 5.3) analysed the following data (% muscle glycogen) on the effect of insulin on rabbits:

Control: 0.19, 0.18, 0.21, 0.30, 0.66, 0.42, 0.08, 0.12, 0.30, 0.27 Treated: 0.15, 0.13, O.OO*, 0.07, 0.27, 0.24, 0.19, 0.04, 0.08, 0.20, 0.12

71.1 = 1 0 , ng = 11 ; s1 = 0.168 ,s2 = 0.085 . The usual 2-sample t-test for equality of means yields a P-value of

about 0.027. Tippett noted the possibility that the variances may be dif- ferent, but concluded nevertheless that the means were probably different because of Welch's (1938) observation that the t-test was not unduly af- fected when the sample sizes were equal (here they are almost equal). The two bootstrap tests, using 200 re-samples, yield significance probabilities of 0.056 and 0.044 2'1 and T2 respectively, in agreement with Tippett's assessment. Example 2. Goulden (1952, Example 4 2 ) analysed the following data on protein determinations in wheat samples from two different provinces in Canada, abstracted from a large survey.

* below detectable level

XI = 0.273 ,x2 = 0.135 ;

Sample 1: Sample 2:

15.1, 14.3, 11.5, 14.5, 15.4, 12.5, 14.6, 16.6 12.2, 12.5, 11.2, 12.6, 11.0, 11.6, 12.0, 12.5, 11.8, 12.4, 11.5, 12.0, 11.6, 12.7

nl = 8 , n2 = 14; 31 = 14.31, x2 = 11.97; s1 = 0.57, s2 = 0.14.

Page 11: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 187

Using an approximation to the significance level of the statistic Tl due to Cochran & Cox (1957, p.101) Goulden found that the significance probability associated with a test of identical means was between 0.01 and 0.05. The standard Welch (1938) procedure gives a two-sided P-value of 0.00425. Using 1000 re-samples, the significance probabilities for bootstrap tests based on TI and T2 were 0.0055 and 0.0025 respectively. Example 3. Snedecor & Cochran (1976, Example 10.12.1) reported the following data comprising the number of days survived by mice inoculated with three strains of typhoid organism (numbers in parentheses indicate multiplicities):

Sample 1: 2(6),3(4),4(9),5(8),6(3),7 ( n 1 = 31)

Sample 2: 2,3(3),4(3),5(6),6(6),7(14),8(11),9(4), 10(6), 11(2), (122 = 60) 12(3), 13 sample 3: 2( 3) , 3( 5) , 4( 5), 5( 8) , 6( 19), 7( 23) , 8( 22) , 9( 14), 1 O( 14) , 1 1 ( 7) , (723 = 133) 8 1 = 4.03 , x 2 = 7.37 , x 3 = 7.80 ; 8 1 = 1.38, ~2 = 2.42 , 83 = 2.58 .

12(8), 13(4), 14(1)

The standard F-test for equal means (given merely as an exercise in Example 10.12.2) yields an F-value of 31.0, for 2 and 221 degrees of freedom. Snedecor & Cochran noted that the unequal sample sizes and possible heterogeneity of variances should be borne in mind in analysing these data. For bootstrap re-sampling, there is a different potential prob- lem, namely the large number of replicated values in each sample. Were the sample sizes smaller, one might wish to re-sample from smoothed data (e.g., the data have been rounded to the nearest day; so a re-sampled ran- dom uniform U[-1/2,1/2] could be added to each resample value XG: see Fisher & Hall, 1989). However, no difficulties were encountered in this instance. The bootstrap tests based on TI and 2'2 resulted in extremely small P-values, with the actual value of each statistic exceeding the largest of the 1000 bootstrap values. (We have andysed the data directly, here, for comparative purposes; i t would certainly be beneficial to consider a preliminary transformation of the data to stabilize the variance).

5. Discussion

We have described bootstrap analysis of variance tests which are non- pumrnetric, in the sense that they involve no assumptions about any of

Page 12: j.1467-842X.1990.tb01011.x

188 NICHOLAS I. FISHER AND PETER HALL

the parent populations. This has been achieved by resampling from data values. Should the underlying populations be specified up to some vector X of parameters, we would estimate X at i, say, and resample from the population with X set equal to i.

For example, suppose we assume that the i t h population is Normal N ( p i , o ? ) . We estimate pi at Xi., estimate ui at sf 3 (ni - 1)-l Cj(Xij - x i . )2 , draw {Xi., . . . ,Xi.,i 1 at random from an N(xj., s:) population, and put yi;l XTj -Xi. exactly as before. Of course, this is equivalent to draw- ing %; from an N ( 0 , s:) population. In this circumstance T2 and Tc2 have identical distributions, and so the bootstrap produces a test which is exact except for inaccuracies arising from doing only a finite amount of resam- pling. The bootstrap is equivalent to constructing exact statistical tables by Monte Carlo methods. Likewise, the parametric bootstrap applied to homoscedastic Normal data and using the statistics 2'1 or T also produces an exact test. Since the distributions of Tl and T are known - they are Fisher's F - then on this occasion the bootstrap is doing no more than reconstruct existing tables by simulation. For T2 the exact distribution is not tabulated, although it is known in principle. James (1951) devised clever and practical approximations to this distribution.

In the case of one-way analysis of variance with r = 2, there is no difficulty in testing one-sided alternative hypotheses such as H1 : p1 < p2. Appropriate test statistics are

and

in homoscedastic and heteroscedastic cases respectively. For a test in non- parametric circumstances, follow the resampling procedures recommended in Section 2, obtaining resampled data values (yi;} for heteroscedastic problems, {X;) for homoscedastic problems with different populations, and {W;} for homoscedastic problems with identical populations. Use the y i 3 s in place of the Xijs to construct Ui using the formula for U2, and either the 2; s or the Wt;. s in place of the Xi'j s to construct Ui us- ing the formula for U1. Approximate the distribution of Ui by that of U; conditional on X, and estimate critical points accordingly. In the het- eroscedastic case, this solution of the two-sample testing problem is related

Page 13: j.1467-842X.1990.tb01011.x

ON BOOTSTRAP HYPOTHESIS TESTING 189

to the nonparametric bootstrap solution of the Behrens-Fisher problem (Hall St Martin, 1988a), but with the important difference that the resam- pling scheme now centres the difference between means under both null and alternative hypotheses.

These arguments and those in earlier sections may be adapted in an obvious way to test hypotheses involving contrasts, for example to test the goodness of fit of a set of contrasts or to make multiple comparisons. The philosophy is exactly that in the previous paragraph: resample as described in earlier sections, and use the resampled data values in place of the Xjj s to construct a bootstrap statistic whose distribution, conditional on X, approximates the distribution of the test statistic under both null and alternative hypotheses.

All our statistics use sample standard deviations to scale differences between means. This is more out of a desire to keep reasonably close to traditional methods, than out of necessity. Alternative robust scale estimates, such as the interquartile range, could be employed. Likewise, our bootstrap-based testing philosophy applies to statistics based on sums of absolute differences rather than sums of squares.

A test for homoscedasticity, such as Levene’s test (Levene, 1960) may be used as precursor to any analysis of variance. It too has a bootstrap version.

Finally, three points should be made. Firstly, bootstrapping should not proceed independently of other aspects of good data analysis. For ex- ample, attention to outliers and use of appropriate transformations may improve both the validity of the model and the efficacy of the bootstrap method. Secondly, we have employed the analysis of variance example merely to illustrate the main issues which arise in bootstrap hypothe- sis testing. Thirdly, the duality between testing and interval estimation opens up the possibility of constructing confidence intervals indirectly from bootstrap hypothesis tests. This aspect is particularly interesting in cases where standard bootstrap interval methods do not perform well, and will be discussed elsewhere.

References

BARNDORFF-NIELSEN, O.E. k HALL, P. (1988). On the level-error after Bartlett

BERAN, R. (1987). Prepivoting to reduce level error of confidence sets. Biometrika 74,

BERAN, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refine-

adjustment of the likelihood ratio statistic. Biometrika 7 6 , 374-378.

457-468.

ments. J. Amer. Statist. Assoc. 83, 682-697.

Page 14: j.1467-842X.1990.tb01011.x

190 NICHOLAS I. FISHER AND PETER HALL

BROWN, M.B. k FORSYTHE, A.B. (1974a). The small sample behaviour of some statis- tics which test the equality of several means. Technometrics 16, 129-132.

BROWN, M.B. k FORSYTHE, A.B. (1974b). The ANOVA and multiple comparisons for data with heterogeneous variances. Biometria SO, 719-724.

COCHRAN, W.G. & COX, G.M. (1957). Experimental Designs. Second edition. New York John Wiley.

DIJKSTRA, J.B. k WERTER, P.S.P.J. (1981). Testing the equality of several means when the population variances are unequal. Commun. Statist. - Simula. Computa.

EFRON, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7,

FISHER, N.I. k HALL, P. (1990). Bootstrap algorithm for small samples. J. Statist.

GOULDEN, C.H. (1952). Methods of Statistical Analysis. Second edition. New York:

HALL, P (1986). On the bootstrap and confidence intervals. Ann. Statist. 14, 1431-

HALL, P. (1988). Theoretical comparison of bootstrap confidence intervals. (With dis-

HALL, P. k MARTIN, M.A. (1988a). O n the bootstrap and two-sample problems. Aus

HALL, P. k MARTIN, M.A. (1988b). On bootstrap resampling and iteration. Biometrika

HARTIGAN, J.A. (1986). Contribution to discussion. Statist. Su. 1, 75-77. HINKLEY, D.V. (1988). Bootstrap methods. J. Roy. Statist. SOC. Ser. B 60, 321-337. JAMES, G.S. (1951). The comparison of several groups of observations when the ratios

LEVENE, H. (1960). Robust tests for equality of variances. In Contributions to Pro& Essays in Honor of Harold Hotelling, eds. I. Olkin et al.,

SNEDECOR, G.W. & COCHRAN, W.G. (1976). Statistical Methods. Sixth edition.

TIPPETT, L.H.C. (1952). The Methods of Statistics. Fourth edition. New York John

WELCH, B.L. (1938). The significance of the difference between two means when the

B10, 557-569.

1-26.

Comput. Simul., to appear.

John Wiley.

1452.

cussion.) Ann. Statist. 16, 927-953.

tral. J. Statist. SOA, 179-192.

75, 661-671.

of the population variances are unknown. Biometrika 38, 324-329.

abiZity and Statistics: pp.278-292. Stanford, Calif.: Stanford University Press.

Ames, Iowa: The Iowa State University Press.

Wiley.

population variances are unequal. Biometrika 20, 350-362.


Recommended