+ All Categories
Home > Documents > Simultaneous Inference in Epidemiological Studies

Simultaneous Inference in Epidemiological Studies

Date post: 05-Oct-2016
Category:
Upload: lesley
View: 213 times
Download: 0 times
Share this document with a friend
7
International Journal of Epidemiology © Oxford University Press 1982 Vol. 11, No. 3 Printed in Great Britain Simultaneous Inference in Epidemiological Studies DAVID R JONES* and LESLEY RUSHTONt Jones D R (Department of Community Medicine, Westminster Medical School, London SW1P 2AR, England) and Rushton L. Simultaneous inference in epidemiological studies. International Journal of Epidemiology 1982, 11: 276-282. Some difficulties encountered in using and interpreting significance tests in both exploratory and hypothesis testing epidemiological studies are discussed. Special consideration is given to the problems of simultaneous statistical inference—how are inferences to be modified when many significance tests are performed on the same set of data? Although some partial solutions are available, greater emphasis on estimation methods and less use of and reliance on significance testing in epidemiological studies is more appropriate. In most epidemiological studies extensive use is made of significance tests. Study reports are often full of the resultant 'p-values'; in the eyes of many investigators and some editors a survey report looks naked without 'p-values'. 1 In fact, significance tests occupy a central place in most studies involving exploration of data in the medical and social fields. The way in which significance tests have gained their important role and the problems involved in their use are considered by, amongst others, Atkins and Jarrett 2 who comment: 'Significance tests perform a vital function in the social sciences because they appear to supply an objective method of drawing conclusions from quantitative data. Sometimes they are used mechanic- ally, with little comment, and with even less regard for whether or not the required assumptions are satisfied.' Schwartz et aP make a clear distinction (in the context of clinical trials) between those studies in which it is essential to reach a conclusion or decision and a significance test is inappropriate (the 'pragmatic' approach) and those in which further understanding of the mechanism of the phenomenon studied is central and a significance test may be appropriate (the 'explanatory' approach). A more general review is pro- vided by Barnett, 4 amongst others. * Department of Community Medicine, Westminster Medical School, 17 Horseferry Road, London SW1P 2AR England. t Division of Epidemiology, Institute of Cancer Research, Sutton, Surrey, England. Present address: School of Mathematics, Statistics & Computing, Thames Polytechnic, London SE18. Even in an exploratory study a significance level provides a very poor summary of the results obtained, because it combines two characteristics which would be clearer if kept separate, namely, the size of the differ- ence tested and the precision with which it is measured. 5 Thus non-significance of a result can arise either because the effect being studied is small (or even totally absent) or because the sample size of the study is too small to estimate the size of the effect precisely—the study is not sufficiently powerful. Determination of the power of a proposed epidemiological study—the probability that it will detect an effect, for example a relative risk, of a specified size if it exists—is relatively rarely performed at the design stage of such a study. Sometimes this is justified on practical grounds if the size of the sample available for study, and thus the power of the study, is dictated by the choice of population to be studied; for example, only a very limited number of men may have been exposed to a suspected hazard in a certain factory. However, in such cases, knowledge of the power of the study is essential if any negative results obtained are to be assessed appro- priately. When the choice of sample is not so strongly circumscribed, power calculations should be an impor- tant part of the design stage of the study. Power calcula- tions relating to the basic methods of analysis used in most epidemiological studies can be made 6 although of .course the circumstances in which they are possible are limited. 7 These calculations are supplemented by investigations of the properties of specific methods 8 and can be extended by simulation methods if necessary. 9 ' 10 A relatively small shift of emphasis from significance testing (for example—Is the number of deaths observed in a cohort study significantly different from the number 276 at University of Sussex on September 26, 2012 http://ije.oxfordjournals.org/ Downloaded from
Transcript
Page 1: Simultaneous Inference in Epidemiological Studies

International Journal of Epidemiology© Oxford University Press 1982

Vol. 11, No. 3Printed in Great Britain

Simultaneous Inference inEpidemiological Studies

DAVID R JONES* and LESLEY RUSHTONt

Jones D R (Department of Community Medicine, Westminster Medical School, London SW1P 2AR, England) andRushton L. Simultaneous inference in epidemiological studies. International Journal of Epidemiology 1982, 1 1 :276-282.Some difficulties encountered in using and interpreting significance tests in both exploratory and hypothesistesting epidemiological studies are discussed. Special consideration is given to the problems of simultaneousstatistical inference—how are inferences to be modified when many significance tests are performed on the sameset of data? Although some partial solutions are available, greater emphasis on estimation methods and less use ofand reliance on significance testing in epidemiological studies is more appropriate.

In most epidemiological studies extensive use is made ofsignificance tests. Study reports are often full of theresultant 'p-values'; in the eyes of many investigatorsand some editors a survey report looks naked without'p-values'.1 In fact, significance tests occupy a centralplace in most studies involving exploration of data in themedical and social fields. The way in which significancetests have gained their important role and the problemsinvolved in their use are considered by, amongst others,Atkins and Jarrett2 who comment:

'Significance tests perform a vital function in thesocial sciences because they appear to supply anobjective method of drawing conclusions fromquantitative data. Sometimes they are used mechanic-ally, with little comment, and with even less regard forwhether or not the required assumptions are satisfied.'

Schwartz et aP make a clear distinction (in the contextof clinical trials) between those studies in which it isessential to reach a conclusion or decision and asignificance test is inappropriate (the 'pragmatic'approach) and those in which further understanding ofthe mechanism of the phenomenon studied is centraland a significance test may be appropriate (the'explanatory' approach). A more general review is pro-vided by Barnett,4 amongst others.

* Department of Community Medicine, Westminster Medical School,17 Horseferry Road, London SW1P 2AR England.t Division of Epidemiology, Institute of Cancer Research, Sutton,Surrey, England.Present address:School of Mathematics, Statistics & Computing, Thames Polytechnic,London SE18.

Even in an exploratory study a significance levelprovides a very poor summary of the results obtained,because it combines two characteristics which would beclearer if kept separate, namely, the size of the differ-ence tested and the precision with which it is measured.5

Thus non-significance of a result can arise eitherbecause the effect being studied is small (or even totallyabsent) or because the sample size of the study is toosmall to estimate the size of the effect precisely—thestudy is not sufficiently powerful.

Determination of the power of a proposedepidemiological study—the probability that it will detectan effect, for example a relative risk, of a specified size ifit exists—is relatively rarely performed at the designstage of such a study. Sometimes this is justified onpractical grounds if the size of the sample available forstudy, and thus the power of the study, is dictated by thechoice of population to be studied; for example, only avery limited number of men may have been exposed to asuspected hazard in a certain factory. However, in suchcases, knowledge of the power of the study is essential ifany negative results obtained are to be assessed appro-priately. When the choice of sample is not so stronglycircumscribed, power calculations should be an impor-tant part of the design stage of the study. Power calcula-tions relating to the basic methods of analysis used inmost epidemiological studies can be made6 although of.course the circumstances in which they are possible arelimited.7 These calculations are supplemented byinvestigations of the properties of specific methods8 andcan be extended by simulation methods if necessary.9'10

A relatively small shift of emphasis from significancetesting (for example—Is the number of deaths observedin a cohort study significantly different from the number

276

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 2: Simultaneous Inference in Epidemiological Studies

SIMULTANEOUS INFERENCE IN EPIDEMIOLOGICAL STUDIES 277

expected if the null hypothesis that there is no specialrisk in the group studied is true?) to interval estimation(On the basis of the number of deaths observed, in whatrange can we be fairly sure that the relative risk will lie?)could lead to a marked improvement in the way inwhich the results of epidemiological studies are assessed.Atkins and Jarrett2 argue the case for estimation ratherthan significance testing in a wide range of applications.The calculations required to obtain an interval estimateare usually no more complex than those required toobtain a p-value; indeed often the same quantities arecalculated, but used to construct a confidence intervalrather than a test statistic and hence a p-value. Onemain advantage of the estimative approach over that ofsignificance testing is that the precision of the estimateof the size of the effect being studied is made explicit.Armitage" provides both an introduction to estimationand details of estimation in circumstances relevant toepidemiological studies; Breslow and Day12 reviewmethods of estimation of confidence limits for relativerisks in case-control studies. A comparable review forcohort study methods is as yet more difficult to find, but,for example, Bailar and Ederer13 give tables for con-structing confidence limits for the ratio of observed andexpected deaths suitable for use in the analysis of cohortstudies.

Of course, not all the problems associated withsignificance testing disappear when estimation is givengreater emphasis; for example, we shall see later thatsimultaneous inference problems remain. However, theadvantages are important—estimates from differentstudies can be compared in assessing the overall picture;in contrast, significance testing forces the summary ofthe results of each study into a form which is difficult tocollate with corresponding results from other studiesexcept in the most superficial of ways.

In view of the key position of significance tests it ishardly surprising that when doubts about their inter-pretation are raised controversy and confusion result.One such problem which often arises in epidemiologicalstudies is that of simultaneous statistical inference.

SIMULTANEOUS STATISTICAL INFERENCEIf many significance tests are performed in the analysisof a study the probability of finding at least one statistic-ally significant result (using one of the conventionalsignificance levels, such as the 5% level) can be largeeven if there are no real 'discrepancies' to be found inthe data tested—in other words, if all of the nullhypotheses14'11 are true—so that all of the significantresults occur by chance alone.

This is well illustrated by a simple example given byTukey.15 Suppose that we are analysing the results of anepidemiological study of a single suspected hazard, butthat the hazard is in fact entirely non-existent, so that allthe null hypotheses in the analysis are true. If weperform a significance test within each of just 10subgroups of the study population (defined perhaps byage and sex, exposure level or area of residence, etc)then the chance of finding a result significant at the 5%level in at least one of the 10 subgroups by chance alone(all of the null hypotheses being true) is more than 40%.The details of the calculation of this result (probabilityof at least one significant result = 1.0— (.95)10 = .401)depend on the assumption that the results for thevarious subgroups are statistically independent. Inpractice, in the analysis of most epidemiological studiesthis assumption is unlikely to be true—for example,subgroups may be defined by subdividing the samepopulation firstly by age and secondly by exposure level,and these two variables may be correlated—but the gistof the message will be similar.

Thus it may be demonstrated that statisticallysignificant results are likely to be found by chance inmany epidemiological studies even if there are no realcorrelations between the exposures (or other 'risk'factors) and the diseases studied.

Identification of those significant results in which thenull hypothesis is not true is made more difficult by thepossible occurrence of many spurious significant results.The seriousness of this problem varies from study tostudy. It is dependent on the size and type of the study,and in particular on the number of significance tests per-formed in analysis of the study results. For example, asGardner16 points out, among his set of 5000 correla-tions between pairs of variables describing sociodemo-graphic, environmental and mortality characteristics ofEnglish towns about 50 correlations could be expectedto differ significantly from zero at the 1% level bychance alone.

The nature of the investigation is also important. Adistinction must be made17'18 between a priori and aposteriori investigations. In an a priori investigationdata are collected specifically to test hypothesesformulated before the study begins. This is in accordwith the 'searchlight theory' of scientific method19*20 andclassical methods of statistical inference involving thetesting of the plausibility of a preconceived hypothesis inthe light of the newly collected observations may beappropriate.

In an a posteriori investigation the hypotheses are tobe formulated after the data have been searched or'dredged' for interesting patterns. The hypotheses aregenerated on the basis of the patterns so found, follow-ing the 'bucket theory'.""20 Mixing metaphors even

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 3: Simultaneous Inference in Epidemiological Studies

278 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

further, those studies are also known as 'fishing' studies.Formal hypothesis testing is not in our view appropriatein a posteriori investigations; at very least the method ofanalysis needs to take account explicitly of the know-ledge that hypotheses have been selected for studybecause of the patterns found in the data set.

On the whole the problem of simultaneous inferenceseems more severe in a posteriori than in a prioriinvestigations, setting aside any issue of appropriate-ness of the inferential method in the former, because the.former generally involve relatively large numbers ofstatistical 'tests' in the course of their dredging or fishingprocedures. In practice a single study, whether of thecase-control or cohort type, is likely to involve elementsof both types of investigation (see, for example, Rushtonand Alderson)21 with some preconceived hypotheses tobe tested with the new data and others to be generatedby looking at other aspects of the data. This is wellillustrated by Hjalmarson et al,22 who are careful todistinguish analyses of hypotheses relating to subgroupsof their sample defined before the study from thoserelating to subgroups formed retrospectively, afterinspection of the data.

Comparable and perhaps more severe problems areencountered in the use of screening, surveillance andmonitoring schemes. Screening23 involves routineexamination of members of a population in a search forcases of previously unidentified disease among them, forexample, use of the Guthrie test to detectphenylketonuria in young babies. A screening test ischaracterized by its specificity—the proportion of thoseexamined who do not have the disease who are cor-rectly so identified by the test—and its sensitivity—theproportion of those with the disease who are correctlyidentified by the test. If the specificity of the test isinadequate, too many false positives will occur. Thesecorrespond directly to 'spurious' significant resultsamong a large set of significance test results obtainedfrom a conventional study.

Monitoring and surveillance systems again involveroutine examinations of data describing some aspects ofthe health of members of a population but it may beunknown beforehand which diseases or hazards arebeing sought. The quantities of data examined in amonitoring study are likely to be larger than in mosttraditional epidemiological studies, and so the problemsof multiple significance testing are likely to be moresevere. The feasibility of such a system for monitoringthe health of industrial employees is discussed by Belland Coleman.10 A relatively simple example, namely theregular scrutiny of notified cases of congenitalabnormalities in each local area, is described by Hill etal,1* although they do not acknowledge the existence ofa simultaneous inference problem.

WHAT IS SPECIAL ABOUT EPIDEMIOLOGICALSTUDIES?The problems resulting from simultaneous inference ormultiple or repeated comparisons may occur in applica-tion of statistical methodology to any field if manyexaminations of (large) sets of data are made throughthe medium of significance testing (or equivalently,calculation of confidence intervals).

Comparable problems to those outlined above are tobe found in applications of statistics other thanepidemiology; in fact they occur in most surveyanalyses. Similarly, interpretation of batteries of bio-chemical tests on patients is clouded by the relativelyhigh probability of at least one of the test results lyingoutside its 'normal range' if a large number of tests areperformed on an individual.25 The same kind of problemcan arise in the conduct of clinical trials.26 However, therelated problem of repeated significance testing hasreceived more attention.27 Here, data on the same set ofpatients may be examined repeatedly as theyaccumulate in the course of a clinical trial. Repeatedexamination of the data increases the probability offinding a nominally significant result attributable tochance alone during one of the examinations. Severalresolutions of this problem of sequential analysis arepossible26 one of which involves use of the overallprobability for the whole set of examinations (parallel tothe 40.1% in the example above) rather than that foreach test.28 Monitoring systems suffer from problemsboth of multiple and of repeated significance testing.

What is special about the epidemiological examples ofmultiple and/or repeated significance testing? Thetechnical problems are comparable with those found inexamples from other fields. These problems are com-pounded with weight of concern about the results ofepidemiological studies on the part of various interestedgroups. For example, both management and tradesunions will be concerned about the potential hazardswhich may be highlighted by study of an industrialpopulation, but the implications of any study findingswill be quite different for these two groups. This degreeof external interest is not unique to epidemiology, butthe level of such interest is undoubtedly high there.Peto29 has deprecated the polarization of viewpoints onthese results. Whether or not a disinterested middleground exists is arguable; there is, however, a cleartemptation to regard the results of statistical studies, andin particular concise but deficient summaries of resultsprovided by significance levels, as objective measures ofthe importance of the potential hazards studied.2 Sinceconsiderable costs, both financial and social, to institu-tions and to individuals, may depend on the findings ofepidemiological studies, it is only natural that straight-forward and concise summaries of their results are

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 4: Simultaneous Inference in Epidemiological Studies

SIMULTANEOUS INFERENCE IN EPIDEMIOLOGICAL STUDIES 279sought. What is unfortunate is that such a central placehas been afforded to the statistical significance ofindividual studies. Despite repetition of the warning thatstatistical significance is not equivalent to significance ina clinical or epidemiological sense, the statisticalsignificance or non-significance of a result (at the con-ventional 5% level!) may seem attractively clear cut asthe crucial criterion to use in deciding whether evidenceof a hazard should be taken seriously. Thus, forexample, in response to Doll and Peto's suggestion30131

that not all of their statistically significant results wereimportant and that not all of their non-significant resultsshould be disregarded, it was suggested32 that on thecontrary 'the logic of practical statistical inference'demanded that all statistically significant associations beaccepted as true.

When significance tests or 'p-values' are regarded asthe crucial criteria for the interpretation ofepidemiological studies the importance of the problemsof simultaneous inference, and their implications forinterpretation of the 'p-values' are correspondinglysevere. The problem is often ignored and the significancelevels of individual tests taken at face value. At the otherextreme, undue emphasis is given to significance levelsadjusted for the effects of the many tests performed. Wenow review briefly methods of adjustment appropriatefor use in epidemiological studies.

SOME TECHNICAL 'SOLUTIONS' OF THEPROBLEM OF SIMULTANEOUS INFERENCEExtensive reviews of statistical aspects of 'the multiplecomparison' or simultaneous inference problem and ofmethods of adjustment to compensate for it are avail-able in statistical literature.33"35 However, relativelylittle of this work is directly applicable to analysis ofmost epidemiological studies, for example, to case-control or cohort studies. Most of the methods of adjust-ment assume that the variables of interest arequantitative and normally distributed, and manymethods are appropriate for application to analysis ofvariance.33

In contrast, in many epidemiological studies thevariables to be analysed are qualitative (such ascategory of job in occupational studies) and mostmethods of analysis are based on contingency tableanalysis. Appropriate methods of adjusting for theeffects of simultaneous inference in such analyses areless readily available.

For example, in the analysis of case-control studydata, epidemiologists frequently work in terms of therelative risk (or odds ratio or cross product ratio) andmay wish to estimate confidence limits for such relativerisks in addition to or instead of calculating theirsignificance. Further, the contingency table involved

may be of larger dimensions than just 2 x 2 (exposed,not exposed; diseased, not diseased); there may be morethan two exposure levels, for example, and more thanone relative risk may need to be estimated. If severalrelative risks are being estimated, calculation of thestatistical significance of their difference from unity mayneed to be adjusted, and similarly the estimation of con-fidence limits for each relative risk may need to takeaccount of the fact that several sets of such confidencelimits are being constructed simultaneously. Since therelative risks are not independent of each other, thecalculation of the required adjustment to the significancelevels is not straightforward, although some results forsimultaneous confidence intervals for the cross-productratios are available.36"33 In practice only the simplestmethods are likely to be understood and appropriatelyused in epidemiological studies. Simultaneous estima-tion—calculation of simultaneous confidence intervalsfor example—can be 'a puzzling subject' even foreminent statisticians.37

One approach to the problems of simultaneous infer-ence or multiple comparisons in analysing study resultsis to specify an error rate for the whole of a family, orset of tests, and to 'budget' or 'parcel out' that rate to thevarious statements in the family.38"39 This ensures thatthe occurrence of spurious significant results will belimited. However, the price paid may be high since some(usually all) of the statement error rates will need to besmall. In other words, at least some of the individualtests will have to be carried out at high significancelevels and low power, especially if the number of state-ments in the family is large. The possibility thus arisesthat real effects will be ignored because they do notreach such extreme levels of significance.

In fact, the error rates or significance levels at whichtests are to be carried out are specified before theanalysis begins in relatively few epidemiological studies,although there are exceptions among large a posterioristudies of mortality data.17"40 Instead, the p-values•('exact significance levels') of each of the differencesexamined (eg. excesses of observed over expected deathsin a cohort study) are calculated.

The question of what constitutes a family of tests isaddressed by several authors.33"39"41 The more testsincluded in a family, the greater the problem of dis-regarding as non-significant individually significantresults because of their part in the larger family of testsperformed.

Cole42 expresses some related doubts clearly:

'The fact that many comparisons have been madeand, thus, that some may be expected to be significantby chance alone, is supposed to detract from each ofthe p-values obtained. It is the same as saying that an

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 5: Simultaneous Inference in Epidemiological Studies

280 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

association is penalized because it emerged in a large,rather than in a small study. This is bothersomebecause, under a null state, the p-value has a 5%chance of taking on the value of 0.05 by chance alonewhether it relates to the only variable evaluated in astudy or one of hundreds.

He suggests that the view that in such circumstancesthe p-values should be adjusted upwards stems from amistaken analogy with a different sort of 'multiplecomparison' problem. Such problems have received agreat deal of attention;33"34 their essential feature is thatthe various comparisons made are not independent ofone another. Thus, for example, if the regional patternsof mortality rates were investigated by comparing therates in each pair of regions separately (North withSouth, South with West, West with North, etc) a seriesof non-independent comparisons would be made, andthe p-values obtained in such comparisons would needto be adjusted upwards accordingly. However, com-parisons of this kind are rarely made in the analysis ofepidemiological studies, although the analyses often doinclude sets of non-independent comparisons, forexample, those obtained by analysing data relating totwo sets of subgroups of population, the first defined byage and the second by, say, exposure level, which maybe correlated with age. Whilst some adjustment of the p-values so obtained may, in principle, be appropriate, fewof the methods of adjustment described33 are directlyapplicable. In many analyses, however, this will not bethe major problem of interpretation, and its importancedeclines if less emphasis is placed on (the exact value of)the significance levels obtained.

Many of the available adjustment methods are basedon specific distributional or independence assump-tions.33 We can, however, fall back on rather more crudemethods in a wide variety of circumstances. One of thebest known of such methods is use of Bonferroniinequality :n if the error rate (significance level) of eachstatement is a, the family error rate a' is such that a' <n a where n is the number of statements in the family.The inequality a' < na is in fact remarkably close toequality provided both a and n are small. There is, ofcourse, no need for all the statement error rates to beequal to a—some could be smaller if some others werelarger but this is the usual choice by default.

It follows that the significance level suggested foreach statement if the family level is to be the con-,ventional 0.05 is 0.05/n. Implicit use of the Bonferroniinequality can be found in epidemiological and relatedstudies and informal use of such ideas and methods inassessing results of such studies is even more wide-spread. Its use is now almost standard in studies testingmultiple associations between HLAs and disease.43 Adecision always needs to be made about the balance

between the desirability of restricting the number (or, ina variation of the method,44 the proportion) of falsesignificances found overall in a study and that of settingthe significance levels for individual tests at a reason-able (not too high) level. A related method is describedby Sidak.45

The significance levels for individual tests may beextreme if n is at all large. The danger of rejecting 'real'differences is thus again seen to be the price paid forlimiting the overall error rate. The uncertainty about thedefinition of 'family' and hence the choice of n arisesagain. Further, erroneous results may be obtained if theBonferroni approach is used where the definition of thefamily is made ad hoc, after the data has been inspectedfor 'interesting' results.46 Tukey suggests15 a com-promise be made between counting all possible tests andcounting only those interesting ones actuallyinvestigated. Similar points are made by Mantel andHaenszel,47 and Mantel48 offers further guidance in theuse of the Bonferroni method.

Other techniques which may help in dealing with theproblem of multiple inferences in epidemiological studiesdo exist. For example, it may be helpful to test thesignificance of the most extreme among a set of valuesof a particular measure of interest such as a relative risk,occurring in a particular study rather than testing eachof the measures separately. This is illustrated by the aposteriori analysis carried out in a large scale review ofstandardized mortality ratios in a set of occupationalgroups.17 Nonetheless, this method still yields asignificance test, with the associated problems alreadynoted.

A more useful set of techniques for scanning largesets of results in the hope of detecting real effects repre-sented therein is provided by graphical methods ofprobability plotting. These methods are reviewed byGerson49 and Barnett50 among others, and a goodillustration of the methods is given by Hills51 whoapplies half-normal plotting techniques to the determina-tion of which coefficients in a large correlation matrixare too large to have been obtained from an underlyingpopulation in which the correlation is zero. In the sameway these techniques could help to identify, say, thosestandardized mortality ratios, among those in a large setresulting from an epidemiological study or review, whichdeviate enough from their values under the nullhypothesis to be considered unlikely to have been drawnfrom a population in which that null hypothesis is true.It should not be necessary to emphasize that whilst thistechnique is a very useful way of looking at a large set ofresults, it is not a definitive way of detecting real effects.

IS THERE REALLY A PROBLEM?The problems of simultaneous inference in

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 6: Simultaneous Inference in Epidemiological Studies

SIMULTANEOUS INFERENCE IN EPIDEMIOLOGICAL STUDIES 281epidemiological studies appear severe when strongemphasis is placed on the significance level of results ina particular study. Unfortunately, such an emphasis isnot uncommon, in part because, as we have indicatedabove, significance tests and 'p-values' offer a simplesummary of the results of a study, and an apparentlymechanical or objective way of evaluating these results.The pressures to draw clear cut conclusions fromepidemiological studies are often substantial. In suchcircumstances the issue of whether and how to adjust p-values for the effects of simultaneous inference natur-ally seems important; whilst some people may be happyto regard any nominally significant result as evidence ofa real effect—a real excess of mortality, for example—others will be happy to pay attention only to adjustedvalues, since these are less likely to be 'significant'.

A simplistic and mechanistic emphasis on the use ofsignificance testing and 'p-values' in epidemiologicalstudies is to be deprecated on grounds much moregeneral than that it adds to the severity of a'simultaneous inference' problem. Definitive answers toquestions of association between exposure and disease,or even worse causality, are sought on the basis of theresults of a single study, and reliance is placed onsignificance levels as an adequate summary of theresults of such a study if such an emphasis is accepted.

On the contrary, several authors (see for example47"42)have emphasized the need to base such inferences on acollation of evidence from several studies. A single studyprovides only leads, or at most only partial evidence,and inference will be based on the degree to whichresults of difference studies are in agreement, reprodu-cible and consistent. Susser52 discusses the logic of andprocedure for such inferences in detail, with specialemphasis on criteria of judgement to be used for theinference of a causal relationship between a factor and adisease, including the time sequence of factor anddisease, consistency of the associations, their strengthand specificity and the coherence of the explanationoffered.

When evidence from several studies is to be weighedin reaching a conclusion, the results of an individualstudy are of comparatively less importance and as aresult so also is the 'simultaneous inference' problem.The adequacy of p-values as a summary of results maybe questioned, and the preferability of an estimativeapproach have already been discussed. A more radicalapproach to the analysis of epidemiological studies isoffered by the use of Bayesian4"53 methods. In principlesuch methods would seem ideal for the analysis ofepidemiological study data. Relevant prior knowledge ofthe variables in question (for example, estimatesobtained from previous studies of death rates in a highrisk group) can be modified, by means of well-specified

rules, in the light of data from the current study to yieldposterior estimates of the variables. If necessary theseestimates may be used as prior estimates for a furtherstudy, and so on. This seems to be a good model of howoverall judgements of associations between exposuresand disease are formed, at least if a sequence of relatedstudies is carried out. Although there are limitations onthe types of data for which the Bayesian approach hasbeen developed, techniques relevant for many types ofepidemiological study data are known. For example,Bayesian methods for the analysis of data in the form ofa contingency table are presented by Lindley." Thelimited range of such methods appropriate for andreadily available to the epidemiologist remains a dis-advantage of the approach. It should, however, be clearthat the alternative, conventional frequentist inference,can also pose severe problems of interpretation, forexample, when undue reliance is placed on significancetesting, and many tests are to be made. The solution liesin the realization, by all those who need to interpret theresults of epidemiological studies, that statisticalmethods are essential tools, but cannot alone provideeverything required to that end.

ACKNOWLEDGEMENTSWe gratefully acknowledge the helpful comments andencouragement we have received from severalcolleagues, notably Professors M J R Healy and JohnFox, and from a referee in the preparation of this paper.

REFERENCES1 Selvin H C and Stuart A. Data-dredging procedures in survey

analysis. American Statistician 1966; 20:20-3.2 Atkins E and Jarrett D. The significance of'Significance Tests'. In:

Demystifying Social Statistics. Irvine J, Miles I and Evans J(eds). London, Pluto Press, 1979.

3 Schwartz D, Flamant R and Lellouch, J. Clinical Trials. (Trans-lated by M J R Healy.) London, Academic Press, 1980.

4 Barnett V. Comparative Statistical Inference. London, J Wiley,1973.

! Healy M J R. Does medical statistics exist? BIAS 1979; 6: 137-82.6 Cohen J. Statistical Power Calculations for the Behavioural Sciences.

New York, Academic Press, 1969.7 Mosteller F. Problems of omission in communications. Clin

PharmacolTher 1979; 25: 761-4.* Ury H K. Efficiency of case control studies with multiple controls

per case: continuous or dichotomous data. Biometrics 1975;31:643-9.

9 Jones D R. Computer simulation as a tool for clinical trial design.lntJBiomedComp 1979; 10: 145-150.

10 Bell C M J and Coleman D. A simulation study of occupationalhealth monitoring systems. Division of Epidemiology, Instituteof Cancer Research, Sutton, 1981.

11 Armitage P. Statistical Methods in Medical Research. Oxford,BlackweU, 1971.

12 Breslow N E and Day N E. Statistical Methods in Cancer Research.Volume 1: The Analysis of Case Control Studies. Lyon, Inter-national Agency for Research on Cancer, 1980.

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from

Page 7: Simultaneous Inference in Epidemiological Studies

282 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY13 Bailar J C and Ederer F. Significance factors for the ratio of a

Poisson variable to its expectation. Biometrics 1964; 20:639^*3.

uColton T. Statistics in Medicine. Boston, Little Brown and Co.,1974.

15 Tukey J W. Some thoughts on clinical trials, especially problems ofmultiplicity. Science 1977; 198: 679-84.

"Gardner M J. Using the environment to explain and predictmortality./ Roy Statist Soc Series A 1973; 136:421-40.

17 Office of Population Censuses and Surveys, Decennial SupplementEngland and Wales 1970-72 Occupation Mortality. Series DSno 1. London, Her Majesty's Stationery Office, 1978.

18 Registrar General, Decennial Supplement England and Wales 1961Occupational Mortality Tables. London, Her Majesty'sStationery Office, 1971.

19 Popper K R. Objective Knowledge: an Evolutionary Approach.London, Oxford University Press, 1972.

20 Alderson M R. An Introduction to Epidemiology. London ,Macmil lan , 1976.

21 R u s h t o n L and Alderson M R. An epidemiological survey of eightoil refineries in the U K — F i n a l Repor t . London , Institute ofPetroleum, 1980.

22 Hjalmarson A, Elmfeldt D, Herlitz J, et al. Effect on mortality ofMetoprolol in acute myocardial infarction. Lancet 1981; 2:823-7.

23 D'Souza M F. Early diagnosis and multiphasic screening. In:Recent Advances in Community Medicine No. 1. Bennett A E(ed) Edinburgh, Churchill Livingstone, 1978.

24 Hill G B, Spicer C C and Weatherall J A C. The computersurveillance of congenital malformations. Br Med Bull 1968;24:215-18.

25 Healy M J R. Normal values from a statistical viewpoint. BullAcadRoyMedBelg 1969;9: 703-18.

26 Armitage P. The analysis of data from clinical trials. TheStatistician 1979; 28:171-83.

27 Armitage P. Sequential Medical Trials. Oxford, Blackwell, 1975.2a McPherson C K. The problem of examining accumulating data

more than once. N Eng J Med 1974; 290:501-2.29 Pe to R. Distor t ing the epidemiology of cance r : the need for a m o r e

balanced overview. Nature 1980; 284: 297-300.30 Dol l R and Peto R. Morta l i ty among doc to rs in different occupa-

tions. BrMedJ 1977; 1: 1433-6.31 Peto R and Doll R. When is significant not significant? Br MedJ

1977; (letter): 259.32 Dudley H . When is significant not significant? Br Med J 1977;

Getter): 4 7 .33 Miller R G. Simultaneous statistical inference. New York, McGraw

Hill, 1966."Miller RG. Developments in multiple comparisons 1966-1976. /

Amer Statist Assoc 1977; 72:779-88.

35 O'Neill R and Wetherill G B. The present state of multiplecomparison methods. J Roy Statist Soc Series B 1971; 3 3 :218-41.

36 Goodman L A. Simultaneous confidence limits for cross-productratios in contingency tables. J Roy Statist Soc Series B 1964;26:86-102.

37 Cox D R. Discussion of O'Neill R and Wetherill G B's paper: Thepresent state of multiple comparison methods. (See Reference34).

38 Kurtz T E, Link R F , Tukey J W and Wallace D L. Short cutmultiple comparisons for balanced single and doubleclassifications. Par t 1, Results. Technometrics 1965; 7:95-161.

39 Cox D R . A remark on multiple comparison methods.Technometrics 1965; 7 : 2 2 3 - 4 .

40 Office of Population Censuses and Surveys, Registrar General ' sDecennial Supplement for England and Wales 1 9 6 9 - 1 9 7 3 .Area Mortali ty Tables. Series D S N o . 3. London, HerMajesty 's Stationery Office, 1980.

41 Steel R G D. Error rates in multiple compar isons . Biometrics1961; 17: 326-8.

42 Cole P . The evolving case-control study. J Chron Dis 1979; 32 :15-27.

43 Evans C , Lewinsohn H C and Evans J M. Frequency of H L Aantigens in asbestos workers with and without pulmonaryfibTOsis. BrMedJ 1977; 1 : 6 0 3 - 5 .

44 Seeger P. A note on a method for the analysis of significances enmasse. Technometrics 1968; 10:586-93.

43 Sidak Z. Rectangular confidence regions for the means ofmultivariate normal distributions. / Amer Statist Assoc 1967;62:626-33.

46 Rodger R S. Confidence intervals for multiple compar isons and themisuse of the Bonferroni inequality. BrJMath Statist Psychol1973; 26: 58-60.

47 Mantel N and Haenszel W . Statistical aspects of the analysis of d a t aof retrospective studies of disease. J Natl Cancer Inst 1959;

48 Mantel , N . Assessing labora tory evidence for neoplastic activity.Biometrics 1980; 36: 381-99.

49 Gerson M. The techniques and uses of probability plotting. TheStatistician 24:234-57.

50 Barnett V. Probabili ty plotting methods and order statistics. AppliedStatistics 1975; 24:95-108.

" Hills M. On looking at large correlation matrixes. Biometrika 1969;S6: 249-53.

" Susser M W. Causal Thinking in the Health Sciences. New York;Oxford University Press, 1973.

53 Lindley D V. Introduction to Probability and Statistics from aBayesian Viewpoint Part 2: Inference. Cambridge,Cambridge University Press, 1965.

at University of Sussex on Septem

ber 26, 2012http://ije.oxfordjournals.org/

Dow

nloaded from


Recommended