Size matters: Standard errors in the application
of null hypothesis significance testing in criminology
and criminal justice
SHAWN D. BUSHWAY* and GARY SWEETENDepartment of Criminology and Criminal Justice, University of Maryland, 2220 LeFrak Hall,
College Park, MD, 20742, USA
*corresponding author: E-mail: [email protected]
DAVID B. WILSONAdministration of Justice Program, George Mason University, Manassas, VA, USA
Abstract. Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for
empirically examining hypothesized relationships, and the main approach for establishing the
importance of empirical results. NHST is the foundation of classical or frequentist statistics. The
approach is designed to test the probability of generating the observed data if no relationship exists
between the dependent and independent variables of interest, recognizing that the results will vary from
sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice
literature with respect to the correct application of NHST. We apply a modified version of the
instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles
in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice
Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that
most researchers provide the basic information necessary to understand effect sizes and analytical
significance in tables which include descriptive statistics and some standardized measure of size (e.g.,
betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer
discuss the standards by which a finding would be considered large or small. Moreover, less than half of
the articles distinguish between analytical significance and statistical significance, and most articles used
the term Fsignificance_ in ambiguous ways.
Key words: criminal justice, criminology, Justice Quarterly, regression, review, significance, standard
error, testing
Introduction
Null Hypothesis Significance Testing (NHST) has been a mainstay of the social
sciences for empirically examining hypothesized relationships, and the main
approach for establishing the importance of empirical results. NHST is the
foundation of classical or frequentist statistics founded by Fisher (the clearest
statement is in Fisher 1935), and it has three key steps. First, a null hypothesis of
Bno difference^ or Bno relationship^ is established with no specific alternative
hypothesis specified. In the second step, a test statistic is calculated under a number
of distributional assumptions. Finally, if the probability of obtaining the calculated
test statistic is below a certain threshold (typically 0.05 in the social sciences), the
Journal of Experimental Criminology (2006) 2: 1–22 # Springer 2006
DOI: 10.1007/s11292-005-5129-7
null hypothesis is rejected. The approach is designed to test the probability of
generating the observed data if no relationship exists between the dependent and
independent variables of interest, recognizing that the results will vary from
sample to sample. A rejection of the null implies that the observed data was not
generated due to simple sampling variation, and therefore a true difference or
relationship between the key variables exists with only some small chance of Type
I error (i.e., concluding that there is a relationship when in fact there is none).
NHST has attracted criticism from its inception (e.g., Berkson 1938; Boring
1919) and critics of the approach can be found in psychology (Gigerenzer 1987;
Harlow et al. 1997; Rozeboom 1960), medicine (Marks 1997), economics (Arrow
1959; McCloskey and Ziliak 1996), ecology (Anderson et al. 2000; Johnson 1999)
and criminology (Maltz 1994; Weisburd et al. 2003). The critics can be prone to
flowery hyperbole. Our favorite is from Rozeboom (1997), who stated that B(n)ull
hypothesis testing is surely the most bone-headedly misguided procedure ever
institutionalized in the rote training of science students^ (p. 335). The critics can be
placed into three basic categories.
The first group of critics believes that the statistical properties of this approach
are flawed. For example, the approach ignores theoretical effect sizes and Type II
error. From this perspective, alternative approaches such as Bayesian statistics,
NeymanYPearson decision theory, non-parametric tests, and Tukey exploratory
techniques should at least supplement if not replace NHST (Maltz 1994; Zellner
2004). A second group of critics takes a more philosophical approach, focusing on
the limitations of experimental and correlational designs in social science (Lunt
2004). These critics would prefer more qualitative evidence and theoretical
development and less Brank empiricism.^ In contrast to the first two groups, the
third group of critics is not interested in replacing NHST, but rather hopes to
correct mistakes in the application of NHST.1
This last group focuses on the difference between analytical significance and
statistical significance. To these critics, the size of the effect should drive the
evaluation of the analysis, not its statistical significance. As such, effects can be
analytically uninteresting (i.e., trivially small given the topic at hand) and
statistically significant. Effects can also be analytically interesting and statistically
non-significant. That is, tests can have low power, such that the researcher cannot
reject the null hypothesis for analytically interesting effect sizes (Cook et al. 1979;
Lipsey 1990; Weisburd et al. 2003). The goal of these reformers is to persuade
researchers to focus on analytical or substantive significance in addition to
statistical significance. In economics, this effort has been led by McCloskey and
Ziliak (1996; Ziliak and McCloskey 2004) who have conducted high profile
reviews of the lead journal in the field (American Economic Review) to document
proper and improper usage of the NHST. The movement is more advanced in
medicine and psychology (Fidler 2002; Thompson 2004) where revised editorial
standards strongly recommend that researchers move away from over-reliance on
P-values and become more focused on effect sizes and confidence intervals (APA
2001). A recent issue of the Journal of Socio-Economics devoted to this topic is the
first attempt to bring together researchers from multiple disciplines (sociology,
SHAWN D. BUSHWAY ET AL.2
psychology, economics, and ecology) to discuss the problem and possible
solutions.2 In this issue, there is marked frustration at the slow adoption of good
practice despite the clear consensus about the standards of good practice.
One possible reason for the lack of real change is that misuse causes no real
harm. Elliott and Granger (2004) and Wooldridge (2004), for example, do not deny
that people do a poor job of reporting their results, but nonetheless argue that
NHST is useful for theory testing and the intelligent reader usually can decipher
the effect sizes for herself. On the other hand, Weisburd et al. (2003) provide a
compelling case that the poor use of the NHST can in fact have profound negative
consequences. Specifically, they are concerned about researchers who accept a null
hypothesis in program evaluation and act as if the effect was actually zero,
concluding that the program does not work. They sampled all program evaluations
used in the Maryland Crime Prevention Report in which null findings were
reported (Sherman et al. 1997). They then investigated whether there was enough
statistical power to reject the null hypothesis of a small but substantively
meaningful effect size of 0.2 or greater. In slightly less than half of the cases
this reasonable effect size could not be rejected, suggesting that the null finding
was not substantively the same as an effect size of zero. Or to put it another way,
tests with low power to identify reasonable effect sizes were being used to
conclude that programs did not work (see also Lipsey et al. 1985, for a similar
conclusion). It is hard to argue that such practice is not both bad statistics and
harmful for the accumulation of knowledge.
It is not surprising that the first formal review of the use of NHST in
criminology and criminal justice was done in the context of program evaluation.
Thompson (2004) argues that the shift to Bmeta analytical^ thinking (Cumming
and Finch 2001), with its focus on effect sizes and replicability, is facilitating a real
change in psychology and education research. Replicability is judged by evaluating
the stability of effects across a related literature, a comparison that requires the
use of standardized effect sizes and does not depend on statistical significance.
Meta-analytic thinking has become increasingly common in criminology and
criminal justice (Petrosino 2005; Wilson 2001). We believe, however, that the poor
practice of NHST goes beyond program evaluation and has negative consequences
for knowledge building in all areas of the field. While the change can start in
the program evaluation literature, it must ultimately pervade criminology more
broadly.
This paper is intended to evaluate the state of the criminological and criminal
justice literature with respect to the correct application of NHST. Our assessment
conceptualizes the application of NHST around four basic issues. The first two
concern the size of the coefficients of interest. This broader issue is conceptualized
as having two components: reporting the size of an effect (or presenting
information necessary to determine the size of an effect), and interpreting the size
of an effect, not merely its statistical significance. The third issue concerns the
correct interpretation of non-significant effects, such as reporting confidence
intervals or power analysis. The fourth issue focuses on basic errors in the
application of NHST, such as errors in the specification of the null hypothesis or
SIZE MATTERS 3
relying on statistical significance for the selection of variables in a multivariate
model.
To address these four issues, we adapt the instrument used in economics by
McCloskey and Ziliak (1996) and Ziliak and McCloskey (2004). We used this
instrument to code 82 articles in criminology and criminal justice selected from
three sources: Criminology, the flagship journal of the American Society of
Criminology, Justice Quarterly, the flagship journal of the Academy of Criminal
Justice Science, and a review piece by Farrington and Welsh (2005) on
experiments in criminal justice. In each case our goal is to focus on outlets
representing the best practice in a particular area of the field.
We find very similar results across the outlets. In general, we find both good and
bad practices in the field. On the one hand, most researchers provide the basic
information necessary to understand effect sizes and analytical significance in
tables which include descriptive statistics and some standardized measure of size
(e.g., betas, odds ratios). In fact, most researchers describe the size of their
coefficients in some way. On the other hand, only 31% of the articles mention
power and fewer than 10% of the articles discuss the standards by which a finding
would be considered large or small. None of the articles explicitly test statistical
power with a specific alternative hypothesis. It is not surprising, therefore, that
only 40% of the articles distinguish between analytical significance and statistical
significance, and only about 30% of the articles avoid using the term Fsignificance_in ambiguous ways. In large part, research in this field equates statistical
significance with substantive significance. Researchers need to take the next step
and start to compare effect sizes across studies rather than simply conclude that
they have similar effects solely on the basis of a statistically significant finding in
the same direction as previous work. The paper proceeds in the next section by
discussing the sampling frame, followed by a discussion of the instrument, and
finally, the results.
Materials and methods
Sampling frame
We focus on the prominent articles in three discrete subfields Y criminology,
criminal justice, and experimental criminology. As such, we selected articles in the
journals Criminology and Justice Quarterly in the years 2001 and 2002. We
recognize that this strategy presents more of a snapshot than an overview;
however, we would be surprised to find that practice in this two year window was
substantially different than practice in the surrounding years. We selected all
articles that used bivariate analysis, ordinary least squares (OLS) methods (e.g.,
OLS regression, analysis-of-variance), or logistic regression. Although NHST is
also appropriate for the broader class of non-linear models, we followed
McCloskey and Ziliak (1996) by focusing on the simplest models in common
usage.
SHAWN D. BUSHWAY ET AL.4
There were 66 published articles in 2001Y2002 in Criminology, and of those, 32
met our eligibility criteria. There were 62 articles in 2001Y2002 in Justice
Quarterly and of those, 32 met our eligibility criteria. The most commonly
excluded analyses were hierarchical, structural equation, tobit and count models.
Because we also wanted to include program evaluations in our analysis, we used
the recent Farrington and Welsh (2005) review of experimental studies as our
sampling frame. We would have preferred to pick a journal like the Journal of
Experimental Criminology rather than a review article. However, this journal is
only in its second year, and there was no other unified source of experiments in
criminology. The Farrington and Welsh review, published in this journal,
represented what we believe is the definitive list of experiments in criminology.
We focused on articles that had been published in journals since 1995. Of these 27
articles, we were able to acquire 18 through the University of Maryland library.
Because these articles met the criteria for inclusion in Farrington and Welsh’s
(2005) review, we assume that they represent the state of experimental
criminology, and are not significantly different from the other nine articles.
Coding protocol
The original instrument by McCloskey and Ziliak (1996) had 19 questions. We
eliminated seven questions for one of three reasons: (1) we felt that McCloskey
and Ziliak were taking an extreme position, such as when they requested
simulations to determine if regression coefficients were reasonable, (2) we felt
the question was redundant, or (3) we judged the question to be ambiguous,
making it difficult to code consistently across studies. We also added three
questions, one to explicitly test the issue raised by Weisburd et al. (2003) with
respect to accepting the null hypothesis, and two others to address the use of
confidence intervals to aid in the interpretation of effect sizes, particularly for null
findings All questions were coded as one for good practice and zero for bad
practice. The questions and coding conventions, grouped by the four main issues,
were:
A. Reporting effect size, or the information necessary to determine effect size
1) Were the units and descriptive statistics for all variables used in bivariate
and multivariate analyses reported? Adequate reporting of descriptive statistics is
necessary to fully assess the magnitude of effects. McCloskey and Ziliak insisted
only on the display of means, but differences between means can only be properly
interpreted in the context of sample variability. As such, to be considered good
practice, we required that means and standard deviations be reported for
continuous variables. Furthermore, we insisted that studies include descriptive
statistics for all variables including in analyses. A number of studies provided only
partial information and therefore were not given credit for good practice on this
item. In this and other items, one could argue that this Fmistake_ is quite minor.
Nonetheless, this type of omission makes understanding the substantive signifi-
SIZE MATTERS 5
cance of findings difficult, and complicates the comparison of effect sizes across
studies. We did not determine if the sample sizes in the descriptive statistics table
matched the sample sizes in the analysis although technically this should be the
case.
2) Were coefficients reported in elasticity (% change Y/% change X) form or in
some interpretable form relevant for the problem at hand so that readers can
discern the substantive impact of the regressors? While we did find a few studies
that used elasticities, this practice is uncommon in criminology and criminal
justice. It is much more common to see standardized betas, odds ratios, or other
effect size indices reported. In order to get credit for good practice, the study
author had to provide betas/odds ratios/elasticities in all of the tables in which
multivariate models were reported. This question was not applicable for papers that
did not use multivariate regressions.
3) Did the paper eschew Basterisk econometrics,^ defined as ranking the
coefficients according to the absolute size of the t-statistics? Reporting coefficients
without a t-statistic, P-value, or standard error counts as Fasterisk econometrics._4) Did the paper present confidence intervals to aid in the interpretation of the
size of coefficients? Any presentation, either in tables or the text was coded as good
practice. Just as a standard deviation provides a context for interpreting a mean by
providing information about the variability in a distribution, confidence intervals
provide a context for interpreting coefficients by providing information on
precision or the plausible range within which the population parameter is likely
to fall. The use of confidence intervals as an adjunct to NHST has been widely
recommended as a method to address weaknesses in NHST (e.g., APA Task Force
on Statistical Inference, 1996).
B. Interpreting effect size, not just statistical significance
5) Did the paper discuss the size of the coefficients? Any mention of the size of
a coefficient in substantive terms in the text of the paper was coded as good
practice. This coding decision was more lenient than that applied by McCloskey
and Ziliak, but nonetheless we still found articles which failed this question.
Simply listing the betas in replication of the table was not sufficient.
6) Did the paper discuss the scientific conversation within which a coefficient
would be judged Flarge_ or Fsmall_? In other words, did the authors explicitly
consider what other authors had found in terms of effect size or what standards
other authors had used to determine importance? Time after time, authors claimed
that their results were similar to prior results in the literature solely on the basis of
a statistically significant finding in the same direction as previous findings. This
question required that some attempt was made to compare effect size across studies
in an attempt to build knowledge. One could argue that this exercise is somewhat
futile in a field like criminology where many of the variables are unique scales or
otherwise constructed variables without inherent meaning, and where treatments
are applied to unique populations. But a comparison of betas or odds ratios is
informative and might encourage a more standardized approach to measurement.
Moreover, we find it hard to understand how criminological understanding can be
SHAWN D. BUSHWAY ET AL.6
advanced simply by knowing the sign and significance of an effect with no
substantive understanding. This is particularly problematic in theory testing, where
every theory has found support in the form of a significant coefficient in a reduced
form model on the variable or variables thought to best represent the theory. Some
index of the size of the observed effect is critical to understanding the importance
and theoretical implications of a finding.
7) After the first use, did the paper avoid using statistical significance as the
only criterion of importance? The other most common reference was to measures
of fit such as R2.
8) In the conclusion section, did the authors avoid making statistical
significance the primary means for evaluating the importance of key variables in
the model? In other words, did the authors keep statistical significance separate
from substantive meaning. This was coded as Fnot applicable_ if the paper did not
center on key independent variables (e.g., exploratory analyses).
9) Did the paper avoid using the word Bsignificance^ in ambiguous ways,
meaning Bstatistically significant^ in one sentence and Blarge enough to matter for
policy or science^ in another? We conducted an electronic text search for the word
fragment Fsignific_ and evaluated each usage to determine if the usage was clear.
This item was coded as not applicable if no use of the word fragment Fsignific_ was
found in the article. We did not automatically code a study as using Fsignificance_ambiguously if Fsignificant_ was used without qualification (statistical v. substan-
tive); rather, use of the term had to be consistent (if statistically significant was the
default use, then a qualifier must be used if the authors meant Fsubstantively
significant._) One ambiguous usage was enough to get a Fbad practice_ score on this
question.
C. Interpreting statistically non-significant effects
10) Did the paper mention the power of a test? We coded two types of power
discussions. The first type involved some discussion of power (usually sample size)
and its impact on parameter estimates. The second type of power discussion was an
actual test of power in the study. We had some concerns that this question puts
undue emphasis on post-hoc power analysis. While appropriate in some cases,
post-hoc power analysis can lead to mistakes in interpretation because it requires
the assumption of a Fknown_ or hypothetical effect size. A more general approach
advocated in psychology and statistics involves the presentation of confidence
intervals around the coefficients so readers can observe the range of values
consistent with the analysis (APA 2001; Hoenig and Heisey 2001). This practice has
yet to see widespread use in criminology and criminal justice, but we believe that
confidence intervals could easily be presented in most papers.
11) Did the paper make use of confidence intervals to aid in the interpretation
of null findings? A statistically non-significant effect does not, in and of itself,
provide a basis for accepting the null. Recall that NHST assumes the truthfulness
of the null and then determines the probability of the observed data given that
assumption. A statistically non-significant effect merely means that the data could
reasonably occur by chance in a reality where the null hypothesis were true. This
SIZE MATTERS 7
establishes the plausibility of the null but little more. It is fundamentally a weak
conclusion. By providing a range of plausible values for the population effect, a
confidence interval greatly facilitates the interpretation of a null finding. A
confidence interval that includes the null and does not include a substantively
meaningful value provides evidence that the effect is functionally null. A large
confidence interval that includes substantively meaningful values despite being
statistically non-significant, however, leaves open the possibility that the null is not
only false but that a genuine effect of substantive size exists. See Rosenthal and
Rubin (1994) for an interesting discussion of this issue.
12) Did the paper eschew Fsign econometrics_ meaning remarking on the sign
but not the size or significance of the coefficients? The most common form of Bsign
econometrics^ is a discussion of the sign of a non-significant coefficient without a
larger justification. A larger justification would include a statement that the effect
sizes were large but not statistically significant because sample sizes were small.
This question is explicitly focused on researchers who place a premium on
statistical significance but report sign as if it is independent of statistical sig-
nificance. The only case where researchers are justified in reporting the sign of a
non-significant coefficient is when it is substantively meaningful, and there are
sample limitations (Greene 2003, Ch. 8). In general this question applies only to
non-significant coefficients, but we also coded cases where researchers reported a
comparison of two (significant) coefficients without an explicit hypothesis test. In
this case, saying that coefficient A is bigger than coefficient B without considering
statistical significance is focusing on the sign of the difference.
13) In the conclusions, did the authors avoid interpreting a statistically non-
significant effect with no power analysis or confidence interval as evidence of no
relationship? We developed this question based on Weisburd et al. (2003). Con-
cluding there is no effect was justified if a confidence interval does not contain
a meaningful effect. This was only coded for papers that were dealing with ex-
plicit treatments (the Farrington sample), which was directly comparable to the
Weisburd et al. sample.
D. Avoiding basic errors in the application of NHST
14) Were the proper null hypotheses specified? The most common null
hypothesis is that the coefficient is zero. In fact, this is usually a default position
for researchers who estimate a simple reduced form regression without very much
structure. Yet, this may not be the relevant null hypothesis. In economics, theory
often makes an explicit prediction about the size of a coefficient. We can think of
only one case in criminology where this might be true Y in sentencing research,
researchers have begun including the presumptive sentence from sentence
guidelines (Engen and Gainey 2000). If judges follow the guidelines with random
variation, the coefficient should be 1. However, criminologists do sometimes want
to compare coefficients across groups, which implies a non-zero null.
15) Did the paper avoid choosing variables for inclusion solely on the basis of
statistical significance? The standard logic, fairly common in the social sciences, is
that if variables are statistically significant, they should be included even without
SHAWN D. BUSHWAY ET AL.8
theoretical justification. But this approach essentially equates statistical signif-
icance with substantive significance. Variables should be included because theory
suggests that this is the process that generates the data. We coded as bad practice
papers for which the only justification for the inclusion of variables was the finding
of significance in prior studies. We admit to some ambivalence on this point
because it seems like something of a semantic point, but it is nonetheless true that
this approach is logically flawed. A much more problematic practice is the use of
stepwise regression. Simulations have shown that stepwise regression can lead to
statistically significant and strong results for variables that are unrelated to the
dependent variable (Freedman 1983). We coded as bad practice any article which
excluded control variables because of a lack of statistical significance.
Each article was coded by two coders.3 Two coders were responsible for each set
of articles, so while the identity of coders switched between outlets, they remained
constant within an outlet. The concordance rate was 76% in Criminology, 73% in
Justice Quarterly, and 72% in the Farrington articles. After coding independently,
the coders met to reconcile their decisions. The reconciled coding is reported in
this paper.
Results
The results of the survey suggest that researchers in the field of criminology and
criminal justice are not applying NHST blindly, with a majority of studies pro-
viding information about the size of effects. We found many examples of good
practice and we also found that some authors were quite explicit in their un-
derstanding of the limitations of NHST. However, we also found many examples of
bad practice. For example, most researchers failed to discuss the size of coefficients
in substantive terms or clearly distinguished statistically significant effects from
substantive significance ones.
We provide the distribution of scores across the items in Table 1. The scores are
very similar across outlet. The average paper received a score of 7. Several papers
received a high score of 10. The lowest was a paper with a score of 2.
Table 2 provides the percentage of studies correctly addressing each item. The
(relatively) good news is that most researchers in criminology presented statistical
information necessary for an assessment of size. More specifically, roughly three-
Table 1. Descriptive statistics for the number of correct scores by article source.
Source Mean Median Minimum Maximum SD Percent N
Justice Quarterly 7.0 7 2 10 1.8 54.5% 32
Criminology 6.8 7 2 10 2.0 52.9% 32
Farrington experiments 7.1 7 2 10 1.8 52.0% 18
Total 6.9 7 2 10 1.8 53.3% 82
Percent reflects the mean percent correct for applicable items.
SIZE MATTERS 9
quarters of the authors reported some standardized version of their coefficient in
order to facilitate interpretation,4 and over two-thirds provided descriptive statistics
for all of the variables used in their analysis. The sole exception was the failure to
report confidence intervals for all but three studies in our sample. It would clearly
be better if 100% of authors provided these basic factsYindeed, over a third of these
studies might need to be excluded from a meta-analysis because of the lack of
basic descriptive statistics. Nonetheless, our results did show that the majority of
articles provide the basic building blocks for a comparison of effect size. These
building blocks provide a starting point for the conversation about size as well as
statistical significance. It is also encouraging to note that these results are very
Table 2. Survey results, full sample.
Item N = 82
Presented statistical information necessary for a determination of the size of
an effect
1. Were the units and descriptive statistics reported for all variables used in
bivariate and multivariate analysis reported?
64.6%
2. Were coefficients reported in elasticity form or some other interpretable
form relevant for the problem at hand so that readers could discern the
substantive impact of the regressors?
76.1%
3. Did the paper eschew Fasterisk econometrics_, defined as ranking the
coefficients according to the absolute size of the t-statistics?
57.3%
4. Did the paper present confidence intervals to aid in the interpretation of the
size of coefficients?
3.7%
Interpretation informed by the size of a coefficient, not merely its statistical
significance
5. Did the paper discuss the size of the coefficients? 75.0%
6. Did the paper discuss the scientific conversation within which a coefficient
would be judged Flarge_ or Fsmall_?
9.8%
7. After the first use, did the paper avoid using statistical significance as the
only criterion of importance?
75.6%
8. In the conclusion section, did the authors avoid making statistical significance
the primary means for evaluating the importance of key variables in the model?
44.2%
9. Did the paper avoid using the word Bsignificance[ in ambiguous ways, meaning
Bstatistical significant[ in one sentence and Blarge enough to matter for policy
or science[ in another?
31.7%
Correctly handled non-significant effects
10. Did the paper mention the power of a test? 30.5%
11. Did the paper make use of confidence intervals to aid in the interpretation
of null findings?
0.0%
12. Did the paper eschew Fsign econometrics_ meaning remarking on the sign
by not the size or significance of the coefficients?
58.0%
13. In the conclusions, did the authors avoid interpreting a statistically
non-significant effect with no power analysis as evidence of no relationship?
55.6%
Avoided basic errors in the application of statistical significance testing
14. Were the proper null hypotheses specified? 95.1%
15. Did the paper avoid choosing variables for inclusion solely on the basis
of statistical significance?
78.0%
SHAWN D. BUSHWAY ET AL.10
similar to the results from the most recent Ziliak and McCloskey (2004) review in
economics, which suggests that the problems in criminology are shared with at
least one other social science.
One area where authors in criminology struggled was in interpreting their
results with respect to the size of effects. Despite a fairly common discussion of
size in the results section, many authors ultimately fell back on statistical
significance as the ultimate arbiter of importance. Perhaps then it is not surprising
that two-thirds of the authors used some form of the word Fsignificance_ in an
ambiguous way. It was common to find statements that a variable Fachieved
significance_ or became Fmore/less significant_ in the presence of additional
variables. Variables were described as being Fmodestly significant_ or Fhighly
significant._ In general, the most common error was implying that the strength of the
relationship was determined by the size of the p-value. This is simply not true, and
can be quite misleading if large samples lead researchers to stress small effects
with very large t-values. In keeping with this error, nearly half of the authors
practiced Fasterisk econometrics,_ rank ordering the coefficients according to the
size of the test statistics. There is simply no scientific justification for equating
statistical significance with substantive significance, either in relative or absolute
terms. Statistical significance should be the starting point in any discussion of
effect size, not the end point.
Very few of the authors (9.8%) presented a discussion of the standard by which
their effects would be considered large or small. Virtually none of the authors
compared the magnitude of their results with other studies, although information is
provided in most cases that would allow this comparison. Gottfredson and
colleagues (2003, evaluation sample) provide a notable exception. They compared
their estimate of the effect of drug treatment courts on recidivism to the average
effect reported in a prior meta-analysis. This simple comparison, usually absent in
other studies, allows the reader to assess the importance of the findings beyond
mere statistical significance. Repeatedly, authors claimed similar findings as prior
studies based solely on coefficients which were significant in the same direction.
Although this information is relevant, sign and significance represent a very coarse
description which could be made far more precise with a discussion of the effect
sizes in the two studies. Alternatively, theories could be developed more formally
to provide explicit predictions about the size of the coefficients.
Non-significant effects are problematic in the social sciences. Almost half of the
papers discuss the direction of non-significant effects (Fsign econometrics_) without
attending to the issue of size or adequately addressing the non-significant nature of
the effect. We found many cases where the authors adhered to strict NHST but
then focused on the sign of the coefficient. For example, in one Criminology article
the authors stated that the original bivariate relationship, although not significant,
was positive. Then, they report that when additional variables were included, the
sign of the relationship changed, Bsuggesting that the (initial) relationship was
spurious.^ But according to NHST, the relationship was spurious even without the
additional variables. In neither case do we know for certain that the relationship is
substantively spurious. Without a high level of statistical power, substantively
SIZE MATTERS 11
meaningful effects remain plausible even though the null hypothesis cannot be
rejected. Equivocation is difficult to avoid under strict NHST.
Augmenting NHST with confidence intervals can facilitate the interpretation of
null findings (Hoenig and Heisey 2001). In the articles examined, there was
virtually no consideration of Type II error (falsely accepting the null hypothesis).
Less than a third of the authors mentioned the issue of power, and only two out of
82 articles included a power test.5 Moreover, almost half of the evaluations from
the Farrington and Welsh review equated a failure to reject the null hypothesis as
equivalent to an effect size of zero without conducting a power test or examining the
confidence interval. None of the articles in the Justice Quarterly or Criminology
samples conducted a power test, and only one in Justice Quarterly and two in the
experimental sample provided a confidence interval. Authors simply do not
provide an analytical discussion of their ability to find differences that may be
meaningful.
In general, the studies included in this sample did well at avoiding basic errors
in NHST. Roughly three-fourths of the studies avoided relying on statistical
significance for the selection of variables in multivariate models and most studies
correctly specified the null hypothesis. A particularly good example of the latter
was Steffensmeier and Demuth (2001). These authors used a population of all
people convicted of a felony or misdemeanor in Pennsylvania, with over 68,000
cases. First, they demonstrated that they understood that populations should not be
treated as samples by stating that, Bbecause our data set is not a sample, but
contains all reported sentences with complete data, statistical tests of significance
do not apply in the conventional sense^ (p. 160). They then argued, in a fairly
conventional way, for inclusion of the tests, but then immediately showed that the
tests were not being applied without thinking. Specifically, they stated that
Bbecause the number of cases included in our analysis is so large, many small
sentencing differences among groups or categories often turn out to be significant
in the statistical sense. Therefore, we place more emphasis on direction and
magnitude of the coefficients than on statistical significance levels . . .^ (p. 160).
This is exactly the correct approach to take in their case. In contrast, researchers
conducting studies with small samples would be better served by focusing on
confidence intervals and interpreting the range of possible values for a population
parameter (e.g., from zero to a moderately large effect).
Other examples of good practice involve the specification of a null hypothesis.
Over 90% of the articles received a positive rating on this question, even in cases
when the null was non-zero. For example, Koons-Witt (2002) wanted to explore
the impact of gender on sentencing decisions before and after the onset of
sentencing guidelines. Therefore, she correctly specified the null hypothesis as
equality between the coefficients on gender in separate models.6 Moreover,
researchers often specify models in which a mediating variable is included, and the
coefficient on the original variable is expected to decline. For example, Kleck and
Chiricos (2002) hypothesized that the coefficient on unemployment in a regression
of crime and unemployment would change when explicit measures of motivation
and opportunity were included. In this case, they correctly specified the null
SHAWN D. BUSHWAY ET AL.12
Table 3. Results by item and source of study.
Item
Criminology
2001Y2002
N = 32
Justice quarterly
2001Y2002
N = 32
Farrington
Experiments
N = 18
Presented statistical information necessary
for a determination of the size of an effect
1. Were the units and descriptive statistics reported
for all variables used in bivariate and multivariate
analysis reported?
72.7% 59.4% 55.6%
2. Were coefficients reported in elasticity form
or some other interpretable form relevant for the
problem at hand so that readers could discern
the substantive impact of the regressors?
78.1% 86.2% 45.5%a
3. Did the paper eschew Fasterisk econometrics_,
defined as ranking the coefficients according
to the absolute size of the t-statistics?
69.7% 56.3% 33.3%
4. Did the paper present confidence intervals
to aid in the interpretation of the size of
coefficients?
0.0% 3.1% 11.1%
Interpretation informed by the size of a coefficient,
not merely its statistical significance
5. Did the paper discuss the size of the coefficients? 65.6% 87.5% 70.6%
6. Did the paper discuss the scientific conversation
within which a coefficient would be judged
Flarge_ or Fsmall_?
6.1% 3.1% 27.8%
7. After the first use, did the paper avoid using sta-
tistical significance as the only criterion of importance?
66.7% 84.4% 72.2%
8. In the conclusion section, did the authors avoid
making statistical significance the primary means for
evaluating the importance of key variables in the model?
35.7% 38.7% 66.7%
9. Did the paper avoid using the word Bsignificance[
in ambiguous ways, meaning Bstatistical
significant[ in one sentence and Blarge enough
to matter for policy or science[ in another?
24.2% 37.5% 33.3%
Correctly handled non-significant effects
10. Did the paper mention the power of a test? 30.3% 28.1% 33.3%
11. Did the paper make use of confidence intervals
to aid in the interpretation of null findings?
0.0% 0.0% 0.0%
12. Did the paper eschew Fsign econometrics_
meaning remarking on the sign by not the size or
significance of the coefficients?
57.6% 61.3% 50%
13. In the conclusions, did the authors avoid
interpreting a statistically non-significant effect with
no power analysis as evidence of no relationship?
Y Y 55.6%
Avoided basic errors in the application of statistical
significance testing
14. Were the proper null hypotheses specified? 90.9% 96.9% 94.4%
15. Did the paper avoid choosing variables for
inclusion solely on the basis of statistical significance?
75.8% 71.9% 88.9%
a Only 11 of the 18 experiments were coded on this question because the results were simple mean
comparisons. Of the remaining seven, regressions were occasionally run to support the main finding, and
several articles did not report these regressions in tabular form.
SIZE MATTERS 13
hypothesis as the coefficient in the original equation. Throughout the discussion,
Kleck and Chiricos explicitly focused on the change in the coefficient. However,
we did find examples in which researchers incorrectly concluded that the effect had
been mediated because the coefficient was no longer significantly different from
zero.
Table 3 provides the results by source. For the most part, our results did not
vary by source. The Criminology papers were more likely to provide descriptive
statistics and avoid asterisk econometrics than Justice Quarterly papers, but Justice
Quarterly papers were more likely to talk about size and avoid ambiguous usage of
significance. In no way could it be said that authors in Criminology scored
systematically better than the authors from the two other sources. The main
difference was that the articles in the Farrington and Welsh sample were almost
seven times more likely to discuss some standard of size. This makes sense
because the experimental studies often reported program effect sizes in a research
arena with prior results. Nonetheless, the vast majority of experiments also did not
discuss the standard by which magnitude could be evaluated. In general, the
pattern across the article sources was far more similar than different. Clearly, the
problems are fairly endemic across the field and are not solely the responsibility of
any one group of substantive researchers.
Discussion
NHST is the dominant approach to drawing inferences regarding research
hypotheses in quantitative criminology and criminal justice. Although alternatives
like Bayesian inference exist, the purpose of this article is not to convince the field
to abandon NHST. Rather, we would like to encourage more thoughtful
application. In essence, we want to put NHST in its place: as a tool to facilitate
the inferential process, not as the end game for quantitative research.
The fundamental limitation of NHST is that it does not provide information
about size. As the title of this article states, size matters. To state that there is an
effect begs the question, how big? This requires attention to size and a scholarly
discussion that addresses the substantive significance of findings. Therefore, it is
not surprising that we believe our single most troubling finding was the lack of a
serious attempt in most articles to place the magnitude of the effect in a context or
even to attend to the issue of size. A research study should be placed in the context
of past work or theoretical predictions. Simply reporting the coefficient without
any attempt to validate or otherwise establish the magnitude of the effect within
the literature or policy framework risks creating a large body of independent
research with no cumulative advance in knowledge. This is particularly evident
when studies of a common research hypothesis with different sample sizes arrive at
different conclusions based solely on NHST. Without attending to size, researchers
may conclude that the empirical research base has lead to an equivocal conclusion
regarding the hypothesis. However, focusing on size may tell a more consistent
story, or at least a story that is not determined by the sample size of the studies but
SHAWN D. BUSHWAY ET AL.14
rather by the size of the empirical relationships examined. It is the latter, after all,
in which we are truly interested. On the positive side, many of the key ingredients
for a substantive discussion of size, like descriptive statistics and standardized
coefficients, were reported in most of the studies in our review. All of the basic
tools are there for researchers to take the next step and compare effect sizes across
studies rather than simply concluding that they have similar findings solely on
the basis of a significant finding in the same direction as previous work.
The role of sample size in determining statistical significance is also under-
appreciated. This is evident in our study by the absolute lack of any discussion of
statistical power. In a research world in which sample sizes range from a few
dozen to 68,000, this is particularly alarming, and strikes us as fundamentally
unwise. All research designs do not have equal ability to identify the same effect
size. A discussion about the relative power of a test to identify an effect and an
awareness of the confidence intervals around an effect seems to be both reasonable
and essential for a good evaluation of the value of the study.
One potential criticism of this type of discussion about NHST is from well-
known econometrician Edward Leamer (2004). In his criticism of McCloskey and
Ziliak, he states that it is ultimately not size, but models, that matter. Too much
emphasis on the minutiae of NHST threatens to take the attention away from the
important question of whether the model provides insight into the question of
interest. During our coding, we were often frustrated by the lack of discussion
about the source of causal identification in the regression models, and the general
lack of understanding about the limitations of observational studies with controls
for observables to identify causality. While the criminology and criminal justice
articles were not substantially different from the economics articles with respect to
the use of NHST, we feel confident in stating that the application of causal models
based on observational data is in fact substantially more thoughtful in economics.
We were also frustrated by the lack of attention to measurement issues in many
of the papers we reviewed. Often authors would raise key issues with respect to
measurement only to proceed without addressing them. One anonymous reviewer
made a compelling argument that this problem in criminology is far more pressing
than any discussion about NHST. We do not necessarily disagree, but we
nonetheless think that issues surrounding the appropriate use of NHST deserve
attention by criminologists.
The goal of NHST is admirable: to protect against the acceptance of a research
hypothesis when the observed data can be explained by sampling variation. This
simple goal, however, has taken on a hegemonic role in the practice of social
scientific research. Critical thinking about the meaningfulness of findings in
scientific and practical terms is often lacking. Size matters. Large effects have
different theoretical and practical importance than small effects. A binary accept/
reject approach to hypothesis testing advances our field far less than approaches
that explicitly assess whether observed effects are of a size consistent with
theoretical expectations or are large enough to matter in a practical or policy
context. This requires reasoned argumentation and scientific discourse, rather than
a reliance on an arbitrary and binary decision rule (i.e., p e 0.05). The former
SIZE MATTERS 15
requires greater skill and scholarly effort but also promises greater advancement
for our field.
Acknowledgement
The authors wish to thank Emily Owens for help with the coding and University of
Maryland’s Program for Economics of Crime and Justice Policy for generous
financial support. We also wish to thank Michael Maltz and two anonymous
reviewers for helpful advice. All errors remain our own.
Notes
1 It should be noted that there is no controversy about what constitutes correct use Y the
problem is unambiguously in the application.2 This issue has received less attention in criminology and criminal justice than in these
other disciplines. Maltz’s 1994 paper in the Journal of Research and Crime and
Delinquency is the best-known paper on the problems with NHST in criminology.3 Three coders were used in the analysis. All three coders have successfully completed at
least four upper level courses in econometrics in an economics department and are at least
4th year PhD students. All three coders read the McCloskey and Ziliak (1996) piece and
participated in pilot coding and reconciling as a group.4 Almost every author who reported logit coefficients also reported the odds ratio. We
suspect this is because statistics packages now provide this information easily. However,
there was some confusion about the interpretation of the odds ratio. For example, we
found one paper in Justice Quarterly in which the authors interpreted the coefficients as
the odds ratio, even though the odds ratio was also provided in the table. We recommend
that authors simply report the change in probability associated with each x. This is far
more intuitive, and several statistics packages, including Stata, now provide this
information with a simple command (dlogit).5 Even the two papers which conducted explicit power tests did not provide satisfying
discussions of power. Both papers were ambiguous about the effect size for which they
estimated power. Furthermore, one of the papers suggested that the low power associated
with their statistical tests minimized the probability of Type II error. In fact, by definition
low power is associated with a higher probability of Type II error.6 Koons-Witt cited Brame et al. (1998) to justify the test statistic used to test the null
hypothesis. Citation of this paper was very common in our sample. However, Brame et al.
stated that their test is appropriate only for OLS and count models. This test can be used
with logit or probit models if the researcher assumes Bthat both the functional form and
the dispersion of the residual term for the latent response variable are identical for the two
groups being compared^ (p. 259, fn11). These are strong assumptions which are unlikely
to be met in most cases. These assumptions were never justified, or even stated, in the
papers that compared coefficients from logit or probit models.
SHAWN D. BUSHWAY ET AL.16
References (Asterisk indicates papers in sample.)
*Agnew, R. (2002). Experienced, vicarious, and anticipated strain: An exploratory study on
physical victimization and delinquency. Justice Quarterly 19, 603Y632.
*Agnew, R., Brezina, T., Wright, J. P. & Cullen, F. T. (2002). Strain, personality traits, and
delinquency: Extending general strain theory. Criminology 40, 43Y72.
*Alpert, G. P. & MacDonald, J. M. (2001). Police use of force: An analysis of organizational
characteristics. Justice Quarterly 18, 393Y409.
Anderson, D. R., Burnham, K. P. & Thompson, W. L. (2000). Null hypothesis testing:
Problems, prevalence, and an alternative. Journal of Wildlife Management 64, 912Y923.
APA (2001). Publication manual of the American Psychological Association, (5th edition),
Washington, DC: American Psychological Association.
APA Task Force on Statistical Inference (1996, December). Task Force on Statistical
Inference initial report. Washington, DC: American Psychological Association. Available:
http://www.apa.org/science/tfsi.html.
*Armstrong, T. A. (2003). The effect of moral reconation therapy on the recidivism of
youthful offenders: A randomized experiment. Criminal Justice and Behavior 30,
668Y687.
Arrow, K. J. (1959). Decision theory and the choice of a level of significance for the t-test.
In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow & H. B. Mann (Eds.),
Contributions to probability and statistics: Essays in honor of Harold Hotelling (pp.
70Y78). Stanford, CA: Stanford University Press.
*Baller, R. D., Anselin, L., Messner, S. F., Deane, G. & Hawkins, D. F. (2001). Structural
covariates of U.S. county homicide rates: Incorporating spatial effects. Criminology 39,
561Y590.
*Baumer, E. P. (2002). Neighborhood disadvantage and police notification by victims of
violence. Criminology 40, 579Y616.
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the
chi-square test. Journal of the American Statistical Association 33, 526Y536.
*Bernburg, J. G. & Thorlindsson, T. (2001). Routine activities in social context: A closer
look at the role of opportunity in delinquent behavior. Justice Quarterly 18, 543Y568.
*Borduin, C. M., Mann, B. J., Cone, L. T., Henggeler, S. W., Fucci, B. R., Blaske, D. M. &
Williams, R. A. (1995). Multisystemic treatment of serious juvenile offenders: Long-term
prevention of criminality and violence. Journal of Consulting and Clinical Psychology 63,
569Y578.
Boring, E. G. (1919). Mathematical vs. scientific importance. Psychological Bulletin 16,
335Y338.
*Braga, A. A., Weisburd, D. L., Waring, E. J., Mazerolle, L. G., Spelman, W. & Gajewski,
F. (1999). Problem-oriented policing in violent crime places: A randomized controlled
experiment. Criminology 37, 541Y580.
Brame, R., Paternoster, R., Mazerolle, P. & Piquero, A. (1998). Testing for the equality of
maximum-likelihood regression coefficients between two independent equations. Journal
of Quantitative Criminology 14, 245Y261.
*Broidy, L. M. (2001). A test of general strain theory. Criminology 39, 9Y36.
*Burruss, G. M. Jr., & Kempf-Leonard, K. (2002). The questionable advantage of defense
counsel in juvenile court. Justice Quarterly 19, 37Y68.
*Campbell, F. A., Ramey, C. T., Pungello, E., Sparling, J. & Miller-Johnson, S. (2002).
SIZE MATTERS 17
Early childhood education: Young adult outcomes from the Abercedarian project. Applied
Developmental Science 6, 42Y57.
*Cernkovich, S. A., & Giordano, P. C. (2001). Stability and change in antisocial behavior:
The transition from adolescence to early adulthood. Criminology 39, 371Y410.
*Chermak, S., McGarrell, E. F. & Weiss, A. (2001). Citizens’ perceptions of aggressive
traffic enforcement strategies. Justice Quarterly 18, 365Y392.
Cook, T. D., Gruder, C. L., Hennigan, K. M. & Flay, B. R. (1979). The history of the sleeper
effect: Some logical pitfalls in accepting the null hypothesis. Psychological Bulletin 86,
662Y679.
*Copes, H., Kerley, K. R. Mason, K. A. & Van Wyk, J. (2001). Reporting behavior of fraud
victims and Black’s theory of law: An empirical assessment. Justice Quarterly 18,
343Y364.
Cumming, G. & Finch, S. (2001). A primer on the understanding, use and calculation of
confidence intervals that are based on central and non-central distributions. Educational
and Psychological Measurement 61, 532Y575.
*Curry, G. D., Decker, S. H. & Egley, A. Jr. (2002). Gang involvement and delinquency in a
middle school population. Justice Quarterly 19, 275Y292.
*Dawson, M. & Dinovitzer, R. (2001). Victim cooperation and the prosecution of domestic
violence in a specialize court. Justice Quarterly 18, 593Y622.
*DeJong, C., Mastrofski, S. D. & Parks, R. B. (2001). Patrol officers and problem solving:
An application of expectancy theory. Justice Quarterly 18, 31Y62.
*Dugan, J. R. & Everett, R. S. (1998). An experimental test of chemical dependency therapy
for jail inmates. International Journal of Offender Therapy and Comparative Criminology
42, 360Y368.
*Dunford, F. W. (2000). The San Diego Navy Experiment: An assessment of interventions
for men who assault their wives. Journal of Consulting and Clinical Psychology 68,
468Y476.
Elliott, G. & Granger, C. W. J. (2004). Evaluating significance: Comments on Bsize
matters^. The Journal of Socio-Economics 33, 547Y550.
*Engel, R. S. & Silver, E. (2001). Policing mentally disordered suspects: A reexamination of
the criminalization hypothesis. Criminology 39, 225Y252.
Engen, R. L. & Gainey, R. R. (2000). Modeling the effects of legally relevant and extralegal
factors under sentencing guidelines: The rules have changed. Criminology 38, 1207Y1230.
*Exum, M. L. (2002). The application and robustness of the rational choice perspective in
the study of intoxicated and angry intentions to aggress. Criminology 40, 933Y966
Farrington, D. P. & Welsh, B. C. (2005). Randomized experiments in criminology: What
have we learned in the last two decades? Journal of Experimental Criminology 1, 9Y38.
*Feder, L. & Dugan, L. (2002). A test of the efficacy of court-mandated counseling for
domestic offenders: The Broward experiment. Justice Quarterly 19, 343Y376.
*Felson, R. B. & Ackerman, J. (2001). Arrest for domestic and other assaults. Criminology
39, 655Y676.
*Felson, R. B. & Haynie, D. L. (2002). Pubertal development, social factors, and
delinquency among adolescent boys. Criminology 40, 967Y988.
*Felson, R. B., Messner, S. F., Hoskin, A. W. & Deane, G. (2002). Reasons for reporting
and not reporting domestic violence to the police. Criminology 40, 617Y648.
Fidler, F. (2002). The fifth edition of the APA Publication manual: Why its statistics
recommendations are so controversial. Educational and Psychological Measurement 62,
749Y770.
SHAWN D. BUSHWAY ET AL.18
*Finn, M. A. & Muirhead-Steves, S. (2002). The effectiveness of electronic monitoring with
violent male parolees. Justice Quarterly 19, 293Y312.
Fisher, R. A. (1935). The design of experiments. Edinburgh, Scotland: Oliver and Boyd.
Freedman, D. A. (1983). A note on screening regression equations. The American
Statistician 37, 152Y155.
*Garner, J. H., Maxwell, C. D. & Heraux, C. G. (2002). Characteristics associated with the
prevalence and severity of force used by the police. Justice Quarterly 19, 705Y746.
Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Kruger,
G. Gigerenzer & M. S. Morgan (Eds.), The probabilistic revolution. Vol. II: Ideas in the
Sciences (pp. 11Y33). Cambridge, MA: MIT Press.
*Golub, A., Johnson, B. D., Taylor, A. & Liberty, H. J. (2002). The Validity of arrestees’
self-reports: Variations across questions and persons. Justice Quarterly 19, 477Y502.
*Gottfredson, D. C., Najaka, S. S. & Kearly, B. (2003). Effectiveness of drug treatment
courts: Evidence from a randomized trial. Criminology and Public Policy 2, 171Y196.
*Greenberg, D. F. & West, V. (2001). State prison populations and their growth, 1971Y1991.
Criminology 39, 615Y654.
Greene, W. H. (2003). Econometric analysis, (5th edition). Upper Saddle River, NJ:
Prentice-Hall.
Harlow, L. L., Mulaik, S. A. & Steiger, J. H. (Eds.), 1997. What if there were no significance
tests? Mahwah, NJ: Lawrence Erlbaum Associates.
*Harmon, T. R. (2001). Predictors of miscarriages of justice in capital cases. Justice
Quarterly 18, 949Y968.
*Hay, C. (2001). Parenting, self-control, and delinquency: A test of self-control theory.
Criminology 39, 707Y736.
*Henggeler, S. W., Melton, G. B., Brondino, M. J., Scherer, D. G. & Hanley, J. H. (1997).
Multisystemic theory with violent and chronic juvenile offenders and their families: The
role of treatment fidelity in successful dissemination. Journal of Consulting and Clinical
Psychology 65, 821Y833.
*Hennigan, K. M., Maxson, C. L., Sloane, D. & Ranney, M. (2002). Community views on
crime and policing: Survey mode effects on bias in community surveys. Justice Quarterly
19, 565Y587.
Hoenig, J. M. & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power
calculations for data analysis. The American Statistician 55, 19Y24.
*Inciardi, J. A., Martin, S. S., Butzin, C. A., Hopper, R. M. & Harrizon, L. D. (1997). An
effective model of prison-based treatment for drug-involved offenders. Journal of Drug
Issues 27, 261Y278.
*Ireland, T. O., Smith, C. A. & Thornberry, T. P. (2002). Developmental issues in the impact
of child maltreatment on later delinquency and drug use. Criminology 40, 359Y400.
Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal of
Wildlife Management 63, 763Y772.
*Kaminski, R. J. & Marvell, T. B. (2002). A comparison of changes in police and general
homicides: 1930Y1998. Criminology 40, 171Y190.
*Kautt, P. & Spohn, C. (2002). Cracking down on black drug offenders? Testing for
interactions among offenders’ race, drug type, and sentencing strategy in federal drug
sentences. Justice Quarterly 19, 1Y36.
*Kempf-Leonard, K., Tracy, P. E. & Howell, J. C. (2001). Serious, violent, and chronic
juvenile offenders: The relationship of delinquency career types to adult criminality.
Justice Quarterly 18, 449Y478.
SIZE MATTERS 19
*Killias, M., Aebi, M. & Ribeaud, D. (2000). Does community service rehabilitate better
than short-term imprisonment? Results of a controlled experiment. Howard Journal 39,
40Y57.
*Kingsnorth, R. F., MacIntosh, R. C. & Sutherland, S. (2002). Criminal charge or probation
violation? Prosecutorial discretion and implications for research in criminal court
processing. Criminology 40, 553Y578.
*Kleck, G. & Chiricos, T. (2002). Unemployment and property crime: A target-specific
assessment of opportunity and motivation as mediating factors. Criminology 40, 649Y679.
*Koons-Witt, B. A. (2002). The effect of gender on the decision to incarcerate before and
after the introduction of sentencing guidelines. Criminology 40, 297Y328.
*Kramer, J. H. & Ulmer, J. T. (2002). Downward departures for serious violent offenders:
Local court Bcorrections[ to Pennsylvania sentencing guidelines. Criminology 40, 897Y932.
Leamer, E. E. (2004). Are the roads red? Comments on Bsize matters.^ The Journal of
Socio-Economics 33, 555Y557.
Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research.
Newbury Park, CA: Sage.
Lipsey, M. W., Crosse, S., Dunkle, J., Pollard, J. & Stobart, G. (1985). Evaluation: The state
of the art and the sorry state of the science. In D. S. Cordray (Ed.), Utilizing prior
research in evaluation planning (New Directions for Program Evaluation, No. 27, pp.
7Y28). San Francisco: Jossey-Bass.
Lunt, P. (2004). The significance of the significance test controversy: comments on Fsize
matters._ The Journal of Socio-Economics 33, 559Y564.
*Maguire, E. R. & Katz, C. M. (2002). Community policing, loose coupling, and
sensemaking in American police agencies. Justice Quarterly 19, 503Y536.
Maltz, M. D. (1994). Deviating from the mean: The declining significance of significance.
Journal of Research in Crime and Delinquency 31, 434Y463.
Marks, H. M. (1997). The progress of experiment: Science and therapeutic reform in the
United States 1900Y1990. Cambridge, UK: Cambridge University Press.
*Marlowe, D. B., Festinger, D. S., Lee, P. A., Schepise, M. M., Hazzard, J. E. R., Merrill, J.
C., Mulvaney, F. D. & McLellan, A. T. (2003). Are judicial status hearings a key
component of drug court? During-treatment data from a randomized trial. Criminal
Justice and Behavior 30, 141Y162.
*Marquart, J. W., Barnhill, M. B. & Balshaw-Biddle, K. (2001). Fata attraction: An analysis
of employee boundary violations in a southern prison system, 1995Y1998. Justice
Quarterly 18, 877Y910.
*Mastrofski, S. D., Reisig, M. D. & McClusky, J. D. (2002). Police disrespect toward the
public: An encounter-based analysis. Criminology 40, 519Y552.
*McCarthy, B., Hagan, J. & Martin, M. J. (2002). In and out of harm’s way: Violent
victimization and the social capital of fictive street families. Criminology 40, 831Y865.
McCloskey, D. N. & Ziliak, S. T. (1996). The standard error of regressions. Journal of
Economic Literature 34, 97Y114.
*McNulty, T. L. (2001). Assessing the raceYviolence relationship at the macro level: The
assumption of racial invariance and the problem of restricted distributions. Criminology
39, 467Y490.
*Meehan, A. J. & Ponder, M. C. (2002). Race & place: The ecology of racial profiling
African American motorists. Justice Quarterly 19, 399Y430.
*Menard, S., Mihalic, S. & Huizinga, D. (2001). Drugs and crime revisited. Justice
Quarterly 18, 269Y300.
*Mills, P. E., Cole, K. N., Jenkins, J. R. & Dale, P. S. (2002). Early exposure to direct
SHAWN D. BUSHWAY ET AL.20
instruction and subsequent juvenile delinquency: A prospective examination. Exceptional
Children 69, 85Y96.
*Ortmann, R. (2000). The effectiveness of social therapy in prison: A randomized
experiment. Crime and Delinquency 46, 214Y232.
*Peterson, D., Miller, J. & Esbensen, F.-A. (2001). The impact of sex composition on gangs
and gang member delinquency. Criminology 39, 411Y440.
Petrosino, A. (2005). From Martinson to meta-analysis: Research reviews and the US offender
treatmentdebate.Evidence & Policy: A Journal of Research, Debate and Practice 1, 149Y172.
*Piquero, A. R. & Brezina, T. (2001). Testing Moffitt’s account of adolescent-limited
delinquency. Criminology 39, 353Y370.
*Pogarsky, G. (2002). Identifying Bdeterrable^ offenders: Implications for research on
deterrence. Justice Quarterly 19, 431Y452.
*Rebellon, C. J. (2002). Reconsidering the broken homes/delinquency relationship and
exploring its mediating mechanism(s). Criminology 40, 103Y136.
*Rhodes, W. & Gross, M. (1997). Case management reduces drug use and criminality
among drug-involved arrestees: An experimental study of an HIV prevention intervention.
Washington, DC: National Institute of Justice.
*Richards, H. J., Casey, J. O. & Lucente, S. W. (2003). Psychopathy and treatment response
in incarcerated female substance abusers. Criminal Justice and Behavior 30, 251Y276.
Rosenthal, R. & Rubin, D. B. (1994). The counternull value of an effect size: A new
statistic. Psychological Science 5, 329Y334.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological
Bulletin 5, 416Y428.
Rozeboom, W. W. (1997). Good science is abductive, not hypothetico-deductive. In L. L.
Harlow, S. A. Mulaik & J. H. Steiger (Eds.), What if there were no significance tests? (pp.
335Y392). Mahwah, NJ: Lawrence Erlbaum Associates.
*Scheider, M. C. (2001). Deterrence and the base rate fallacy: An examination of perceived
certainty. Justice Quarterly 18, 63Y86.
*Schnebly, S. M. (2002). An examination of the impact of victim, offenders, and situational
attributes on the deterrent effect of defensive gun use: A research note. Justice Quarterly
19, 377Y398.
*Schwartz, M. D., DeKeseredy, W. S., Tait, D., & Avi, S. (2001). Male peer support and a
feminist routine activities theory: Understanding sexual assault on the college campus.
Justice Quarterly 18, 623Y650.
Sherman, L. W., Gottfredson, D., MacKenzie, D., Eck, J., Reuter, P. & Bushway, S. (1997).
Preventing crime: What works, what doesn’t, what’s promising: A report to the United
States Congress. Washington, DC: National Institute of Justice.
*Silver, E. (2002). Mental disorder and violent victimization: The mediating role of
involvement in conflicted relationships. Criminology 40, 191Y212.
*Simons, R. L., Stewart, E., Gordon, L. C., Conger, R. D. & Elder, G., Jr. 2002. A test of
life-course explanations for stability and change in antisocial behavior from adolescence
to young adulthood. Criminology 40, 401Y434.
*Spohn, C. & Holleran, D. (2001). Prosecuting sexual assault: A comparison of charging
decisions in sexual assault cases involving strangers, acquaintances, and intimate partners.
Justice Quarterly 18, 651Y688.
*Spohn, C. & Holleran, D. (2002). The effect of imprisonment on recidivism rates of felony
offenders: A focus on drug offenders. Criminology 40, 329Y358.
*Steffensmeier, D. & Demuth, S. (2001). Ethnicity and judges’ sentencing decisions:
HispanicYBlackYWhite comparisons. Criminology 39, 145Y178.
SIZE MATTERS 21
*Stewart, E. A., Simons, R. L. and Conger, R. D. (2002). Assessing neighborhood and social
psychological influences on childhood influences on childhood violence in an African-
American sample. Criminology 40, 801Y829.
*Swanson, J. W., Borum, R., Swartz, M. S., Hiday, V. A., Wagner, H. R. & Burns, B. J.
(2001). Can involuntary outpatient commitment reduce arrests among persons with severe
mental illness? Criminal Justice and Behavior 28, 156Y189.
*Taylor, B. G., Davis, R. C. & Maxwell, C. D. (2001). The effects of a group batterer
treatment program: A randomized experiment in Brooklyn. Justice Quarterly 18, 171Y201.
*Terrill, W. & Mastrofski, S. D. (2002). Situational and officer-based determinants of police
coercion. Justice Quarterly 19, 215Y248.
Thompson, B. (2004). The Bsignificance^ crisis in psychology and education. The Journal of
Socio-Economics 33, 607Y613.
*van Voorhis, P., Spruance, L. M., Ritchey, P. N., Listwan, S. J. & Seabrook, R. (2004). The
Georgia cognitive skills experiment: A replication of Reasoning and Rehabilitation.
Criminal Justice and Behavior 31, 282Y305.
*Velez, M. B. (2001). The role of public social control in urban neighborhoods: A multi-
level analysis of victimization risk. Criminology 39, 837Y864.
*Vogel, B. L. & Meeker, J. W. (2001). Perceptions of crime seriousness in eight African-
American communities: The influence of individual, environmental, and crime-based
factors. Justice Quarterly 18, 301Y321.
Weisburd, D., Lum, C. M. & Yang, S.-M., 2003. When can we conclude that treatments or
programs Bdon’t work^? The Annals of the American Academy of Political and Social
Science 574, 31Y48.
*Weitzer, R. & Tuch, S. A. (2002). Perceptions of racial profiling: Race, class, and personal
experience. Criminology 40, 435Y456.
Wellford, C., 1989. Towards an integrated theory of criminal behavior. In S. Messner, M. M.
Krohn, and A. Liska (Eds.), Theoretical integration in the study of deviance and crime:
Problems and prospects (pp. 119Y128). Albany, NY: State University of New York.
*Wells, L. E. & Weisheit, R. A. (2001). Gang problems in nonmetropolitan areas: A
longitudinal assessment. Justice Quarterly 18, 791Y824.
*Welsh, W. N. (2001). Effects of student and school factors on five measures of school
disorder. Justice Quarterly 18, 911Y948.
*Wexler, H. K., Melnick, G., Lowe, L. & Peters, J. (1999). Three-year reincarceration
outcomes for Amity in-prison therapeutic community and aftercare in California. Prison
Journal 79, 321Y336.
Wilson, D. B. (2001). Meta-analytic methods for criminology. Annals of the American
Academy of Political and Social Science 578, 71Y89.
Wooldridge, J. M. (2004). Statistical significance is okay, too: Comment on Bsize matters.^The Journal of Socio-Economics 33, 577Y579.
*Wright, B. R. E., Caspi, A., Moffitt, T. E. & Silva, P. A. (2001). The effects of social ties
on crime vary by criminal propensity: A life-course model of interdependence.
Criminology 39, 321Y352.
*Wright, J. P., Cullen, F. T., Agnew, R. S. & Brezina, T. (2001). BThe root of all evil^? An
exploratory study of money and delinquent involvement. Justice Quarterly 18, 239Y268.
Zellner, A. (2004). To test or not to test and if so, how? Comments on Bsize matters?^ The
Journal of Socio-Economics 33, 581Y586.
Ziliak, S. T. & McCloskey, D. N. (2004). Size matters: The standard error of regressions in
the American Economic Review. The Journal of Socio-Economics 33, 527Y546.
SHAWN D. BUSHWAY ET AL.22