The Stata Journal - Mahidol University · The editors of the Stata Journal are pleased to invite...

The Stata Journal

Volume 13 Number 1 2013

®

A Stata Press publicationStataCorp LPCollege Station, Texas

The Stata Journal

Editors

H. Joseph Newton

Department of Statistics

Texas A&M University

College Station, Texas

[email protected]

Nicholas J. Cox

Department of Geography

Durham University

Durham, UK

[email protected]

Associate Editors

Christopher F. Baum, Boston College

Nathaniel Beck, New York University

Rino Bellocco, Karolinska Institutet, Sweden, and

University of Milano-Bicocca, Italy

Maarten L. Buis, WZB, Germany

A. Colin Cameron, University of California–Davis

Mario A. Cleves, University of Arkansas for

Medical Sciences

William D. Dupont, Vanderbilt University

Philip Ender, University of California–Los Angeles

David Epstein, Columbia University

Allan Gregory, Queen’s University

James Hardin, University of South Carolina

Ben Jann, University of Bern, Switzerland

Stephen Jenkins, London School of Economics and

Political Science

Ulrich Kohler, University of Potsdam, Germany

Frauke Kreuter, Univ. of Maryland–College Park

Peter A. Lachenbruch, Oregon State University

Jens Lauritsen, Odense University Hospital

Stanley Lemeshow, Ohio State University

J. Scott Long, Indiana University

Roger Newson, Imperial College, London

Austin Nichols, Urban Institute, Washington DC

Marcello Pagano, Harvard School of Public Health

Sophia Rabe-Hesketh, Univ. of California–Berkeley

J. Patrick Royston, MRC Clinical Trials Unit,

London

Philip Ryan, University of Adelaide

Mark E. Schaffer, Heriot-Watt Univ., Edinburgh

Jeroen Weesie, Utrecht University

Nicholas J. G. Winter, University of Virginia

Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager

Lisa Gilmore

Stata Press Copy Editors

David Culwell and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book

reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository

papers that link the use of Stata commands or programs to associated principles, such as those that will serve

as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go

“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate

or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to

a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users

(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers

analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could

be of interest or usefulness to researchers, especially in fields that are of practical importance but are not

often included in texts or other journals, such as the use of Stata in managing datasets, especially large

datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata

with topics such as extended examples of techniques and interpretation of results, simulations of statistical

concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-

ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch,

Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

http://www.stata-journal.com

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone

979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic

1-year subscription $ 98 1-year subscription $138

2-year subscription $165 2-year subscription $245


1-year student subscription $ 75 1-year student subscription $ 99

1-year university library subscription $125 1-year university library subscription $165



1-year institutional subscription $245 1-year institutional subscription $285



Electronic only Electronic only

1-year subscription $ 75 1-year subscription $ 75



1-year student subscription $ 45 1-year student subscription $ 45

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may

be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX

77845, USA, or emailed to [email protected].

®

Copyright c© 2013 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and

help files) are copyright c© by StataCorp LP. The contents of the supporting files (programs, datasets, and

help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata

Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.

http://www.stata.com/bookstore/sj.html

http://www.stata.com/bookstore/sjj.html

http://www.stata-journal.com/archives.html

Volume 13 Number 1 2013

The Stata Journal

Articles and Columns 1

Announcement of the Stata Journal Editors’ Prize 2013 . . . . . . . . . . . . . . . . . . . . . . . . 1Stata as a numerical tool for scientific thought experiments: A tutorial with

worked examples . . . . . . . . . . . . . . . . . .T. Wimberley, E. Parner, and H. Støvring 3Versatile sample-size calculation using simulation. . . . . . . . . . . . . . . . . . . . .R. Hooper 21Sar: Automatic generation of statistical reports using Stata and Microsoft Word

for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. L. Lo Magno 39Within and between estimates in random-effects models: Advantages and draw-

backs of correlated random effects and hybrid models . . . . . . . . . . . . R. Schunck 65Trial sequential boundaries for cumulative meta-analyses . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Miladinovic, I. Hozo, and B. Djulbegovic 77Regression anatomy, revealed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .V. Filoso 92kmlmap: A Stata command for producing Google’s Keyhole Markup Language .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A. Okulicz-Kozaryn 107A menu-driven facility for sample-size calculations in cluster randomized con-

trolled trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .K. Hemming and J. Marsh 114Estimating Geweke’s (1982) measure of instantaneous feedback . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .M. F. Dicle and J. Levendis 136Two-stage nonparametric bootstrap sampling with shrinkage correction for clus-

tered data . . . . . . . . . . . . . . . . . . . . . . E. S.-W. Ng, R. Grieve, and J. R. Carpenter 141Joint modeling of longitudinal and survival data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .M. J. Crowther, K. R. Abrams, and P. C. Lambert 165Doubly robust estimation in generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. Orsini, R. Bellocco, and A. Sjolander 185Review of Data Analysis Using Stata, Third Edition, by Kohler and Kreuter . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Schumm 206Review of Flexible Parametric Survival Analysis Using Stata: Beyond the Cox

Model by Patrick Royston and Paul C. Lambert . . . . . . . . . . . . . . . . . . . N. Orsini 212

Notes and Comments 217

Stata tip 114: Expand paired dates to pairs of dates . . . . . . . . . . . . . . . . . . N. J. Cox 217

Software Updates 220

The Stata Journal (2013)13, Number 1, pp. 1–2

Announcement of the Stata Journal Editors’Prize 2013

The editors of the Stata Journal are pleased to invite nominations for their 2013prize in accordance with the following rules. Nominations should be sent as privateemail to [email protected] by July 31, 2013.

1. The Stata Journal Editors’ Prize is awarded annually to one or more authors ofa specified paper or papers published in the Stata Journal in the previous threeyears.

2. The prize will consist of a framed certificate and an honorarium of U.S. $500,courtesy of the publisher of the Stata Journal. The prize may be awarded inperson at a Stata Conference or Stata Users Group meeting of the recipient’s orrecipients’ choice or as otherwise arranged.

3. Nominations for the prize in a given year will be requested in the Stata Journalin the first issue of each year and simultaneously through announcements on theStata Journal website and on Statalist. Nominations should be sent to the edi-tors by private email to [email protected] by July 31 in that year. Therecipient(s) will be announced in the Stata Journal in the last issue of each yearand simultaneously through announcements on the Stata Journal website and onStatalist.

4. Nominations should name the author(s) and one or more papers published in theStata Journal in the previous three years and explain why the work concerned isworthy of the prize. The precise time limits will be the annual volumes of the StataJournal, so that, for example, the prize for 2013 will be for work published in theannual volumes for 2010, 2011, or 2012. The rationale might include originality,depth, elegance, or unifying power of work; usefulness in cracking key problemsor allowing important new methodologies to be widely implemented; and clarityor expository excellence of the work. Comments on the excellence of the softwarewill also be appropriate when software was published with the paper(s). Nomina-tions might include evidence of citations or downloads or of impact either withinor outside the community of Stata users. These suggestions are indicative ratherthan exclusive, and any special or unusual merits of the work concerned may nat-urally be mentioned. Nominations may also mention, when relevant, any body oflinked work published in other journals or previously in the Stata Journal or StataTechnical Bulletin. Work on any or all of statistical analysis, data management,statistical graphics, and Stata or Mata programming may be nominated.

c© 2013 StataCorp LP gn0055

2 Announcement of the Stata Journal Editors’ Prize 2013

5. Nominations will be considered confidential both before and after award of theprize. Neither anonymous nor public nominations will be accepted. Authors maynot nominate themselves and so doing will exclude those authors from consider-ation. The editors of the Stata Journal may not be nominated. Employees ofStataCorp may not be nominated. Such exclusions apply to any person with suchstatus at any time between January 1 of the year in question and the announce-ment of the prize. The associate editors of the Stata Journal may be nominated.

6. The recipient(s) of the award will be selected by the editors of the Stata Journal,who reserve the right to take advice in confidence from appropriate persons, sub-ject to such persons not having been nominated themselves. The editors’ decisionis final and not open to discussion.

The 2012 Prize, the first in this series, was awarded to David Roodman. For fulldetails, please see Stata Journal 12: 571–574 (2012).

H. Joseph Newton and Nicholas J. CoxEditors, Stata Journal


Stata as a numerical tool for scientific thoughtexperiments: A tutorial with worked examples

Theresa WimberleyDepartment of Economics and Business

National Centre for Register-Based ResearchAarhus UniversityAarhus, [email protected]

Erik ParnerDepartment of Public Health

BiostatisticsAarhus UniversityAarhus, Denmark

Henrik StøvringDepartment of Public Health

BiostatisticsAarhus UniversityAarhus, Denmark

Abstract. Thought experiments based on simulation can be used to explain theimpact of the chosen study design, statistical analysis strategy, or the sensitivityof results to fellow researchers. In this article, we demonstrate with two exampleshow to implement quantitative thought experiments in Stata. The first exampleuses a large-sample approach to study the impact on the estimated effect size ofdichotomizing an exposure variable at different values. The second example usessimulations of datasets of realistic size to illustrate the necessity of using samplingfractions as inverse probability weights in statistical analysis for protection againstbias in a complex sampling design. We also give a brief outline of the general stepsneeded for implementing quantitative thought experiments in Stata. We demon-strate how Stata provides programming facilities for conveniently implementingsuch thought experiments, with the advantage of saving researchers time, specula-tion, and debate as well as improving communication in interdisciplinary researchgroups.

Keywords: st0281, quantitative thought experiments, simulations

1 Introduction

A primary obligation for applied statisticians working in larger, interdisciplinary re-search groups is to provide guidance on study design, choice of statistical model, andexplanation of results. The impact of the chosen study design, statistical analysis strat-egy, or sensitivity of results to certain assumptions must often be explained to the entireresearch group, even when some members have no formal statistical training. While thestatistical literature may often provide solid results for preferring one approach overanother, it will typically be in the form of abstract, mathematical, or probabilisticreasoning, which is difficult to communicate in plain language. Consequently, the the-

c© 2013 StataCorp LP st0281

4 Stata for thought experiments

oretical results risk being perceived as unconvincing—magical arguments originatingfrom the black hat of a statistician. This perception is only reinforced if the result iscounterintuitive or controversial.

In such situations, we have found that implementing quantitative thought experi-ments may serve as a valuable pedagogical instrument to better explain what the theorymeans. These numerical experiments can often be further updated to account for ensu-ing “what if?” questions, which can be crucial in making sure that all arguments putforward in the group have been heard and fairly evaluated. The main prerequisite forthis to be of practical value, however, is the ability to easily implement the thoughtexperiment in a flexible and convenient software program that allows numerical anal-ysis. In this article, we will demonstrate with two examples that Stata provides suchprogramming facilities.

Arguably, this is just another example of how to use the computational muscles ofStata to overcome analytical shortcomings or solve problems that are mathematicallyintractable. A well-known example of this is to use Stata for estimation of statisti-cal power via stochastic simulation (Feiveson 2002), which is now a commonly usedand often-cited strategy. Therefore, the main objective of this article is not to givea general and detailed account on stochastic simulation in Stata. Rather, our mainobjective is to provide two illustrative case studies where the thought-experiment ap-proach allowed us to present convincing arguments to our fellow researchers—in thiscase, epidemiologists—even in situations where the statistical theory is complex.

Both examples presented here originate from our work within the Lifestyle DuringPregnancy Study (LDPS). The LDPS is a large epidemiological study on the effect oflow-to-moderate alcohol consumption during early pregnancy on a child’s neurodevel-opment at age 5 (Kesmodel et al. 2010). The study was based on a complex stratifiedsample conducted within the Danish National Birth Cohort (Olsen et al. 2001), wherethe stratification was used to ensure adequate representation of women with higher ex-posures in terms of both average alcohol intake (weekly number of alcoholic drinks) andbinge episodes (drinking at least five alcoholic drinks at a single occasion). Originally,the sampling was based on 20 different strata (Kesmodel et al. 2010) defined in a com-plex way; for pedagogical reasons, we here choose a more simple design consisting of18 different strata, all defined by categories of average alcohol intake during pregnancy(0, 1–4, 5–8, 9+ drinks per week) and timing of binge drinking (no binge, occurrencein weeks 1–2, 3–4, 5–8, 9, or later), where the last two categories of timing of bingedrinking were collapsed when the average alcohol intake was 5–8 and 9+ drinks perweek. Note that average intake was defined such that it was possible for women to havean average intake of 0 (the typical intake) and yet have one or more binge episodes.

The first example in this article studies the impact on the estimated effect size dueto dichotomizing an exposure variable at different cutoff values. The second exampleillustrates the necessity of using sampling fractions as inverse probability weights in thestatistical analysis for protection against bias in this complex sampling design. Afterthe two examples, we give a brief outline of the general steps needed for implementingquantitative thought experiments in Stata.

T. Wimberley, E. Parner, and H. Støvring 5

2 Example 1: Does dichotomizing an exposure variableat higher values always lead to larger effect sizes?

2.1 Scientific setting

In the LDPS, the cutoff for defining a binge drinking episode was set at five drinks at asingle occasion. With this definition, the analysis of the data yielded a rather small bingeeffect on IQ, and so speculation naturally arose in the research group on the causes forthis. In the literature, it is well known that dichotomization of a predictor variable mayimpair statistical efficiency and cutoffs should be chosen carefully (Senn and Julious2009). In a previous article by Olsen (1994), a cutoff of eight drinks had been used anda larger effect estimate reported; thus one of the epidemiologists argued that using ahigher cutoff value in the LDPS would likely have led to a larger estimated effect (inabsolute value).

We wanted to investigate whether this was invariably true in realistic scenarios, thatis, whether a higher cutoff value will always lead to a larger effect estimate (in absoluteterms) when the effect on outcome is monotonically increasing with higher values ofthe continuous explanatory variable. To answer this, we set up a quantitative thoughtexperiment.

2.2 Implementation

In general, the estimated effect size will depend on the distribution of the explanatoryvariable and the dose–response relationship; thus these two characteristics must bevaried to create the relevant scenarios.

Assume that IQ is the outcome of interest and that binge drinking is the binary expo-sure (yes or no) defined from the actual number of drinks consumed at a single occasionbased on a cutoff value. To mimic the actual setting, we consider using five and eightdrinks as cutoff values. The distribution of the number of drinks can take many differentforms, but here we will consider three simple forms. Either most women only have afew drinks, the women have a uniform distribution of drinks, or most women consumemany drinks. Assume without loss of generality for all the settings that 14 drinks is themaximum number of drinks consumed at a single occasion. The uniformly distributedexposure variable can then be generated straightforwardly by taking integer values ofuniform random numbers after multiplication with an appropriate factor, here 15.

When most women drink a small number of drinks, the exposure variable can con-veniently be generated by raising a uniform random variate to a power larger than 1before taking the integer value. Similarly, the situation with most women drinking highnumbers of drinks can be obtained by choosing a power smaller than 1. Regardlessof its distribution, the variable can subsequently be dichotomized into a binary bingevariable. For example, the situation with most women having a small intake can begenerated as follows:


. set obs 1000000obs was 0, now 1000000

. generate ndrinks = int(runiform()^3*15)

. generate binge5 = ndrinks >=5

. generate binge8 = ndrinks >=8

Notice that we chose to generate a very large dataset (n = 1,000,000) so as tovirtually eliminate random error in subsequent results. The resulting distribution ofexposure is shown in figure 1.

0.5

11.

52

Den

sity

0 5 10 15Number of drinks

Most consume few drinksMost consume many drinksThe number of drinks is uniformly distributed

Distribution of the number of drinks

Figure 1. The distribution of the number of drinks generated from 1,000,000 observa-tions as a decreasing, uniform, and increasing integer function ranging from 0 to 14.The three different distributions are generated by raising a uniform random variate tothe power of 3, 1, and 0.33, respectively.


2.3 The shape of the dose–response curve

When specifying the dose–response curve between IQ (the response) and the numberof drinks (the dose), we consider three different types of dose–response curves, namely,concavely declining, linearly declining, or convexly declining. By definition, the IQ in ageneral population is defined to follow a normal distribution with a mean of 100 and astandard deviation of 15; so for the unexposed, we assume a higher IQ—say, 105—andwe then subtract the effect of alcohol from the mean for higher intakes. Figure 2 showsthe shapes of the three types of relationships we considered.

8590

9510

010

5M

ean

IQ

0 5 10 15Number of Drinks

Concave LinearConvex

Dose−response curves

Figure 2. Three different shapes of the dose–response curve, where the mean IQ isplotted against the number of drinks. The vertical lines illustrate the values used fordichotomizing the exposure. Concave: IQ = 105− (x/5)3; linear: IQ = 105− x; convex:IQ = 105 − 3

√200x.

In Stata syntax, the concavely declining relationship is specified as

. generate IQ = rnormal() * 15 + 105 - (ndrinks/5)^3


2.4 Comparison of the effect sizes in different settings

For the actual implementation, let us first consider the single setting, where most womenhave a low intake and the shape of the dose–response curve is concavely declining.Because IQ is not, according to the above definition, normally distributed given bingestatus (mean IQ varies within binge categories), we use robust variance estimation toobtain standard error estimates, which implies that the model must be formulated asa linear regression with IQ as the response variable and binge drinking as a binarycovariate. The following output shows the estimated effect sizes for each of the twocutoff values defining binge drinking:

. regress IQ i.binge5, vce(robust)

Linear regression Number of obs = 1000000F( 1,999998) =45431.21Prob > F = 0.0000R-squared = 0.0463Root MSE = 15.432

RobustIQ Coef. Std. Err. t P>|t| [95% Conf. Interval]

1.binge5 -7.376778 .034609 -213.15 0.000 -7.444611 -7.308946_cons 104.9271 .0180216 5822.29 0.000 104.8918 104.9624

. regress IQ i.binge8, vce(robust)

Linear regression Number of obs = 1000000F( 1,999998) =68834.50Prob > F = 0.0000R-squared = 0.0699Root MSE = 15.24

RobustIQ Coef. Std. Err. t P>|t| [95% Conf. Interval]

1.binge8 -10.6854 .0407275 -262.36 0.000 -10.76522 -10.60557_cons 104.6848 .016679 6276.45 0.000 104.6521 104.7175

In this setting, the effect size is largest in absolute value when binge exposure isdefined from the high cutoff of eight drinks.


To efficiently estimate all effect sizes corresponding to the different definitions of theoutcome and exposure, we wrap the code into two foreach loops:

. set seed 198598

. foreach power of numlist 3 1 0.33 {2. set obs 10000003. generate ndrinks = int(runiform()^`power´ * 15)4. generate dr_concave = - (ndrinks / 5)^35. generate dr_linear = - ndrinks6. generate dr_convex = - (ndrinks * 200)^(1 / 3)7. generate binge8 = ndrinks >=88. generate binge5 = ndrinks >=59. foreach dr_fct of varlist dr_concave dr_linear dr_convex {10. generate IQ = rnormal() * 15 + 105 + `dr_fct´11. regress IQ i.binge5, vce(robust) noheader12. regress IQ i.binge8, vce(robust) noheader13. drop IQ14. }15. drop _all16. }

(output omitted )

2.5 Results

The code above results in nine different linear regression analyses for each of the twodifferent definitions of the exposure; these analyses are listed in table 1.

Table 1. Estimated effect size and robust standard errors from the linear regression ofIQ on the dichotomized binge exposure on a large sample (n = 1,000,000) in differentsettings. For the concave scenario with decreasing distribution of exposure, the resultsdiffer slightly from those presented earlier, which is simply the consequence of usingdifferent random seeds.

Distribution of exposure

Dose–response Exposure cutoff Decreasing Equal Increasing

Concave 5 −7.38 (0.03) −8.57 (0.03) −12.00 (0.08)8 −10.75 (0.04) −10.91 (0.03) −12.19 (0.04)

Linear 5 −8.06 (0.03) −7.49 (0.03) −7.91 (0.08)8 −9.23 (0.04) −7.51 (0.03) −6.18 (0.04)

Convex 5 −8.95 (0.03) −6.03 (0.03) −4.48 (0.08)8 −8.71 (0.04) −5.10 (0.03) −3.12 (0.04)

As expected, both the shape of the dose–response curve and the distribution of theexposure variable determine which cutoff value yields the larger effect. When the shapeof the dose–response is concave, a cutoff value of eight drinks results in the largesteffect size in absolute value in this example, but this is reversed when the shape is


convex. When the dose–response relationship is linear, the distribution of the exposuredetermines which is larger. The conclusion is thus that the magnitude of the estimatedeffect depends not only on the cutoff value but also on the distribution of the covariateand the shape of its association with the outcome. There is thus no guarantee thatchoosing a higher cutoff value would have led to a higher effect estimate. Note thatthe high number of observations (n = 1,000,000) has virtually eliminated the randomvariation (all standard errors are between 0.03 and 0.08), so any variation in effect sizescan be considered systematic effects.

3 Example 2: Can use of sampling weights be avoided instatistical analyses of complex sampling designs?

3.1 Scientific setting

In the LDPS, the sample was stratified in a complex manner to ensure adequate rep-resentation in every stratum. When one analyzes such complex sampling designs, thestandard strategy according to the statistical literature is to calculate the samplingfractions for each stratum and use these as inverse probability weights in a weightedanalysis with robust variance estimation (see [U] 20.22.3 Sampling weights). Thisstrategy was followed in the LDPS, but weighting the analyses seemed to decrease preci-sion, which in the research group raised the question of whether an appropriate analysiscould be conducted omitting weights.

While the use of sampling weights may in some specific situations be abandonedwithout inducing bias (Winship and Radbill 1994), this is not true in general. In theLDPS, the focus was on estimating a marginal effect of average alcohol intake while ac-counting for the sampling categories defined by average and binge drinking but withoutconsidering an interaction term between the two. Therefore, the main objective forthis example was to illustrate how large the cost could be in terms of bias and reducedcoverage probabilities of confidence intervals by omitting weights in such an analysis.We again used a quantitative thought experiment implemented in Stata to answer this.

3.2 Implementation

To mimic the setup for the actual LDPS and yet keep the model simple, we imaginea study where the sample is stratified by both average alcohol intake (four categories:0, 1–4, 5–8, 9+ drinks per week) and binge drinking (yes or no). Our aim is now toestimate the effect of maternal average alcohol intake on a child’s IQ by conducting alinear regression analysis. Note that just as in the real LDPS, the sampling design inour example is based on categorized average intake, but the actual average number ofdrinks consumed per week is recorded and can be used as a covariate in the statisticalanalysis.


To define the sample, we must specify the joint distribution of average alcohol intakeand binge drinking. Suppose that average alcohol intake per week lies between 0 and 14drinks with most women having a low intake; that is, the random variable describingaverage intake is generated as a declining integer function ranging from 0 to 14. Supposethat the probability of binge drinking during pregnancy increases with average alcoholintake such that among those with an average intake of 0, 20% will be categorized asbinge drinkers, whereas for those with an average intake of 14, approximately 50% willbe categorized as binge drinkers.

So that we can mimic the original setup where the LDPS is a subsample within theDanish National Birth Cohort, a dataset with 100,000 observations is first generated forreference, and from this, all the subsamples are then drawn.

. set seed 1508776


. generate avalco = int(runiform()^3 * 15)

. generate binge = runiform() < (.2 + avalco/(14*2))

As in the LDPS, a higher fraction is sampled among those having a high alcoholintake, be it on average or as binging.

. recode avalco (0 = 1) (1/4 = 2) (5/8 = 3) (9/20 = 4), generate(alcocat)(92637 differences between avalco and alcocat)

. generate sampfrac = (alcocat / 10 + binge / 2) / 20

. table alcocat binge, c(mean sampfrac) format(%5.3f)

RECODE of bingeavalco 0 1

1 0.005 0.0302 0.010 0.0353 0.015 0.0404 0.020 0.045

Using the sampling weights, we select stratified samples from the cohort of 100,000observations. In table 2, an example of one of these stratified samples is shown—thesample is more balanced across the strata than the full cohort. In a more realisticsample, it would be likely that more subjects have an average intake of 0, but becausethe average intake is defined by a simple decreasing function, the first group (0 drinksper week) contains fewer individuals than the second group (1–4 drinks per week) inthis hypothetical setup. This is, however, of no consequence for the subsequent results.


Table 2. The distribution of subjects across the strata in the full cohort and in thesample

avalco binge Full cohort, n (%) Stratified sample, n (%)

1 0 32,401 (32.4) 160 (8.8)1 8,206 (8.2) 222 (12.2)

2 0 20,770 (20.8) 205 (11.3)1 7,890 (7.9) 287 (15.8)

3 0 8,621 (8.6) 133 (7.3)1 6,426 (6.4) 268 (14.7)

4 0 6,171 (6.2) 113 (6.2)1 9,515 (9.5) 430 (23.7)

Total 100,000 (100) 1,741 (100)

For the outcome variable IQ, we assume that its mean decreases slightly with in-creasing alcohol intake and that it is normally distributed with a standard deviationof 15 and a mean of 105 for nondrinkers, just as in the previous example. In Stata, thisbecomes

. generate IQ = rnormal()*15 + 105 - (avalco/7)^3> - 4 * binge - .4 * (avalco/7)^3 * binge

. save sourcepop, replacefile sourcepop.dta saved

Note that the data defining the full cohort is saved in sourcepop.dta for later use.

3.3 Bias in unweighted versus weighted analyses

A simple linear regression of IQ on average alcohol intake in the full cohort of 100,000observations yields an effect estimate of −0.618. We take this to be the true value withwhich the estimated coefficients in the smaller stratified samples are to be compared.

First, we again construct a simple program that saves the estimated regression coef-ficient and standard error for the unweighted and weighted analysis, respectively. Thiscan then be fed to simulate. Second, the results are evaluated by calculating andsummarizing the relative bias and the coverage probability from the 2,500 simulatedregression coefficients and standard errors.


. * A program selecting a subsample with sample fraction sampfrac

. * and running an unweighted regression of IQ on average alcohol.

. use sourcepop

. program define alcononpw, eclass1. preserve2. keep if runiform() < sampfrac3. regress IQ avalco4. restore5. end

. simulate _b _se, reps(2500) saving(nonpwres, replace): alcononpw

command: alcononpw

Simulations (2500)1 2 3 4 5

.................................................. 50

.................................................. 100

(output omitted )

.................................................. 2500

. * A program selecting a subsample with sample fraction sampfrac,

. * but running a weighted regression of IQ on average alcohol.

. use sourcepop

. program define alcopw, eclass1. preserve2. keep if runiform() < sampfrac3. regress IQ avalco [pw = 1/sampfrac]4. restore5. end

. simulate _b _se, reps(2500) saving(pwres, replace): alcopw

command: alcopw


.................................................. 50

.................................................. 100

(output omitted )

.................................................. 2500

. * The estimated regression coefficient used as the true value.

. preserve

. use sourcepop

. quietly regress IQ avalco

. matrix truecoefs = e(b)

. local trueval = truecoefs[1, 1]

. display `trueval´-.61761482

. restore


. ******************************************************************

. * Results

. ******************************************************************

. * Summarizing the results by calculating the relative bias

. * and coverage probability.

. foreach dataset in nonpwres pwres {2. display "Dataset used `dataset´"3. use `dataset´, clear4. generate relbias = (_b_avalco - `trueval´) / `trueval´5. ci _b_avalco6. centile relbias _se_avalco7. generate coverage = (_b_avalco - 1.96 * _se_avalco) <= `trueval´

> & (_b_avalco + 1.96 * _se_avalco) >= `trueval´8. ci coverage, bin9. }

Dataset used nonpwres(simulate: alcononpw)

Variable Obs Mean Std. Err. [95% Conf. Interval]

_b_avalco 2500 -.6621544 .0014757 -.6650481 -.6592607

Binom. Interp.Variable Obs Percentile Centile [95% Conf. Interval]

relbias 2500 50 .0727582 .0672506 .0785631_se_avalco 2500 50 .0761998 .0761099 .0762694

Binomial ExactVariable Obs Mean Std. Err. [95% Conf. Interval]

coverage 2500 .9188 .0054628 .9073955 .9292116Dataset used pwres(simulate: alcopw)

Variable Obs Mean Std. Err. [95% Conf. Interval]

_b_avalco 2500 -.6209438 .0019018 -.624673 -.6172146


relbias 2500 50 .0066028 .0001758 .0142346_se_avalco 2500 50 .0967832 .0966078 .0970025


coverage 2500 .9552 .0041373 .9463414 .9629713


From table 3, it is clear that the weighted analysis yields less bias and a coverageprobability closer to the nominal value than the unweighted analysis.

Table 3. The mean estimate with its median standard error, median relative bias, andcoverage probability based on 2,500 simulations. As a true value, we used the estimate−0.618 obtained in the linear regression in the entire cohort of 100,000.

Mean Relative bias (%) Coverage probability(standard error) with 95% confidence interval

Unweighted −0.662 (0.076) 7.3 0.919 [0.907, 0.929]Weighted −0.621 (0.097) 0.7 0.955 [0.946, 0.963]

We thus conclude that using sample fractions as inverse probability weights sub-stantially reduces bias while maintaining the coverage probability close to the nominalvalue of 95% even when the model is misspecified. These features are not shared by theunweighted analysis. Although the use of sampling weights results in larger standarderrors and less power, the protection against bias outweighs this, and when it is notpossible in a study to gain both high power and unbiased estimates, a less precise butunbiased estimate is preferred. As a follow-up (not shown but available upon request),we found in another worked thought experiment that the power for detecting an in-teraction effect between average intake and binging was low, so a simple strategy foravoiding the use of weights does not exist.

4 Outline of the process for constructing a quantitativethought experiment

In this article, two different approaches have been used to construct quantitative thoughtexperiments. One approach is to generate one very large dataset and use this to comparethe estimated effect size in different situations, as done in example 1, with the objectiveof virtually eliminating random error. The other approach is to generate many datasetsof a realistic size in simulations, analyze them separately, and summarize results acrossthem with the objective of estimating bias, precision, or coverage probability, all infinite samples with random error, as illustrated in example 2. For both approaches, therecipe for constructing a quantitative thought experiment is rather similar, as outlinedbelow.


4.1 Generating datasets

After a random-number seed is set to make the results reproducible, the chosen numberof observations is generated; this number may either be very large—say, 1,000,000—orreflect realistic sample sizes actually available. Thus the following two lines are typicalwhen initiating a thought experiment:

. set seed 2083675


The thought experiment should be both realistic, illustrative, and easy to follow.Therefore, it is important to consider how to build up a realistic scenario but stillkeep it simple: do not include more variables than necessary, keep distributions of thevariables as simple as possible, etc.

Often the only two variables needed to be defined are the exposure and the outcome.Before constructing the variables, one should define some characteristics for each of thevariables. What should the range be? Is it categorical or continuous? What shape doesits distribution have? Is it normally distributed or skewed, and what is the relationbetween the variables? Because outcome typically depends on the exposure variable,the exposure variable must be defined first. The simplest way of generating a randomvariable with a given distribution is by inversion of its distribution function, becausethe procedure can then be based on generating uniform random variates. For example,an exposure with a range from 0 to 10 and with a decreasing distribution function canbe defined as F (x) = {(1/10)x}1/2, which in Stata becomes

. generate exposure = runiform()^2*10

To check whether the distribution of the variable is as desired, we suggest plotting ahistogram of the variable.

The outcome is then defined based on the expected association with exposure. Forexample, the association may be as a normally distributed outcome with mean 50 anda standard deviation of 10 for the unexposed, X = 0, and linearly decreasing withexposure:

. generate outcome = rnormal() * 10 + 50 - exposure

The association can, of course, be more complicated, and the outcome may depend onmore than one variable; see example 2.

4.2 Estimation/simulation

When the scenario is established, the generated dataset can be used to investigate dif-ferent expected properties of the data. For the first approach with a single large dataset,the only result of interest is the estimated effect size. This estimate can straightforwardlybe found with a single estimation command, for example, a simple linear regression with


robust standard errors, to account for departures from normality. By using a very largedataset, we ensure that random variation is virtually eliminated (see example 1):

. quietly regress outcome exposure, vce(robust)

For the other approach, where several datasets are generated and analyzed, moresteps are needed to obtain and present results. For each of the characteristics, thefollowing procedure should be followed:

1. Choose the number of simulations.

2. Define a program.

3. Run the simulations on the defined program.

4. Present the results.

In the following, we give a short description of the procedure for simulating the biasand the coverage probability. For both, it is essential to define the true value, and ifthis cannot be determined analytically, it could instead be taken as the estimate foundin a very large simulated dataset, where random error is negligible.

Simulating the bias

As in example 2, the bias is estimated as the relative difference between the estimatedeffect and the true value. The true value is here taken to be the estimate in the generateddataset containing 1,000,000 observations. The number of simulations is not easy todetermine beforehand but can be adjusted according to the resulting standard errorof the bias. If we wish to get a standard error of the relative bias of 0.005, we couldsimply keep increasing the number of simulations until we reach this precision. Next theprogram used to actually estimate the parameter is defined. In this example program,the estimation is done on a simple 1% random sample of the observations, and theresults are saved by specifying the eclass option. Strictly speaking, the observationsare not independent, but given the large size of the source dataset, this is a problemthat may be ignored. The program could look like this:

. program define pr_est, eclass1. preserve2. keep if runiform() < 0.013. regress outcome exposure, vce(robust)4. restore5. end

Using the simulate command in Stata, the program is applied several times—say,2,500—and the results of each simulation are saved in a new dataset.


. simulate _b, reps(2500) saving(datares, replace): pr_est

command: pr_est


.................................................. 50

.................................................. 100

(output omitted )

.................................................. 2500

To evaluate the median relative bias, we calculate the relative bias from each ofthe 2,500 saved effect estimates. The mean of these and their corresponding confidenceintervals are estimated and presented with the centile command.

. local trueval = -0.9944

. generate relbias = (_b_exposure -`trueval´)/`trueval´

. centile relbias


relbias 2500 50 .0007486 -.000921 .0025195

Simulating the coverage probability

Because the coverage probability is an estimated proportion with a nominal value of95%, it can be shown from the formula for the standard error that if we choose thenumber of simulations to be 2,500, this will result in a standard error of less than 0.5%.

To present the coverage probability, we must generate a variable that records whetherthe true value is within the computed 95% confidence interval. If we use the ci commandwith the binary option on this variable, the coverage probability can be estimated andcompared with the nominal value, say, 95%:

. simulate _b _se, reps(2500) saving(datares, replace): pr_est

command: pr_est


.................................................. 50

.................................................. 100

(output omitted )

.................................................. 2500

. local trueval = -0.9944

. generate cp = (_b_exposure - 1.96 * _se_exposure) <= `trueval´> & (_b_exposure + 1.96 * _se_exposure) >= `trueval´

. ci cp, binomial


cp 2500 .954 .0041897 .9450405 .9618754


5 Discussion

In this article, we presented two examples of how to use Stata for answering ques-tions that could otherwise easily result in extensive speculation and debate. The mainprerequisite for both examples is a rigorous specification of the scenario of interest.Invariably, we started by formulating a full statistical model for the outcome, its rela-tionship with explanatory variables, and their distribution. Values for these are oftennaturally available from the actual application in which the discussions arise, so it ispossible to accept arguments at face value, code them in Stata language, and simplyobserve what the results are.

There are, however, a few caveats to this seemingly limitless inventory of tools. Thechoice of contrasting scenarios often requires experience and intuition that allow one toselect the features that need to be varied to obtain general results. If the scenarios donot span the relevant variation, it is easy to become misled by seemingly general results.A case in point is the choice of shapes for dose–response curves in example 1. Had theconcave curve been omitted, one might have concluded that departures from a linearrelationship always led to larger effect estimates with lower cutoff values.

The major advantage of using worked thought experiments is the ability to engageconstructively in an ongoing development of thoughts within the study group. When,for example, the results of example 2 are presented—the complex stratified design leadsto a lower precision than one would find if just running an ordinary analysis ignoringsampling weights—the following question may naturally arise: What precision wouldbe anticipated in other studies by using the data from the LDPS, but with, say, differentexposure variables from other register data? Such a question would lend itself directlyto a new thought experiment where the observed sampling fractions are used to weightthe anticipated analysis.

We thus hope the examples may serve as inspiration for applied statisticians whoneed to engage with subject matter specialists and provide intelligible guidance to themon the statistical planning and analyses on a given project. In our own practice, we havefound that once we adopted this strategy, it quickly became compelling and widespreadbecause it is very flexible, convenient, and applicable to a huge range of situations.

6 Acknowledgment

The authors want to thank the research group in the Lifestyle During Pregnancy Studyfor lively and engaging discussions on application of statistical methods, in particularErik Lykke Mortensen and Ulrik Schiøler Kesmodel.


7 ReferencesFeiveson, A. H. 2002. Power by simulation. Stata Journal 2: 107–124.

Kesmodel, U. S., M. Underbjerg, T. R. Kilburn, L. Bakketeig, E. L. Mortensen, N. I.Landrø, D. Schendel, J. Bertrand, J. Grove, S. Ebrahim, and P. Thorsen. 2010.Lifestyle during pregnancy: Neurodevelopmental effects at 5 years of age. The de-sign and implementation of a prospective follow-up study. Scandinavian Journal ofPublic Health 38: 208–219.

Olsen, J. 1994. Effects of moderate alcohol consumption during pregnancy on childdevelopment at 18 and 42 months. Alcoholism: Clinical and Experimental Research18: 1109–1113.

Olsen, J., M. Melbye, S. F. Olsen, T. I. Sørensen, P. Aaby, A.-M. N. Andersen,D. Taxbøl, K. D. Hansen, M. Juhl, T. B. Schow, H. T. Sørensen, J. Andresen, E. L.Mortensen, A. W. Olesen, and C. Søndergaard. 2001. The Danish National BirthCohort—Its background, structure and aim. Scandinavian Journal of Public Health29: 300–307.

Senn, S., and S. Julious. 2009. Measurement in clinical trials: A neglected issue forstatisticians? Statistics in Medicine 28: 3189–3209.

Winship, C., and L. Radbill. 1994. Sampling weights and regression analysis. Sociolog-ical Methods and Research 23: 230–257.

About the authors

Theresa Wimberley has a master’s degree in statistics from the University of Aarhus. Between2009 and 2011, she worked as a research assistant, particularly on the Lifestyle During Preg-nancy Study, in the Department of Biostatistics at the University of Aarhus. Since 2012, shehas been employed at the National Centre of Register-based Research as a PhD student, doingresearch within the field of pharmacoepidemiology.

Erik T. Parner has a PhD in statistics from the University of Aarhus. He is a professorof biostatistics at the University of Aarhus. His research fields are time-to-event analysis,statistical methods in epidemiology and genetics, and the etiology and changing prevalence ofautism.

Henrik Støvring has a PhD in biostatistics from the University of Southern Denmark. He is anassociate professor of biostatistics at the University of Aarhus. His research fields are statisticalmethods in pharmacoepidemiology, risk communication to patients, and the use of health careservices. He has served as a senior statistician on the Lifestyle During Pregnancy Study.


Versatile sample-size calculation usingsimulation

Richard HooperCentre for Primary Care and Public Health

Queen Mary, University of LondonLondon, UK

[email protected]

Abstract. I present a new Stata command, simsam, that uses simulation todetermine the sample size required to achieve a given statistical power for anyhypothesis test under any probability model that can be programmed in Stata.simsam returns the smallest sample size (or smallest multiple of 5, 10, or someother user-specified increment) so that the estimated power exceeds the target.The user controls the precision of the power estimate, and power is reported witha confidence interval. The sample size returned is reliable to the extent that ifsimsam is repeated, it will, nearly every time, give a sample size no more than oneincrement away.

Keywords: st0282, simsam, sample size, power, simulation

1 Introduction

The statistical power of a research study is the probability that it will find evidenceof an important effect. Power depends on what we mean by “important”, on whatcounts as evidence, and on how the study is designed; but given these specifications,power depends on parameters of the design such as the sample size. In the life sciences,choosing a sample size on the basis of the power it achieves is an important ethicalconsideration when planning research (Newell 1978; Altman 1980).

Many methods for calculating sample size, including those for comparing indepen-dent samples using a t test or χ2 test, rely on an approximation for the relationshipbetween sample size and power (Florey 1993; Campbell, Julious, and Altman 1995).Although the exact power achieved by a given sample size can be estimated for anyhypothesis test using Monte Carlo simulation (Feiveson 2002; Eng 2004), using simu-lation to determine the sample size required to achieve a given power is slightly morecomplicated, necessitating that power be estimated at different sample sizes to find theone at which the target power is attained. This computational burden may be thereason simulation is not used more routinely as a tool for sample-size calculation. Ageneral approach for determining sample size by simulation in Stata has been describedby Feiveson (2002, 2009), and a sample-size calculator of this kind has been developedby Browne, Golalizadeh Lahi, and Parker (2009) for random-effects models in MLwiNand R, but practical and versatile software tools have not to date been widely available.


22 Versatile sample-size calculation

I describe a new Stata command, simsam, that uses a novel, iterative algorithmthat uses simulation to determine the sample size required to achieve a given power.The user controls the precision with which power is estimated, but in early iterations,the algorithm uses less precision to make more rapid progress. simsam assumes thatcode for generating and analyzing a single dataset is provided in a separate program;thus simsam can calculate sample size for any method of analysis under any probabilitymodel that can be programmed in Stata. In fact, although the term “sample size” isused throughout this article, simsam can be used to determine any design parameterthat, when increased (with all other parameters fixed), causes the power to increase.There could be more than one parameter, such as duration of recruitment and durationof follow-up or number of clusters and number of participants per cluster.

2 The simsam command

2.1 Syntax

simsam subcommand n option name, inc(#) prec(#)[power(#) alpha(#)

detect(options) null(options) assuming(options) start(#) iter(#)

notable pvalue(name) level(#)]

or

simsam continue[, inc(#) prec(#) null(options) iter(#) notable

]2.2 Options

inc(#) specifies the sample-size increment. simsam returns a sample size that is amultiple of the increment. This option is required except with simsam continue,where if inc() is not specified, it is assumed to be the same as in the previous simsamcommand. inc() must be an integer (simsam will only consider integer values forthe sample size).

prec(#) specifies the precision of the final estimate of power. This option is requiredexcept with simsam continue, where if prec() is not specified, it is assumed to bethe same as in the previous simsam command. The precision is the half-width ofthe confidence interval for power, whose level is set by the level() option. Theactual confidence interval is calculated using an exact binomial method, so it willnot always match the specified precision exactly.

power(#) specifies the target for statistical power as a decimal fraction. The defaultis power(0.9).

alpha(#) specifies the significance level. The default is alpha(0.05).

R. Hooper 23

detect(options) lists the options to subcommand that specify the effect to be detected.

null(options) lists the options to subcommand that specify the null model.

assuming(options) lists any other options to subcommand that specify additional as-sumptions.

start(#) specifies the starting value for sample size in the iterative algorithm. Thedefault is start(100). The algorithm can generally find its way to any requiredsample size, large or small, from a starting value of 100, but sometimes a judiciouschoice of starting value will ensure quicker convergence.

iter(#) specifies the maximum number of iterations. The default is iter(10), whichwill be sufficient in many applications. The largest number that can be specified isiter(99).

notable suppresses the output table containing results from each iteration.

pvalue(name) identifies the returned scalar containing the p-value. The default ispvalue(p), which means that after execution of subcommand, simsam looks for thep-value in r(p).

level(#) specifies the confidence level for estimates of power. The default is level(99)to ensure good coverage and to emphasize the distinction from sample-based esti-mates, which would usually be quoted with 95% confidence. The level can be setto any integer between 90 and 99, but precision will usually be controlled using theprec() option, so level() can be kept at its default.

2.3 Saved results

simsam saves the following scalars and macros in r(). Some of these are used bysimsam continue to pick up where simsam left off; therefore, users should not useanother rclass command after simsam if they intend to use simsam continue.

Scalarsr(n) required sample size r(tryn) sample size to try nextr(p) estimate of power at sample r(nullp) estimate of power under null

size n r(nullpl) lower confidence limit for powerr(pl) lower confidence limit for power under nullr(pu) upper confidence limit for r(nullpu) upper confidence limit for power

power under nullr(alpha) significance level r(level) confidence level for estimates ofr(power) target power powerr(inc) increment r(prec inc) precision and increment ratior(prec) precisionr(reps) number of replications

used to estimate power


Macrosr(subcomm) subcommand r(assuming) additional optionsr(noption) name of option for sample size r(ntried) sample sizes tried so far, at fullr(pvalue) where p-value is returned by precision

subcommand r(exitcond) exit conditionr(detect) specification of the effect to r(phase) phase of algorithm (heuristic or

be detected step-down)r(null) specification of the null model

3 The subcommand

In the simsam syntax, subcommand is the name of a program containing code to generateand analyze a single dataset. n option name is the name of an option to subcommandthat controls the sample size. Essential features of the subcommand program are that

(a) options follow standard syntax, and there are no arguments before the comma;

(b) data in memory are cleared before the new dataset is generated; and

(c) it is an r-class command, returning the p-value as a scalar.

Where speed is of the essence, the program should also be as lean as possible.Requirement (b) also allows subcommand to be used with Stata’s simulate command.

Example

The following program could be used as a subcommand for simsam to calculatesample size for a two-sample t test (though a faster method in this case would be to useStata’s sampsi command). The program generates normally distributed observationsin two independent samples and then applies a t test. It has required options d() (meandifference), sd() (standard deviation in each group), and npergrp() (sample size pergroup).

. program define s_ttest, rclass1. version 12.02. syntax , D(real) SD(real) NPERGRP(integer)3. drop _all4. set obs `=2*`npergrp´´5. gen group=mod(_n, 2)6. gen x=rnormal(`d´*group, `sd´)7. capture noisily ttest x, by(group)8. return scalar p=r(p)9. end

simsam assumes by default that the p-value will be returned in r(p), but an alter-native can be specified with the pvalue() option. For example, pvalue(p exact) tellssimsam to look for the p-value in r(p exact). This is useful if the user wants to choosefrom more than one returned p-value, for example, exact and approximate or one-tailedand two-tailed. simsam also assumes that if the p-value is missing—for example, if an

R. Hooper 25

error was captured during the analysis and a p-value was not returned—the result is tobe considered nonsignificant. This allows for situations where legitimately arising datacannot be analyzed—for example, a logistic regression on a contingency table with azero cell. Once a new program has been checked for syntax errors, errors in the analysisshould be routinely captured so that they do not halt the progress of simsam. In partic-ular, capture noisily should be captured, as in the example above, so that runningsubcommand by itself produces noisy output.

4 Basic use

In basic use, there are two required options to simsam: inc(), which specifies theincrement for sample size, and prec(), which sets the precision of the final estimateof power (see section 2.2). A certain degree of precision relative to the increment isnecessary for simsam to converge reliably on a solution, though this will also depend onwhat the required sample size turns out to be; if during its execution simsam suspectsthat there is a problem with precision, it will halt and give further advice (this essentiallyinvolves checking whether successive iterations of sample size are determined to withinone increment, subject to the uncertainty in the power estimate; see section 8.4).

Options for target power and significance and their default values are the same asfor Stata’s sampsi command. The other options likely to be specified are detect() andassuming(). Their use is best illustrated with an example.

Example

Suppose you want to use the s ttest command defined above to calculate the samplesize required to detect a mean difference of 0.5 between two independent groups usinga t test, with 80% power at the 5% significance level, assuming a standard deviation ineach group of 1.0. As will be seen in a later example, you can play with the incrementand precision once an initial solution has been obtained; to start, you may prefer a fairlywide increment and precision—in this case an increment of 10 and a precision of 1%.

(Note: All examples in this article were run in Stata version 12. Because of changesto Stata’s random-number generator, results can be version-dependent even with thesame seed specified.)


. version 12.0

. set seed 20120301

. simsam s_ttest npergrp, power(0.8) alpha(0.05) detect(d(0.5))> assuming(sd(1.0)) inc(10) prec(0.01)

iteration npergrp power (99% CI)

1 100 ........... 0.9200 (0.8239, 0.9737)2 70 ........... 0.8110 (0.7771, 0.8418)3 70 ........... 0.8368 (0.8274, 0.8459)4 60 ........... 0.7766 (0.7661, 0.7870)

npergrp = 70achieves 83.68% power (99% CI 82.74, 84.59)

at the 5% significance levelto detect

d = 0.5assuming

sd = 1.0

If continuing, use prec/inc < 2.8e-03

The final results show that a sample size of 70 per group achieves an estimated powerof 83.68%. This is the smallest multiple of 10 for which estimated power exceeds 80%.

In the example, options listed inside detect() and assuming() are all passed asoptions to s ttest, along with the option npergrp(n), where n is the working samplesize at each iteration. In this situation, detect() and assuming() are essentially inter-changeable (other than influencing how parameters are reported in the final output).

The output table updates as each iteration is completed. It can be suppressed withthe notable option. The line of dots in each row marks the progress of the iteration. Theinterval between dots represents one tenth of the time required for the whole iteration.This interval gets longer in later iterations; that is, the dots appear more slowly as theiterations continue (this differs from the simulate command).

The algorithm used by simsam to determine sample size is discussed in more detailin the next section.

5 The algorithm

5.1 Background

To find the minimum sample size at which power exceeds a given threshold, we have toestimate power at more than one sample size. In the Stata examples given by Feiveson(2002, 2009) and in the MLPowSim software developed by Browne, Golalizadeh Lahi,and Parker (2009), the user specifies an upper and lower limit for sample size togetherwith a sample-size increment; the algorithm considers every sample size in turn, startingat the lower limit and increasing by the given increment until the upper limit is reached.

R. Hooper 27

Williams, Ebel, and Wagner (2007) suggested a more efficient binary search algo-rithm in which half the possible sample sizes are ruled out at each iteration. Startingwith a range of possible sample sizes (n1, n2), this algorithm estimates the power at themid-point of the range nmid and then repeats the process either on the range (n1, nmid)or on the range (nmid, n2), depending on whether the power at nmid is above or belowthe required power. Jung (2008) suggested a binary search in which the number ofreplications varies: it starts with 100 replications at each sample size and increases to1,000 replications once the algorithm has narrowed down the search. simsam developsthese ideas further using an algorithm that combines 1) an intelligent or heuristic searchfor sample size and 2) a progressively increasing number of replications.

5.2 Heuristic search

A binary search can rule out half the sample sizes at each iteration because the relation-ship between sample size and power can be assumed to be an increasing one. If we couldassume something about the form of the relationship between sample size and power,it might be possible to hone in more rapidly on the desired solution. If we knew theexact relationship, we could go directly to the solution; but even if we can only guess anapproximate relationship, we should still be able to find a better guess at sample sizeand then apply the same process iteratively to converge on a solution.

simsam makes a guess based on a normal approximation: it assumes the hypothe-sis test is based on a normally distributed estimate of effect size with standard errordecreasing as one over the square root of sample size. With this assumption, simsamcan formulate the relationship between power and sample size from a single estimate ofpower at a single sample size and in this way jump to a new, improved guess at samplesize (see section 8.1).

5.3 Increasing the number of replications

simsam starts with 100 replications in the first iteration and multiplies this by 10 ateach iteration until the required maximum is reached, after which it continues using themaximum number of replications at each iteration. The maximum is calculated fromthe precision specified by the user for the power estimate (see section 8.2). Supposethat the maximum number of replications is 1,000,000. With this scheme, we canafford to spend the first four iterations converging on a solution. As long as we aim tohave converged by the time we reach 1,000,000 replications, then the earlier iterationswill only have consumed 111,100 replications in total—less than 1/9 of the 1,000,000required to estimate the power at the final iteration. This is much more efficient thanusing the full number of replications to evaluate every different sample size.


5.4 Terminating the search

The algorithm requires a robust approach to decide when to stop. It proceeds in twophases: a heuristic phase as described above, followed by a step-down phase in whichthe strategy changes to an incremental search.

During the heuristic phase, once the maximum number of replications is reached,the algorithm maintains a list of the sample sizes it has already tried, a record of thesmallest estimated power exceeding the target, and the sample size that achieved this(the best sample size so far). When the algorithm determines that the next sample sizeis one it has already tried, it reverts to the “best” sample size and switches over fromthe heuristic phase to the step-down phase.

In the step-down phase, the algorithm reduces the sample size by one increment ateach iteration. It stops when 1) it reaches a sample size it has already tried (for whichthe estimated power must have been below the target); 2) it reaches a sample size ofzero; or 3) it estimates the power to be below the target. The step-down phase ensureswe have found the minimum required sample size by confirming that the power at asmaller sample size is below the target.

Example

The processes acting behind the scenes in the example above can now be explained.We are trying to estimate a power of 80% with a 99% confidence interval of +/ − 1%.This requires 10,620 replications. The first iteration uses 100 replications of samplesize 100 and estimates the power to be 92.00%. This results in a revised guess of 70for the sample size. The second iteration uses 1,000 replications of sample size 70 andestimates the power to be 81.10%. This again results in a guess of 70 (rounded up tothe nearest 10) for the required sample size. The third iteration uses the maximumnumber of replications (10,620—the algorithm skips 10,000 because this is almost themaximum anyway) with sample size 70 and estimates the power to be 83.68%.

The revised guess at the required sample size is 70 again (this is not shown in thetable but can be inferred from what simsam does next). Because this is the secondtime a sample size of 70 has been suggested at the maximum number of iterations, thealgorithm switches over to the step-down phase and looks instead at a sample size of 60.Power in this case is estimated to be 77.66%. Because the estimated power at samplesize 70 is 83.68% and at sample size 60 is 77.66%, simsam concludes that the smallestsample size to the nearest 10 for which power exceeds 80% is 70.

R. Hooper 29

6 Continuing after a previous command

simsam may halt before a solution has been obtained either because it suspects a problemthat might prevent convergence or because it has completed a prespecified number ofiterations. In fact, simsam will cease iterating under one of five exit conditions:

1. the algorithm has converged according to the criteria defined in the previous sec-tion;

2. the number of iterations specified in the iter() option has been completed, thedefault being 10 (the largest number that can be specified is 99);

3. simsam suspects that given the precision relative to the increment specified, thesample size cannot be reliably determined to within one increment;

4. the estimated power is unexpectedly low (if the power is less than the significancelevel, this could indicate a problem with subcommand—in particular, a power of 0may indicate the program is consistently failing to return a p-value); or

5. the working sample size is continually increasing, but the power is not beingcontrolled as expected.

After any of these exit conditions, simsam may be restarted using the simsamcontinue command. simsam continue uses saved results to continue where simsamleft off and does not have any required options, though it does allow the user to alterthe increment and the precision before continuing. After exit conditions 1 and 3, simsamestimates the precision relative to the increment that is required if the user wants tocontinue.

Cases 4 and 5 are not considered errors because they could arise legitimately (forexample, through sampling error or poor choice of starting value), but using simsamcontinue in these cases is unlikely to solve the underlying problem, and the program isliable simply to halt again. Continuing after simsam has converged, without changingthe increment or precision, will just cause the previous result to be output again. In fact,simsam continue is likely to be most useful in three situations, which are illustratedwith examples below: 1) stopping and restarting after a fixed number of iterations; 2)obtaining a rough solution before proceeding to a higher-precision one; and 3) amendingthe precision or increment to allow convergence.


Example

. version 12.0

. set seed 20120301

. simsam s_ttest npergrp, power(0.8) alpha(0.05) detect(d(0.5)) assuming(sd(1))> inc(10) prec(0.01) iter(2)


1 100 ........... 0.9200 (0.8239, 0.9737)2 70 ........... 0.8110 (0.7771, 0.8418)

Warning: did not converge within 2 iterations

. simsam continue, iter(2)


1 70 ........... 0.8368 (0.8274, 0.8459)2 60 ........... 0.7766 (0.7661, 0.7870)



d = 0.5assuming

sd = 1


Example

. version 12.0

. set seed 20120301

. simsam s_ttest npergrp, power(0.8) alpha(0.05) detect(d(0.5)) assuming(sd(1))> inc(10) prec(0.01)


1 100 ........... 0.9200 (0.8239, 0.9737)2 70 ........... 0.8110 (0.7771, 0.8418)3 70 ........... 0.8368 (0.8274, 0.8459)4 60 ........... 0.7766 (0.7661, 0.7870)



d = 0.5assuming

sd = 1


R. Hooper 31

. simsam continue, inc(1) prec(0.001)


1 70 ........... 0.8383 (0.8352, 0.8412)2 64 ........... 0.8010 (0.8000, 0.8020)3 63 ........... 0.7949 (0.7939, 0.7959)



d = 0.5assuming

sd = 1


Example

. version 12.0

. set seed 20120301

. simsam s_ttest npergrp, power(0.8) alpha(0.05) detect(d(0.5)) assuming(sd(1))> inc(1) prec(0.01)


1 100 ........... 0.9200 (0.8239, 0.9737)

Warning: npergrp not reliably determined to within one increment


. simsam continue, inc(10) prec(0.01)


1 70 ........... 0.8110 (0.7771, 0.8418)2 70 ........... 0.8368 (0.8274, 0.8459)3 60 ........... 0.7766 (0.7661, 0.7870)



d = 0.5assuming

sd = 1



7 Validating a sample-size calculation

One of the advantages of simsam as a system for sample-size calculation is that it isrelatively easy to share and to validate. This is useful, for example, if a sample-sizecalculation is to be included in a grant proposal.

7.1 Introduction to the example

Suppose there is a proposal for a randomized trial of a community intervention—theAARDVARK trial—that is to be randomized by household and delivered either to oneadult in the household or to two if there is a married or cohabiting couple living there.The investigators assume the proportion of households with couples is 0.3; the meandifference in outcome between intervention and control groups is 0.5; the standarddeviation within each group is 1.0; and the intracluster correlation is 0.5. They intendto analyze their data using random-effects regression, testing the group difference witha likelihood-ratio test. They report that a sample size of 58 households per groupachieves 80% power at the 5% significance level and submit the following command,which they used with simsam to calculate sample size:

. program define aardvark, rclass1. version 12.02. syntax, PCOUPLE(real) ICC(real) D(real) SD(real) NHOUSPERGRP(integer)3. drop _all4. set obs `=2*`nhouspergrp´´5. scalar sigmaa=sqrt(`icc´)*`sd´6. scalar sigmae=sqrt(1-`icc´)*`sd´7. gen group=mod(_n,2)8. gen k=1+(runiform()<`pcouple´)9. gen ybar=rnormal(`d´*group, sigmaa)10. gen housid=_n11. expand k12. gen y=rnormal(ybar, sigmae)13. capture noisily {14. xtreg y group, i(housid) mle15. estimates store model116. xtreg y, i(housid) mle17. estimates store model018. lrtest model1 model019. }20. return scalar p=r(p)21. end

R. Hooper 33

7.2 Repeating the calculation and estimating the power under thenull hypothesis

Because simsam only converges on a solution if it can do so reliably, repeating thesimsam command will nearly always give the same answer (or, at worst, a sample sizeone increment away). Thus one way for the funding panel to validate the investigators’calculation is simply to repeat it. simsam also allows the user to estimate the powerunder the null hypothesis at the required sample size. This can be specified as anadditional option, null(), when repeating the sample-size calculation. The examplebelow takes some time to run (over 24 hours, depending on system specification), butin the timescale of a typical grant application, this should be no great burden (andwithout the efficient algorithm used by simsam, it might take weeks to determine thesample size by simulation).

. version 12.0

. set seed 20120301

. simsam aardvark nhouspergrp, power(0.8) alpha(0.05) null(d(0)) detect(d(0.5))> assuming(sd(1.0) icc(0.5) pcouple(0.3)) inc(1) prec(0.001)

iteration nhousp~p power (99% CI)

1 100 ........... 0.9900 (0.9280, 0.9999)2 43 ........... 0.6400 (0.5998, 0.6788)3 63 ........... 0.8443 (0.8347, 0.8535)4 56 ........... 0.7919 (0.7885, 0.7952)5 58 ........... 0.8024 (0.8014, 0.8033)6 57 ........... 0.7954 (0.7944, 0.7964)

null 58 ........... 0.0523 (0.0512, 0.0533)

nhouspergrp = 58achieves 80.24% power (99% CI 80.14, 80.33)


d = 0.5assuming

sd = 1.0icc = 0.5

pcouple = 0.3

under null: 5.23% power (99% CI 5.12, 5.33)


This reproduces the investigators’ solution of 58 households per group. The powerunder the null is estimated to be 5.23%. In general, we would expect power underthe null to equal significance. Here we must consider the validity of the subcommandprogram (see section 7.3). If we are satisfied with this (and excluding sampling variationas an explanation), we must conclude that a likelihood-ratio test in this situation isslightly biased. If the sample-size calculation is repeated with a nominal significancelevel of 4.8% to bring the true significance level closer to 5%, then the required samplesize becomes 59 households per group.


One note of caution: If power is defined as the probability of correctly rejectingthe null hypothesis under a given alternative, then in a two-sided context, we need tobe clear what we mean by “correctly rejecting”. Power in the two-sided case is oftencalculated as the probability that the two-tailed p-value is significant and the observedeffect is in the correct direction. For simsam to calculate this “directional” power, weneed the subcommand to return the relevant one-sided p-value multiplied by two ratherthan the two-sided p-value. In this case, the power under the null would be half thesignificance level. In practical terms, this distinction turns out to be of little importance:when the directional power is 80%, the nondirectional power is only around 0.0001%greater using a normal approximation.

With the null() option in use, the distinction between detect() and assuming()is clearer. To calculate the required sample size, you pass all options in detect() andassuming() to subcommand. To calculate power under the null, you pass all optionsin null() and assuming() to subcommand. In other words, assuming() contains as-sumptions that hold under both the null and the alternative.

The null() option can also be used with simsam continue. If the inc() andprec() options are not set, then power under the null is calculated for the previoussimsam solution.

7.3 Validating the subcommand

The command aardvark certainly appears to have face validity; that is, it seems to dowhat is intended. To investigate further what the command is doing, we could run itseparately and follow up with additional checks:

. version 12.0

. set seed 20120301

. quietly aardvark, pcouple(0.3) icc(0.5) d(0.5) sd(1.0) nhouspergrp(58)

. table group, contents(mean y sd y)

group mean(y) sd(y)

0 -.1695986 .95299811 .4627651 1.021282

R. Hooper 35

. bysort group: loneway y housid

-> group = 0

One-way Analysis of Variance for y:

Number of obs = 73R-squared = 0.9582

Source SS df MS F Prob > F

Between housid 62.658676 57 1.099275 6.04 0.0002Within housid 2.732109 15 .1821406

Total 65.390785 72 .90820535

Intraclass Asy.correlation S.E. [95% Conf. Interval]

0.80037 0.07926 0.64501 0.95572

Estimated SD of housid effect .8545366Estimated SD within housid .4267793Est. reliability of a housid mean 0.83431

(evaluated at n=1.26)

-> group = 1

One-way Analysis of Variance for y:

Number of obs = 76R-squared = 0.9120

Source SS df MS F Prob > F

Between housid 71.341771 57 1.25161 3.27 0.0037Within housid 6.8844844 18 .38247136

Total 78.226255 75 1.0430167

Intraclass Asy.correlation S.E. [95% Conf. Interval]

0.63477 0.12855 0.38282 0.88673

Estimated SD of housid effect .8153182Estimated SD within housid .6184427Est. reliability of a housid mean 0.69442

(evaluated at n=1.31)

We could also run simulations involving aardvark using the simulate command toinvestigate its behavior further.

7.4 Comparison with approximate methods

If there were no clustering by household, a conventional sample-size calculation wouldindicate that 63 participants per group were required (Campbell, Julious, and Altman1995). If we use a standard adjustment for the “design effect” introduced by clustering(see Donner, Birkett, and Buck [1981]), calculated using the average cluster size of 1.3and the intracluster correlation of 0.5, the required sample size per group becomes 73participants, or 56 households. Variability in cluster size will reduce the power. An


alternative adjustment that is based on the coefficient of variation of the distributionof cluster size and is conservative is discussed by Eldridge, Ashby, and Kerry (2006).Using this adjustment, we obtain a required sample size of 78 participants per group,or 60 households. From this, we would suspect that the true required sample size wassomewhere between 56 and 60 households per group, which agrees with the solutionobtained by simsam.

8 Methods and formula

8.1 Successive iterations of sample size in the heuristic phase

In the case where the null and alternative hypotheses differ by one degree of freedom,we can imagine the hypothesis test to be based on an unbiased estimate of the effectsize δ, which is roughly normally distributed with standard error σ/

√n, n being the

sample size. Under this assumption of normality, the relationship between sample sizen and type II error probability β is

δ

σ/√

n= z1−α/2 + z1−β

where zp is the pth percentile of the standard normal distribution, and α is the type Ierror probability, that is,

n = (σ/δ)2(z1−α/2 + z1−β

)2or equivalently

n(z1−α/2 + z1−β

)2 = constant

Suppose at iteration i we are considering sample size ni, at which we estimate thepower to be 1− βi. Then we can use the above equation to calculate a new sample sizeni+1 aimed at achieving the target power 1 − β∗:

ni+1 = ni

(z1−α/2 + z1−β∗

z1−α/2 + z1−bβi

)2

= nif(βi

)simsam rounds this up to the next multiple of the chosen increment.

R. Hooper 37

8.2 Number of replications

The maximum number of replications r is calculated from the target power 1 − β∗ (orthe significance if the power under the null is being calculated), the precision δβ (thehalf-width of the confidence interval), and the confidence level 1 − λ using standardmethods for estimating proportions:

r = β∗ (1 − β∗)(

z1−λ/2

δβ

)2

simsam rounds this number up to the nearest multiple of 10.

8.3 Confidence intervals for power

Confidence intervals for power are calculated by the Stata command cii using an exactbinomial method; see [R] ci.

8.4 Required ratio of precision to increment

After each iteration, simsam assesses whether the required sample size can be reliablydetermined to within one increment. To do this, it imagines the power to be estimatedas 1−β∗ (confidence interval [1−β∗− δβ, 1−β∗ + δβ]) at the most up-to-date guess atrequired sample size ni+1 and looks at what the difference would be between the nextiteration of sample size at the upper and lower confidence limits, respectively. Thisdifference should be less than the increment m for there to be confidence that the nextiteration of sample size does not change by more than one increment:

{f (β∗ + δβ) − f (β∗ − δβ)}ni+1 < m

If this condition is not met, simsam halts. In this case, and also when it converges ona solution, simsam provides the user with a suggested precision-to-increment ratio. Itdoes this by approximating the left-hand side of the above inequality using the derivativeof f with respect to β, which is f , and using its best guess at sample size n (either ni+1

if halted prematurely or the solution on which it has converged). Thus

2δβf (β∗) n < m

that is,δβ

m< 1/

{2nf (β∗)

}=

(z1−α/2 + z1−β∗

)e−z2

1−β∗/2

4√

2πn


9 ReferencesAltman, D. G. 1980. Statistics and ethics in medical research: III How large a sample?

British Medical Journal 281: 1336–1338.

Browne, W. J., M. Golalizadeh Lahi, and R. M. A. Parker. 2009. A guide to samplesize calculations for random effect models via simulation and the MLPowSim softwarepackage. http://www.bristol.ac.uk/cmm/software/mlpowsim/mlpowsim-manual.pdf.

Campbell, M. J., S. A. Julious, and D. G. Altman. 1995. Estimating sample sizesfor binary, ordered categorical, and continuous outcomes in two group comparisons.British Medical Journal 311: 1145–1148.

Donner, A., N. Birkett, and C. Buck. 1981. Randomization by cluster: Sample sizerequirements and analysis. American Journal of Epidemiology 114: 906–914.

Eldridge, S. M., D. Ashby, and S. Kerry. 2006. Sample size for cluster randomized trials:Effect of coefficient of variation of cluster size and analysis method. InternationalJournal of Epidemiology 35: 1292–1300.

Eng, J. 2004. Sample size estimation: A glimpse beyond simple formulas. Radiology230: 606–612.

Feiveson, A. H. 2002. Power by simulation. Stata Journal 2: 107–124.

———. 2009. FAQ: How can I use Stata to calculate power by simulation?http://www.stata.com/support/faqs/statistics/power-by-simulation/.

Florey, C. D. 1993. Sample size for beginners. British Medical Journal 306: 1181–1184.

Jung, S.-H. 2008. Sample size calculation for paired survival data: A simulation method.Journal of Statistical Computation and Simulation 78: 85–92.

Newell, D. J. 1978. Type II errors and ethics (letter). British Medical Journal 4: 1789.

Williams, M. S., E. D. Ebel, and B. A. Wagner. 2007. Monte Carlo approaches for deter-mining power and sample size in low-prevalence applications. Preventive VeterinaryMedicine 82: 151–158.

About the author

Richard Hooper is a senior lecturer in medical statistics at the Centre for Primary Care andPublic Health in the Blizard Institute at Queen Mary, University of London, where he is anadviser for Research Design Service London, and he is a senior statistician with the PragmaticClinical Trials Unit.


Sar: Automatic generation of statistical reportsusing Stata and Microsoft Word for Windows

Giovanni L. Lo MagnoDepartment of Agriculture and Forestry Sciences

University of PalermoPalermo, Italy

[email protected]

Abstract. The output provided by most Stata commands is plain text not suit-able to be presented or published. After the numerical and graphical outputsare obtained, the user has to copy them into a word processor to complete theediting process. Some Stata commands help you to obtain well-formatted out-put, especially tabulated results in LATEX or other formats, but they are not acomplete solution nor are they friendly tools. Stata automatic report (Sar) is aneasy-to-use macro for Microsoft Word for Windows that allows a powerful integra-tion between Stata and Word. With Sar, the user can retrieve numerical resultsand graphs from Stata and automatically insert them into a well-formatted Worddocument, exploiting all the functions of Word. This process is managed by Wordwhile Stata is executed in the background. Sar requires Stata commands and somespecific Sar commands to be written in ordinary Word comments. Thus the reportis well documented, and this can encourage the sharing of the workflow of dataanalysis and the reproducibility of the research. With Sar, the user can create anautomatic report, that is, a Word document that can be automatically updated ifdata have changed. Sar works only on Windows systems.

Keywords: pr0055, Sar, Stata Automation object, report automation, MicrosoftWord, reproducible research, Automation, OLE

1 Introduction

Presenting results is an essential step in the workflow of data analysis (Long 2008);nevertheless, it is often not well integrated with the other steps of the process. Oncethe statistical analysis is completed, results must be copied manually from Stata toanother software to create presentations, reports, articles, books, or webpages. Anefficient workflow should be automated, allowing the user to focus on core issues andsave time.

Several user-written Stata commands are available to produce output that is readilyusable in a LATEX environment (Knuth 1986; Lamport 1994) or in editing software likea word processor. The listtex command by Newson (2003) uses values from a Statadataset to create rows of a table in several formats: LATEX, HTML, or plain text withseparators. The textab command by Hardin (1995) generates TeX code to format atable containing values from a Stata dataset, so it allows some options to customize thetable. The estout command by Jann (2005) is a tool for creating tables from stored

c© 2013 StataCorp LP pr0055

40 Sar: Automatic generation of statistical reports

estimates. The esttab command by Jann (2007) simplifies the usage of the estoutcommand. A dedicated command to the creation of a table of regression results is theoutreg command by Gallup (2012).

Generally, the existing commands present the following limitations: 1) most of themare LATEX oriented, and many users think working in LATEX is difficult or boring; 2)their syntax is complex and not easy to remember; 3) they do not adopt a “what yousee is what you get” approach.

In this article, I present a software called Stata automatic report (Sar), which thecurrent version is 1.1. Sar is not a Stata command, but a macro for Microsoft Word,written in the Visual Basic for Applications (VBA) programming language. In short, amacro is a program executed by Word itself, often used to automate repetitive tasks orto extend the functions of Word.

Sar is used to facilitate the creation of Word documents that present numericalresults obtained through Stata. To do this, Sar directly retrieves data from Stata,exploiting Stata Automation (for details, see http://www.stata.com/automation/), andautomatically inserts them in the Word document. This basic feature avoids the processof manually copying numerical output from Stata to Word.

Stata Automation is a framework allowing Windows applications to interact withStata. To use Stata Automation, you have to install the Stata Automation object(see section 2). Because Sar exploits Stata Automation, it can run only on Windowsmachines.

Sar allows the creation of automatic reports, that is, Word documents that canupdate themselves if data change. Stata commands used to produce the statisticaloutput and the Sar commands used to manage formatting have to be written in Wordcomments. These comments are associated with portions of text (for example, tables)that are updated with data retrieved from Stata according to the commands writtenin the comments. This way, you can obtain a well-documented and self-explanatorydocument. You also have the great advantage of using Word to edit the layout of thereport in a “what you see is what you get” approach.

Only comments with the user initials “sar” (in lowercase) are processed by Sar,so you can continue to use ordinary comments in your Word document (see Worddocumentation to learn how to change user initials).

This article is organized as follows. In section 2, I will explain how to install the Sarmacro and how to enable the interprocess communication between Word and Stata. Insection 3, I will show the use of Sar for the first time, with a simple but enlighteningexample. In section 4, I will explain how to set the format of numerical outputs obtainedthrough Sar. In section 5, I will show how to create tables, with particular attentionto tables of estimates from regression analysis. In section 6, I will explain how toautomatically insert graphs—created in Stata—into the Word document. In section 7,I will present an advanced feature of Sar: the use of Sar programs. In section 8, I willdescribe how to use Sar in interactive mode, an alternative way to retrieve data fromStata. In section 9, I will explain the syntax of all the Sar commands and provide some

G. L. Lo Magno 41

brief descriptions of them. In section 10, I will describe some limitations of Sar, and insection 11, I will close the article with my conclusions.

2 What you need to use Sar

Sar works only on Microsoft Word for Windows. Both Stata and Microsoft Word mustbe installed on your computer.

To use Sar, you need to download the Word macro-enabled template fileStata automatic report 1.1.dotm from http://www.stata-journal.com/software/sj13-1/pr0055/Stata automatic report 1.1.dotm and copy it into the Word Startupfolder. This file contains the Sar 1.1 macro. To find the Startup folder path, openWord, click on Customize Quick Access Toolbar > More Commands... > Ad-vanced in the Word options dialog window, and then click on the File Locations...button.

You also need to install the Stata Automation object. See http://www.stata.com/automation/, section 3.1, for instructions. See that same URL for general informationabout the Stata Automation object.

Once the macro has been copied into the Word Startup folder, make sure the macrowas found and loaded by Word (please see the Word documentation). To easily runthe Sar macro, you should create a button on the customizable Quick Access Toolbarof Word, as in figure 1. Do the following steps: 1) click on the arrow to the right of theicons on the Quick Access Toolbar and choose More Commands...; 2) select Macrosfrom the Choose commands from: list; 3) select the Stata Automatic Report macro,then click on Add; and 4) click on the Modify button to assign an icon to the buttonand change its name to Stata automatic report 1.1.


Figure 1. Customize the Quick Access Toolbar in Word

Optionally, you can create a keyboard shortcut to run the Sar macro. To create akeyboard shortcut, see the Word documentation.

3 A quick look at how Sar works

First of all, I will show you how Sar works with the help of a simple example. Sup-pose you want to create an automatic report where the mean of the price variable ofauto.dta is discussed. You have to create a Word file where a portion of text, hence-forth called “placeholder” text, is commented with a mix of Stata and Sar commands.I suggest you use the conventional character “X” as a general placeholder. After theSar macro has been launched, the placeholder will be replaced by the numerical valueobtained from processing the Stata and Sar commands. Before you execute Sar, theWord file should appear like that in figure 2.

G. L. Lo Magno 43

Figure 2. A simple automatic report (before Sar is executed)

Notice that the comment in figure 2 is signed with the initials “sar” in lowercase(the 1 is automatically added by Word to designate this as the first comment by “sar”).Only comments with these initials will be processed by Sar, so you can continue to useordinary comments when revising your document provided that you use different initialsfor them. To change the user initials, go to File > Options (figure 3).

Figure 3. Setting user initials to work with Sar

The first two commands in the comment of figure 2 (sysuse and summarize) willbe executed by Stata. They load auto.dta and calculate summary statistics for theprice variable. The third command is the Sar @print command, which asks Word toreplace the placeholder with the r(mean) value retrieved from the Stata environment.Telling the difference between Sar commands and Stata commands is easy: every Sarcommand begins with the @ symbol.


After Sar has been executed, the report will appear as the one in figure 4.

Figure 4. A simple automatic report (after Sar is executed)

The placeholder text was replaced by the value 6165.2568. If the value of the pricevariable changes, you can easily update your report simply by running Sar again. Thecomment will remember the statistical analysis you carried out, helping you to documentyour analysis and allowing other people to reproduce the results. Now your documentis automated and self-explanatory.

4 Formatting numerical output

Of course, you would like to be able to format the numerical values retrieved from theStata environment, for example, by choosing the number of digits after the decimalseparator. You can do this by using the @format command. This command must befollowed by a numeric format string as its unique argument, specified according to thesame rules used in the Stata format command. For example, you can type @format%6.1f to set a numerical output with a period as the decimal separator and only onedigit after the decimal separator. All numerical outputs will be formatted accordingto the numeric format specified in the command syntax, but the effect of the @formatcommand is maintained until a new @format command specifies a different numericformat.

G. L. Lo Magno 45

In the example in figure 5, the automatic report shows the average price of the carsin auto.dta, the number of observations, and the number of foreign cars.

Figure 5. Using the @format command

Notice that the first Stata commands are placed in a comment associated with thetitle of the report, making the document easier to read. Stata and Sar commands canbe placed in different comments, and they are always executed sequentially.

No leading spaces are added to the numerical output even if they are expectedaccording to the Stata formatting rules. For example, in the second comment, the effectof @format %6.1f would be the same as that of @format %21.1f. This simplifies theediting work in Word.

Try to change the @format command in the second comment to @format %6,1f toobtain a numerical output with a comma as the decimal separator.

5 Creating tables

Most reports have statistical tables, and statisticians know how boring it can be tocreate them. Sar helps you to easily create tables, filling them with numerical valuesand labels. The basic commands to do this are @filltable, @matrixrownames, and@matrixcolnames.

The @filltable command must always be inserted into a comment associated witha table. In its simplest syntax, the command requires three arguments (there are five inits complete syntax). The first is the data, retrieved from the Stata environment, thatyou want to insert into the table; the second and the third, respectively, are the tablerow and column from which the filling process starts.

A well-formatted table needs row and column labels, too. Results stored in a matrixwith row and column names are often obtained in Stata; row and column names are nat-


ural candidates to use as labels for your table. You can automatically insert these namesinto your table with the @matrixrownames (for row names) and @matrixcolnames (forcolumn names) commands. Similarly to the @filltable command, in the simplest syn-tax for @matrixrownames and @matrixcolnames, the first argument is a matrix (butnever a scalar) and the second and the third arguments, respectively, are the row andthe column of the first table cell that has to be filled with the names of the matrix.

In figure 6, we use the @filltable and @matrixrownames commands to create atable presenting results from a regression analysis. First of all, we obtain the betamatrix of estimated regression coefficients as the transposed matrix1 of e(b) in thefirst Sar comment. The @filltable command in figure 6 has the column vector betaas its first argument, the number 2 as its second argument, and the number 2 as itsthird argument: the command prints the matrix beta starting from the table cell in thesecond row and second column. The @matrixrownames beta 2 1 command prints therow names of the matrix beta starting from the table cell in the second row and firstcolumn.

The Word table associated with the @filltable and @matrixrownames commandsneeds to be big enough to contain the vector that the commands insert into it—Sar willnot automatically add the needed rows and columns to the table.

Figure 6. Using the @filltable and @matrixrownames commands

Now we want to add standard errors in parentheses under each estimated coefficient.To do this, we have to replace our 4 × 5 table with a 7 × 5 table (we add three rows,one for each standard error). To obtain the final table (shown in figure 7), we need toalternate estimated coefficients and standard errors in the second column of our newtable and alternate row names with white cells in the first column.

The @filltable command has a total of five arguments (the last two are optional).So far, I have explained the first three. The fourth and the fifth arguments are called,respectively, row step and column step: they set how many rows (columns) have to beskipped between a row (column) and the next row (column) when you write the matrix

1. Note that the apex used as operator of transposition (’) is not the one directly obtained by tryingto type it in Word (‘). If you want to get the correct apex after typing the wrong one in Word, youhave to press Ctrl+Z.

G. L. Lo Magno 47

in the table. Their default values are (0, 0), for which no row or no column is skipped.In the first @filltable command in figure 7, we use a row step of 1 and a column stepof 0 (this last has no effect in this example because beta is a column vector); so thecoefficients are printed in the table skipping one row between values. The same rowstep and column step are used for the second @filltable command, which prints thestandard errors (the standard error vector sd is obtained in the first Sar comment byusing two lines of Mata code). A row step of 1 is also specified as the fourth argumentof the @matrixrownames command, which takes only four arguments (a column stepargument would be useless).

Figure 7. A table of regression results with beta coefficients and standard errors2

In figure 7, standard errors are listed in parentheses, which we obtain using the@beginstring and @endstring commands. Each of these commands takes only oneargument given by a string delimited by two sharps (#). The @beginstring commandsets the string you want to put before numerical output printed in tables, whereas the@endstring command sets the string you want to place after numerical output. In ourexample, we use an opening parenthesis as the opening string and a closing parenthesisas the closing string. The effect of these commands persists until the @resetstringcommand (with no arguments) is used.

2. Note that the quotation marks used in the first Sar comment are " and not “. If you try to typethem from your keyboard, Word will automatically create the latter. To obtain the correct one,press Ctrl+Z after you have typed the wrong quotation marks in Word.


6 Graphics

Sar can retrieve not only numbers but also graphs created in Stata. For this purpose,you can use the @graph command, which allows you to insert into Word the last graphdisplayed in Stata. The @graph command is typically used immediately after a Statagraph command, such as graph twoway, graph matrix, or graph bar.

The @graph command must be associated with a placeholder, which can be text oran image (but not a table). Similarly to the @print command, the placeholder will bereplaced by what is retrieved from Stata, which in the case of the @graph command is animage. From a technical point of view, the graph is exported by Stata into a temporaryfile in your system temporary folder, and then the image is inserted into Word (thename of the temporary file is sartemp, with the extension depending on the file formatchosen in the @graph command). This data exchange process is hidden to the user.

In the example in figure 8, we create an ordinary graph in Stata by using the twowaycommand. This command is immediately followed by the Sar @graph command, whichwill retrieve the last displayed graph and insert it into Word. The @graph commandrequires three arguments. The first is the image file format used in the data exchangeprocess between Stata and Word; the format must be one of the following: .eps, .png,.tif, .wmf, or .emf. In the example, the file format is .tif, a raster (not vector) fileformat. The choice of the format can affect the visual appearance of the image. Thesecond and the third arguments, respectively, are the width and the height of the imageas it will be displayed in Word.

Figure 8. A graph inserted into Word by Sar

G. L. Lo Magno 49

The @graph command also can be used after a Stata graph use command. Forexample, in the following Sar and Stata code,

twoway (scatter length weight)graph save scatter1.gph, replacetwoway (scatter price mpg)graph save scatter2.gph, replacegraph use scatter1.gph@graph tif 100 100

the graph that will be inserted into Word is scatter1.gph and not the last-createdscatter2.gph.

7 Sar programs

A Sar program is, roughly speaking, a list of Sar and Stata commands. The code of aSar program is defined between the @program and the @end commands, that is,

@program myprog[. . .

][my commands

][. . .

]@end

The first argument of the @program command is used to specify the name of the Sarprogram. In the example above, the name of the Sar program is myprog. This programcan be executed simply by using the @do command with the program name as its firstargument:

@do myprog

A program can be defined in a Word comment or in an external plain text file, calledlibrary in the Sar jargon. A library can contain many Sar programs. To load all the Sarprograms defined in a library, you can use the @loadlibrary command with the filepath of the library file in quotation marks as its unique argument. Here is an example:

@loadlibrary "C:\sar libraries\mylibrary.txt"

Programs can accept arguments. Arguments have to be specified in the @programcommand. For example, the following program,

@program outmatrix matrix@matrixrownames §matrix§ 2 1@matrixcolnames §matrix§ 1 2@format %4.3f@filltable §matrix§ 2 2@end


has a unique argument labeled matrix and prints the user-specified matrix in a Wordtable. In the program code, each argument is written between two § symbols.2 Beforeexecuting the program, Sar replaces the §matrix§ string with the argument providedby the user. We use the outmatrix Sar program in figure 9.

Figure 9. Using a Sar program and defining it in a Word comment

The @do command has always the Sar program name as its first argument. If theprogram requires some arguments, they must be included as arguments in the @docommand. For example, in figure 9, the @do command calls the outmatrix Sar programwith r(C) as an argument.

If the outmatrix program is defined in a library, it will have to be loaded throughthe @loadlibrary command, as it is done in figure 10.

Figure 10. Loading a Sar program from a library

Using programs and libraries, the Sar code in the automatic report tends not to beverbose; it is very easy to understand what the code does. Moreover, a library can bereused and shared with colleagues.

2. To access the § symbol in Word, click Insert > Symbol > More Symbols.... In the Symboldialog box, you will find the section sign symbol (§).

G. L. Lo Magno 51

The following program produces a regression output with beta coefficients, standarderrors, number of observations, and R-squared:

@program regressoutmatrix beta = e(b)´mata: V = st matrix("e(V)")mata: V = st matrix("sd", sqrt(diagonal(V)))@format %10.1f@filltable beta 2 2 1 0@matrixrownames beta 2 1 1@beginstring #(#@endstring #)#@filltable sd 3 2 1 0@resetstring@format %3.0f@filltable e(N) -2 2@format %4.3f@filltable e(r2) -1 2@end

An example of its usage is given in figure 11. Because the @print command cannotbe used to insert numerical values into a table, we print the number of observationsand the R-squared by using the @filltable command. The last two @filltablecommands in the regressout program have a negative number as the first argument.Negative numbers are used to indicate row and column coordinates according to adifferent coordinate system, where -1 is the last row or column, -2 is the second-to-lastrow or column, and so on. We use negative numbers to make the regressout programflexible and easy to use. This way you can use the program without being worried abouthow many variables you use in the regression analysis: the number of observations andthe R-squared will always be printed, respectively, in the second-to-last row and in thelast row of the table.


Figure 11. Output of a regression analysis with the regressout program

If you like the regressout Sar program, you can save it in a library and use it in allyour automatic reports simply by loading the library with the @loadlibrary command.You can also create other Sar programs to carry out specific formatting tasks.

A Sar program cannot contain the definition of another program. Nested definitionsof programs like the following are forbidden:

@program alpha[do something in alpha

]@program beta[do something in beta

]@end[do something in alpha

]@end

A Sar program cannot call another program with the @do command or load programsthrough the @loadlibrary command.

G. L. Lo Magno 53

8 Using Sar in interactive mode

Suppose you do not have any need to create a document that can update itself ifdata change, and you like to launch Stata commands directly from Stata because youconsider it a more comfortable environment. In such a case, you could consider usingthe @interact command, which allows you to halt Sar, open a Stata session in whichyou can interactively launch your commands, and return to Sar to use the obtainedresults in your document.

To use Sar in interactive mode, place the @interact command in a Word comment(the command does not require arguments). The elaborations managed by Sar willhalt when the @interact command is found. At that moment, Sar will open theStata software window, which remains at your disposal, to allow you to type yourcommands and interact with Stata. After you conclude your Stata session, remembernot to manually close the Stata window: this will cause Sar to crash. To avoid this,you have to return to Word, where you will find a dialog window with a button to closeStata. If you correctly do this, the data objects created in the Stata session will beavailable for the Sar commands following the @interact command.

The @interact command can be useful when you are not sure about how to doa certain statistical analysis, so you need to interact with Stata to check intermediateresults or maybe call help. Suppose that you have to obtain the well-known X ′X matrixrequired in regression analysis from a dataset given by X (the last column is a columnvector of ones). You do not mind documenting your statistical analysis in a Wordcomment, so you choose to use the @interact command as shown in figure 12.

Figure 12. Using Sar in interactive mode


When this example is launched and Sar finds the @interact command, the Statawindow will open and there you can type the following:

. sysuse auto(1978 Automobile Data)* suppose you do not remember how to use the mkmat command. help mkmat. mkmat mpg weight, matrix(X). count

74. matrix one = J(74, 1, 1). matrix X = X, one. matrix mymatrix = invsym(X´ * X)

After typing these commands, remember to return to Word, where you will find adialog window with a button named Close Stata that you must click on. If you try tomanually close Stata, Sar will crash, and you will have to retype all your commands.

The matrix mymatrix, created in the previous Stata session, is now available in theSar environment. The @filltable command in figure 12 retrieves the values of thismatrix and uses them to fill the final table. If you are no longer interested in the Wordcomment, you can then delete it.

9 Syntax and description of the Sar commands

This section gives the syntax and a short description of each Sar command. In thegiven syntax of a Sar command, arguments between brackets are optional. Examplesare shown only for commands with arguments.

9.1 @beginstring and @endstring

Syntax

@beginstring #string#

@endstring #string#

Description

The @beginstring command sets the string of characters you want to place before thenumerical output of the @filltable command. The @endstring command sets thestring of characters you want to place after the numerical output of the @filltablecommand. The string must be specified between two sharps (#).

G. L. Lo Magno 55

Examples

The following code ensures that every numerical value inserted into a table by the@filltable command will begin and end with an open bracket:

@beginstring #[#@endstring #]#

The following code cancels the effect of the previous code:

@beginstring ##

9.2 @cleartable

Syntax

@cleartable

Description

The @cleartable command clears the table associated with the comment where thecommand is written. It can only be used within Word comments associated with asingle table.

The command has no arguments.

9.3 @do

Syntax

@do SarProgram[arg1 arg2 arg3 . . . argN

]Description

The @do command executes a program previously loaded by the @loadlibrary com-mand or defined in a Word comment through the @program and @end paradigm.

SarProgram specifies the program to be executed.

The optional arguments arg1, arg2, arg3, . . . , argN specify the arguments to bepassed to the program.

Example

Suppose you have defined a Sar program in a comment and you have called it myprogram.If you want to execute it, you can use the following code:

@do myprogram


9.4 @filltable

Syntax

@filltable StataData startingRow startingCol[rowStep colStep

]Description

The @filltable command inserts matrices, Stata results, scalars, and macros givenby the StataData argument into a Word table. It can be used only in Word commentsassociated with a single table.

StataData is the data retrieved from the Stata environment used by the commandto fill the table. It can be a matrix, a Stata result, a scalar, or a macro.

startingRow and startingCol indicate, respectively, the row and the column of thetable cell from which StataData should begin to be printed. They have to be nonzerointegers. If these values are negative, -1 means the last row or column, -2 means thesecond-to-last row or second-to-last column, and so on.

rowStep (colStep) indicates how many rows (columns) have to be skipped between arow (column) and the next row (column) while you are filling the table. When rowStep orcolStep equals 0, no blank row or column is left between printed rows or columns. WhenrowStep or colStep equals n, then n blank rows or columns are left between printed rowsor columns. These arguments are optional, and they have to be nonnegative integers.

Examples

The following code,

@filltable e(V) 4 6 3 1

fills a Word table with values taken from the matrix e(V) starting from the positionin row 4 and column 6. The row step is 3 and the column step is 1, so the commandwill leave three blank rows between printed rows and one blank column between printedcolumns.

In the following code,

@filltable e(V) 4 6

the row step and the column step are not set, so the @filltable command will considerthe default value of 0 for both arguments. No blank rows (columns) will be left betweenprinted rows (columns).

G. L. Lo Magno 57

9.5 @format

Syntax

@format %fmt

Description

The @format command sets the numerical format of the output obtained by the @printand @filltable commands. The set numerical format is preserved for the following@print and @filltable commands.

The %fmt argument has to be a numerical format written with the same rules usedin the Stata format command (see [D] format).

Example

The following code sets a format with three decimal digits and a comma as decimalseparator:

@format %5,3f

9.6 @graph

Syntax

@graph graphicformat width height

Description

The @graph command retrieves the last graph displayed in Stata and inserts it into theWord document. The command must be associated with a placeholder, which can betext or an image but not a table. The placeholder will be replaced by the last graphdisplayed in Stata. The @graph command is typically used immediately after a Statagraph command, such as graph twoway, graph matrix, or graph bar.

The graphicformat argument is the graphic format of the image exported by Stataand imported by Sar during the data exchange process between the two softwares. Itcan be one of the following: .eps, .png, .tif, .wmf, or .emf. The choice of the formatcan affect the visual appearance of the image. The width and the height argumentsrepresent the width and the height of the graph displayed in the Word document.

All arguments must be provided.


Example

The following code inserts a scatterplot into Word with a width of 100 and a height of80, using the .png file format:

sysuse autotwoway (scatter length weight)@graph png 100 80

9.7 @interact

Syntax

@interact

Description

The @interact command halts the execution of Sar to put Stata at your disposal. Youcan use Stata, interact with it, and create data objects (such as scalars or matrices)that will be available in the Sar environment after your Stata session has been closed.Remember not to manually close the Stata window: this will cause Sar to crash. Youhave to return to Word, where you will find a dialog window with a button to closeStata.


9.8 @loadlibrary

Syntax

@loadlibrary "LibraryFilePath"

Description

The @loadlibrary command loads programs defined in a Sar library file. The path ofthe Sar library file is specified in the LibraryFilePath argument.

Example

Suppose you wrote some Sar programs in a plain text file named utilities.txt inthe C:\Sar folder. You can load all the Sar programs defined in the file by using thefollowing code:

@loadlibrary "C:\Sar\utilities.txt"

G. L. Lo Magno 59

9.9 @matrixcolnames and @matrixrownames

Syntax

@matrixcolnames StataMatrix startingRow startingCol[colStep

]@matrixrownames StataMatrix startingRow startingCol

[rowStep

]Description

The @matrixcolnames and @matrixrownames commands fill a Word table with, respec-tively, column names and row names of a Stata matrix. They can be used only in Wordcomments associated with a single table.

StataMatrix is the matrix retrieved from the Stata environment whose matrix rownames are printed by @matrixrownames and whose matrix column names are printedby @matrixcolnames. This argument has to be a matrix.

startingRow and startingCol indicate, respectively, the row and the column of thetable cell from which the row names or column names of StataMatrix begin to be printed.They have to be nonzero integers. If these values are negative, -1 will indicate the lastrow or column, -2 will indicate the second-to-last row or column, and so on.

colStep is an optional argument for @matrixcolnames. It indicates the column stepaccording to which the table is filled. The default value is 0. It has to be a nonnegativeinteger.

rowStep is an optional argument for @matrixrownames. It indicates the row stepaccording to which the table is filled. The default value is 0. It has to be a nonnegativeinteger.

Examples

The following code writes row names of matrix e(b) in a table starting from row twoand column three:

@matrixrownames e(b) 2 3

If you add the row step argument, the row names will be written leaving five blankrows between printed rows:

@matrixrownames e(b) 2 3 5


9.10 @print

Syntax

@print StataValue

Description

The @print command, launched from a Word comment associated with a portion oftext (a temporary text placeholder in Sar jargon), replaces its placeholder with the valueof a Stata result, scalar, or macro retrieved from the Stata environment. The @printcommand cannot be used in a Word comment associated with a table. The StataValueargument must be a Stata result, scalar, or macro.

Example

Suppose you have created a scalar named myresult in Stata. To retrieve the value ofthe scalar, you can type

@print myresult

in a comment associated with the text you want to replace with the value.

9.11 @program and @end

Syntax

@program programName[arg1 arg2 . . . argN

][. . .

][Sar and Stata commands

][. . .

]@end

Description

The @program and @end paradigm is used to define a Sar program. This paradigm canbe used in a Word comment or in a Sar library. A Sar program is, roughly speaking, alist of Sar and Stata commands. This list of commands is defined between the @programand the @end commands. After the commands are loaded in the Sar environment, theycan be executed through the @do command.

The programName argument sets the name of the program.

G. L. Lo Magno 61

The optional arguments arg1, arg2, . . ., argN specify the arguments of the programdefined by the @program and @end paradigm. When you want to use the values passed asarguments in your program, you have to use the §arg1§, §arg2 §, . . ., §argN § callbacksinside your program code; before executing the program, Sar replaces every callbackwith the corresponding values of arguments.

The @end command closes a program definition. It has no arguments.

The following commands cannot be used within a Sar program: @do, @loadlibrary,@interact, and the @program and @end paradigm.

Examples

Consider the following Sar program:

@program printTransposedMatrix matrix row colmatrix mymatrix = §matrix§´@fillmatrix mymatrix §row§ §col§@end

Suppose you call it as follows:

@do printTransposedMatrix e(b) 2 2

The commands executed by Sar, after the call done by @do, will be

matrix mymatrix = e(b)´@fillmatrix mymatrix 2 2

9.12 @resetstring

Syntax

@resetstring

Description

The @resetstring command sets the string of characters coming before and after thenumerical output of the @print and @filltable commands to an empty string. Whenthe @resetstring command is used, no characters are added before or after the nu-merical output. It is equivalent to the commands @beginstring ## and @endstring##.


See also the @beginstring and @endstring syntax and description.


9.13 @viewlog

Syntax

@viewlog

Description

The @viewlog command asks Sar to leave the Stata window open after the Sar macrois executed. This can be used to look at the log created by Stata computations. When@viewlog is used in a Word comment, a dialog window is opened after the execution ofthe Sar macro, allowing you to close the Stata window and definitively terminate theSar macro.


10 Some limitations of Sar

Sar works only in Windows.

Because Sar does not use Stata Automation in asynchronous mode (for details, seehttp://www.stata.com/automation/), you cannot use the following Stata commands:program define, while, forvalues, foreach, and input. These commands can stillbe used inside do-files or ado-files. The exit command is ignored.

Sar internally uses two auxiliary Stata local macros to retrieve results from Stata.The local macro stataAutomaticReportValue is used to retrieve Stata results, scalars,and macros. The local macro stataAutomaticReportMatrix is used to retrieve matri-ces. You have to avoid the use of these two auxiliary local macros to prevent conflictsbetween Sar and your Stata code.

Whenever a command written in a Sar comment modifies the Word document (forexample, filling a table with the @filltable command), the comment is deleted andafterward rebuilt with the same text. This process is hidden to the user. Here thereis no room for me to explain the technical reason why this process is necessary. Youshould consider this way of working with Sar to use Sar properly; otherwise, there couldbe unwanted consequences in the document.

The maximum number of programs that can be loaded in the Sar environment is500. The maximum number of arguments of a Sar program is 50. These limits are notdue to particular technical reasons. I set them simply thinking it is unlikely they couldbe exceeded.

Word comments referring to the same portion of text cause the Sar macro to crashand cause unwanted consequences in the Word document, so you have to avoid them.

G. L. Lo Magno 63

The setting of global and local macros has no effect in Sar. So the following Statacommands will cause an error:

. global mypi "3.14"

. scalar mydoublepi = $mypi * 2

After the execution of the Sar macro, you cannot erase all the previous changescarried out in the document by Sar with the Word undo function (Ctrl+Z). I stronglyrecommend that you save the Word document executing Sar to prevent unwanted con-sequences to your document.

The following commands cannot be used within a Sar program: @do, @loadlibrary,@interact, and the @program and @end paradigm.

11 Conclusions

Sar makes preparing statistical reports in Microsoft Word for Windows easy. Thanks toSar, you can exploit all the functions of Microsoft Word—including its “what you see iswhat you get” approach—and the power of Stata as a computational engine. Sar allowsthe creation of an automatic report, that is, a Word document in which numerical valuesgenerated by statistical analyses in Stata are automatically inserted into the document.Sar provides two advantages: the automatic report does not have to be modified ifdata change, and the report is integrated with the documentation about the statisticalanalysis that has been carried out. Sar allows you to easily share the workflow of dataanalysis and encourages reproducible research.

Automation and self-explanatory documents can also be obtained through the pack-age Sweave (Leisch 2002) for the R statistical software. Sweave is a tool that allowsyou to embed R code in LATEX documents, and it is part of every basic R installation.Unlike Sar, Sweave requires competence in the use of LATEX and does not offer a graphicuser interface for document editing; it essentially adopts a “what you see is what youmean” approach.

Another approach to the automatic generation of documents is described by Giniand Pasquini (2006). The authors give examples of how to generate LATEX and HTML

documents from Stata by using the file suite of commands (see [P] file). This approachis very flexible but, unlike Sar, requires writing LATEX or HTML code.

Sar is a versatile tool. If you use it in interactive mode, you will quickly retrievenumerical results from Stata, and you will not need to manually copy them into theword processor. You can use Sar to create webpages or other types of documentsmanaged by Word. Automation capabilities of Sar can be exploited to write homework,schoolwork, and tests with different random numerical values (all with a different butknown seed of a pseudorandom generation process to allow a quick marking of each test).Moreover, “quick parts” can be stored in Word to create reusable tables with associatedSar comments and inserted into the document when necessary; thus you avoid having toreinvent the wheel any time a new document has to be edited (see Word documentationto find out how to create “quick parts”).


Sar is an extensible system. To solve complex formatting tasks, you can add newuseful functions defining Sar programs. By calling Sar programs, you can make yourcode less verbose and easier to read.

Like all software, Sar can be improved. Some of its limitations have been discussedin the previous section. If you want to modify the Sar macro, right-click on the Stataautomatic report 1.1.dotm file, choose Open, and edit the macro in the VBA inte-grated development environment available in Word (see Word documentation to findout how to use it). Mastery in the use of the VBA language and the Stata Automationobject is required.

12 ReferencesGallup, J. L. 2012. A programmer’s command to build formatted statistical tables.

Stata Journal 12: 655–673.

Gini, R., and J. Pasquini. 2006. Automatic generation of documents. Stata Journal 6:22–39.

Hardin, J. 1995. dm29: Create TeX tables from data. Stata Technical Bulletin 25: 3–7.Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 20–25. College Station,TX: Stata Press.

Jann, B. 2005. Making regression tables from stored estimates. Stata Journal 5: 288–308.

———. 2007. Making regression tables simplified. Stata Journal 7: 227–244.

Knuth, D. E. 1986. The TEXbook. Reading, MA: Addison–Wesley.

Lamport, L. 1994. LATEX: A Document Preparation System. 2nd ed. Reading, MA:Addison–Wesley.

Leisch, F. 2002. Sweave: Dynamic generation of statistical reports using literate dataanalysis. In COMPSTAT 2002, ed. W. Hardle and B. Ronz, 575–580. Physica Verlag:Heidelberg.

Long, J. S. 2008. The Workflow of Data Analysis Using Stata. College Station, TX:Stata Press.

Newson, R. 2003. Confidence intervals and p-values for delivery to the end user. StataJournal 3: 245–269.

About the author

Giovanni L. Lo Magno is a research fellow in statistics at the University of Palermo in Italy.His research interests focus on gender differences in employment and statistical softwares. Heis the author of Tabula, a graphic user interface front-end for Stata.


Within and between estimates inrandom-effects models: Advantages and

drawbacks of correlated random effects andhybrid models

Reinhard SchunckDepartment of SociologyUniversity of Bielefeld

Bielefeld, [email protected]

Abstract. Correlated random-effects (Mundlak, 1978, Econometrica 46: 69–85;Wooldridge, 2010, Econometric Analysis of Cross Section and Panel Data [MIT

Press]) and hybrid models (Allison, 2009, Fixed Effects Regression Models [Sage])are attractive alternatives to standard random-effects and fixed-effects models be-cause they provide within estimates of level 1 variables and allow for the inclusionof level 2 variables. I discuss these models, give estimation examples, and addresssome complications that arise when interaction effects are included.

Keywords: st0283, xtreg, xtmixed, multilevel data, panel data, fixed effects, ran-dom effects, correlated random effects, hybrid model

1 Introduction

It is widely recognized that fixed-effects models have an advantage over random-effectsmodels when analyzing panel data because they control for all level 2 characteristics,measured or unmeasured (Allison 2009; Halaby 2004; Wooldridge 2010). This alsoapplies in a multilevel framework. However, a major drawback of fixed-effects mod-els is their inability to estimate the effect of any variable that does not vary withinclusters, which holds for all level 2 variables. To circumvent this disadvantage, ithas been proposed to estimate within effects in random-effects models (Allison 2009;Neuhaus and Kalbfleisch 1998; Rabe-Hesketh and Skrondal 2008; Raudenbush 1989a;Wooldridge 2010).

2 Models

In the linear case,1 the random-intercept model is given by

yit = β0 + β1xit + β2ci + μi + εit (1)

1. The strategy presented here also extends to nonlinear models (Allison 2009; Neuhaus andKalbfleisch 1998, Wooldridge 2010).


66 Within and between estimates in random-effects models

where subscript i denotes level 2 (for example, subjects) and t denotes level 1 (forexample, occasions). xit is a level 1 variable that varies between and within clusters,and ci is a level 2 variable that varies only between clusters. μi is the level 2 error andthe random intercept, and εit is the level 1 error. Throughout the article, εit will betreated as white noise and not considered further.

The standard distributional assumption regarding the level 2 error is μi|xit, ci ∼N(0, σ2

μ). The model provides consistent effect estimates if E(μi|xit, ci) = 0. Subtract-ing the between model

yi = β0 + β1xi + β2ci + μi + εi

from (1) provides the fixed-effects model in its demeaned form:

(yit − yi) = β1(xit − xi) + (εit − εi) (2)

The subtraction removes the level 2 error (μi) from the equation. As a result, themodel’s estimate of β1 is unbiased even if E(μi|xit) �= 0. But this comes at a cost.The subtraction also removes all variables that do not vary at level 1. Fixed-effectsmodels therefore cannot estimate the effect of level 2 variables. This may not be seenas a problem in panel-data analysis. But it is definitely a major drawback in multilevelanalysis, where interest often lies particularly in estimating the effect of level 2 variables.

However, it is possible to estimate within effects in random-effects models by de-composing level 1 variables into a between (xi = n−1

i

∑ni

t=1 xit) and a cluster (xit − xi)component. This hybrid model (Allison 2009) is given by

yit = β0 + β1(xit − xi) + β2ci + β3xi + μi + εit (3)

Using (4) to estimate β1 gives the within-effect estimate, that is, the fixed-effects esti-mate (Mundlak 1978; Neuhaus and Kalbfleisch 1998). Hence, the estimates of β1 from(3) and (4) are identical. Because (4) is a random-effects model, we can use it toestimate effects of level 2 variables. However, for the estimate of β2 to be unbiased,E(μi|xit, ci) = 0 and μi|xit, ci ∼ N(0, σ2

μ) still have to hold. Moreover, even though (4)is a random-effects model, its estimate of β1 is not more efficient than the one obtainedthrough estimating (3), because both estimates are solely based on within variation. β3

estimates the between effect (Mundlak 1978; Neuhaus and Kalbfleisch 1998). While it isnot necessary to include the cluster mean (xi) to obtain the within estimate of β1, its in-clusion ensures that effect estimates of level 2 variables are corrected for between-clusterdifferences in xit.

The idea to decompose between and within variation and to estimate the respec-tive effects in a single model is not new (Kaufman 1993; Kreft, de Leeuw, and Aiken1995; Neuhaus and Kalbfleisch 1998; Raudenbush 1989a), but it seems to have becomeincreasingly popular in panel data (Burnett and Farkas 2009; Phillips 2006; Ousey andWilcox 2007; Teachman 2011; Zhou 2011) as well as in multilevel analysis (Curran andBauer 2011; Epstein et al. 2012; Landale, Gorman, and Oropesa 2006; Nomaguchi andBrown 2011; Park, Lee, and Epstein 2009; Schempf et al. 2011).

R. Schunck 67

This approach offers several additional advantages. First, it allows us to test forequivalence of within and between estimates. This test, which is referred to as anaugmented regression test (Jones et al. 2007, 217), can be used as an alternative tothe Hausman specification test (Baltagi 2008, 73). If between and within effects arethe same—that is, β1 = β3—then (4) collapses to (1), the random-intercept model.Second, a decomposition into between and within effects can be used with generalizedestimating equations, which enables us to specify less restrictive within-cluster errorstructures. Third, this approach can be extended to include random slopes, allowingeffects of level 1 variables to vary between clusters.

A hybrid model that includes a random slope for (xit − xi) is given by

yit = β0 + (β1 + μ2i)(xit − xi) + β2ci + β3xi + μ1i + εit

The hybrid model is closely related to the correlated random-effects model (Wooldridge2010), first proposed by Mundlak (1978). The correlated random-effects model relaxesthe assumption of zero correlation between the level 2 error and the level 1 variables.In particular, it introduces the assumption μi = πxi + νi, so (1) becomes

yit = β0 + β1xit + β2ci + πxi + νi + εit (4)

The cluster mean of xit picks up any correlation between this variable and the level 2error. Including the cluster mean of a level 1 variable in a random-effects model istherefore an alternative to cluster mean centering (Halaby 2003, 519). Thus β1 from(6) is the fixed-effects estimate (Mundlak 1978; Wooldridge 2010), and it is identical tothe estimate obtained from (4). But the estimated effect of xi will differ. In the hybridmodel, this is the between effect. In the correlated random-effects model, this is thedifference of the within and between effects (Mundlak 1978); that is, π = β3 − β1.

The relation between the hybrid model and the correlated random-effects modelbecomes obvious if we rewrite (4) as

yit = β0 + β1xit + β2ci + (β3 − β1)xi + μi + εit (5)

The correlated random-effects model allows for the inclusion of level 2 variables, andit can be used with generalized estimating equations, just like the hybrid model. It isalso possible to perform an augmented regression test. But the test takes a differentform. Because π already estimates the difference of the within and between effects, thenull hypothesis is π = 0 (Baltagi 2008, 133). In principle, the correlated random-effectsmodel can also include random slopes. However, a correlated random-effects model withrandom slopes is not equivalent to the corresponding hybrid model with random slopes,and it is advisable to use the hybrid model (Raudenbush 1989a; 1989b).2

2. In particular, because π = β3 − β1, the random part of the slope, μ2i, will appear in the estimatedeffect of xit and xi, which makes it hard to interpret (see Raudenbush [1989a, 12]).


3 Estimation

The example presented below uses infant birth weight data (Abrevaya 2006; Rabe-Hesketh and Skrondal 2008). The data comprise 8,604 infants clustered within 3,978mothers. We will examine how the level 1 variables, mother’s age (mage, continuous)and smoking behavior (smoke, binary), and the level 2 variable, race (black, binary),affect a child’s birth weight (birwt, continuous). To estimate between and within effectsin one model, we must first generate the cluster-specific mean of xit. The second step isto create the deviation scores, which is also known as group mean centering. We haveto ensure that the means are generated on the multivariate sample, that is, by usinglistwise deletion to handle missings. This is done using mark (Jann 2007b).

. use http://www.stata-press.com/data/mlmus2/smoking

. mark nonmiss

. markout nonmiss birwt smoke mage black

. egen msmoke = mean(smoke) if nonmiss==1, by(momid)

. generate dsmoke = smoke-msmoke

The cluster-specific means and the deviation scores can also be computed easily withthe center command (Jann 2007a).

. by momid, sort: center mage if nonmiss==1, prefix(d) mean(m)

Once the variables have been created, xtreg, re can be used to fit the hybrid model.

. xtreg birwt dsmoke dmage msmoke mmage black, i(momid) re

. estimates store hybrid

The correlated random-effects model is fit in a similar way, but it includes uncenteredversions of the level 1 variables.

. xtreg birwt smoke mage msmoke mmage black, i(momid) re

. estimates store corr_re

We also fit a random-intercept and a fixed-effects model so that we can compare theirestimates with those of the hybrid model and the correlated random-effects model.

. xtreg birwt smoke mage black, i(momid) re

. estimates store re

. xtreg birwt smoke mage, i(momid) fe

. estimates store fe

R. Schunck 69

Estimation results are shown in table 1. Model 1 presents the estimates from therandom-intercept model, model 2 from the fixed-effects model, model 3 from the hybridmodel, and model 4 from the correlated random-effects model. Comparing the random-intercept model (model 1) with the fixed-effects model (model 2), we see that thereare considerable differences in effect estimates. The estimated effect on birth weightof smoking during pregnancy, for instance, is considerably smaller when estimated as awithin-mother effect, that is, once all time-constant (observed and unobserved) differ-ences between mothers are accounted for. The estimated effect of mother’s age, on theother hand, is considerably larger in the fixed-effects model than in the random-effectsmodel.

If we compare the estimates from the fixed-effects model (model 2) with those fromthe hybrid model (model 3) and the correlated random-effects model (model 4), we seethat all three models estimate the same (within) effects of the level 1 variables.3 Acomparison of between and within effects from the hybrid model (model 3) providesadditional insight. The model, for instance, estimates the between-mother effect ofsmoking as −332.93. Accordingly, the average birth weight for a mother who smokesand a mother who does not smoke will differ by 332.928 grams. The within-mothereffect is estimated to be −105.70. Thus for a given mother, smoking decreases thebirth weight of her children by 105.70 on average. A Wald test can be used to test forequivalence of within and between estimates.

. estimates restore hybrid

. test dsmoke=msmoke

. test dmage=mmage

. hausman fe re

In the present case, the test statistics suggest that the null hypothesis of equalityfor within and between estimates should be rejected [smoking: Wald χ2(1) = 38.68,age: Wald χ2(1) = 22.09], which can be considered evidence against the random-effectsmodel. A Hausman test reaches the same conclusion [χ2(2) = 58.63]. How can weexplain the differences of between and within effects? Smoking is likely to be corre-lated with other mother-specific unobserved variables (Abrevaya 2006) that adverselyimpact birth weight. Therefore, the between effect (and the estimate from the random-intercept model, which is a weighted average of the between and the within estimate)overestimates the effect of smoking. Certainly, this raises the question whether thereis a meaningful interpretation of the between effect. The estimate is obviously biasedbecause it is confounded with the level 2 error. However, a comparison with the withinestimate can inform us how much of the observed relation in birth weight and a mother’ssmoking is due to unobserved heterogeneity between smoking and nonsmoking mothers,which is not accounted for by our model.

3. The estimated standard errors differ slightly because the data used here are unbalanced (Allison2009, 27).


Tab

le1.

Ran

dom

-effe

cts,

fixed

-effe

cts,

hybr

id,a

ndco

rrel

ated

rand

om-e

ffect

slin

ear

regr

essi

onm

odel

sfo

rbi

rth

wei

ght

data

Mod

el1:

Mod

el2:

Mod

el3:

Mod

el4:

Mod

el5:

rand

omeff

ects

fixed

effec

tshy

brid

corr

elat

edre

hybr

id,ra

ndom

slop

e

smoke

−249

.06

−105

.70

−105

.70

(17.

42)

(29.

53)

(29.

52)

mage

10.7

023

.12

23.1

2(1

.20)

(3.0

5)(3

.05)

dsmoke

−105

.70

−106

.08

(29.

52)

(32.

33)

dmage

23.1

223

.10

(3.0

5)(3

.05)

black

−255

.03

−260

.03

−260

.03

−260

.03

(26.

66)

(26.

62)

(26.

62)

(26.

56)

msmoke

−332

.93

−227

.23

−332

.92

(21.

53)

(36.

54)

(21.

48)

mmage

7.52

−15.

597.

52(1

.31)

(3.3

2)(1

.30)

Sqrt

.V

aria

nce:

dsmoke

234.

96Sq

rt.

Var

ianc

e:Lev

el2

341.

6644

2.92

341.

7134

1.66

341.

32Sq

rt.

Var

ianc

e:Lev

el1

374.

7337

4.73

374.

7337

4.73

372.

47xtmixedsigmau

xtmixedsigmae

xtmixedNg

Lev

el2:

Mot

hers

3978

3978

3978

3978

3978

Lev

el1:

Infa

nts

8604

8604

8604

8604

8604

Sta

ndard

erro

rsin

pare

nth

eses

Const

ant

om

itte

d

R. Schunck 71

As argued above, the estimated effects of the cluster means will differ between thehybrid model and the correlated random-effects model. In the correlated random-effectsmodel, this is π = β3 − β1, as shown in (5). Indeed, the estimated effect of −227.23for the cluster mean from the correlated random-effects model (model 4) correspondsexactly to the difference of the estimated within-mother and between-mother effectsfrom the hybrid model (model 3): −332.93− (−105.70) = −227.23. The test of the nullhypothesis that the difference of within and between estimates is equal to 0 providesthe same results as for the hybrid model [smoking: Wald χ2(1) = 38.68, age: Waldχ2(1) = 22.09].

. estimates restore corr_re

. test msmoke

. test mmage

As columns 4 and 5 in table 1 show, the hybrid model and the correlated random-effects model also provide (identical) effect estimates of the level 2 variable black. Theestimated effect of this variable is similar, albeit not identical, to the one obtainedfrom the random-intercept model (model 1). This is because including xi controls moreencompassingly for between-cluster differences in xit.

As pointed out above, a decomposition into between and within effects also allowsus to incorporate random slopes. Let us assume we have reasons to believe that thewithin effect of smoking varies across mothers. In that case, we would want to specifya hybrid model with a random slope for the variable dsmoke. This can be done withxtmixed.

. xtmixed birwt dsmoke dmage msmoke mmage black || momid: dsmoke

The results are shown in the last column of table 1 (model 5).

4 Interactions

There are pitfalls to the application of these models when including interactions. Letus say we are interested in including an interaction of smoking by mother’s age. Thusour comparison model is the following fixed-effects model:(

birwtit − birwti

)= β1 (mageit − magei) + β2

(smokeit − smokei

)+β3

(smokeitmageit − smokeimagei

)+ (εit − εi) (6)

We can fit this easily by using the operator # to specify the interaction, the c. operatorto indicate continuous variables, and i. to indicate factor variables.

. xtreg birwt i.smoke##c.mage, i(momid) fe

. estimates store fe_inter


However, specifying the hybrid model as4

. xtreg birwt c.dsmoke##c.dmage c.msmoke##c.mmage black, i(momid) re

. estimates store hybrid_inter_incorrect

is incorrect. Why is that? What we want to estimate is βk(xitzit − xizi). But ifwe specify the interaction in the hybrid model as above, we estimate βk{(xit − xi)(zit − zi)} = βk(xitzit − xitzi − xizit + xizi), which produces a completely differentresult. Therefore, we first have to generate the interaction term xitzit, cluster meancenter the new variable, and then enter it into the model.

. generate smokeXmage = smoke*mage

. by momid, sort: center smokeXmage if nonmiss==1, prefix(d) mean(m)

. xtreg birwt dsmoke dmage dsmokeXmage msmoke mmage msmokeXmage black, i(momid) re

. estimates store hybrid_inter_correct

Because the correlated random-effects model does not include deviations from thecluster means but does include the variables in their uncentered form, it is possible toinclude the interaction of these variables with the operator #. However, to obtain thecorrect within effect of the interaction, we still have to include the respective interactionterms of the cluster means. This, again, cannot be done with the operator #. We haveto control for xizi, but using # instead results in xizi.

The following specification therefore results in incorrect estimates:

. xtreg birwt i.smoke##c.mage c.msmoke##c.mmage black, i(momid) re

. estimates store corr_re_inter_incorrect

The correct estimates are obtained through the following specification:

. xtreg birwt i.smoke##c.mage msmoke mmage msmokeXmage black, i(momid) re

. estimates store corr_re_inter_correct

An estimation example is provided in table 2. Model 1 shows the estimates fromthe standard fixed-effects model. This model estimates the interaction effect of smokingby mother’s age as 3.79. Models 2 and 3 present estimates from hybrid models, andmodels 4 and 5 present estimates from correlated random-effects models. Models 2 and4 include the incorrect interaction, and models 3 and 5 include the correct interactions.If we compare the estimated interaction effects of models 2 and 4 with the benchmark,model 1, we see that both are incorrect. Moreover, the estimated main effects of models 2and 4 are, of course, also incorrect. Models 3 and 5, on the other hand, estimate thecorrect main and interaction effects.5

4. Note that group mean-centered variables are continuous.5. The magnitude of the difference between the correct and the incorrect estimates is considerably

larger in the hybrid model than in the correlated random-effects model. This is what we wouldexpect, considering that the difference between the correct and the incorrect interaction terms inthe hybrid model, that is, the difference between (xitzit −xizi) and {(xit −xi)(zit − zi)}, is largerthan the difference in the correlated random-effects model, that is, the difference between xizi andxizi.

R. Schunck 73

Tab

le2.

Fix

ed-e

ffect

s,hy

brid

,and

corr

elat

edra

ndom

-effe

cts

linea

rre

gres

sion

mod

els

wit

hin

tera

ctio

nsfo

rbi

rth

wei

ght

data

Model

1:

Model

2:

Model

3:

Model

4:

Model

5:

fixed

effec

tshybri

d,in

corr

ect

hybri

d,co

rrec

tco

rrre

,in

corr

ect

corr

re,co

rrec

tin

tera

ctio

nin

tera

ctio

nin

tera

ctio

nin

tera

ctio

n

1.smoke

−205.6

6−1

98.4

0−2

05.6

6(1

34.0

1)

(133.0

4)

(133.9

4)

mage

22.7

522.7

822.7

5(3

.09)

(3.0

9)

(3.0

9)

dsmoke

−103.9

8−2

05.6

6(2

9.5

4)

(133.9

4)

dmage

23.1

222.7

5(3

.05)

(3.0

9)

1.smoke#c.mage

3.7

93.5

23.7

9(4

.96)

(4.9

2)

(4.9

6)

c.dsmoke#c.dmage

−43.8

2(3

1.1

2)

dsmokeXmage

3.7

9(4

.96)

black

−258.0

3−2

58.4

3−2

58.5

7−2

58.4

3(2

6.7

3)

(26.7

3)

(26.7

3)

(26.7

3)

msmoke

−266.0

3−2

65.7

3−7

0.7

8−6

0.0

7(1

04.9

1)

(104.7

0)

(169.1

6)

(170.0

0)

mmage

7.8

97.9

3−1

4.8

6−1

4.8

2(1

.45)

(1.4

5)

(3.4

1)

(3.4

1)

c.msmoke#c.mmage

−2.5

1−5

.91

(3.8

4)

(6.2

4)

msmokeXmage

−2.5

2−6

.31

(3.8

4)

(6.2

7)

Sqrt

.V

ari

ance

:Lev

el2

443.3

7341.8

3341.7

2341.7

5341.6

9Sqrt

.V

ari

ance

:Lev

el1

374.7

5374.6

6374.7

5374.7

5374.7

5Lev

el2:

Moth

ers

3978

3978

3978

3978

3978

Lev

el1:

Infa

nts

8604

8604

8604

8604

8604

Sta

ndard

erro

rsin

pare

nth

eses

Const

ant

om

itte

d


If we want to exploit the possibility of estimating within effects in random-effectsmodels via the hybrid specification, we have to include interaction terms in the old-fashioned way by generating interaction variables. If we use the correlated random-effects setup, we still have to generate interactions of the cluster mean variables byhand.

Unfortunately, in the case of the hybrid model, it becomes impossible to use postes-timation commands such as margins in the usual manner. margins relies on factor-variable notation. If interactions are not specified via #, Stata will not automaticallytake into account both main and interaction effects.

Suppose we are interested in the marginal effect of mother’s age, that is, in thepartial derivate of (4) with respect to mother’s age. Based on the fixed-effects model(model 1) this is 23.28. However, using margins after the hybrid model (model 3) gives22.75, which is only the main effect. Using margins after the correlated random-effectsmodel (model 5), on the other hand, gives the correct estimate of 23.28.6

. estimates restore fe_inter

. margins, dydx(mage)

. estimates restore hybrid_inter_correct

. margins, dydx(dmage)

. estimates restore corr_re_inter_correct

. margins, dydx(mage)

5 Conclusion

I discussed some advantages that correlated random-effects and hybrid models offer. Ineither case, a decomposition of within and between effects in a single model increasesflexibility in model setup because it combines advantages of fixed- and random-effectsmodels. It allows us to estimate the effect of level 2 variables while providing effectestimates of level 1 variables that are unbiased by a possible correlation with the level2 error. Moreover, a comparison of within and between effects (or their difference)provides an assessment of the degree to which unobserved heterogeneity in level 2 char-acteristics is responsible for an observable relation between the outcome and a level 1variable, which is not accounted for by our model. Importantly, these advantages applywhen handling any clustered data, whether it is panel data or multilevel data.

Yet there are some aspects to consider. First, within-effect estimates obtainedthrough random-effects models are not more efficient than those obtained from fixed-effects models. Second, the approach described here offers no remedy for a possiblecorrelation of a level 2 variable and the level 2 error (this would require an instrumental-variables approach; see Hausman and Taylor [1981] when handling panel data). Third,handling interactions in these models may be cumbersome. Nevertheless, these modelsare useful extensions to the standard random-effects and fixed-effects approaches.

6. Note that depending on what margins is used for, predictions based on fixed-effects models maystill differ from those based on correlated random-effects models because the estimated interceptsdiffer.

R. Schunck 75

6 Acknowledgments

I would like to thank Silvia Maja Melzer, Klaus Pforr, Carsten Sauer, and Peter Valetfor their comments and suggestions. I would also like to thank Julia Harand for herhelp in preparing the manuscript.

7 ReferencesAbrevaya, J. 2006. Estimating the effect of smoking on birth outcomes using a matched

panel data approach. Journal of Applied Econometrics 21: 489–519.

Allison, P. D. 2009. Fixed Effects Regression Models. Thousand Oaks, CA: Sage.

Baltagi, B. H. 2008. Econometric Analysis of Panel Data. 4th ed. New York: Wiley.

Burnett, K., and G. Farkas. 2009. Poverty and family structure effects on children’smathematics achievement: Estimates from random and fixed effects models. TheSocial Science Journal 46: 297–318.

Curran, P. J., and D. J. Bauer. 2011. The disaggregation of within-person and between-person effects in longitudinal models of change. Annual Review of Psychology 62:583–619.

Epstein, A. J., J. D. Ketcham, S. S. Rathore, and P. W. Groeneveld. 2012. Variations inthe use of an innovative technology by payer: the case of drug-eluting stents. MedicalCare 50: 1–9.

Halaby, C. N. 2003. Panel models for the analysis of change and growth in life coursestudies. In Handbook of the Life Course, ed. J. T. Mortimer and M. J. Shanahan,503–527. New York: Springer.

———. 2004. Panel models in sociological research: Theory into practice. AnnualReview of Sociology 30: 507–544.

Hausman, J. A., and W. E. Taylor. 1981. A generalized specification test. EconomicsLetters 8: 239–245.

Jann, B. 2007a. center: Stata module to center (or standardize) variables. StatisticalSoftware Components S444102, Department of Economics, Boston College.http://ideas.repec.org/c/boc/bocode/s444102.html.

———. 2007b. Stata tip 44: Get a handle on your sample. Stata Journal 7: 266–267.

Jones, A., N. Rice, T. B. d’Uva, and S. Balia. 2007. Applied Health Economics. Abing-don, UK: Routledge.

Kaufman, R. L. 1993. Decomposing longitudinal from cross-unit effects in panel andpooled cross-sectional designs. Sociological Methods and Research 21: 482–504.


Kreft, I. G. G., J. de Leeuw, and L. S. Aiken. 1995. The effect of different forms ofcentering in hierarchical linear models. Multivariate Behavioral Research 30: 1–21.

Landale, N. S., B. K. Gorman, and R. S. Oropesa. 2006. Selective migration and infantmortality among Puerto Ricans. Maternal and Child Health Journal 10: 351–360.

Mundlak, Y. 1978. On the pooling of time series and cross section data. Econometrica46: 69–85.

Neuhaus, J. M., and J. D. Kalbfleisch. 1998. Between- and within-cluster covariateeffects in the analysis of clustered data. Biometrics 54: 638–645.

Nomaguchi, K. M., and S. L. Brown. 2011. Parental strains and rewards among mothers:The role of education. Journal of Marriage and Family 73: 621–636.

Ousey, G. C., and P. Wilcox. 2007. The interaction of antisocial propensity and life-course varying predictors of delinquent behavior: Differences by method of estimationand implications for theory. Criminology 45: 313–354.

Park, C. Y., M. A. Lee, and A. J. Epstein. 2009. Variation in emergency department waittimes for children by race/ethnicity and payment source. Health Services ResearchJournal 44: 2022–2039.

Phillips, J. A. 2006. Explaining discrepant findings in cross-sectional and longitudinalanalyses: An application to U.S. homicide rates. Social Science Research 35: 948–974.

Rabe-Hesketh, S., and A. Skrondal. 2008. Multilevel and Longitudinal Modeling UsingStata. 2nd ed. College Station, TX: Stata Press.

Raudenbush, S. 1989a. “Centering” predictors in multilevel analysis: Choices and con-sequences. Multilevel Modelling Newsletter 1(2): 10–12.

———. 1989b. A Response to Longford and Plewis. Multilevel Modelling Newsletter1: 8–11.

Schempf, A. H., J. S. Kaufman, L. C. Messer, and P. Mendola. 2011. The neighbor-hood contribution to black–white perinatal disparities: An example from two NorthCarolina counties, 1999–2001. American Journal of Epidemiology 174: 744–752.

Teachman, J. 2011. Modeling repeatable events using discrete-time data: Predictingmarital dissolution. Journal of Marriage and Family 73: 525–540.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nded. Cambridge, MA: MIT Press.

Zhou, M. 2011. Intensification of geo-cultural homophily in global trade: Evidence fromthe gravity model. Social Science Research 40: 193–209.

About the author

Reinhard Schunck is a research and teaching associate in the Department of Sociology at theUniversity of Bielefeld in Germany. His research interests span multilevel and longitudinal dataanalysis. He works primarily in the field of social stratification and inequality.


Trial sequential boundaries for cumulativemeta-analyses

Branko MiladinovicCenter for Evidence-Based Medicine and Health Outcomes Research

University of South FloridaTampa, FL

[email protected]

Iztok HozoDepartment of MathematicsIndiana University Northwest

Gary, IN

Benjamin DjulbegovicCenter for Evidence-Based Medicine and Health Outcomes Research

University of South FloridaTampa, FL

Abstract. We present a new command, metacumbounds, for the estimation of trialsequential monitoring boundaries in cumulative meta-analyses. The approach isbased on the Lan–DeMets method for estimating group sequential boundariesin individual randomized controlled trials by using the package ldbounds in Rstatistical software. Through Stata’s metan command, metacumbounds plots theLan–DeMets bounds, z-values, and p-values obtained from both fixed and random-effects cumulative meta-analyses. The analysis can be performed with count dataor on the hazard scale for time-to-event data.

Keywords: st0284, metacumbounds, trial sequential analysis, cumulative meta-analysis, information size, Lan–DeMets bounds, monitoring boundary, cumulativez score, heterogeneity

1 Introduction

Randomized controlled trials (RCTs) are the gold standard for making causal inferencesregarding treatment effects. Meta-analyses of RCTs increase both the power and theprecision of estimated treatment effects. However, there is a risk that a meta-analysismay report false positive results, that is, report a treatment effect when in reality there isnone. This is especially true when the pooled estimates are updated with the publicationof a new trial in cumulative meta-analyses. A small RCT may result in chance findingsand overestimation. To avoid false conclusions, Pogue and Yusuf (1997, 1998) advocatedconstructing Lan–DeMets trial sequential monitoring boundaries for cumulative meta-analysis. This is analogous to constructing interim treatment sequential monitoringboundaries in a single RCT, where a trial would be terminated if the cumulative z curve


78 Trial sequential boundaries for cumulative meta-analyses

crossed the discrete sequential boundary and a treatment larger than expected occurred.They calculated the optimal information size based on the assumption that participantsoriginated from a single trial.

More recently, Wetterslev et al. (2008) adjusted the method for heterogeneity and la-beled it trial sequential analysis (TSA). Their approach accounted for bias and observedheterogeneity in a retrospective cumulative meta-analysis. We implement TSA in Stataunder the command metacumbounds and with the ldbounds package in open-sourceR statistical software, which calculates bounds by using the Lan–DeMets α spendingfunction approach. metacumbounds is the first widely available package to constructmonitoring bounds for cumulative meta-analysis for both count data and informationin the form of hazard ratios for time-to-event data. Analyzing time-to-event data on thecount scale leads to the loss of valuable information, decreases the power, and should beavoided. Tierney et al. (2007) discuss methods for extracting hazard ratios from pub-lished data. The option to construct monitoring bounds for cumulative meta-analysison the hazard scale has not been available in the domain of public software and, to ourknowledge, is presented here for the first time. In section 2, we discuss the methodologybehind TSA. In section 3, we describe how to install R and the packages needed to im-plement metacumbounds. In section 4, we present the command metacumbounds, andin section 5, the command is illustrated with two examples from published literature.

2 Methods

Group sequential analysis for individual RCTs was introduced by Armitage (1969) andPocock (1977). Gordon Lan and Demets (1983) made the methods for controlling thetype I error when interim analyses are conducted more flexible by introducing the z curveand α spending function, which produce either the O’Brien–Fleming or the Pocock typeboundaries. Under this method, the progress of a single RCT is measured over time,and the trial is terminated early if the cumulative z curve crosses a discrete sequentialboundary. The boundary depends on the number of decision times and the rate atwhich the prespecified type I error α is spent, independent of the number of futuredecision times. The probability of terminating a trial early at time ti is calculated asthe proportion of α that should be spent at ti minus the α already used in the past.We use five different spending functions (Demets and Gordon Lan 1994):

(i) O’Brien–Fleming spending function

α(t) =

{0, t = 0

2 − 2Φ(Z α

2√t), 0 < t ≤ 1

(ii) Pocock spending function

α(t) =

{0, t = 0α ln{1 + (e − 1)t}, 0 < t ≤ 1

B. Miladinovic, I. Hozo, and B. Djulbegovic 79

(iii) Alpha × time

α(t) =

{0, t = 0αt, 0 < t ≤ 1

(iv) Alpha × time1.5

α(t) =

{0, t = 0αt1.5, 0 < t ≤ 1

(v) Alpha × time2

α(t) =

{0, t = 0αt2, 0 < t ≤ 1

Pogue and Yusuf (1997) extended the methodology to cumulative meta-analysis,where its progress is monitored as the relevant information is accrued over time. Thetotal number of observed patients in the cumulative meta-analysis is defined as theaccrued information size (AIS). Assuming that the information size (that is, the samplesize) needed is at least equal to the sample size required in an individual RCT, giventhe prespecified type I error α and power (1−β), then the required a priori anticipatedinformation size (APIS) based on a prespecified intervention effect is defined as

APIS =4ν

μ2(Zα

2+ Zβ)2

Here μ is the intervention effect and ν its variance, assuming equal size between theintervention and control groups. For count data and the event rates in the control andexperimental groups pc and pe, μ = pc − pe and ν = p∗(1− p∗), where p∗ = (pc + pe)/2.The a priori relative risk reduction (RRR) is defined as RRR = 1 − pe/pc.

If we use the results of Lachin and Foulkes (1986), the required APIS for time-to-event data and assumed hazard ratio HR0, expected censoring rate w (that is, loss tofollow-up), and average survival rate across studies S is given by

APIS =(Zα

2+ Zβ)2

(1 − w)(1 − S)

(HR0 + 1HR0 − 1

)2

Individual RCTs may be biased. It is well accepted that trials with a high risk ofbias due to inadequate randomization sequence generation, intention-to-treat analysis,allocation concealment, masking, or reported incomplete outcome data may overesti-mate intervention effects. RRR and low-bias information size (LBIS) are thus calculatedby applying the intervention effects from low-bias trials only. Combining trials as ifparticipants came from one mega-trial may bias the results because of heterogeneity.To account for uncertainty induced by heterogeneity, we must adjust (multiply) infor-mation size by 1/(1 − I2) to calculate the low-bias heterogeneity-adjusted informationsize (LBHIS). Note that I2 is heterogeneity defined as

I2 =(Q − k + 1)

Q


and Q is Cochran’s homogeneity statistic. Once the information size is calculated, whilethe new trials are published and meta-analyses are updated, the monitoring bounds canbe updated over time as well. Brok et al. (2008) present a set of examples of two-sidedTSA for four different cumulative z curves (see figure 1).

Figure 1. Examples of the upper half of two-sided TSA

(A) Crossing of Z = 1.96 provides a significant result but a spurious effect because thez curve does not cross the monitoring boundary. This is a false positive result.

(B) Crossing of the monitoring boundary before reaching the information size providesfor firm evidence of effect. This is a true positive result.

(C) z curve not crossing Z = 1.96 indicates absence of evidence; that is, the meta-analysis included fewer patients than the required information size. This is a falsenegative result.

(D) Lack of predefined effect even though the information size is reached. This is atrue negative result.

The monitoring boundary typically moves right and down over time. However, it maymove right and up if the event rate decreases, intervention effect increases, or hetero-geneity increases. In the context of LBIS and LBHIS, crossing of the monitoring boundsbefore the information size is reached indicates that high-bias risk trials find a largerintervention effect compared with low-bias risk trials.

3 R statistical software

R statistical software is an open-source package that may be downloaded free of chargeat http://www.r-project.org. To use metacumbounds, after installing R, the user needsto install the R packages foreign (to read and write Stata data files) and ldbounds(to compute group sequential bounds by using the Lan–Demets method with either


the O’Brien–Fleming or the Pocock spending functions). The package ldbounds isbased on the Fortran code ld98 by Reboussin et al. (2000). Statistical packages canbe downloaded from the Comprehensive R Archive Network from a multitude of mirrorwebsites within R. This is done by selecting Packages > Install package(s). . . andthen the mirror site closest to the user (figure 2 outlines the steps). The USA(MD)

Comprehensive R Archive Network mirror highlighted in figure 2 is at the United StatesNational Cancer Institute (http://watson.nci.nih.gov/cran mirror/).

Figure 2. ldbounds installation description

Note that R does not have to be running when Stata is executing the metacumboundscommand. The Stata program rsource is used to run R from inside Stata. It works byrunning the Rterm.exe program and may be downloaded from within Stata by typingssc install rsource.

4 The metacumbounds command

4.1 Syntax for metacumbounds

Our command metacumbounds assumes that Stata’s metan command (Harris et al.2008) has been installed. Because of the complexity of the syntax and to facilitateits implementation, we have included the dialog-box file metacumbounds.dlg, whichshould be placed in the active Stata directory.


metacumbounds varlist[if] [

in], data(count | loghr) effect(f | r)

spending(string) rdir(string) is(ais | apis | lbis | lbhis) [id(strvar)

surv(#) loss(#) lbid(varname) stat(rr | or | rd) wkdir(string)

kprsrce(string) alpha(#) beta(#) graph rrr(#) listRout listRin keepR

graph options]

where varlist contains either count data or log hazard-ratios, their standard errors, andtrial sample size.

4.2 Options

data(count | loghr) specifies whether the analysis is done for count data or on the log-hazard scale for time-to-event outcomes. Under the data(count) option, the usercan specify effect size based on risk ratio, odds ratio, or risk difference. For bothdata(count) and data(loghr), the output is on the natural scale. logrr or logormay equally be used under the loghr option in the unlikely event that the countdata are unavailable, in which case the survival rate S and loss to follow-up are bothequal to 0. data() is required.

effect(f | r) specifies whether fixed- or random-effects estimates are used in the outputand graph. If the fixed-effects model is chosen and heterogeneity I2 is greater than30%, then a warning message is displayed. The pooling method used is the inversevariance method (fixedi and randomi in metan). effect() is required.

spending(string) specifies the spending function that is calculated by ldbounds inR. spending(1) computes O’Brien–Fleming type bounds. spending(2) computesPocock type bounds. spending(3) computes bounds of type αt. spending(4) com-putes bounds of type αt1.5. spending(5) computes bounds of type αt2. spending()is required.

rdir(string) lists the path of the directory where the binary files for R can be found.rdir() is required.

is(ais | apis | lbis | lbhis) specifies the method to be used for calculation size. is()is required.

ais represents the simple accrued information size—the fraction of the total numberof participants in the meta-analysis used up to that point. The assumed a prioriRRR (RRR = rrr()) is used to determine the power of the test for given alpha andgiven (actual) sample size.

apis represents the a priori information size and means that the total sample sizewill be calculated so that the trial has the a priori intervention effect (RRR = rrr())on the incidence rate in the control group (which is calculated from the providedtrial data). The incidence rate for the experimental group is calculated using thisRRR. The RRR is given by the user, as are alpha and beta. These variables are thenused to determine the sample size (APIS).


lbis represents the low-bias information size and means that the total sample sizewill be calculated using the incidence rate of only those trials for which the low-biasID variable is greater than 0. If LBIS = 1, then the trial has low bias. If LBIS = 0,then the trial does not have low bias, it has high bias. The intervention effect (RRR)is now calculated from the incidence rates of both control and experimental groupsfor only those trials for which the low-bias ID variable is greater than 0. For thisRRR and for user-specified alpha and beta, we calculate the required sample size andcall it LBIS.

lbhis (low-bias heterogeneity-adjusted information size) is the same as lbis ex-cept adjusted for heterogeneity; that is, LBHIS = LBIS/(1 − I2), where I2 is theheterogeneity index of this group of trials for the given statistic.

id(strvar) is a character variable used to label the studies. If the data contain alabeled numeric variable, then the decode command can be used to create a charactervariable.

surv(#) for hazard-ratio data specifies the overall average survival rate and is definedon [0, 1).

loss(#) for hazard-ratio data specifies the percent of patients lost to follow-up and isdefined on [0, 1).

lbid(varname) specifies whether each study is low risk for bias (coded 1) or high riskfor bias (coded 0) under is(lbis) or is(lbhis).

stat(rr | or | rd) for count data specifies the effect size (risk ratio, odds ratio, or riskdifference) to be pooled.

wkdir(string) is the directory where all the files should be saved.

kprsrce(string) saves the R source file after the program is completed.

alpha(#) specifies the type I error. # must be between 0 and 1.

beta(#) specifies the type II error. # must be between 0 and 1.

graph requests a graph.

rrr(#) specifies the trial a priori intervention effect size (RRR) to calculate APIS. ForLBIS and LBHIS, rrr() is calculated from low-bias trials only.

listRout lists the R output on the Stata screen.

listRin lists the R source file on the Stata screen.

keepR keeps the R source file.

graph options are overall graph options. shwRRR and pos() allow for the addition andposition of the RRR, α, and power on the graph; xtitle(string) and ytitle(string)add labels to the x and y axes; title(string) and subtitle(string) add the titleand subtitle to the graph. The dialog box makes performing TSA easier.


5 Examples5.1 Example 1: Effects of artery catheter tip position in the newborn

Wetterslev et al. (2008) performed TSA with data from a systematic review by Barring-ton (2000). One of the review’s aims was to determine whether the position (high versuslow) of the tip of an umbilical arterial catheter led to clinical vascular compromise. Outof five total trials, only one was found to have adequate allocation concealment andwas considered low bias (table 1). The author reported that high-placed catheters werefound to produce a significantly lower incidence of clinical vascular complications withRRR = 47% (95% confidence interval (CI); [37%–56%]).

Table 1. High versus low catheter position for clinical vascular compromise

Study High (n/N) Low (n/N) Low bias

Harris (1978) 3/18 12/18 noMokrohisky (1978) 9/33 26/40 no

Stork (1984) 12/85 25/97 noKempley (1992) 34/162 66/146 noUACTSG (1992) 77/481 130/489 yes

For LBIS and LBHIS to be calculated, the low-bias ID variable needs to be specified.In their analysis, Wetterslev et al. (2008) assumed RRR = 15% based on clinical sig-nificance. Figure 3 provides a screenshot of the dialog box used to perform the TSA

analysis, which confirms the results from the systematic review in figure 4(a)–(c). Thefigure also displays the actual power achieved given the information size. Trial sequen-tial monitoring boundary (TSMB) for AIS and APIS detected three potentially spuriousp-values; TSMB for LBIS and LBHIS detected two potentially spurious levels.

Figure 3. Dialog box used to create figure 4


AIS = 1569

02

46

8

0 500 1000 1500Information size

RRR = 15% (alpha = 5%, power = 56%)

(a) Results showing three potentially spurious p-valuesfor AIS of 1,569 patients.

APIS = 2743

02

46

8


RRR = 15% (alpha = 5%, power = 80%)

(b) Results showing three potentially spurious p-valuesfor APIS of 2,743 patients.

LBIS = 470

02

46

8


RRR = 40% (alpha = 5%, power = 80%)

(c) Results showing three potentially spurious p-valuesfor LBIS of 470 patients. Note that because LBIS equalsLBHIS, results for the latter are the same.

Figure 4. TSA on the effects of umbilical artery catheter position in newborns


. use example1

. metacumbounds a b c d, data(count) effect(f) id(study) alpha(0.05) beta(0.20)> is(AIS) stat(rr) graph spending(1) rrr(.15) kprsrce(StataRsource.R)> rdir(C:\Program Files\R\R-2.12.2\bin\i386) shwRRR pos(10)> xtitle(Information size)Isquare = 0.00%

Cumulative fixed-effects meta-analysis of 5 studies with Lan-DeMets bounds---------------------------------------------------------------

CumulativeTrial estimate(rr) z P val partN UBHarris_1978 0.250 2.508 0.012 36 8.000Mokrohisky_1978 0.371 3.691 0.000 109 8.000Stork_1984 0.436 4.041 0.000 291 5.128Kempley_1992 0.452 5.911 0.000 599 3.445UACTSG_1992 0.525 6.936 0.000 1569 1.962

5.2 Example 2: Neoadjuvant chemotherapy for invasive bladder can-cer

Advanced Bladder Cancer Meta-analysis Collaboration (2011) conducted individual pa-tient data meta-analysis to study whether neoadjuvant chemotherapy improves survivalin patients with invasive bladder cancer. They concluded that the hazard ratio forall trials, including single-agent cisplatin, tended to favor neoadjuvant chemotherapywith RRR = 11% (95% CI; [2%–19%]) (the results were reported on the hazard scale asHR = 0.89; 95% CI; [81%–98%]). All 10 trials were found to have adequate allocationconcealment and were considered low bias (see table 2). Because I2 = 0%, fixed- andrandom-effects meta-analyses produce identical TSMBs, and LBIS equals LBHIS.

Table 2. Neoadjuvant chemotherapy for invasive bladder cancer

Study Neoadjuvant (n/N) Local (n/N) HR [95% CI] Low bias

Raghavan (1991) 34/41 37/55 1.43 [0.88, 2.31] yesWallace (1991) 59/83 50/76 1.11 [0.76, 1.61] yesMartinez (1995) 43/62 38/59 1.02 [0.66, 1.57] yes

Malmstrom (1996) 68/151 84/160 0.77 [0.56, 1.06] yesCortesi (unpub) 43/82 41/71 0.91 [0.6, 1.40] yes

Bassi (1999) 53/102 60/104 0.93 [0.64, 1.35] yesMRC/EORTC (1999) 275/491 301/485 0.85 [0.72, 1] yes

Sherif (2002) 79/158 90/159 0.86 [0.64, 1.16] yesSengelov (2002) 70/78 60/75 1.06 [0.75, 1.50] yesGrossman (2003) 98/158 108/159 0.77 [0.58, 1.01] yes

Figure 5 provides a screenshot of the dialog box used to perform the analysis. Usingthe estimated average survival rate of S = 40% and assuming w = 0% loss to follow-up,we found that TSA confirms the results from the systematic review for AIS [figure 6(a)–


(c)]. TSMB crosses the z curve for AIS of 2,809 patients. The TSA confirms the resultsfor the systematic review of APIS = 1,990 under assumed RRR = 15%, α = 0.05, andpower(1 − β) = 0.8. However, the results of the systematic review do not hold underestimated LBIS = LBHIS = 4,418. There was one spurious p-value (Grossman trial)under LBIS and LBHIS estimates.

Figure 5. Dialog box used to create figure 6


AIS = 2809

02

46

8


RRR = 11% (alpha = 5%, power = 67%)

(a) Results showing that TSMB crosses z curve for AISof 2,809 patients.

APIS = 1990

02

46

8


RRR = 15% (alpha = 5%, power = 80%)

(b) Results for APIS of 1,990 patients.

LBIS = 4418

02

46

8

0 1000 2000 3000 4000 5000Information size

RRR = 10% (alpha = 5%, power = 80%)

(c) Results for LBIS of 4,418 patients. Note that becauseLBIS equals LBHIS, results for the latter are the same.

Figure 6. TSA on the effects of neoadjuvant chemotherapy for invasive bladder cancer


. use example2

. metacumbounds ln_hr se_ln_hr N, data(loghr) effect(r) id(study) surv(0.40)> loss(0.00) alpha(0.05) beta(0.20) is(APIS) graph spending(1) rrr(.15)> kprsrce(StataRsource.R) rdir(C:\Program Files\R\R-2.12.2\bin\i386\)> shwRRR pos(10) xtitle(Information size)Isquare = 0.00%

Cumulative random-effects meta-analysis of 10 studies with Lan-DeMets bounds----------------------------------------------------------------

CumulativeTrial estimate() z P val partN UBRaghavan_1991 1.430 1.453 0.146 96 8.000Wallace_1991 1.221 1.322 0.186 255 8.000Martinez_1995 1.153 1.142 0.253 376 5.087Malmstrom_1996 1.019 0.143 0.887 687 3.640Cortesi_1997 0.989 0.106 0.915 840 3.281Bassi_1999 0.971 0.360 0.719 1046 2.910MRC_EORTC_1999 0.917 1.384 0.166 2022 .Sherif_2002 0.903 1.876 0.061 2339 .Sengelov_2002 0.915 1.696 0.090 2492 .Grossman_2003b 0.897 2.229 0.026 2809 .

6 Discussion

We presented a command, metacumbounds, for the implementation of TSA in Stata,which we recommend to minimize the risk of random error when performing cumulativemeta-analyses. This way, the risk of finding a difference in treatment effects whereno difference exists is minimized. The command uses a package for constructing Lan–Demets bounds in an open-source R statistical software.

metacumbounds can be implemented by using either fixed- or random-effects meta-analysis. It can incorporate heterogeneity in the calculation of boundaries. The methodcan be applied with count data or on the hazard scale for time-to-event data; TSA

for both has not been available in the domain of public software. In addition to thesubgroup analysis, funnel plots and meta-regression, the plot of the cumulative z curve,and monitoring boundaries, APIS and LBIS (or LBHIS in the presence of heterogeneity)should be a standard supplement to any meta-analysis.

7 Acknowledgment

The rsource program we used to run the R statistical software through Stata wasdeveloped by Roger Newson of the Imperial College London.

8 ReferencesAdvanced Bladder Cancer Meta-analysis Collaboration. 2011. Neoadjuvant cisplatin for

advanced bladder cancer. Cochrane Database of Systematic Reviews 6: CD001426.


Armitage, P. 1969. Sequential analysis in therapeutic trials. Annual Review of Medicine20: 425–430.

Barrington, K. J. 2000. Umbilical artery catheters in the newborn: Effects of positionof the catheter tip. Cochrane Database of Systematic Reviews 2: CD000505.

Brok, J., K. Thorlund, C. Gluud, and J. Wetterslev. 2008. Trial sequential analysisreveals insufficient information size and potentially false positive results in manymeta-analyses. Journal of Clinical Epidemiology 61: 763–769.

Demets, D. L., and K. K. Gordon Lan. 1994. Interim analysis: The alpha spendingfunction approach. Statistics in Medicine 13: 1341–1352.

Gordon Lan, K. K., and D. L. Demets. 1983. Discrete sequential boundaries for clinicaltrials. Biometrika 70: 659–663.

Harris, R. J., M. J. Bradburn, J. J. Deeks, R. M. Harbord, D. G. Altman, and J. A. C.Sterne. 2008. metan: Fixed- and random-effects meta-analysis. Stata Journal 8: 3–28.

Lachin, J. M., and M. A. Foulkes. 1986. Evaluation of sample size and power foranalyses of survival with allowance for nonuniform patient entry, losses to follow-up,noncompliance, and stratification. Biometrics 42: 507–519.

Pocock, S. J. 1977. Group sequential methods in the design and analysis of clinicaltrials. Biometrika 64: 191–199.

Pogue, J. M., and S. Yusuf. 1997. Cumulating evidence from randomized trials: Utilizingsequential monitoring boundaries for cumulative meta-analysis. Controlled ClinicalTrials 18: 580–593.

———. 1998. Overcoming the limitations of current meta-analysis of randomised con-trolled trials. Lancet 351: 47–52.

Reboussin, D. M., D. L. DeMets, K. Kim, and K. K. G. Lan. 2000. Computationsfor group sequential boundaries using the Lan–DeMets spending function method.Controlled Clinical Trials 21: 190–207.

Tierney, J. F., L. A. Stewart, D. Ghersi, S. Burdett, and M. R. Sydes. 2007. Practicalmethods for incorporating summary time-to-event data into meta-analysis. Trials 8:16.

Wetterslev, J., K. Thorlund, J. Brok, and C. Gluud. 2008. Trial sequential analysismay establish when firm evidence is reached in cumulative meta-analysis. Journal ofClinical Epidemiology 61: 64–75.

About the authors

Branko Miladinovic is an assistant professor of biostatistics in the Center for Evidence-BasedMedicine at the University of South Florida. His recent research has focused on meta-analysisand extreme value distributions in both frequentist and Bayesian settings.


Iztok Hozo is a professor of mathematics and actuarial sciences at Indiana University Northwest.His research interests include medical decision making, acceptable regret theory, and meta-analysis.

Benjamin Djulbegovic is a distinguished professor of medicine and oncology at the Universityof South Florida and the H. Lee Moffitt Cancer Center and Research Institute. He is alsothe Director of the Center for Evidence-Based Medicine and the Associate Dean for ClinicalResearch at the University of South Florida. His major academic interests lie in the areas ofevidence-based medicine, decision analysis, clinical reasoning, systematic reviews and meta-analysis and comparative effectiveness research, the ethics of clinical trials, practice guidelines,outcomes research, the impact of clinical trials, and the role of uncertainty in medicine.


Regression anatomy, revealed

Valerio FilosoDepartment of Economics

University of Naples “Federico II”Naples, Italy

[email protected]

Abstract. The regression anatomy theorem (Angrist and Pischke, 2009, MostlyHarmless Econometrics: An Empiricist’s Companion [Princeton University Press])is an alternative formulation of the Frisch–Waugh–Lovell theorem (Frisch andWaugh, 1933, Econometrica 1: 387–401; Lovell, 1963, Journal of the AmericanStatistical Association 58: 993–1010), a key finding in the algebra of ordinaryleast-squares multiple regression models. In this article, I present a command,reganat, to implement graphically the method of regression anatomy. This addi-tion complements the built-in Stata command avplot in the validation of linearmodels, producing bidimensional scatterplots and regression lines obtained by con-trolling for the other covariates, along with several fine-tuning options. Moreover, Iprovide 1) a fully worked-out proof of the regression anatomy theorem and 2) an ex-planation of how the regression anatomy and the Frisch–Waugh–Lovell theoremsrelate to partial and semipartial correlations, whose coefficients are informativewhen evaluating relevant variables in a linear regression model.

Keywords: st0285, reganat, regression anatomy, Frisch–Waugh–Lovell theorem,linear models, partial correlation, semipartial correlation

1 Inside the black box

In the case of a linear bivariate model of the type

yi = α + βxi + εi

the ordinary least-squares (OLS) estimator for β has the known simple expression

β =∑n

i (xi − x) (yi − y)∑ni (xi − x)2

=Cov(yi, xi)

Var(xi)

In this framework, a bidimensional scatterplot can be a useful graphical device duringmodel building to detect, for instance, the presence of nonlinearities or anomalous data.

When the model includes more than a single independent variable, there is nostraightforward equivalent for the estimation of β, and the same bivariate scatterplotbetween the dependent variable and the independent variable of interest becomes po-tentially misleading because, in the general case, the independent variables are notorthogonal between them. Consequently, most econometric textbooks limit themselvesto providing the formula for the β vector of the type

β = (X ′X)−1X ′y


V. Filoso 93

and drop altogether any graphical depiction of the relation of interest. Although com-pact and easy to remember, this formulation is a sort of black box because it hardlyreveals anything about what really happens during the estimation of a multivariate OLS

model. Furthermore, the link between the β and the moments of the data distributiondisappears, buried in the intricacies of matrix algebra.

Luckily, an enlightening interpretation of the β’s in the multivariate case exists andhas relevant interpreting power. It was originally formulated more than 70 years ago byFrisch and Waugh (1933), revived by Lovell (1963), and implemented in applied econo-metrics by Angrist and Pischke (2009) under the catchy phrase “regression anatomy”.According to this result, given a model with K independent variables, the coefficient βfor the kth variable can be written as

βk =Cov

(yi, x

ki

)Var

(xk

i

)where xk

i is the residual obtained by regressing xki on all remaining K − 1 independent

variables.

The result is striking because it establishes the possibility of breaking a multivariatemodel with K independent variables into K simpler bivariate models and also shedslight on the machinery of multivariate OLS. This property of OLS does not depend onthe underlying data-generating process or on its causal interpretation: it is a purelynumerical property of the estimator that holds because of the algebra behind it.

For example, the regression anatomy theorem makes transparent the case of theso-called problem of multicollinearity. In a multivariate model with two variables thatare highly linearly related, the theorem implies that for a variable to have a statisticallysignificant β, it must retain sufficient explicative power after the other independentvariables have been partialled out. Obviously, this is not likely to happen in a highlymulticollinear model because the most variability is between the regressors and notbetween the residual variable xk

i and the dependent variable y.

While this theorem is widely known as a standard result of the matrix algebra ofthe OLS model, its practical relevance in the modeling process has been overlooked,say Davidson and MacKinnon (1993), most probably because the original articles had alimited scope; it nonetheless illuminated a very general property of the OLS estimator.Hopefully, the introduction of a Stata command that implements it will help to spreadits use in econometric practice.

2 The Frisch–Waugh–Lovell theorem

The regression anatomy theorem is an application of the Frisch–Waugh–Lovell (FWL)theorem about the relationship between the OLS estimator and any vertical partitioningof the data matrix X. Originally, Frisch and Waugh (1933) tackled a confusing issuein time-series econometrics. Because many temporal series exhibit a common temporaltrend, it was typical during the early days of econometrics to detrend these variables

94 Regression anatomy, revealed

before entering them in a regression model. The rationale behind this two-stage method-ology was to purify the variables from spurious temporal correlation and use only theresidual variance in the regression model of interest.

In practice, when an analyst was faced with fitting a model of the type

yi = β0 + β1x1i + · · · + βkxki + · · · + βKxKi + ei (1)

with each variable possibly depending linearly on time, the analyst first estimated a setof K auxiliary regressions of the type

xki = ck + c1kt + eki

and an analogous regression for the dependent variable,

yi = c0y + c1yt + eyi

The analyst then used the residuals from these models to build an analogue to (1):

yi = β′0 + β′

1x1i + · · · + β′kxki + · · · + β′

K xKi + e′i

Alternatively, other analysts directly entered the time variable in (1) and fit the fullmodel:

yi = β∗0 + β∗

1x1i + · · · + β∗kxki + · · · + β∗

KxKi + dt + e∗iThese two schools of econometric practice debated over the merits and the shortcomingsof the respective methods until Frisch and Waugh quite surprisingly demonstrated thatthe two estimation methods are numerically equivalent; that is, they provide exactlythe same results

β′k = β∗

k

ande′i = e∗i

In broader terms, the theorem applies to any regression model with two or more inde-pendent variables that can be partitioned into two groups:

y = X′1β1 + X′

2β2 + r (2)

Consider the general OLS model y = X′β + e, with XN,K . Next partition the X matrixin the following way: let X1 be an N × K1 matrix and let X2 be an N × K2 matrix,with K = K1 + K2. It follows that X = (X1X2). Let us now consider the model

M1y = M1X2β2 + e (3)

where M1 is the matrix projecting off the subspace spanned by the columns of X1.In this formulation, y and the K2 columns of X2 are regressed on X1; then the vec-tor of residuals M1y is regressed on the matrix of residuals M1X2. The FWL the-orem states that the β’s calculated for (3) are identical to those calculated for (2).A complete proof can be found in advanced econometric textbooks such as those byDavidson and MacKinnon (1993, 19–24) and Ruud (2000, 54–60).

V. Filoso 95

3 The regression anatomy theorem

A straightforward implication of the FWL theorem states that the βk coefficient also canbe estimated without partialling the remaining variables out of the dependent variableyi. This is exactly the regression anatomy (RA) theorem that Angrist and Pischke(2009) have advanced as a fundamental tool in applied econometrics. In this section,for the sake of simplicity and relevance to my Stata command reganat, I provide aproof restricted to the case in which XN,K , K1 = 1, and K2 = K − 1, building on theindications provided in Angrist and Pischke (2009).

Theorem 3.1 (Regression anatomy). Given the regression model

yi = β0 + β1x1i + · · · + βkxki + · · · + βKxKi + ei (4)

and an auxiliary regression in which the variable xki is regressed on all the remainingindependent variables,

xki = γ0 + γ1x1i + · · · + γk−1xk−1i + γk+1xk+1i + · · · + γKxKi + fi (5)

with xki = xki− xki being the residual for the auxiliary regression, the parameter βk canbe written as

βk =Cov(yi, xki)

Var(xki)(6)

Proof. To prove the theorem, plug (4) and the residual xki from (5) into the covarianceCov(yi, xki) from (6) and obtain

βk =Cov(β0 + β1x1i + · · · + βkxki + · · · + βKxKi + ei, xki)

Var(xki)

=Cov(β0 + β1x1i + · · · + βkxki + · · · + βKxKi + ei, fi)

Var(fi)

1. Because by construction E(fi) = 0, it follows that the term β0E(fi) = 0.

2. Because fi is a linear combination of all the independent variables with the ex-ception of xki, it must be that

β1E(fix1i) = · · · = βk−1E(fixk−1i) = βk+1E(fixk+1i) = · · · = βKE(fixKi) = 0

3. Consider now the term E(eifi). This can be written as

E(eifi) = E(eifi)= E(eixki)= E {ei (xki − xki)}= E(eixki) − E(eixki)


Because ei is uncorrelated with any independent variable, it is also uncorrelatedwith xki; accordingly, we have E(eixki) = 0. With regard to the second term ofthe subtraction, substituting the predicted value from (5), we get

E {ei (γ0 + γ1x1i + · · · + γk−1xk−1i + γk+1xk+1i + · · · + γKxKi)}

Once again, because ei is uncorrelated with any independent variable, the expectedvalue of the terms is equal to 0. Thus it follows that E(eifi) = 0.

4. The only remaining term is E (βkxkixki). The term xki can be substituted byusing a rewriting of (5) such that

xki = E (xki|X−k) + xki

This gives

E (βkxkixki) = βkE [xki {E (xki|X−k) + xki}]= βk

(E{x2

ki

}+ E [{E (xki|X−k) xki}]

)= βkVar(xki)

which follows directly from the orthogonality between E (xki|X−k) and xki.

5. From previous derivations, we finally get

Cov(yi, xki) = βkVar(xki)

which completes the proof.

4 A comparison between reganat and avplot

Let us sum up our results so far: the value of the coefficient βk can be obtained by theFWL theorem and the RA theorem. While the FWL theorem states that

βk =Cov(yi, x

ki )

Var(xki )

the RA theorem states that

βk =Cov(yi, x

ki )

Var(xki )

There are good reasons to use both formulations when building a multivariate model:both have advantages and shortcomings.

V. Filoso 97

1. Variance of residualsThe OLS residuals obtained by the FWL theorem and the RA theorem are generallydifferent. In particular, those obtained via the FWL theorem coincide with thoseobtained for the multivariate full OLS model and are valid for inferences about βk,while the residuals obtained via the RA theorem tend to be inflated because

Var (yi) ≥ Var (yi)

This holds true because the variance of y can be written, in the simple case of aunivariate model yi = α + βxi + εi, as

σ2y = β2σ2

x + σ2ε

where β2σ2x is the variance of y.

2. Partial and semipartial correlationsIn a regression model with just one independent variable, the OLS estimator canbe written as

β =Cov(yi, xi)

Var(xi)= ρyx

σy

σx

where ρyx is the correlation coefficient between x and y. The same relation appliedto a multivariate model provides two alternative expressions when using either theFWL method or the RA method. In the case of the FWL method, we have

βk =Cov(yi, x

ki )

Var(xki )

= ρeyex

σey

σex

while in the case of the RA theorem, we have

βk =Cov(yi, x

ki )

Var(xki )

= ρyexσy

σex

The term ρeyex is the partial correlation coefficient, while ρyex is the semipartial

correlation coefficient. Because the FWL and the RA methods provide the sameestimate for βk, we can write the relation between the two types of correlationcoefficients as

ρyex =σ

ey

σyρ

eyex

from which is evident that ρyex ≤ ρeyex because the variance of y is larger than the

variance of y.

The advantage of using the semipartial coefficient over the partial coefficient is thatthe former is expressed in terms of σy units, whereas the latter’s metrics dependon the independent variable under study. Thus using the semipartial coefficientallows for a comparison of the relative strength of different independent variables.


3. Semipartial correlations and R2

In a multivariate OLS model, each independent variable’s variance can be splitinto three components:

a. Variance not associated with y

b. Variance associated with y and shared with other regressors

c. Variance associated with y and not shared with other regressors

When you construct an OLS model, the inclusion of a new regressor is valuablewhen the additional explaining power contained in it is not already fully capturedby the other K regressors. Accordingly, the new variable must mainly provide thekind of variance denoted with (c).

A measure of the value of this informative variance for a new regressor is its semi-partial correlation coefficient: this fact can be used to decompose the variance in amultivariate model. Under normal conditions, the sum of the squared semipartialscan be subtracted from the overall R2 for the complete OLS regression to get thevalue of common variance shared by the independent variables with y.

The squared semipartial coefficient can also be expressed as the gain to the R2

due to the inclusion of the kth variable, weighted by the portion of unexplainedvariance. In formula, this is

ρ2yexk

=R2

with − R2without

(1 − R2with) (N − K − 1)

Finally, a correspondence between the correlation coefficient and the R2’s fromeither the FWL regression or the RA regression can be established. In the caseof the univariate model yi = α + βxi + εi, the coefficient of determination R2 isdefined as β2σ2

x/σ2y and is equal to ρ2

yx, that is, the squared simple correlationcoefficient between y and x. In the same fashion, the R2 from the FWL regressionis equal to the squared partial correlation coefficient, while the R2 from the RA

regression is equal to the squared semipartial correlation coefficient.

I must note that Stata includes the official command avplot, which graphs thevariable xki against yki (the residual of a regression of y on all variables except the kth).Though germane in scope and complementary in many walks of statistical life, reganatis more congruent than avplot with the quantitative interpretation of a multivariatelinear model: the former permits an appreciation of the original metrics of yi, while thelatter focuses on yki, whose metrics are less appealing to the general reader.

In the causal interpretation of the regression model (Angrist and Pischke 2009,chap. 1), the coefficient β is the size of the effect of a causing variable on a depen-dent variable, free of other competing factors. The same logic relies on the conceptof ceteris paribus, that is, the evaluation of a cause with all other factors being equal.While the variable xki is the statistical counterpart of the causing variable, the variableyki is less informative than the original yi because it is constrained to have a zero mean.

V. Filoso 99

In applied statistical practice—for example, in econometrics (Feyrer, Sacerdote, andStern 2008)—it is customary to present, early in an article, a bidimensional scatterplotof a dependent variable against an explanator of interest, even though the plot is po-tentially misleading because the variance shared by other potential confounders is nottaken into account. Usually, in later pages, the main explanator is plugged into a setof other explanators to fit a regression model, but any scatterplot of the main rela-tion of interest is seldom presented. This is unfortunate because the valuable graphicalinformation derived from the FWL theorem gets lost. Nonetheless, to be worth the ef-fort, the postestimation graph must resemble the original relation of interest. This isexactly the context in which reganat can enrich the visual apparatus available to theapplied statistician while saving the original metrics of the variables involved as muchas possible.

5 The command reganat

The estimation command reganat is written for Stata 10.1. It has not been tested onprevious versions of the program.

5.1 Syntax

The command has the following syntax:

reganat depvar varlist[if] [

in] [

, dis(varlist) label(varname) biscat biline

reg nolegend nocovlist fwl semip scheme(graphical scheme)]

Just like any other standard OLS model, a single dependent variable and an array ofindependent variables are required.

By default, when the user specifies K covariates, the command builds a multigraphmade of K bidimensional subgraphs. In each of them, the x axis displays the value ofeach independent variable free of any correlation with the other variables, while the yaxis displays the value of the dependent variable. Within each subgraph, the commanddisplays the scatterplot and the corresponding regression line.

5.2 Options

dis(varlist) restricts the output to the variables in varlist and excludes the rest. Onlythe specified varlist will be graphed; nonetheless, the other regressors will be usedin the background calculations.

label(varname) uses varname to label the observations in the scatterplot.

biscat adds to each subgraph the scatterplot between the dependent variable and theoriginal regressor under study. The observations are displayed using a small triangle.Because E(xki) = 0 by construction and because E(xki) is in general different from 0,


the plotting of xki and xki along the same axis requires the variable E(xki) to beshifted by subtracting its mean.

biline adds to each subgraph a regression line calculated over the univariate modelin which the dependent variable is regressed only on the regressor under study. Todistinguish the two regression lines that appear on the same graph, biline uses adashed pattern for the one for the univariate model.

reg displays the output of the regression command for the complete model.

nolegend prevents the legend from being displayed.

nocovlist prevents the list of covariates from being displayed.

fwl uses the FWL formulation in place of RA.

semip adds a table with a decomposition of the model’s variance.

scheme(graphical scheme) specifies the graphical scheme to be applied to the compositegraph. The default is scheme(sj).

6 An example

Consider the following illustrative example of reganat, without any pretense of estab-lishing a genuine causality model. Suppose that we are interested in the estimation of asimple hedonic model for the price of cars dependent on their technical characteristics.In particular, we want to estimate the effect, if any, of a car’s length on its price.

First, we load the classic auto.dta and regress price on length, obtaining

. sysuse auto(1978 Automobile Data)

. regress price length

Source SS df MS Number of obs = 74F( 1, 72) = 16.50

Model 118425867 1 118425867 Prob > F = 0.0001Residual 516639529 72 7175549.01 R-squared = 0.1865

Adj R-squared = 0.1752Total 635065396 73 8699525.97 Root MSE = 2678.7

price Coef. Std. Err. t P>|t| [95% Conf. Interval]

length 57.20224 14.08047 4.06 0.000 29.13332 85.27115_cons -4584.899 2664.437 -1.72 0.090 -9896.357 726.559

V. Filoso 101

The estimated β is positive. Then because other technical characteristics could influencethe selling price, we include mpg (mileage) and weight as additional controls to get

. regress price length mpg weight


Model 226957412 3 75652470.6 Prob > F = 0.0000Residual 408107984 70 5830114.06 R-squared = 0.3574



length -104.8682 39.72154 -2.64 0.010 -184.0903 -25.64607mpg -86.78928 83.94335 -1.03 0.305 -254.209 80.63046

weight 4.364798 1.167455 3.74 0.000 2.036383 6.693213_cons 14542.43 5890.632 2.47 0.016 2793.94 26290.93

With this new estimation, the sign of length has become negative. The RA theoremstates that this last estimate of β for length also could be obtained in two stages, whichis exactly the method deployed by the command.

In the first stage, we regress length on mpg and weight:

. regress length mpg weight


Model 32497.5726 2 16248.7863 Prob > F = 0.0000Residual 3695.08956 71 52.0435149 R-squared = 0.8979

Adj R-squared = 0.8950Total 36192.6622 73 495.789893 Root MSE = 7.2141

length Coef. Std. Err. t P>|t| [95% Conf. Interval]

mpg -.3554659 .2472287 -1.44 0.155 -.8484259 .137494weight .024967 .0018404 13.57 0.000 .0212973 .0286366_cons 120.1162 10.3219 11.64 0.000 99.53492 140.6975

Here it becomes clear that length and weight are remarkably correlated. In the secondstage, we get the residual value of length conditional on mpg and weight by using themodel just estimated, and then we regress price on this residual reslengthr.


. predict reslengthr, residuals

. regress price reslengthr


Model 40636131.6 1 40636131.6 Prob > F = 0.0297Residual 594429265 72 8255962.01 R-squared = 0.0640



reslengthr -104.8682 47.26845 -2.22 0.030 -199.0961 -10.64024_cons 6165.257 334.0165 18.46 0.000 5499.407 6831.107

The value of the β from this bivariate regression coincides with that obtained fromthe multivariate model, although the standard errors are not equal because of differentdegrees of freedom used in the calculation.

The command reganat uses the decomposability of the RA theorem to plot therelation between price and length on a bidimensional Cartesian graph, even thoughthe model we are actually using is multivariate. Actually, the command plots priceand reslengthr by using the following command, which produces the graph in figure 1.

. reganat price length mpg weight, dis(length)

Regression Anatomy

Dependent variable ...... : priceIndependent variables ... : length mpg weightPlotting ................ : length

3,29

115

,906

−24.71256 12.65348

Multivariate slope: −104.868 (39.722). Semipartial rho2: 0.064

Length (in.)

Covariates: Length (in.), Mileage (mpg), Weight (lbs.).

Dependent variable: PriceRegression Anatomy

Figure 1. Regression anatomy

V. Filoso 103

The graph displays the variable length after partialling out the influence of mpg andweight. Remarkably, this variable now also assumes negative values, which did nothappen in the original data. This happens because residuals have zero expected valueby construction; accordingly, the original data have been scaled to have zero meandisplayed on the x axis together with residuals.

It is instructive to compare graphically the bivariate model and the multivariatemodel with the options biscat and biline. This command produces the graph offigure 2.

. reganat price length mpg weight, dis(length) biscat biline

Regression Anatomy

Dependent variable ...... : priceIndependent variables ... : length mpg weightPlotting ................ : length

3,29

115

,906

−45.93243 45.06757

Multivariate slope: −104.868 (39.722). Semipartial rho2: 0.064Bivariate slope: 57.202 (14.080)

Length (in.)


Regression lines: Solid = Multivariate, Dashed = Bivariate.Scatterplot: Dots = Transformed data, Triangles = Original data.


Figure 2. Regression anatomy: Original and transformed data

The graph also displays, for both models, the numerical value of β and its standarderror at 95% in parentheses. Furthermore, on the same line, the command displays thesquared semipartial correlation coefficient. The calculation is obtained using Stata’sbuilt-in pcorr command.


The other variables of the model also can be plotted on the graph to check whetherthe inclusion of additional controls does influence their effect on the dependent variable.This produces the composite graph of figure 3.

. reganat price length mpg weight, dis(length weight) biscat biline

Regression Anatomy

Dependent variable ...... : priceIndependent variables ... : length mpg weightPlotting ................ : length weight

3,29

115

,906

−45.93243 45.06757

Multivariate slope: −104.868 (39.722). Semipartial rho2: 0.064Bivariate slope: 57.202 (14.080)

Length (in.)

3,29

115

,906

−1259.459 1820.541

Multivariate slope: 4.365 (1.167). Semipartial rho2: 0.128Bivariate slope: 2.044 (0.377)

Weight (lbs.)


Regression lines: Solid = Multivariate, Dashed = Bivariate.Scatterplot: Dots = Transformed data, Triangles = Original data.


Figure 3. Regression anatomy: Composite graph

The inclusion of additional controls also affects the β for weight; in the bivariate model,its value is less than half as much as in the multivariate model (as is clear from theobservation of the different slopes in the right panel).

V. Filoso 105

The command is also useful to decompose the model’s variance to get an idea ofboth the idiosyncratic and the joint contributions of the independent variables. Us-ing the option semip, we get an additional table with partial correlations, semipartialcorrelations, squared partial correlations, squared semipartial correlations, relevant sig-nificance values, and some summary statistics. The results are obtained using Stata’sbuilt-in pcorr command.

. reganat price length mpg weight, dis(length) semip

Regression Anatomy

Dependent variable ...... : priceIndependent variables ... : length mpg weightPlotting ................ : length(obs=74)

Partial and semipartial correlations of price with

Partial Semipartial Partial Semipartial SignificanceVariable Corr. Corr. Corr.^2 Corr.^2 Value

length -0.3009 -0.2530 0.0906 0.0640 0.0102mpg -0.1226 -0.0991 0.0150 0.0098 0.3047

weight 0.4080 0.3582 0.1664 0.1283 0.0004

Model´s variance decomposition Value Perc.

Variance explained by the X´s individually 0.2021 0.5656Variance common to X´s 0.1553 0.4344

Variance explained by the model (R-squared) 0.3574

The final table decomposes the model’s variance: the vector of the three variableslength, mpg, and weight explains 35.74% of price. This explained variance can bebroken into the idiosyncratic contribution of each variable (6.4% + 0.98% + 12.83% =20.21%) and the common variance (15.53%). In conclusion, around 57% of the model’sexplained variance can be attributed to the specific contribution of the independentvariables, while these same variables share around 43% of price’s explained variance.

7 Acknowledgments

The author gratefully acknowledges Joshua Angrist for the invaluable support and en-couragement provided during the development of the reganat command and for sug-gesting the title of the article. An anonymous referee deserves thanks for providingseveral hints, which significantly expanded the scope of this work. The editor’s com-petence and availability have proven crucial throughout the process of submission andrevision. Thanks also to Tullio Jappelli, Riccardo Marselli, and Erasmo Papagni foruseful suggestions. Any remaining errors are solely the author’s responsibility.


8 ReferencesAngrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s

Companion. Princeton, NJ: Princeton University Press.

Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics.2nd ed. New York: Oxford University Press.

Feyrer, J., B. Sacerdote, and A. D. Stern. 2008. Will the stork return to Europeand Japan? Understanding fertility within developed nations. Journal of EconomicPerspectives 22(3): 3–22.

Frisch, R., and F. V. Waugh. 1933. Partial time regressions as compared with individualtrends. Econometrica 1: 387–401.

Lovell, M. C. 1963. Seasonal adjustment of economic time series and multiple regressionanalysis. Journal of the American Statistical Association 58: 993–1010.

Ruud, P. A. 2000. An Introduction to Classical Econometric Theory. Oxford: OxfordUniversity Press.

About the author

Valerio Filoso is an assistant professor in public finance in the Department of Economics atthe University of Naples “Federico II”, where he teaches economic analysis of law. His researchinterests include family and labor economics, the effects of taxation on entrepreneurship, polit-ical economy, and monetary governance. He has been a visiting professor in macroeconomics atSan Diego State University and a visiting researcher in the Eitan Berglas School of Economicsat Tel Aviv University.


kmlmap: A Stata command for producingGoogle’s Keyhole Markup Language

Adam Okulicz-KozarynRutgers University

Camden, NJ

[email protected]

Abstract. kmlmap produces a Keyhole Markup Language file by using Stata dataand geographical coordinates. This file produces a map when uploaded to GoogleMaps. The resulting map is a so-called choropleth or thematic map, where unitsof analysis are colored according to values of a variable. You can click on unitsof analysis on the map to display more information, to zoom, and to do all otherthings that Google Maps can do. The units of analysis can be points or polygons.

Keywords: gr0055, kmlmap, geographic information systems, Keyhole MarkupLanguage, XML, map, Google Maps, thematic, choropleth

1 Introduction

It is usually useful to map data. By mapping, I mean displaying variables’ values byusing different colors on a map. For instance, you may produce a map of zip codes withdifferent colors for different income deciles. Mapping is relevant for every researcher.No matter what you study, it always takes place somewhere, and place usually matters.When you map your data, you will observe things that you cannot observe with graphs,tables of summary statistics, or regression coefficients. For instance, values of variablesusually cluster in space. This is called positive spatial autocorrelation. As an example,neighboring zip codes tend to have similar crime levels and similar income levels: zipcodes in urban areas have high crime and low income, while zip codes in suburbs havethe opposite. Despite these benefits, mapping is overlooked in the social sciences. Onereason is the lack of good mapping commands in mainstream social science softwaresuch as Stata.

Producing a good map can take a lot of time: I do my analysis in Stata, then movedata to some other software for mapping, and then move maps to yet another softwarefor online viewing. kmlmap grew out of my frustration with this process—I want todo everything in Stata. Before I get to the details of kmlmap, let’s briefly review theavailable options for mapping data in Stata.

c© 2013 StataCorp LP gr0055

108 Stata command for producing Google’s Keyhole Markup Language

Stata can produce maps by using the user-written tmap or spmap commands, butStata’s mapping capabilities are poor, and they are unlikely to get significantly better1—Stata is a great statistical software, but it is not a geographic information systems (GIS)software (that is, it is not made for mapping). Even R, which has better graphics thanStata, is not a good choice for mapping.

So what is wrong with tmap and spmap? You cannot point and click. With maps,pointing and clicking is helpful: you want to zoom in and out, show different snapshots,click on the features to display more information, and so on. There are many specializedtools for mapping: Qgis, Grass, etc. What is wrong with those? GIS software such asQgis is a specialized software that requires time to learn and even then requires time totransfer data from Stata.

Another limitation of Stata’s tmap and spmap commands as well as specialized GIS

software is that they do not allow users to easily share maps.2 You want to show yourmaps to other people without the hassle of installing and learning any new software.The best way to share any piece of information, including maps, is to put it on awebsite. There are great web-based tools for map sharing,3 yet they too have familiarlimitations—they take time to learn, and they take time to transfer your data.

Most people know how to use Google Maps, and the interface is user friendly. AndGoogle Maps is extensible: you can produce your own maps and share them with othersby using easy, well-known interfaces. To produce your own Google map, however, youneed to produce Keyhole Markup Language (KML) first. This is where the kmlmappackage comes in—it will write KML for you.

2 Google’s KML—how it works

Why KML? It will become popular in the future for academic work. Elsevier, for in-stance, recommends KML for articles.4 Google uses KML to share geographic informa-tion. KML is a form of Extensible Markup Language (XML), and XML is a markuplanguage that uses tags to organize information.5 You can write a KML file that would,for instance, put a polygon on the top of your house and then just upload the file toGoogle Maps, where you can view it. Such a KML file is fairly simple and looks likethis:6

1. It is not possible for Stata (or any other software that runs from a hard drive) to have all the detailscontained in Google Maps (roads, administrative boundaries, satellite and aerial imagery, and soforth).

2. Yes, there are some add-ons for GIS software, but they are clunky.3. For instance, http://geocommons.com, http://www.plugandplaymaps.com, and

http://www.click2map.com.4. See http://www.elsevier.com/wps/find/intro.cws home/googlemaps.5. You can find out more at http://code.google.com/apis/kml/documentation/kmlreference.html.6. Most of the KML file is made up of coordinates: latitude and longitude.

A. Okulicz-Kozaryn 109

<?xml version="1.0" encoding="UTF-8"?><kml xmlns="http://www.opengis.net/kml/2.2">

<Document><name> KML Samples</name><open>1</open><description>My house</description>

<Placemark><Polygon>

<outerBoundaryIs><LinearRing>

<coordinates>-112.3349463145932,36.14988705767721,100-112.3354019540677,36.14941108398372,100-112.3344428289146,36.14878490381308,100-112.3331289492913,36.14780840132443,100-112.3317019516947,36.14680755678357,100-112.331131440106,36.1474173426228,100-112.332616324338,36.14845453364654,100-112.3339876620524,36.14926570522069,100-112.3349463145932,36.14988705767721,100

</coordinates></LinearRing>

</outerBoundaryIs></Polygon>

</Placemark></Document>

</kml>

It is easy to read: there are some tags (for example, <Polygon>) and a set of coor-dinates (latitude, longitude).7 How can Stata produce KML files? Stata can write anytext file, including KML, with the file command. This is the essence of kmlmap: it willtake the coordinates and variables’ values and write them as a KML file. It will alsoassign colors to quantiles of a variable that you want to map. Coordinates usually comeas a shapefile (.shp); Stata’s command shp2dta8 can convert a shapefile to Stata dataformat (.dta). kmlmap needs two files, one with coordinates and one with data—bothin Stata format. You can easily find shapefiles matching your units of analysis online.9

3 The kmlmap command

3.1 Description

kmlmap produces a KML file by using Stata data and geographical coordinates. This KML

file produces a map when uploaded to Google Maps. The resulting map is a so-calledchoropleth or thematic map, where units of analysis are colored according to values ofa variable. You can click on units of analysis on the map to display more information,to zoom, and to do all other things that Google Maps can do. The units of analysis canbe points or polygons.

7. The resulting map can be viewed here: http://maps.google.com/maps/ms?ie=UTF&msa=0&msid=200156373929468762519.0004a3a45407504768897.

8. It seems that shp2dta does not work well with all shapefiles.9. For instance, you can get administrative boundaries for many countries from http://www.gadm.org.


3.2 Syntax

kmlmap varlist[if] [

in], coord(filename)

[n class(#) s(filename)

place(varname) sq siz(string) color(string) plot var fmt(%fmt)]

kmlmap will produce a KML file only. You need to upload it to Google Maps to seea map. For details, see section 5 below.

3.3 Options

coord(filename) coordinates file from shp2dta. coord() is required.

n class(#) specifies the number of classes used to produce colors on the map—it mustbe between 4 and 10. This number is used to produce quantiles of the variables ofinterest, and these quantiles will produce colors for thematic mapping.

s(filename) specifies the path and name for the KML file. The default is s(kmlmap.kml)in the current directory.

place(varname) specifies the ID variable that identifies points or polygons, for instance,city name or country name. It is a good idea to make it descriptive because it willappear on the map.

sq siz(string) specifies the size of a square. Placemarks in Google Maps cannot becolored with KML as of September 2012; hence, points are displayed as squares. Thedefault is sq siz(0.2), and this is enough to cover a large city.

color(string) specifies the color on the map. string can be red, green, or blue; colorramp ranges, then, from black to the specified color. The default, and probably themost appealing ramp ID, is red green, where color ranges from red (for low values)to green (for high values).

plot var fmt(%fmt) formats the display of plot variables; these numbers will appear inGoogle Maps on the menu and when you click on points or polygons. The defaultis plot var fmt(%9.1f).

4 Example: Mapping population in Oregon

This example illustrates how to map the population in counties in Oregon. First, get afile with geographic coordinates, usually a shapefile, and transform it into Stata formatby using shp2dta. shp2dta will produce two files, one with data and one with coordi-nates. To make sure that the data match the coordinates, you must generate a uniqueID variable, ID (or other variable name specified in the place() option), that will beused to match data with coordinates when you run kmlmap. Also you must not add ordelete any rows to make sure that data and coordinates match.


The following installs shp2data, which produces Stata files from geographical data(so-called shapefiles):

. ssc install shp2dta

. mkdir kmlmap

. cd kmlmap

Get some geographical data with coordinates as a shapefile:

. copy http://geography.uoregon.edu/GeogR/data/shp/orcounty.shp ORpop.shp, replace

. copy http://geography.uoregon.edu/GeogR/data/shp/orcounty.dbf ORpop.dbf, replace

. shp2dta using ORpop.shp, data(dta) coord(coo) replace

. use dta, clear

An ID variable is used to merge your data with coordinates:

// gen _ID=_n // in this case _ID already exists

Now you can do any data-management manipulations on the variables; however, youhave to retain all original rows and cannot add new rows:

describelist

Produce labels so that map layers are nicely labeled:

. label variable POP1990 "population in 1990"

. label variable BLACK "Black population"

. label variable WHITE "White population"

Finally, you are ready to produce a KML file. Let’s map the population in 1990, with10 classes, save it as eg.kml, and label each polygon with a NAME variable. Note thatyou can also map two variables at the same time: there will be two layers on a map.

. kmlmap POP1990, coord(coo) n_class(10) s(eg.kml) place(NAME)good, _ID exists(note: file eg.kml not found)alpha ff red ff green ff blue ffkml file saved, now use Google Maps to display it -- see helpfile

. kmlmap WHITE BLACK, coord(coo) n_class(10) s(eg1.kml) place(NAME)good, _ID exists(note: file eg1.kml not found)alpha ff red ff green ff blue ffalpha ff red ff green ff blue ffkml file saved, now use Google Maps to display it -- see helpfile

5 Quick KML viewing

Keep in mind that kmlmap produces the KML file only, and you need to put it onlineso that Google Maps can display it. This process can be automated to a point whereStata does everything. You need to set it up, however. This section explains how to do


it. If you run your operating system using a command line, skip to KML viewing forpower users; if you point and click, continue with KML viewing for regular users.

5.1 KML viewing for regular users

The trick is not to waste time uploading your file each time you produce it. You canuse some software that will make a directory on your hard drive publicly available,for instance, http://www.dropbox.com/.10 Dropbox will even work on the Windowsoperating system.

Once Stata saves the KML file in the public folder,11 just log in to your Dropboxaccount and get the link for your KML file. Then go to http://maps.google.com andpaste that link into the search box. You will then get the link for the map, which is a linkto Google Maps that loads the KML file from your Dropbox. The full link will alwaysstart with http://maps.google.com/maps?f=q&source=s q&hl=en&geocode=&q=. Toget the full link to your map, just add the http at the end for your Dropbox KML file path.If your Dropbox KML file path is http://dl.dropbox.com/u/29735658/eg1.kml and if youpaste that into http://maps.google.com, then you will get http://maps.google.com/maps?f=q&source=s q&hl=en&geocode=&q=http://dl.dropbox.com/u/29735658/eg1.kml. Of course, you can put your KML file onto any publicly accessible location andrefer to that location from Google Maps.

5.2 kml viewing for power users

You can run shell commands from within Stata by using the ! command, and so youcan scp12 your file to your server and fire up your web browser pointing to the locationof your file (with the Google Maps prefix):

! scp file_name.kml <name>@<server>:~/public_html/file_name.kml! chromium "http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=http://<your_web_address>/file_name.kml" &

For instance, you would point your web browser13 to http://maps.google.com/maps?f=q&source=s q&hl=en&geocode=&q=http://dl.dropbox.com/u/29735658/eg1.kml. If asked for a password when uploading files with scp, use ssh keys.14

10. You can read more about using Dropbox for sharing KML here:http://ogleearth.com/2009/01/dropbox-kml-file-hosting-synced-from-your-desktop/.

11. The files must be placed in the “Public” folder of Dropbox for Google Maps to access them.12. scp is a Linux or Unix command but not a Windows command.13. Note that chromium is a version of Google Chrome web browser; you can substitute it with any

other decent browser, for example, firefox.14. See http://www.thegeekstuff.com/2008/06/perform-ssh-and-scp-without-entering-password-

on-openssh/.


6 Limitations and to-do list

kmlmap is simple and rough around the edges—it needs some extending and polishing.Please feel free to tinker with it and improve it. kmlmap does not work with GoogleEarth. This is not a big limitation: Google Maps is more useful because it can be easilyshared over the Internet, and that’s the major idea behind kmlmap.

As of September 2012, there is no way15 to turn off the visibility of layers in GoogleMaps, so they all will be turned on at the beginning, and you need to turn themoff. It would be good to add an algorithm (for example, Ramer–Douglas–Peucker) forsimplifying polygons; if the KML file is big, it loads slowly. I limited the number ofpoints or polygons to 100,000 and the number of nodes per polygon to 2,500; if yourdata are bigger than that, you will need to simplify them first (an easy way to simplifydata is to use http://www.mapshaper.org/). Another limitation is that the availablecolors are not very appealing. The available features also could be extended with, forinstance, Google Fusion Tables.16 Time permitting, I will work on updating kmlmap.

7 Acknowledgment

I would like to thank an anonymous reviewer for detailed comments.

About the author

Adam Okulicz-Kozaryn is an assistant professor of public policy at Rutgers University, CamdenCampus. He obtained his PhD from the University of Texas at Dallas in 2008. His work hasfocused on a variety of topics: income inequality, preferences for redistribution, urban and ruralissues, cultural economics, values, religion, and happiness. He is also interested in informationtechnology and computational social science: automation, data mining, data management, andtext processing. He uses Linux, Python, and Stata.

15. See https://groups.google.com/group/kml-support-getting-started/browse thread/thread/a7caeef04a0787df.

16. See http://googlegeodevelopers.blogspot.com/2010/05/map-your-data-with-maps-api-and-fusion.html.


A menu-driven facility for sample-sizecalculations in cluster randomized controlled

trials

Karla HemmingUniversity of Birmingham

Birmingham, UK

[email protected]

Jen MarshUniversity of Birmingham

Birmingham, UK

[email protected]

Abstract. We introduce the Stata menu-driven command clustersampsi, whichcalculates sample sizes, detectable differences, and power for cluster randomizedcontrolled trials. The command permits continuous, binary, and rate outcomes(with normal approximations) for comparisons of two-sided tests in two equal-sizedarms. The command allows for specification of the number of clusters available, orthe cluster size, or the average cluster size along with an estimate of the variationof cluster sizes. When the number of clusters available is insufficient to detectthe required difference at the prespecified power, clustersampsi will return theminimum number of clusters required under the prespecified design along with theminimum detectable difference and maximum achievable power (both for the pre-specified number of clusters). Cluster heterogeneity can be parameterized by usingeither the intracluster correlation or the coefficient of variation. The command isillustrated via examples.

Keywords: st0286, clustersampsi, sample size, cluster randomized controlled trials,minimum detectable difference, maximum achievable power

1 Introduction

Sample-size calculations are frequently undertaken for cluster randomized controlled tri-als (RCTs). This is usually done by prespecifying the average cluster size, obtaining thesample size required under individual randomization, and inflating by the design effect(DE), which is a simple function of the intracluster correlation (ICC) (Donner and Klar2000). Alternatively, heterogeneity between clusters can be parameterized by the coef-ficient of variation (standard deviation or mean) of the outcome and similar two-stepprocedures (Hayes and Bennett 1999). However, these two-step procedures are some-times not efficient (for example, when many calculations are required) and sometimesnot quite so straightforward. The reasons are outlined below.

Cluster sample-size calculations are not completely straightforward in a numberof situations. Complexity arises in cases when the user prespecifies the number ofclusters available (as opposed to the average cluster size); when the user requires apower or detectable difference calculation (as opposed to a sample-size calculation);and particularly when the calculation involves binary outcomes. This is because theconventional inflation by the DE is only useful when the user specifies the cluster size


K. Hemming and J. Marsh 115

and needs to obtain an estimate of the number of clusters needed. When the userspecifies the number of clusters available and needs to obtain an estimate of the clustersize, the inflation over that which is required under individual randomization dependson the very quantity the user is trying to compute, the cluster size.

Additionally, because limited precision sets in as the cluster sizes increase, somedesigns will be infeasible (Guittet, Giraudeau, and Ravaud 2005). That is, irrespectiveof how large the clusters are made, a fixed number of available clusters might meanthere is insufficient power to detect the required difference. When the objective is tocalculate power or detectable difference under cluster RCT designs of fixed sample sizesfor continuous outcomes, the user can use the simple relationships that exist betweenthose power and detectable differences obtainable under individual randomization andthose obtainable under cluster randomization (Hemming et al. 2011). To obtain anestimate of the detectable difference for binary outcomes where the variance dependson the proportion, the user must solve a quadratic equation. This is also the casefor the computation of detectable differences for continuous outcomes when the clusterheterogeneity is parameterized by the coefficient of variation.

Currently, several options are available to Stata users planning a cluster RCT. Thesampsi command may be used to estimate the required number of clusters (for bothbinary and continuous outcomes) via a two-step procedure that involves calculating thesample size under individual randomization and inflating this by a self-computed DE. Toestimate power for continuous outcomes, the user could also use sampsi after inflatingthe estimated standard deviation by the DE. For cluster designs, sampsi cannot be usedto estimate detectable differences, power for binary outcomes, or the number of clustersrequired.

Another two-step method consists of using the sampclus command (Garrett 2001),which again requires the user to calculate the sample size required under individualrandomization immediately before implementing the command. With sampclus, theuser is permitted to specify either the number of clusters available or the cluster size andthe command returns, whichever is not specified. In cases where the number of clustersavailable is insufficient to detect the required difference at the prespecified power level,the user is alerted and informed of the minimum number of clusters required. sampclusdoes not compute power (to detect a prespecified difference for a fixed sample size) ordetectable difference (to detect a prespecified power for a fixed sample size).

The command clsampsi (Batistatou and Roberts 2010) was developed primarilyfor designs with differential clustering between arms. Differential clustering occurs, forexample, when the individuals in the intervention arm are grouped (say, group therapy)but there is no grouping in the control arm. While clsampsi does offer a single-stepprocedure that calculates both the power (for a prespecified difference for a fixed samplesize) and the sample size (either the number of clusters or the cluster size), it does notcompute the detectable difference and does not alert the user to infeasible designs.

116 Sample-size calculations in cluster randomized controlled trials

Currently, none of these commands allows computation of the detectable difference,nor do they allow specification of heterogeneity parameterized by the coefficient of vari-ation. In addition, none of these commands allows for varying cluster sizes, repeatedmeasures, or adjustment for covariates. All three of these issues can have importantimplications on power and so should be considered at the design stage.

In summary, while estimation of sample-size variables for cluster RCTs is not ex-cessively complex, it would be useful to directly compute these quantities in Stata.Currently, there are two options available for Stata users: using Stata’s built-in com-mands in two-step routines where the user modifies either the sample-size computed byStata or modifies the input variables (say, standard deviation) to account for the clus-tering; or using a user-written Stata command (clsampsi), which is limited to a veryspecific study design. We have therefore developed a Stata command, clustersampsi,that we believe will be very practical for applied health care researchers involved in thedesign on cluster RCTs.

2 The clustersampsi command

The new Stata command clustersampsi computes power, sample size (both the num-ber of clusters and the cluster size), and detectable difference (for both fixed and varyingcluster sizes), and it alerts the user to infeasible designs (due to an insufficient numberof clusters).

When the design is infeasible, clustersampsi computes the minimum number ofclusters required (for the prespecified difference and power); the minimum detectabledifference (for the prespecified number of clusters and power); and the maximum achiev-able power (for the prespecified number of clusters and the difference to be detected).

Binary, continuous, and rate outcomes are supported, with normal approximationsmade throughout. Between-cluster heterogeneity can be specified using either the ICC

coefficient or the coefficient of variation of outcomes (cvclusters). An additional option isincluded to allow downward adjustment of the standard deviations, for example, whenbaseline measurements are taken.

We outline essential formulas (in the main text and appendix), but details have beenpresented elsewhere (Hemming et al. 2011; Hayes and Bennett 1999).

2.1 Background

Suppose a trial will test the null hypothesis H0 : μ1 = μ2, where μ1 and μ2 represent themeans of two populations, by using a two-sample t test and assuming that var(μ1) =σ2

1 and var(μ2) = σ22 . Suppose further that an equal number of individuals will be

randomized to both arms, letting d denote the difference to be detected such thatd = μ1 −μ2, 1− β denote the power, and α denote the significance level. Alternatively,we may be interested in comparing two proportions, p1 and p2, or two rates, λ1 and λ2.We limit our consideration to trials with two equal-sized parallel arms (two-sided t tests).


Then we assume normality of outcomes and approximate the variance of the differenceof the two proportions or two rates (Hemming et al. 2011). The approximations madefor binomial proportions (Armitage, Berry, and Matthews 2002) are slightly differentfrom those made in the sampsi command (details in appendix).

2.2 Sample-size calculations

When a trial randomizes an intervention over a number of clusters each of size m, thenby standard results (Murray 1998), the required sample size nC is that required underindividual randomization (nI) inflated by the DE,

DE = 1 + (m − 1)ρ

where ρ is the ICC coefficient. This DE is modified for varying cluster sizes by a functionthat depends on the coefficient of variation of the cluster sizes, cvsizes (Eldridge, Ashby,and Kerry 2006; this term is not to be confused with the coefficient of variation ofoutcomes, cvclusters, described above).

From this total sample size, the number of clusters (k) required per arm can becalculated. We round up the number of clusters so that the total sample size is amultiple of the cluster size (using the ceiling function). Additionally, we add one extracluster to each arm to allow for the use of the t distribution (Hayes and Bennett 1999).If the user instead specifies the average cluster size and needs to determine the number ofclusters required per arm (m), the formula can be rearranged to determine the numberof clusters as a function of the sample size required under individual randomization,the ICC, and the average cluster size (and also cvsizes). More detailed mathematicalformulas are provided in the appendix.

The between-cluster heterogeneity may be parameterized using either the ICC coeffi-cient or cvclusters; clustersampsi permits specification of either parameter. The sample-size formula for the cvclusters method is outlined below (Hayes and Bennett 1999). Thenumber of clusters k required is

k = 1 +nI

m+ CVIF (1)

where the coefficient of variation inflation factor (CVIF) is

CVIF =cv2

clusters(μ21 + μ2

2)(zα/2 + zβ)2

d2

where zα/2 denotes the upper 100α/2 standard normal centile.

2.3 Power and detectable difference

Cluster RCTs of fixed size have both a fixed number of clusters, each with a fixedcluster size (but possibly varying between clusters), and a prespecified difference todetect. For such clusters, it may be of interest to compute available power. It turns out


that when you parameterize the heterogeneity by using the ICC, the power for clusterRCTs is the power available under individual randomization for a standardized effectsize that is deflated by the square root of the DE. Similarly, for cluster RCTs of fixedsample size and prespecified power, the detectable difference is that of a trial usingindividual randomization inflated by the square root of the DE (Hemming et al. 2011).When parameterizing the heterogeneity with cvclusters, the power available is obtainedby a simple rearrangement of the sample-size formula [(2) above], whereas obtaining thedetectable difference involves solving a quadratic formula.

2.4 Infeasible designs

A cluster RCT with a fixed number of clusters will be limited by an upper bound onthe maximum available power or a lower bound on the detectable difference. Theselimits exist because of the diminishing return that sets in when the sample size of eachcluster is increased (Donner and Klar 2000). These limiting values are referred to asthe maximum achievable power or the minimum detectable difference.

For trials with a fixed number of equal-sized clusters k, the trial will be feasibleprovided that the number of clusters is greater than the product of the number ofindividuals required under individual randomization (nI) and the estimated ICC (ρ). Soa simple rule is that the number of clusters k will be sufficient provided that

k > (nI × ρ) + 1

or for clusters of varying sizes,

k >{nI × ρ(cv2

sizes + 1)}

+ 1

These formulas differ slightly from those reported elsewhere because of the addition ofone more cluster in each arm (to allow for the use of the t distribution). When youparameterize the heterogeneity by the coefficient of variation, the following inequalitymust hold for the design to be feasible:

k > CVIF + 1

Where these inequalities do not hold, the clustersampsi command will determine themaximum available power to detect the prespecified difference, the minimum detectabledifference under the prespecified value for power, and the minimum number of clustersrequired to detect the prespecified difference at the prespecified value of the power(Hemming et al. 2011).

2.5 Baseline adjustment: Variance deflation

Baseline measurements and other covariate adjustments lead to increases in power andare useful to consider when designing studies. The implications that adjustment forbaseline measurements and predictive covariates has on sample-size calculations can


be formulated in a single framework by measuring or estimating the correlation rbetween either the baseline measurements or the predictive covariate and the out-come. For continuous outcomes, once an estimate of the correlation r is obtained,the variance of the estimate of the outcome is deflated by the factor 1 − r2. Forbinary outcomes, this deflation factor has been shown to be a good approximation(Hernandez, Steyerberg, and Habbema 2004). To use this functionality, the user istherefore required to specify a value of the correlation between either the covariatesand the outcome or the baseline values and the outcome.

2.6 The dialog box

The clustersampsi command is designed to be used both through the Command win-dow and through a dialog box. All the features available within the command havebeen programmed into the dialog box (a .dlg file), and the computations are carriedout using the corresponding ado-file. The dialog box includes three tabs:

1. The Main tab allows users to specify whether the calculation to be performed isa sample-size calculation (default), a power calculation, or a detectable differencecalculation and whether this calculation is for binary, rates, or continuous (default)outcomes. If users specify a sample-size calculation, then they must also specifywhether they desire to prespecify the average cluster size (the default, in whichcase the command computes the number of clusters required) or whether theywish to prespecify the number of clusters available (in which case the commandcomputes the average cluster size needed). On this tab, the user also specifies theestimated ICC coefficient or the coefficient of variation.

2. The Options tab allows the user to specify the significance level (default 0.05),the power (default 0.8), the number of clusters per arm, the cluster size (or averagecluster size), and cvsizes (default 0, indicating all the clusters are the same size).Variables required to be specified on the Options tab are dependent on thosespecified on the Main tab, and the user will only be able to input the variablesrelevant to the calculation specified on the Main tab. For example, if the userspecifies a power calculation on the Main tab, the power option on the Optionstab will be shaded out. If the user specifies a sample-size calculation, then theuser must also specify only one of either the number of clusters or the cluster sizes.

3. The Values tab allows the user to specify the proportion, rate, or mean (andstandard deviation) values for the two arms, along with an estimate of correla-tion between any before-and-after measurements or the correlation between anycovariates and the outcome (default value of 0). The command is limited to a max-imum of one before and one after measurement (that is, it cannot accommodateadditional repeated measurements). Once again, depending on the calculationsrequested on the Main tab (that is, sample size, power, detectable difference, andbinary or continuous outcomes), those values not relevant are shaded out.


3 Examples

3.1 Example 1: Illustration of infeasible designs

In a real example, a cluster RCT will be designed to evaluate the effectiveness of supportto promote breastfeeding. Randomization will be carried out at a single point in time,randomizing teams of midwives (the clusters) to either the intervention arm or thestandard care arm. The trial will be carried out within a single primary care trust,so the number of clusters is limited to the 40 midwifery teams delivering care withinthe region. A clinically important difference to detect is an increase in the rate ofbreastfeeding from about 40% to 50%. Estimates of ICC range from 0.005 to 0.07 insimilar trials (MacArthur et al. 2003; MacArthur et al. 2009). Using these values, weillustrate how clustersampsi can be used to determine the required cluster size.

Figure 1 shows a screenshot of the Main tab for this calculation to determine thesample size for a Two sample comparison of proportions with an ICC of 0.005 (the lowerof the two ICC estimates).

Figure 1. Screenshot of clustersampsi dialog box: Main tab—set up for example 1


Figure 2 shows the corresponding Options tab specifying a Significance level of 0.05and 80% power. On this Options tab, the Number of clusters per arm is set at 20. TheAverage cluster size is shaded out because this is a sample-size calculation specifying thenumber of clusters and obtaining an estimate of the average cluster size required. TheCoefficient of variation of cluster sizes is left at the default value of 0 and so assumesthe cluster sizes are equal.

Figure 2. Screenshot of clustersampsi dialog box: Options tab—set up for example 1


Figure 3 shows the Values tab for this calculation. Because this is a comparisonof binary proportions, the mean, standard deviation, and rate values are shaded out.Proportion 1 is set at 0.4 and Proportion 2 at 0.5. The correlation between before-and-after measurements is set at 0 because no baseline measurements are anticipated in thiscross-sectional study.

Figure 3. Screenshot of clustersampsi dialog box: Values tab—set up for example 1

The Stata output from the command is shown below. The output shows that underindividual randomization, 385 individuals would be required per arm to detect a changein proportions from 0.4 to 0.5 at 80% power and a 5% significance level. Allowingfor cluster randomization with 20 clusters per arm, a total of 23 individuals would berequired per cluster, equating to a total sample size of 460 per arm.


. clustersampsi, binomial samplesize p1(0.4) p2(0.5) k(20) rho(0.005)> size_cv(0) alpha(0.05) beta(0.8) base_correl(0)

Sample size calculation to determine number of observations required per cluster,for a two sample comparison of proportions (using normal approximations)without continuity correction.

For the user specified parameters:

p1: 0.4000p2: 0.5000significance level: 0.05power: 0.80baseline measures adjustment (correlation): 0.00number of clusters available: 20intra cluster correlation (ICC): 0.0050coefficient of variation (of cluster sizes): 0.00

clustersampsi estimated parameters:

Firstly, assuming individual randomisation:sample size per arm: 385

Then, allowing for cluster randomisation:average cluster size required: 23sample size per arm: 460

Note: sample size per arm required under cluster randomisation is roundedup to a multiple of average cluster size.

In a variation of this example, the ICC is replaced by the higher of the two estimatesof 0.07. The output for this computation is provided below. Under this estimate of theICC, the design becomes infeasible; that is, however many individuals are recruited percluster, it will not be possible to obtain 80% power to detect a difference between 0.4and 0.5. In this scenario, the command alerts the user to this fact. The user is toldthat the minimum number of clusters required to detect a change from 0.4 to 0.5 at80% power is 28 per arm. Alternatively, the user is told that because of the prespecifiednumber of clusters (here, 20 per arm), the maximum achievable power would be in theregion of 65% (that is, with 20 clusters per arm to detect a difference from 0.4 to 0.5,the study would have 65% power), and the minimum detectable difference is 0.12; thatis, the design would have 80% power to detect a change from 0.4 to 0.52.


. clustersampsi, binomial samplesize p1(0.4) p2(0.5) k(20) rho(0.07) size_cv(0)> alpha(0.05) beta(0.8) base_correl(0)

Sample size calculation to determine number of observations required per cluster,for a two sample comparison of proportions (using normal approximations)without continuity correction.


p1: 0.4000p2: 0.5000significance level: 0.05power: 0.80baseline measures adjustment (correlation): 0.00number of clusters available: 20intra cluster correlation (ICC): 0.0700coefficient of variation (of cluster sizes): 0.00


The sample size required under individual randomisation is: 385The specified design is infeasible under cluster randomisation.

You could consider one of the following three options:(i) Increase the number of clusters per arm to more than: 28(ii) Decrease the power to: 0.65(iii) Increase the difference to be detected. So,If, trying to detect an increasing outcome then:decrease the difference to be detected to: 0.1190with corresponding p2: 0.5190If, trying to detect a decreasing outcome then:decrease the difference to be detected to: 0.1134with corresponding p2: 0.2866r(198);

3.2 Example 2: Illustrating detectable differences

A cluster RCT in Iran to evaluate the effectiveness of a polypill (composed of aspirin,Statin, and a pill that lowers blood pressure) is to be nested within a longitudinal cohortstudy (Pourshams et al. 2010). The clustered nature of the trial is thought to be crucialbecause there is a real danger of contamination because of the sharing of medication.A subset of 5,696 individuals, spread over 258 villages, is eligible and has consented toparticipate in this study. Villages are to be randomized to an intervention arm or astandard care arm. The average size of each village is 22 (after allowing for potentialdropout) with a cvsizes of 0.9; that is, there is considerable variation between the sizesof the clusters. The aim of the intervention is to reduce the composite event rate ofstroke or myocardial infarction over five years. The event rate in the control group wasestimated to be in the region of 0.077 over the five years. Two estimates of the ICC wereobtained from previous, similar studies (0.038 and 0.018).


We illustrate how clustersampsi can be used to determine the effect sizes detectableat 80% power under both estimates for the ICC for the fixed sample size. Initially, weperform the calculations assuming the ICC is 0.018. The output for this calculation isprovided below and illustrates the use of cvsizes. The detectable event rate under theintervention arm is 0.053 (assuming a decreasing event rate), which equates to a relativerisk of 0.69, that is, a relative risk reduction of 31%.

. clustersampsi, binomial detectabledifference p1(0.077) m(22) k(129)> rho(0.018) size_cv(0.9) alpha(0.05) beta(0.8) base_correl(0)Detectable difference calculation for two sample comparison of proportions> (using normal approximations)without continuity correction.


p1: 0.08significance level: 0.05power: 0.80baseline measures adjustment (correlation): 0.00average cluster size: 22number of clusters per arm: 129coefficient of variation (of cluster sizes): 0.90intra cluster correlation (ICC): 0.0180


Firstly, under individual randomisation:If, trying to detect an increasing outcome then:detectable difference: 0.02with corresponding p2: 0.10If, trying to detect a decreasing outcome then:detectable difference: 0.02with corresponding p2: 0.06

Then, allowing for cluster randomisation:design effect: 1.70If, trying to detect an increasing outcome then:detectable difference: 0.03with corresponding p2: 0.10If, trying to detect a decreasing outcome then:detectable difference: 0.02with corresponding p2: 0.05

Because estimation of the ICC is subject to much uncertainty, we have also carriedout the calculation assuming the ICC is 0.038. Again the output is provided below.Here the detectable event rate under the intervention arm is 0.049 (again assuming adecreasing event rate), which equates to a relative risk of 0.63, that is, a 37% relativerisk reduction.


. clustersampsi, binomial detectabledifference p1(0.077) m(22) k(129)> rho(0.038) size_cv(0.9) alpha(0.05) beta(0.8) base_correl(0)Detectable difference calculation for two sample comparison of proportions> (using normal approximations)without continuity correction.


p1: 0.08significance level: 0.05power: 0.80baseline measures adjustment (correlation): 0.00average cluster size: 22number of clusters per arm: 129coefficient of variation (of cluster sizes): 0.90intra cluster correlation (ICC): 0.0380


Firstly, under individual randomisation:If, trying to detect an increasing outcome then:detectable difference: 0.02with corresponding p2: 0.10If, trying to detect a decreasing outcome then:detectable difference: 0.02with corresponding p2: 0.06

Then, allowing for cluster randomisation:design effect: 2.48If, trying to detect an increasing outcome then:detectable difference: 0.03with corresponding p2: 0.11If, trying to detect a decreasing outcome then:detectable difference: 0.03with corresponding p2: 0.05

Warning: Normal approximations used close to boundaries might result in> proportions out of range

A clinically important relative risk is in the region of 0.65, which equates to an eventrate in the treatment group of 0.05. If the ICC is as high as 0.038, then the trial willhave less than 80% power to detect this difference. We illustrate how clustersampsican be used to determine the power available to detect the clinically important relativerisk, assuming the ICC is 0.038:


. clustersampsi, binomial power p1(0.077) p2(0.05) m(22) k(129) rho(0.038)> size_cv(0.9) base_correl(0)Power calculation for a two sample comparison of proportions (using normal> approximations)without continuity correction.


p1: 0.0770p2: 0.0500significance level: 0.05baseline measures adjustment (correlation): 0.00average cluster size: 22number of clusters per arm: 129coefficient of variation (of cluster sizes): 0.90intra-cluster correlation (ICC): 0.0380


Firstly, assuming individual randomisation:power: 0.99Then, allowing for cluster randomisation:design effect: 2.48power: 0.75

The power available to detect this difference is 75%, close to 80%. Thus the trial willalmost be sufficiently powered to detect this difference.

3.3 Example 3: Illustrating the coefficient of variation to measureheterogeneity

Hayes and Bennett (1999) show how the coefficient of variation can be used as an al-ternative to the ICC to describe the variation in outcomes between clusters. In theirillustrative cases, they describe an example of a cluster sample-size calculation for acomparison of rates and for measuring cvclusters. We reproduce this example here andillustrate how clustersampsi could be used to perform this calculation. The objectiveis to determine the number of clusters required.

The study is designed to detect a difference between two rates, λ1 = 0.0148 andλ2 = 0.0104, at 80% power and 5% significance with approximately 424 person-yearsof observations in each cluster and with a cvclusters of 0.29. clustersampsi returns thevalue of 37 clusters per arm:


. clustersampsi, samplesize rates r1(0.0148) r2(0.0104) m(424) cluster_cv(0.29)> size_cv(0) alpha(0.05) beta(0.8) base_correl(0)Sample size calculation determining the number of clusters required,for a two sample comparison of rates (using normal approximations).


rate 1: 0.0148rate 2: 0.0104significance level: 0.05power: 0.80baseline measures adjustment (correlation): 0.00average person years per cluster: 424cluster coefficient of variation (of outcomes): 0.29


Firstly, assuming individual randomisation:sample size per arm: 10217

Then, allowing for cluster randomisation:sample size per arm: 15688number clusters per arm (m): 37

Note: sample size per arm required under cluster randomisation is rounded upto a multiple of average cluster size and includes the additionof one extra cluster per arm (to allow for t-distribution).To understand sensitivity to these conservative allowances:power with m clusters per arm: 0.81power with m-1 clusters per arm: 0.80

This is very close to the 36.2 reported by Hayes and Bennett. In the trial, only 28clusters were recruited. We can therefore use clustersampsi to evaluate the powerthat the trial would have had if limited to 28 clusters:

. clustersampsi, rates power r1(0.0148) r2(0.0104) m(424) k(28)> cluster_cv(0.29) alpha(0.05)Power calculation for a two sample comparison of rates (using normal> approximations).


rate 1: 0.014800rate 2: 0.010400significance level: 0.05baseline measures adjustment (correlation): 0.00average person years per cluster: 424number of clusters per arm: 28cluster coefficient of variation (of outcomes): 0.29


Firstly, assuming individual randomisation:power: 0.86Then, allowing for cluster randomisation:power: 0.69

clustersampsi estimates the power to be about 69%, again similar to that reportedby Hayes and Bennett.


4 Conclusion

While cluster sample-size calculations are, for the most part, simple extensions of thoserequired under individual randomization, specific commands in Stata for this class ofproblems should prove very useful. Some commands are currently available in Stata toperform these calculations, but one is very basic and requires a two-step approach, andthe other is specifically designed for trials in which there is no clustering in the controlarm.

The command outlined here, clustersampsi, allows not only for clustering but alsofor varying cluster sizes, for baseline measurements, or for adjustment for predictivecovariates. It also incorporates calculations of samples sizes, power, and detectabledifferences. It will alert the user to infeasible designs and suggest possible options. Theuser can parameterize cluster heterogeneity by using either the ICC coefficient or thecoefficient of variation. The dialog box for clustersampsi should allow straightforwardimplementation for the most common types of cluster RCTs.

When we compare the output of clustersampsi with that of sampclus, the es-timates from clustersampsi tend to result in slightly higher sample sizes because itrounds up to a multiple of the average cluster size and because it adds one to the num-ber of clusters. On the other hand, compared with the estimates from clsampsi, theestimates from clustersampsi tend to be more conservative (that is, a slightly lowerestimated sample size or slightly higher estimated power) because of the noncentral Fdistribution used by clsampsi. These differences are more marked at the parameterboundaries (such as small proportions or few clusters).

We have used a number of approximations here. First, we have approximated thevariance of proportions and rates, we have assumed normality, and we have not madecontinuity corrections. Continuity-corrected sample-size calculations are more conserva-tive but are not considered optimal by everyone (Royston and Babiker 2002). More im-portantly, we have also approximated the variance reduction due to correlation betweenany baseline measurements for binary outcomes by using normality approximations.For continuous outcome measurements in RCTs, adjustment for baseline measurementswill always lead to a reduction in the standard deviation by a factor that dependson the correlation between the before-and-after measurements (Robinson and Jewell1991). For binary outcomes (as opposed to continuous outcomes), although adjustmentfor baseline measures will lead to an increase in power, this is not necessarily by thesame factor. However, it has been shown by others to provide a good approximation(Hernandez, Steyerberg, and Habbema 2004).

5 Appendix: Formulas

The formulas follow those already published (Hemming et al. 2011; Hayes and Bennett1999), with some minor modifications. When the heterogeneity between clusters isspecified by the ICC, then the formulas in Hemming et al. (2011) are used but with theaddition of one to the number of clusters in each arm to account for the t distribution


rather than the normal distribution (as recommended by Hayes and Bennett [1999]).When the heterogeneity between clusters is specified by the coefficient of variation,then the formulas follow those in Hayes and Bennett (1999). The essential formulas forboth methods are described below.

5.1 Formulas using the ICC

The required sample size per arm for a trial at prespecified power 1 − β to detect aprespecified difference of d = μ1 − μ2 is nI , where

nI = (σ21 + σ2

2){

(zα/2 + zβ)2

d2

}Baseline adjustment or adjustment for other covariates will deflate the standard devi-ation by a factor we call B = (1 − r2). The formula above can be simply modified byreplacing σ2

1 with B×σ21 and similarly for σ2

2 . For clusters of average size m with cvsizes,the required number of clusters is k, where

k = 1 +nIVIF

m(2)

where the variance inflation factor (VIF) is

VIF = 1 +{(

cv2sizes + 1

)m − 1

}ρ (3)

For clusters of equal size, this simplifies to

VIF = 1 + (m − 1)ρ

For binary variables p1 and p2, we approximate sd21 = p1(1 − p1) and similarly for sd2

2.For rates λ1 and λ2, we approximate the variances sd2

1 = λ1 and sd22 = λ2.

The above formulas may be simply rearranged to compute power and detectable dif-ferences for mean values. For detectable differences for binary outcomes, it is necessaryto solve the following quadratic to find the detectable difference p2:

0 = ap22 + bp2 + c (4)

where

a = −1 − a1

b = 1 + 2a1p1

c = p1(1 − p1) − a1p21

and where

a1 =(k − 1)m

B × VIF(zα/2 + zβ)2

This provides two values for p2 that correspond to increasing and decreasing values.


If the user is limited to a fixed number of clusters and needs to determine thenumber of observations per cluster, then (5) can be rearranged to give the numberof observations required for each cluster. So, where the clusters are of fixed size, thenumber of observations per cluster is

m =nI(1 − ρ)

k − 1 − ρnI

so that the number of clusters required to make this design feasible is greater thanρnI + 1. If the clusters are of varying size, then using the alternative VIF in (6) givesthe number of observations required per cluster as

m =nI(1 − ρ)

k − 1 − ρnI(cv2sizes + 1)

and, in this case, the minimum number of clusters required to make this design feasibleis ρ(cv2

sizes + 1)nI + 1.

As well as computing the minimum number of clusters required under a design thatis infeasible, clustersampsi computes the maximum power value and the minimumdetectable difference available with the limited number of clusters. These values areobtained by finding the maximum value for zβ or the minimum value for d2, whichwould result in k − 1 − nIρ(cv2

sizes + 1) > 0. So for example, the maximum availablepower for fixed m is

zβ =

√(k − 1)d2

ρ(cv2sizes + 1)(σ2

1 + σ22)

− zα/2

and the minimum detectable difference for continuous outcomes is

d =

√ρ(cv2

sizes + 1)(σ21 + σ2

2)(zα/2 + zβ)2

k − 1

For binary outcomes, the minimum detectable difference is given by (4) except that a1

is replaced by

a1 =(k − 1)

(zα/2 + zβ)2(Bcv2sizes + 1)ρ

5.2 Formulas using the coefficient of variation

The required sample size per arm for a trial at prespecified power 1 − β to detect aprespecified difference of d = μ1 − μ2 is again nI , where

nI = (σ21 + σ2

2){

(zα/2 + zβ)2

d2

}When each of the clusters is size m, the number of clusters required is k so that

k = 1 +BnI

m+ B × CVIF


where the CVIF is

CVIF =cv2

clusters(μ21 + μ2

2)(zα/2 + zβ)2

d2

and where cvclusters is the coefficient of variation of the outcome across the clusters.

Power and detectable difference are simply obtained by rearranging the above formu-las and solving the resulting quadratic where necessary. For proportions, this amountsto solving

0 = ap22 + bp2 + c

where

a = cv2clusters − a2 − 1

m

b =1m

+ 2a2p1

c =p1

m+

p21

m+ cv2

clustersp21 − a2p

21

and wherea2 =

k − 1B(zα/2 + zβ)2

For continuous outcomes, this is such that

0 = aμ22 + bμ2 + c

a = cv2 − a2

b = 2a2μ1

c =(σ2

1 + σ22)

m− a2μ

21 + cv2μ2

1

where a2 is as in the binary case above.

Again, if the user is limited to a prespecified number of clusters, then it is possibleto determine the required average cluster size:

m =nI

k − 1 − CVIF

Certain designs will be infeasible; for a feasible design, the number of clusters requiredis greater than CVIF + 1. Alternatively, limited to this number of clusters, the designwill become feasible on either lowering the power or increasing the difference to be de-tected. The maximum available power and minimum detectable difference are obtainedby determining the maximum value for zβ or minimum value for d2, which results ink − 1 − CVIF > 0.

The maximum available power for both continuous and binary outcomes is

zβ =

√(k − 1)d2

Bcv2clusters(μ

21 + μ2

2)− zα/2


The minimum detectable difference for both continuous and binary outcomes againinvolves solving a quadratic whose coefficients are

a = 1 − a3

b = 2a3μ1

c = μ21 − a3μ

21

and where

a3 =(k − 1)

B × (zα/2 + zβ)2cv2clusters

All functions use ceiling values throughout, so for example, if the number of clusters isestimated to be 7.1, this will be rounded up to 8.

clustersampsi will not give identical results to sampsi for the sample size un-der individual randomization with binary data (hence, any cluster sample sizes calcu-lated via a two-step approach from results of sampsi will not tally with results fromclustersampsi). This is due to an approximation in the case of equal allocation totreatment group: sampsi uses no approximation (equation 3.2 in Machin et al. [1997])but clustersampsi does (equation 3.8 in Machin et al. [1997]). Practically speaking,the difference in sample sizes is only large (more than 10% of the exact sample size re-quired) where small sample sizes (fewer than about 50) are called for. In such situations,the more pressing issue is the use of a cluster design with small samples rather than theprecise size of said sample. Power will also differ for comparisons of proportions becauseof the use of this approximation. Generally, this difference is negligible but may be ofconcern when looking for particularly large effects.

6 Funding acknowledgment

Karla Hemming was partially funded by a National Institute of Health Research (NIHR)grant for Collaborations for Leadership in Applied Health Research and Care (CLAHRC)for the duration of this work. The views expressed in this publication are not necessarilythose of the NIHR or the Department of Health.

7 ReferencesArmitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical

Research. 4th ed. Oxford: Blackwell.

Batistatou, E., and C. Roberts. 2010. clsampsi Stata command.http://www.medicine.manchester.ac.uk/healthmethodology/research/biostatistics/data/clsampsi/.

Donner, A., and N. Klar. 2000. Design and Analysis of Cluster Randomization Trialsin Health Research. London: Arnold.


Eldridge, S. M., D. Ashby, and S. Kerry. 2006. Sample size for cluster randomized trials:Effect of coefficient of variation of cluster size and analysis method. InternationalJournal of Epidemiology 35: 1292–1300.

Garrett, J. M. 2001. sxd4: Sample size estimation for cluster designed samples. StataTechnical Bulletin 60: 41–45. Reprinted in Stata Technical Bulletin Reprints, vol. 10,pp. 387–393. College Station, TX: Stata Press.

Guittet, L., B. Giraudeau, and P. Ravaud. 2005. A priori postulated and real power incluster randomized trials: Mind the gap. BMC Medical Research Methodology 5: 25.

Hayes, R. J., and S. Bennett. 1999. Simple sample size calculation for cluster-randomizedtrials. International Journal of Epidemiology 28: 319–326.

Hemming, K., A. J. Girling, A. J. Sitch, J. Marsh, and R. J. Lilford. 2011. Sample sizecalculations for cluster randomised controlled trials with a fixed number of clusters.BMC Medical Research Methodology 11: 102.

Hernandez, A. V., E. W. Steyerberg, and J. D. Habbema. 2004. Covariate adjustmentin randomized controlled trials with dichotomous outcomes increases statistical powerand reduces sample size requirements. Journal of Clinical Epidemiology 57: 454–460.

MacArthur, C., K. Jolly, L. Ingram, N. Freemantle, C. L. Dennis, R. Hamburger,J. Brown, J. Chambers, and K. Khan. 2009. Antenatal peer support workers andinitiation of breast feeding: Cluster randomised controlled trial. British MedicalJournal 338: b131.

MacArthur, C., H. R. Winter, D. E. Bick, R. J. Lilford, R. J. Lancashire, H. Knowles,D. A. Braunholtz, C. Henderson, C. Belfield, and H. Gee. 2003. Redesigning postna-tal care: A randomised controlled trial of protocol-based midwifery-led care focusedon individual women’s physical and psychological health needs. Health TechnologyAssessment 7: 1–98.

Machin, D., M. J. Campbell, P. M. Fayers, and A. Pinol. 1997. Sample Size Tables forClinical Studies. 2nd ed. Oxford: Blackwell Science.

Murray, D. M. 1998. Design and Analysis of Group-Randomized Trials. New York:Oxford University Press.

Pourshams, A., H. Khademi, A. F. Malekshah, F. Islami, M. Nouraei, A. R. Sadjadi,E. Jafari, N. Rakhshani, R. Salahi, S. Semnani, F. Kamangar, C. C. Abnet, B. Ponder,N. Day, S. M. Dawsey, P. Boffetta, and R. Malekzadeh. 2010. Cohort Profile: TheGolestan Cohort Study—A prospective study of oesophageal cancer in northern Iran.International Journal of Epidemiology 39: 52–59.

Robinson, L. D., and N. P. Jewell. 1991. Some surprising results about covariate ad-justment in logistic regression models. International Statistical Review 58: 227–240.


Royston, P., and A. Babiker. 2002. A menu-driven facility for complex sample sizecalculation in randomized controlled trials with a survival or a binary outcome. StataJournal 2: 151–163.

About the authors

Karla Hemming and Jen Marsh are both lecturers at the University of Birmingham in theDepartment of Public Health, Epidemiology and Biostatistics.


Estimating Geweke’s (1982) measure ofinstantaneous feedback

Mehmet F. DicleLoyola University New Orleans

New Orleans, LA

[email protected]

John LevendisLoyola University New Orleans

New Orleans, LA

[email protected]

Abstract. In this article, we describe the gwke82 command, which implementsa measure of instantaneous feedback for two time series following Geweke (1982,Journal of the American Statistical Association 77: 304–313).

Keywords: st0287, gwke82, Geweke, Granger, causality, vector autoregression,time series, instantaneous feedback

1 Introduction

Tests of statistical causality are really tests of whether lags of one variable can be usedto predict current values of another variable. If data are measured frequently, and thecausality is not instantaneous, then this is adequate. However, if data are measuredinfrequently (for example, most macroeconomic data are measured yearly), then stan-dard Granger (1969) causality tests may miss much of the contemporaneous correlationbetween variables. Geweke (1982) proposed a measure of instantaneous correlation,calculated from the residuals of standard Granger causality tests, that captures “in-stantaneous feedback”. Our command, gwke82, quickly estimates such Geweke-typeinstantaneous feedback between pairs of variables.

Granger-type causality is quite common in financial economics research as well asother disciplines. Geweke-type causality, however, has not received this kind of researchattention.1 We think that this is due to the lack of econometric software coverage ofGeweke-type causality. Our suggested command, gwke82, is intended to fill this void.

1. Interestingly, Geweke-type causality has received attention from neuroscience; for example, seeZhang et al. (2010).


M. F. Dicle and J. Levendis 137

While the Geweke (1982) method applies to any vector-valued linear function, theintuition behind the technique can be more easily seen for a system of two randomvariables that can be estimated using the standard vector autoregression methodology:2

xt =p∑

s=1

E1sxt−s + u1t, Var(u1t) = Σ1 (1)

yt =p∑

s=1

G1syt−s + v1t, Var(v1t) = T1 (2)

xt =p∑

s=1

E2sxt−s +p∑

s=1

F2syt−s + u2t, Var(u2t) = Σ2 (3)

yt =p∑

s=1

G2syt−s +p∑

s=1

H2sxt−s + v2t, Var(v2t) = T2 (4)

xt =p∑

s=1

E3sxt−s +p∑

s=0

F3syt−s + u3t, Var(u3t) = Σ3 (5)

yt =p∑

s=1

G3syt−s +p∑

s=0

H3sxt−s + v3t, Var(v3t) = T3 (6)

If all the coefficients for the lags of x (H2) are statistically significant, then it issaid that “x Granger-causes y”. Such estimation, however, potentially leaves a lot ofcorrelation between x and y unexploited. Specifically, if yt is correlated with xt aftercontrolling for their lags, then there is instantaneous correlation left between them.This is the basis of the Geweke (1982) measure of instantaneous feedback. Gewekeproposed that the variance–covariance matrix of residuals from the vector autoregressionestimation be used to estimate the linear feedback between y to x and x to y, and theinstantaneous linear feedback between x and y.

If x does not Granger-cause y, then (4) can be rewritten as (2). If y does not Granger-cause x, then (3) can be rewritten as (1). Comparing (1) and (3), then, gives us anestimate of the impact of y on x. Specifically, Geweke (1982) proposed the following asmeasures of linear feedback:

n × FX→Y = n × ln (T1/T2) ∼χ2p

n × FY →X = n × ln (Σ1/Σ2) ∼χ2p

n × FX×Y = n × ln (T2 × Σ2/|Υ|) (7)= n × ln (Σ2/Σ3) (8)

= n × ln (T2/T3) ∼χ21 (9)

n × FX,Y = n × ln (Σ1 × T1/|Υ|) ∼χ2(2p+1)

2. We follow the notations in Geweke (1982).

138 gwke82 for instantaneous feedback

where |Υ| =[

Σ2 CC T2

]and C = cov(u2t, v2t). FX→Y and FY →X are the Granger-type

causation F statistics. n is the number of observations for the unrestricted estimations.FX×Y is the measure of instantaneous causation (instantaneous feedback). FX,Y is themeasure of total feedback between x and y. FX,Y also equals to FX→Y +FY →X +FX×Y .Geweke showed that the measures above are asymptotically distributed as F distribu-tions. He also proved that (7), (8), and (9) are equal, implying that instantaneouscausality can be verified directly by comparing (5) with (3) and (6) with (4), or indi-rectly by using the variance–covariance matrix of (3) and (4). Computationally, thegwke82 command uses (7) for instantaneous causality.

Geweke generalized the above results to include vector-valued functions (so that themeasures are asymptotically chi-squared) and allowed for more than two endogenousvariables and also for the inclusion of exogenous variables. The gwke82 command doesnot allow for vector-valued functions, but it does allow for exogenous variables and morethan two endogenous variables.

2 The gwke82 command

2.1 Syntax

gwke82 varlist[if],[m(integer) exog(varlist) detail

]2.2 Options

m(integer) specifies the number of lags for both variables to be included within allestimations. The default is m(2).

exog(varlist) specifies the exogenous variables to be included within all estimations.

detail stores all estimation results.

M. F. Dicle and J. Levendis 139

3 Computing Geweke’s measure of instantaneous causal-ity

. use http://www.stata-press.com/data/r12/lutkepohl2(Quarterly SA West German macro data, Bil DM, from Lutkepohl 1993 Table E.1)

. tssettime variable: qtr, 1960q1 to 1982q4

delta: 1 quarter

. gwke82 dln_inv dln_inc dln_consump if qtr<=tq(1978q4), detail

Granger Causation Chi2 df P-value

dln_inv -> dln_inc 3.7061 2 0.1568dln_inv -> dln_cons~p 2.0599 2 0.3570

dln_inc -> dln_inv 0.1042 2 0.9492dln_inc -> dln_cons~p 12.1271 2 0.0023

dln_cons~p -> dln_inv 3.1568 2 0.2063dln_cons~p -> dln_inc 3.6042 2 0.1650

Instantaneous feedback Chi2 df P-value

dln_inv <-> dln_inc 1.2561 1 0.2624dln_inv <-> dln_cons~p 5.9162 1 0.0150dln_inc <-> dln_cons~p 26.1724 1 0.0000

Total correlation Chi2 df P-value

dln_inv , dln_inc 5.0664 5 0.4078dln_inv , dln_cons~p 11.1330 5 0.0488dln_inc , dln_cons~p 41.9037 5 0.0000

The estimation reveals that dln inc Granger-causes dln cons~p. There is evidenceof instantaneous feedback between dlin inv and dln cons~p, and between dln incand dln cons~p. The total correlation between dln inc and dln cons~p is statisticallysignificant.

The causality statistics and corresponding p-values reported after the gwke82 com-mand are also saved as return matrices.

. return list

matrices:r(total_P) : 3 x 3

r(total) : 3 x 3r(geweke_P) : 3 x 3r(geweke) : 3 x 3

r(granger_P) : 3 x 3r(granger) : 3 x 3

The detail option stores all estimation results. These stored estimations can be ac-cessed with the estimates dir command.

140 gwke82 for instantaneous feedback

. estimates dir

name command depvar npar title

unrestrict~2 sureg mult. depvar 14restricted_1 regress dln_inv 5restricted_2 regress dln_inc 5unrestrict~3 sureg mult. depvar 14restricted_3 regress dln_consump 5unrestrict~3 sureg mult. depvar 14

. estimates restore unrestricted_1_2(results unrestricted_1_2 are active now)

. ereturn list

scalars:e(ic) = 0

e(k_eq) = 2e(dfk2_adj) = 73

e(ll) = 349.9775201862861e(rank) = 14

(output omitted )

e(k) = 14e(N) = 73

macros:e(properties) : "b V"

e(eqnames) : "dln_inv dln_inc"e(depvar) : "dln_inv dln_inc"

(output omitted )

e(cmd) : "sureg"e(_estimates_name) : "unrestricted_1_2"

matrices:e(b) : 1 x 14e(V) : 14 x 14

e(Sigma) : 2 x 2

4 ReferencesGeweke, J. 1982. Measurement of linear dependence and feedback between multiple

time series. Journal of the American Statistical Association 77: 304–313.

Granger, C. W. J. 1969. Investigating causal relations by econometric models andcross-spectral methods. Econometrica 37: 424–438.

Zhang, L., G. Zhong, Y. Wu, M. G. Vangel, B. Jiang, and J. Kong. 2010. UsingGranger–Geweke causality model to evaluate the effective connectivity of primarymotor cortex, supplementary motor area and cerebellum. Journal of Biomedical Sci-ence and Engineering 3: 848–860.

About the authors

Mehmet F. Dicle is an assistant professor of finance at Loyola University New Orleans.

John Levendis is an associate professor of economics at Loyola University New Orleans.


Two-stage nonparametric bootstrap samplingwith shrinkage correction for clustered data

Edmond S.-W. NgDepartment of Health Services Research and PolicyLondon School of Hygiene and Tropical Medicine

London, UK

[email protected]

Richard GrieveDepartment of Health Services Research and PolicyLondon School of Hygiene and Tropical Medicine

London, UK

James R. CarpenterDepartment of Medical Statistics

London School of Hygiene and Tropical MedicineLondon, UK

Abstract. This article describes a new Stata command, tsb, for performing astratified two-stage nonparametric bootstrap resampling procedure for clustereddata. Estimates for uncertainty around the point estimate, such as standard er-ror and confidence intervals, are derived from the resultant bootstrap samples. Ashrinkage estimator proposed for correcting possible overestimation due to second-stage sampling is implemented as default. Although this command is written withcost effectiveness analyses alongside cluster trials in mind, it is applicable to theanalysis of continuous endpoints in cluster trials more generally. The use of thiscommand is exemplified with a case study of a cost effectiveness analysis under-taken alongside a cluster randomized trial. We also report bootstrap confidenceinterval coverage by using data from a published simulation study.

Keywords: st0288, tsb, tsbceprob, two-stage nonparametric bootstrap, shrinkagecorrection, clustered data, cost effectiveness, health economics

1 Introduction

The bootstrap method can be used for estimating uncertainty around the point esti-mate of a statistic of interest (Efron and Tibshirani 1993; Davison and Hinkley 1997).It provides an alternative to statistical methods that rely on normality when such anassumption is implausible and transformation of the original data (to approximate nor-mality) is either problematic or undesirable. A prime example is cost data, whichtend to be right skewed. Here models that assume normality can provide inefficientestimates of the mean cost. One approach is to transform costs (for example, by logtransformation), but simply back transforming the resultant estimates does not provide


142 Two-stage bootstrap for clustered data

the estimate of interest, the effect of treatment on the arithmetic mean cost (Manning1998; Briggs et al. 2005; Faddy, Graves, and Pettitt 2009). Nonparametric bootstrapmethods are attractive in this context because they avoid making distributional as-sumptions.

Many health economic evaluations are undertaken together with cluster randomizedtrials (CRTs), for example, because the intervention (for instance, vaccination programs)is delivered at the level of the group or cluster (Sullivan et al. 2005; Colvin et al. 2006;Wolters et al. 2006; Bachmann et al. 2007; Gomes et al. 2012a). Here the unit of ran-domization is the cluster (for example, school or hospital) rather than the individual(pupil or patient). Clusters are randomized to one of a number of alternative interven-tions; individuals within the same cluster all receive the same intervention. Such studydesign also helps in minimizing the chance of treatment contamination and helps in thosecases where individual randomization may not be feasible or may be considered uneth-ical (Donner and Klar 2000). However, the potentially dependent nature of the CRT

data may violate the independently and identically distributed assumption on whichmany standard statistical methods, including the bootstrap, rely (Flynn and Peters2004; Nixon, Wonderling, and Grieve 2010).

Davison and Hinkley (1997) proposed an extension to the standard bootstrap pro-cedure that recognizes clustering by resampling clusters and then individuals withinclusters. We shall refer to this algorithm as the two-stage bootstrap (TSB). This boot-strap algorithm naturally extends to allowing for any correlation between two or moreendpoints (such as costs and health outcomes). While this TSB has been applied previ-ously (Bachmann et al. 2007; Flynn and Peters 2005), there are no routinely availablecommands for implementing the routine. Further, previous studies do not appear tohave followed Davison and Hinkley’s (1997) original suggestion of using a shrinkagecorrection for correcting the potential overestimation of variance due to resampling atthe second stage, and the original algorithm was only proposed for studies with equalnumbers per cluster (balanced designs). The aim of this article is to provide Statacommands for implementing the TSB and to extend the original algorithm to CRTs withunequal numbers per cluster. The commands are of central relevance to cost effective-ness analyses together with CRTs, but the flexibility of the package allows it to extendmore generally to CRTs with other continuous endpoints.

Section 2 gives an overview of the extended bootstrap resampling for clustered datawith and without a shrinkage correction. Section 3 describes the new suite of commandsand their options. Section 4 illustrates the TSB commands by applying them to a costeffectiveness analysis alongside a CRT. Section 5 reports bootstrap confidence intervalcoverage of TSB by using data from a previously published simulation study. Finally,we finish with a discussion in section 6.

E. S.-W. Ng, R. Grieve, and J. R. Carpenter 143

2 TSB resampling and shrinkage correction

2.1 The standard nonparametric bootstrap

In brief, the standard nonparametric bootstrap approach assumes the observed data area sample, s, drawn from a population with distribution F . A statistic of interest, R,estimated from the observed sample, is given by R. Bootstrap samples are generatedby resampling with replacement, from the observed sample, B times, resulting in Bbootstrap samples. For each resample, the statistic of interest is calculated and denotedby R∗

b for b = 1 to B.1 The bootstrap estimates R(s)∗1, . . . , R(s)∗B provide an empiricaldistribution of the statistic of interest that can be used to approximate the samplingdistribution of the statistic. Measures of uncertainty around the point estimate, suchas standard error and confidence intervals (CIs), are then constructed from this em-pirical distribution. See Efron and Tibshirani (1993) for a comprehensive discussion ofbootstrap methods.

For data collected from clusters such as schools, workplaces, and general practices,the independently and identically distributed assumption required for many standardstatistical methods, including the bootstrap, may not hold (Liu 1988). The dependentnature of the data can be accounted for by extending the resampling strategy to mimicthe way in which the data are sampled from the population so as to preserve the structureof the original data. We now consider alternative ways of modifying the bootstraproutine to address this.

2.2 The “cluster bootstrap”

One way that the bootstrap routine can be modified to recognize the clustering is toresample clusters rather than individuals (Davison and Hinkley 1997). The routine isotherwise as for the standard bootstrap and is readily available to Stata users throughthe bootstrap command with cluster() option. However, the cluster bootstrap ap-proach has been found to perform poorly with CI coverage levels below the nominallevel (Flynn and Peters 2005).

2.3 The TSB for clustered and correlated data, with particular ap-plication to cost effectiveness analysis

A different strategy proposed by Davison and Hinkley (1997) requires the resamplingto be performed in two stages (both with replacement). Here clusters are resampled inthe first stage, and individuals within the chosen clusters are resampled in the secondstage (with replacement in both stages) to obtain a bootstrap sample. The statisticof interest is calculated for the bootstrap sample. This process is repeated B times toform the empirical distribution of the statistic of interest. In the TSB, any correlation

1. General recommendations suggest that at least 1,000 replicates are required for bootstrap con-fidence intervals (Davison and Hinkley 1997; Campbell and Torgerson 1999; Nixon, Wonderling,and Grieve 2010).


between the endpoints can be recognized by resampling jointly those variables that arerequired for calculating the statistic of interest.

In health economic evaluations, a common statistic of interest is the incremental netmonetary benefit (INB), which reports the relative value for money of alternative healthcare programs (Stinnett and Mullahy 1998). The INB is calculated by estimating thedifference between the treatment alternatives in the mean health outcomes, valuing thisby the threshold willingness to pay (WTP) for a unit of health gain, and subtracting fromthis the incremental costs. Methods guidance recommends that measures of uncertaintyfor statistics such as the INB should recognize the potential correlation between theendpoints, cost and health outcomes (Willan and Briggs 2006).2

The statistical uncertainty in the estimated cost effectiveness can be reported byestimating 95% CIs around the INB. Hence, under the TSB approach, the INB can becalculated in each bootstrap replicate, and the 95% CIs estimated from the resultantempirical distribution. Another recommended metric for summarizing the statisticaluncertainty surrounding the cost-effectiveness (CE) measure is the cost-effectivenessacceptability curve (CEAC) (Van Hout et al. 1994). The CEAC presents the probabilitythat an intervention is cost effective, given the data, at alternative threshold levels ofWTP for a unit of health outcome, λ. See the description section of the new commandtsbceprob for further details of the estimation of CEACs with TSB.

Either measure can allow for the correlation between costs and health outcomeswithin the TSB routine by resampling pairs of cost, and effect can be taken from indi-viduals at the second stage. In the new command tsb, we implement joint resamplingof endpoint variables from the original data. Where stratified sampling is appropriate(for example, data from a CRT comparing two interventions), resampling is performedindependently within each stratum.

2.4 TSB with shrinkage correction

Davison and Hinkley (1997) noted that unless the number of clusters and individualsper cluster are both large, this method may overestimate the variance due to resam-pling at the second stage. Resampling at the second stage is likely to double count thewithin-cluster variance because the estimates of the cluster means resampled from stage1 already incorporate both within- and between-cluster variability (Davison and Hinkley1997; Flynn and Peters 2004; Gomes et al. 2012b). Davison and Hinkley (1997) there-fore proposed a shrinkage correction to avoid overestimating the variance. This ap-proach differs from that described above in that rather than resampling clusters andthen individuals within the chosen clusters, the algorithm resamples estimates for twodistributions directly, namely, Fx for cluster means and Fz for individual deviationsor residuals from the cluster means, denoted as zij , for the ith individual in cluster j.

2. INB = λ×Δe−Δc, where Δc and Δe are the incremental cost and health outcome (Δe = e1 − e0

and Δc = c1 − c0 with the intervention arm denoted by subscript 1 and 0 for control), and λthe maximum WTP per unit of health outcomes. The variance of the INB is given by V (INB) =λ2V (Δe) + V (Δc) − 2λcov(Δe, Δc).


Details of this approach are elaborated in the algorithm below. It includes modificationsto the original proposal to allow for unbalanced cluster sizes, which makes the proceduremore applicable to real data.

2.5 Algorithm for TSB with shrinkage correction:

1. For k = 1 (k = 1, . . . , K intervention arms; for simplicity sake, subscript k isomitted in the following steps).

2. Calculate shrunken cluster means, xj = cy.. + (1 − c)yj , for j = 1, . . . , Nc, whereNc is the number of clusters in the stratum. See footnote3 for derivation of theconstant c.

3. Calculate the standardized individual-level residuals (from the estimated cluster

means), zij = (yij −yj)/√

(1 − n−1∗ ), where n∗ is the cluster size. For unbalancedcluster sizes, n∗ is replaced by different measures of “average” cluster size in tsb(see options for tsb).

4. Randomly sample (with replacement) from x1, . . . , xNcto obtain x∗

j′ for j′

=

1, . . . , Nc. j′is a new index for the chosen clusters.

5. Randomly sample (with replacement) from z11, . . . , znNc Ncto obtain z∗

i′ j′ for i′=

1, . . . , nj′ , where nj′ is the size of the jth cluster chosen in step 4. The indexnNc

Nc denotes the last individual in the last cluster in the original data. Thenumber of units to be sampled in this step is dependent on the sum of the sizesof the chosen clusters in step 4. Note also that the cluster membership of theoriginal sample is ignored here.

6. Reconstruct the sample by creating a “synthetic” sample by y∗ij = x∗

j′ + z∗i′j′ .

7. Repeat steps 2 to 6 for the remaining K − 1 intervention arms; then stack the Ksynthetic samples to form the bth bootstrap sample.

8. Calculate the statistic of interest R for the bth bootstrap sample to obtain Rb(y∗ij).

9. Repeat steps 1 to 8 B times to obtain the bootstrap estimates, Rb(y∗ij), for b =

1, . . . , B.

3. The constant c is given by (1− c)2 = (Nck/Nck−1)−SSw/{nj(nj − 1)SSb} or set to 1 if the right-

hand side of the expression is negative. SSw =PNc

j=1

Pnj

i=1(yij − yj)2, SSb =

PNcj=1(yj − y..)

2,

and the grand mean is given by y.. = (PNc

j=1 nj)−1

PNcj=1

Pnj

i=1 yij .


The distribution of the Rb(y∗ij)’s approximates the empirical distribution of R and

can be used for estimating the bootstrap standard error and confidence intervals. Wheretwo or more variables from the original data are required to calculate the statisticof interest, the above resampling is performed jointly for the variables involved. Forhealth economic evaluations, individual costs and effects are resampled jointly: xj andzij become matrices of dimensions (1 × m) and (nj × m), where m is the number ofvariables or endpoints (needed to calculate the statistic of interest), and nj is the numberof individuals in cluster j. For example, m = 2 for individual costs and health outcomesfor calculating the INB. The algorithm above assumes that stratification between thetreatment arms is required, but if that is not the case, then steps 1 and 7 should beomitted.

The original proposal assumed constant cluster size, n∗, across clusters. In ourimplementation of TSB with shrinkage correction, it has been generalized to allow forunequal cluster sizes by replacing n∗ by an “average” cluster size (within the stratum)in step 3 and by acknowledging the variable stratum sizes in step 5, where the secondstage resampling takes place. n∗ is replaced by three different measures of “average”cluster size in tsb: dk (see page 9 of Donner and Klar [2000]), median, and mean. Seethe next section on the tsb command for explanation.

3 Commands

3.1 tsb

Syntax

tsb varlist[if] [

in], stats(&f()) cluster(varname)

[strata(varname)

reps(#) seed(#) unbal(string) lambda(real) noshrink level(#) nodots]

Description

tsb performs a two-stage bootstrap sampling procedure on a user-supplied statistic forclustered data. It is implemented as an ado-file, which serves as a wrapper function forinvoking a number of Mata functions, to perform TSB with or without shrinkage cor-rection. The uncertainty around the point estimate calculated from the original sampleis quantified by the bootstrap standard error and confidence intervals. These estimatesare given by applying the Stata programmer’s command bstat on the bootstrap repli-cates of the statistic of interest. A detailed discussion of the derivation of bootstrapconfidential intervals is beyond the remit of this article. Readers are referred to, forexample, Carpenter and Bithell (2000) for a comprehensive review.

Among the nonparametric confidence intervals reported by bstat and bootstrap,the bias-corrected and accelerated (BCa) confidence interval requires the estimation ofan acceleration parameter that adjusts for the skewness in the sampling distribution


(Briggs, Wonderling, and Mooney 1997; Carpenter and Bithell 2000). In our Stata im-plementation of TSB, the acceleration parameter is calculated by using a stand-aloneMata function before the result is passed to bstat for constructing the BCa confidenceinterval.4

Options

stats(&f()) specifies a user-supplied Mata function, f(), for calculating the statisticfor bootstrapping. See appendix for examples of such functions. Note that theampersand (&) before f() is required as part of the syntax. stats() is required.

cluster(varname) specifies the variable that identifies clusters. cluster() is required.

strata(varname) specifies the variable that identifies strata. For an example of acluster randomized trial, the strata would be levels of a cluster-level treatment orintervention variable. The default is strata(constant).

reps(#) specifies the number of bootstrap replications to be performed. For estimationof the confidence interval, 1,000 or more replications are generally recommended(Efron and Tibshirani 1993; Davison and Hinkley 1997). The default is reps(1000).

seed(#) specifies the random number seed.

unbal(string) specifies the “average” cluster size, n∗. string can be dk (page 9 of Donnerand Klar [2000]), median, or mean. The default is unbal(dk). These are calculatedindependently for each stratum specified in strata. For the option unbal(dk), n∗ =n. −

∑Nc

j=1(nj − n.)2/(Nc − 1)M , where n. = M/Nc (arithmetic mean cluster size)and M =

∑Nc

j=1 nj (total number of individuals in the Nc clusters).

lambda(real) is relevant for cost-effectiveness analysis (CEA) and is the threshold WTP

for a unit of health outcome; the user specifies an optional value, real, that can becalled from within the user-supplied function f(), if required.

noshrink specifies that the two-stage bootstrap resampling is performed without shrink-age correction. If this option is chosen, instead of cluster means, whole clusters areresampled with replacement in stage 1. In stage 2, individuals within the chosen clus-ters are then resampled also with replacement. Cluster membership in the originaldata is respected in this case.

4. Our early attempts of implementing TSB through bootstrap with the bca option suggested thatsecond-stage resampling caused instability in the estimation and on occasions caused Stata to shutdown. The instability was due partly to the jackknife sampling procedure used for estimating theacceleration parameter in bootstrap (Stata Technical Support 2010). As a result, we decided towrite our own computer code for implementing TSB.


level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95) or as set by set level.

nodots suppresses display of the replication dots. One dot character is displayed foreach successful replication.

Saved results

In addition to the standard output given by bstat shown in the Results window, anumber of useful results are stored in Stata matrices e() and r() postestimation. Thestatistic of interest calculated from the original sample is stored in e(b). The bootstrapreplicates of the statistic of interest are stored in r(tsb sam) (with its mean stored ine(b bs)).

3.2 tsbceprob

Syntax

tsbceprob varlist[if] [

in], stats(&nb()) cluster(varname)

strata(varname) lambda(real)[reps(#) seed(#) unbal(string) noshrink

level(#) nodots]

Description

tsbceprob is designed for calculating CE probabilities and has to be used in conjunctionwith the Mata function nb(). nb() calculates the net monetary benefit (NMB) for eachlevel of the cluster-level interventions (or comparators) defined by the strata variable.NMBk is calculated as λek − ck, where ek is the arithmetic mean of health outcomes(second variable in varlist), ck the mean of costs (first variable in varlist) for the kthcomparator, and λ is the WTP threshold.

For a two-way comparison, the intervention is defined as the most cost effectiveif it has a positive INB versus the comparator. For an n-way comparison, the mostcost-effective alternative is that with the highest NMB, where the NMB is calculatedfor each comparator by valuing the absolute level of health outcomes by λ. CEACscan be calculated by estimating the NMB in each bootstrap replicate and reportingthe probability that each intervention is the most cost effective as the proportion ofreplicates in which each intervention has a positive INB versus the comparator (two-waycomparison) or the highest NMB (n-way comparison).

In tsbceprob, when exactly the same NMB value is calculated for two or morecomparators in the same replicate (that is, resulting in ties), these comparators areconsidered equally cost effective. For example, in a replicate where the first of threecomparators yielded the highest NMB, this would result in a row vector of (1, 0, 0) forthe replicate; if the second and third comparators yielded the same highest NMB, the


indicator vector would become (0, 0.5, 0.5). The same principle applies for two- orn-way comparison. The CE probabilities are then estimated by the column means of amatrix consisting of the row vectors for all replicates.

Options

The options are the same as those for tsb except that strata() and lambda() haveto be specified and that tsbceprob is designed to be used with the NMB function nb()only.

Output

tsbceprob returns estimates of the CE probabilities at the given WTP value in thematrix r(tsb ceprob). The content in r(tsb ceprob) is shown in the Results window.r(tsb ceprob) is of dimension {1×(K +1)}, where elements in the first K columns arethe CE probabilities for the K comparators (defined in the strata variable), and the lastelement is the corresponding λ WTP value used for the estimation. These probabilitiescan then be used to plot the CEACs against a range of threshold values, λ. An examplebased on data of a published CRT is given in the next section. See appendix for examplesof some user-supplied functions in Mata.

4 Illustrative examples

In the following examples, we use cost-effectiveness data from a published CRT forevaluating alternative interventions for preventing postnatal depression. The PoNDER(psychological interventions for postnatal depression-randomized controlled trial andeconomic evaluation) study is a UK health technology assessment of alternative inter-ventions compared with usual care for preventing postnatal depression (Morrell et al.2009). The participating general practices were randomized to usual care (control arm)or one of two interventions, a person-centered approach (PCA) or cognitive-behavioralapproach (CBA). We use cost and health-related, quality-of-life data reported at sixmonths for 1,732 patients (70 general practices) with complete information for our illus-tration. Health effects are calculated by the change (six months from baseline) in thehealth-related, quality-of-life measure.


4.1 Example 1: Mean cost (mept())

In our first example, we grouped the two interventions (PCA and CBA) into a singleintervention group. Individual-level health care costs are typically highly right skewed,as in the case for PoNDER (figure 1).

010

020

030

0

0 500 1000 1500 2000 0 500 1000 1500 2000

Control Intervention

Freq

uenc

y

CostGraphs by Cluster−level intervention (2 arms)

Figure 1. Distributions of individual-level costs (in British pounds) by intervention(PoNDER)

TSB can be used here for estimating the uncertainties around the two point estimatesfor mean costs for the control and intervention arms. Here we show how the bootstrapstandard error and different confidence intervals can be estimated by using tsb in con-junction with the user-supplied Mata function mept(). mept() calculates the mean ofthe single variable specified in varlist for tsb. The variables used in the example arecost for individual-level costs (measured in British pounds, GBP), int1 for cluster-leveltreatment (0 for control, 1 for intervention), and cluster for cluster identifier.


0.0

2.0

4.0

6

200 250 300 350 200 250 300 350

Control Intervention

Den

sity

bootstrap replicatesGraphs by tsbsam_arm

Figure 2. Distributions of bootstrap replicates of mean costs (in British pounds) byintervention arm (PoNDER)

Note: Histograms overlaid with corresponding normal densities.

Despite the high level of skewness in the observed costs, the distributions of thebootstrap replicates of mean costs appear symmetric in both arms (figure 2). As aresult, the differences in the limits of the normal approximation and percentile-basedconfidence intervals are small. For the control arm, the 95% normal CIs are estimatedto be 278.8 to 341.0 GBP and 279.9 to 341.8 GBP for the BCa CIs. The TSB samplingtook less than 10 seconds to complete for both arms (see Elapsed time in the output).5

5. All examples were performed with Stata/IC 11.2 for Windows (32-bit) on a Dell PC with Xeon(R)2.93 GHz CPU and 12 GB RAM.


. tsb cost if int1==0, stats(&mept()) cluster(cluster) seed(101)

*** User-supplied settings ***Cluster variable: clusterStatistic (function): &mept()Strata variable: not supplied (assumed constant)

Two-stage bootstrap WITH shrinkage estimator.................................................. 50.................................................. 100

(output omitted )

.................................................. 1000

Elapsed time (mins) = .133

Bootstrap results Number of obs = 495Replications = 1000

Observed BootstrapCoef. Bias Std. Err. [95% Conf. Interval]

bsam 309.8899 -.2805662 15.871883 278.7816 340.9982 (N)278.7852 340.61 (P)279.0978 341.143 (BC)279.9131 341.812 (BCa)

(N) normal confidence interval(P) percentile confidence interval(BC) bias-corrected confidence interval(BCa) bias-corrected and accelerated confidence interval

Mean of TSB sample of statistic of interest =309.60933

. tsb cost if int1==1, stats(&mept()) cluster(cluster) seed(101) nodots

*** User-supplied settings ***Cluster variable: clusterStatistic (function): &mept()Strata variable: not supplied (assumed constant)

Two-stage bootstrap WITH shrinkage estimator




bsam 246.4919 .0702566 6.2435131 234.2548 258.729 (N)234.5496 258.2897 (P)234.4035 258.169 (BC)234.5774 258.3279 (BCa)




4.2 Example 2: Incremental net benefit (inb())

At a societal WTP threshold value of 20,000 GBP per unit of effect, the INB calcu-lated from the observed sample was 98.4 GBP (see Observed Coef. in the output).The uncertainty around this point estimate is quantified by the bootstrap standarderror and confidential intervals. Where the symmetry of the sampling distributionof the statistic of interest is questionable because of, for example, skewed data, thenormal-approximation-based confidence interval may be inappropriate, as indicated inCampbell and Torgerson (1999). Gomes et al. (2012b) showed that the BCa confidenceintervals based on TSB with shrinkage correction provide good confidence interval cov-erage (close to nominal level) over a range of challenging data scenarios, including fewclusters, unbalanced cluster sizes, and skewed costs, in their simulation study. Here theBCa confidence interval suggests there is a 95% chance that the true INB lies between33.6 and 173.2 GBP. The TSB sampling estimation took 34 seconds to complete.

. tsb cost qalygain, stats(&inb()) cluster(cluster) strata(int1) seed(101)> lambda(20000)

*** User-supplied settings ***Cluster variable: clusterStatistic (function): &inb()Strata variable: int1Lambda = 20000

Two-stage bootstrap WITH shrinkage estimator.................................................. 50

(output omitted )

.................................................. 1000




bsam 98.395852 -1.580874 35.236647 29.33329 167.4584 (N)28.34896 166.2569 (P)32.92921 172.3587 (BC)33.58042 173.1515 (BCa)




4.3 Example 3: Cost-effectiveness probabilities (tsbceprob()) andCEACs

In example 3, we use int2 as the cluster-level treatment variable. The interventionsdefined in this variable are control, PCA, and CBA. The following syntax shows howtsbceprob could be used to estimate CE probabilities by embedding it in a foreachloop. The resultant CE probabilities estimated for a range of WTP values from 0 to60,000 GBP (in steps of 5,000) are stored in the Stata matrix ceprob mat. The contextof the matrix was then exported into the current Stata dataset by svmat for plottingthe CEACs (figure 3). Here the treatment variable is int2 with 0 for control, 1 for CBA,and 2 for PCA. A seed value is used for reproducible results.

. capture matrix drop ceprob_mat

. foreach num of numlist 0(5000)60000 {2. tsbceprob cost qalygain, stats(&nb()) cluster(cluster) strata(int2)

> reps(1000) unbal(dk) nodots seed(101) lambda(`num´)3. matrix ceprob_mat = (nullmat(ceprob_mat)\r(tsb_ceprob))4. }

*** User-supplied settings ***Cluster variable: clusterStatistic (function): &nb()Strata variable: int2Average cluster size: dkLambda = 0



Cost-effective probabilities and WTP value0 .92 .08 0

*** User-supplied settings ***Cluster variable: clusterStatistic (function): &nb()Strata variable: int2Average cluster size: dkLambda = 5000



Cost-effective probabilities and WTP value0 .917 .083 5000

(output omitted )


. matrix list ceprob_mat

ceprob_mat[13,4]c1 c2 c3 c4

r1 0 .92 .08 0r1 0 .917 .083 5000r1 0 .872 .128 10000r1 0 .817 .183 15000r1 0 .783 .217 20000r1 .002 .752 .246 25000r1 .004 .734 .262 30000r1 .008 .718 .274 35000r1 .012 .704 .284 40000r1 .013 .695 .292 45000r1 .014 .68 .306 50000r1 .016 .672 .312 55000r1 .018 .662 .32 60000

. svmat ceprob_mat, names(tsb_ceprob)

. rename tsb_ceprob1 tx_control

. rename tsb_ceprob2 tx_1

. rename tsb_ceprob3 tx_2

. rename tsb_ceprob4 lval

. label variable tx_control "Control"

. label variable tx_1 "CBA"

. label variable tx_2 "PCA"

. label variable lval "Willingness-to-pay threshold ()"

. scatter tx_control tx_1 tx_2 lval, connect(l l l)> msize(small small small)> ytitle("Probability cost effective")> yscale(range(0 1)) ylabel(0(0.2)1)> xlabel(0(10000)60000)


CBA is shown to be the most cost-effective intervention over the entire range of WTP

considered according to the CEACs in figure 3. This is followed by PCA and control.

0.2

.4.6

.81

Prob

abilit

y co

st e

ffect

ive

0 10000 20000 30000 40000 50000 60000Willingness−to−pay threshold (£)

Control CBAPCA

Figure 3. Cost-effectiveness acceptability curves at 6 months (PoNDER)

Note: y axis shows the proportions of bootstrap samples with highest net benefitvalue of the corresponding intervention.

5 Simulation study

A recent extensive simulation study reported that TSB with shrinkage correction per-formed as well as mixed models and outperformed robust methods such as seemingunrelated regression and general estimating equation, both with robust variance esti-mators (Gomes et al. 2012b). Here we apply the new Stata command tsb to some ofthe scenarios from the simulation study. We report the CI coverage of four different CIsgiven by bstat, together with the mean width of the CIs and their lower and upper tailerror rate (table 1).


Table 1. Confidence interval coverage of 95% confidence interval reported by tsb withand without shrinkage correction on simulated data (true INB = 1,000 GBP)

ScenarioI

Base case Challenging

Shrinkage correctionConfidence interval Without With Without With

Normal CoverageIII 0.981 0.944 0.965 0.966Mean width 533.8 427.1 3,217.0 3,280.0Lower tail error rate 0.009 0.029 0.016 0.016Upper tail error rate 0.011 0.028 0.020 0.019

Percentile CoverageIII 0.979 0.942 0.960 0.958Mean width 533.6 426.5 3,320.0 3,366.0Lower tail error rate 0.010 0.028 0.018 0.017Upper tail error rate 0.011 0.031 0.023 0.025

Bias-corrected CoverageIII 0.983 0.939 0.959 0.956Mean width 533.5 426.7 3,185.0 3,251.0Lower tail error rate 0.009 0.029 0.018 0.019Upper tail error rate 0.009 0.033 0.024 0.026

Bias-corrected CoverageIII 0.983 0.939 0.954 0.950and accelerated Mean width 533.5 426.7 3,266.0 3,353.0

Lower tail error rate 0.009 0.029 0.021 0.022Upper tail error rate 0.009 0.033 0.026 0.029

Notes: I. Base case: 20 clusters per arm, 50 individuals per cluster, cluster size imbalance (cvimb = 0),intracluster correlation coefficient for costs = 0.01 (effects 0.01), cost skewness (cvcost = 0.2 implies noskewness for Gamma distributed costs), individual- (cluster-) level correlation of costs and effects = 0.2(0); challenging: same as base case but with the following differences: 3 clusters per arm, cvimb = 1,and cvcost = 3.II. All seed values were set at 101 and with 1,000 bootstrap replications throughout. Two thousanddatasets were simulated for each scenario.III. The coverage probabilities are each based on 2,000 replicate samples, implying a typical confi-dence interval width for the coverage probabilities, of 2 × 1.96 × p

0.95 × (1 − 0.95)/2000 = 0.0191(approximately 1.9 percentage points).

In brief, cost and health outcome data were generated from a CRT design assumedto have two randomized arms (intervention and control). Gomes et al. (2012b) simu-lated the effect of the intervention on mean costs and health outcomes according to abivariate linear additive model. Each of the two simultaneous equations for costs andhealth outcomes included a cluster-level mean, a cluster-level incremental effect (for theintervention arm), and an individual deviation from the cluster mean. Individual- andcluster-level costs and health outcomes were allowed to be correlated and, for the basecase, followed a bivariate normal distribution. The base case also assumed a balanceddesign with 50 individuals within each cluster and 20 clusters within each treatmentarm.


In a more challenging scenario, Gomes et al. (2012b) assumed that there were onlythree clusters per arm and that the size of the clusters followed a Gamma distributionwith a mean and a coefficient of variation (cvimb) of 1, which resulted in imbalancedcluster sizes. cvimb is obtained by dividing the standard deviation of cluster size by itsmean. Individual costs were also assumed to follow a Gamma distribution with varyinglevel of skewness as defined by a cvcost of 3. All simulations used the same true valueof 1,000 GBP for the metric of interest, the INB. Two thousand datasets were simulatedfor each scenario (for more details, see Gomes et al. [2012b]).

Here we find that under both these scenarios, the CIs constructed by using TSB withshrinkage correction all gave CI coverages close to the nominal rate of 0.95. As wasanticipated when the TSB was applied without the shrinkage correction, the CIs weretoo wide, and the CI coverage exceeded nominal levels. Under the more challengingscenario with few clusters, variable cluster sizes, and skewed costs, the bias-correctedand accelerated CI, after application of the shrinkage correction, yielded the coveragerate closest to the nominal.

6 Discussion

In this article, we have provided new Stata commands for implementing and extendingthe TSB algorithm proposed by Davison and Hinkley (1997). Unlike their original algo-rithm, our implementation can be applied to the common setting of a CRT with unequalnumbers per cluster. We envisage that the suite of commands will be particularly usefulin a CEA alongside a cluster trial. In this setting, statistical methods are required toallow for the clustered nature of the data, recognizing that costs and health outcomesmay be correlated and that cost data are highly skewed. In general, though, such CEAsfail to address these issues; indeed, a recent review found that around 95% of publishedCEAs alongside CRTs used inappropriate methods (Gomes et al. 2012a). The TSB toolswe provide can help analysts address the challenges of handling clustered data withhighly skewed costs that are correlated with health outcomes.

The two-stage nonparametric bootstrap method with shrinkage correction was re-ported to perform favorably in an extensive simulation study designed to compare theappropriateness of a number of commonly used statistical methods for CEA of CRTs(Gomes et al. 2012b). However, this procedure has not been widely applied in practiceamong health economists mainly because of a lack of implementation in mainstreamand user-friendly software. We hope that making this method available through Statawill help translate research findings into practice in the community.


The TSB approach relies on asymptotic assumptions that may be invalid for smallsamples (in particular, few cluster-level units for clustered data). Although Gomes et al.(2012b) showed that TSB with shrinkage correction gives good confidence interval cover-age even with as few as three clusters per arm, such a result was based on a simulationstudy with a known data-generating mechanism. For real data with few clusters, thedata would have been generated by numerous other factors that would not have beencaptured by any given data-generating mechanism. Hence, analysts should exercisecaution when interpreting the results based on small samples.

Our implementation of TSB generalized Davison and Hinkley’s (1997) original pro-posal by allowing unbalanced cluster sizes in the original data. This should help makeour implementation applicable to more realistic data settings where completely balancedclusters may be rare. Finally, although our illustrative examples focus on health eco-nomic evaluations, the TSB method is applicable for cluster study design more generally.It should be noted, though, that the application only applied to circumstances with acluster design with two levels to the data hierarchy. It does not extend to dependen-cies that may arise in multicenter randomized controlled trials where within a center,individuals are randomized to alternative treatments. Nor does the application extendto a CRT with three or more levels to the data hierarchy (for example, when repeatedmeasures are nested within patients and within clusters).

7 Acknowledgments

Edmond Ng and Richard Grieve received financial support from the Medical ResearchCouncil, UK. We thank Jane Morrell (PI) and Simon Dixon for permission to use, andfor providing access to, the PoNDER data. We also thank Manuel Gomes and MarkPennington for their comments on an early draft and assistance with program testingand the reviewer for his helpful comments.

8 Funding

This work was supported by the Medical Research Council (grant number G0802321/1).

9 ReferencesBachmann, M. O., L. Fairall, A. Clark, and M. Mugford. 2007. Methods for analyzing

cost effectiveness data from cluster randomized trials. Cost Effectiveness and ResourceAllocation 5: 12.

Briggs, A., R. Nixon, S. Dixon, and S. Thompson. 2005. Parametric modelling of costdata: Some simulation evidence. Health Economics 14: 421–428.

Briggs, A. H., D. E. Wonderling, and C. Z. Mooney. 1997. Pulling cost-effectivenessanalysis up by its bootstraps: A non-parametric approach to confidence interval es-timation. Health Economics 6: 327–340.


Campbell, M. K., and D. J. Torgerson. 1999. Bootstrapping: Estimating confidenceintervals for cost-effectiveness ratios. QJM: An International Journal of Medicine 92:177–182.

Carpenter, J., and J. Bithell. 2000. Bootstrap confidence intervals: When, which, what?A practical guide for medical statisticians. Statistics in Medicine 19: 1141–1164.

Colvin, M., M. O. Bachmann, R. K. Homan, D. Nsibande, N. M. Nkwanyana, C. Con-nolly, and E. B. Reuben. 2006. Effectiveness and cost effectiveness of syndromicsexually transmitted infection packages in South African primary care: Cluster ran-domised trial. Sexually Transmitted Infections 82: 290–294.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application.Cambridge: Cambridge University Press.

Donner, A., and N. Klar. 2000. Design and Analysis of Cluster Randomization Trialsin Health Research. London: Arnold.

Efron, B., and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. New York:Chapman & Hall/CRC.

Faddy, M., N. Graves, and A. Pettitt. 2009. Modeling length of stay in hospital and otherright skewed data: Comparison of phase-type, gamma and log-normal distributions.Value in Health 12: 309–314.

Flynn, T. N., and T. J. Peters. 2004. Use of the bootstrap in analysing cost data fromcluster randomised trials: Some simulation results. BMC Health Services Research4: 33.

———. 2005. Cluster randomized trials: Another problem for cost-effectiveness ratios.International Journal of Technology Assessment in Health Care 21: 403–409.

Gomes, M., R. Grieve, R. Nixon, and W. J. Edmunds. 2012a. Statistical methods forcost-effectiveness analyses that use data from cluster randomized trials: A systematicreview and checklist for critical appraisal. Medical Decision Making 32: 209–220.

Gomes, M., E. S. Ng, R. Grieve, R. Nixon, J. Carpenter, and S. G. Thompson. 2012b.Developing appropriate methods for cost-effectiveness analysis of cluster randomizedtrials. Medical Decision Making 32: 350–361.

Liu, R. Y. 1988. Bootstrap procedures under some non-I.I.D. models. Annals of Statis-tics 16: 1696–1708.

Manning, W. G. 1998. The logged dependent variable, heteroscedasticity, and theretransformation problem. Journal of Health Economics 17: 283–295.

Morrell, C. J., R. Warner, P. Slade, S. Dixon, S. Walters, G. Paley, and T. Brugha.2009. Psychological interventions for postnatal depression: Cluster randomised trialand economic evaluation. The PoNDER trial. Health Technology Assessment 13:1–176.


Nixon, R. M., D. Wonderling, and R. D. Grieve. 2010. Non-parametric methods forcost-effectiveness analysis: The central limit theorem and the bootstrap compared.Health Economics 19: 316–333.

Stata Technical Support. 2010. Personal communication.

Stinnett, A., and J. Mullahy. 1998. Net health benefits: A new framework for theanalysis of uncertainty in cost-effectiveness analysis. Medical Decision Making 18:S68–S80.

Sullivan, S. D., T. A. Lee, D. K. Blough, J. A. Finkelstein, P. Lozano, T. S. Inui, A. L.Fuhlbrigge, V. J. Carey, E. Wagner, and K. B. Weiss. 2005. A multisite randomizedtrial of the effects of physician education and organizational change in chronic asthmacare: Cost-effectiveness analysis of the Pediatric Asthma Care Patient OutcomesResearch Team II (PAC-PORT II). Archives of Pediatrics and Adolescent Medicine159: 428–434.

Van Hout, B. A., M. J. Al, G. S. Gordon, and F. F. H. Rutten. 1994. Costs, effects andC/E-ratios alongside a clinical trial. Health Economics 3: 309–319.

Willan, A. R., and A. H. Briggs. 2006. Statistical Analysis of Cost-Effectiveness Data.Chichester, UK: Wiley.

Wolters, R., R. Grol, T. Schermer, R. Akkermans, R. Hermens, and M. Wensing. 2006.Improving initial management of lower urinary tract symptoms in primary care: Costsand patient outcomes. Scandinavian Journal of Urology and Nephrology 40: 300–306.

About the authors

Edmond Ng was a lecturer in medical statistics at the London School of Hygiene and TropicalMedicine, London, UK. He is now a research statistician of the Clinical Practice ResearchDatalink group at the Medicines and Healthcare products Regulatory Agency. He has beenprogramming in Mata for over a year and used it for writing the tsb command suite.

Richard Grieve is a reader in health economics at the London School of Hygiene and TropicalMedicine.

James Carpenter is a professor in medical statistics at the London School of Hygiene andTropical Medicine.

Appendix. Sample Mata functions for use with TSB

The three Mata functions used in our examples are shown below, namely, mept(),inb(), and nb(). These child functions take an input data object, data, from theirparent function tsb when they are called by the latter. The data object is of dimension{Nk × (nept + 2)}, where Nk is the total number of observations in stratum k (thesubscript k is omitted when resampling is performed without stratification), and nept

is the number of endpoints to be resampled jointly. For CEAs where two endpoints,individual costs and effects, are to be resampled jointly, nept equals 2. The last twocolumns in data are reserved for cluster and strata variables (see technical notes below).


Technical notes on TSB

The two rightmost columns of the data object are reserved for the cluster identifier(second column from the right) and strata (rightmost column) variables. For example,inb() and nb() both assume the rightmost column (fourth column) in data, the stratavariable (the object treat is used in these functions for the strata variable). Analystswishing to use tsb on other statistics of interest must bear this in mind when writingtheir own child Mata functions.

The first nept columns in the data object are the variables (endpoints) specified tobe resampled jointly. Where nept > 1, the ordering of the variables in the child Matafunction must match those given in varlist for tsb. For example, the function inb()takes the first column of the data object as costs and the second column as health effects.Therefore, these two endpoints must appear in the same order in the tsb command (forexample, tsb costs effects, stats(&inb()) . . . ).

When the following three sets of syntax are issued in Stata, the corresponding Matafunctions will be saved in one’s (Stata’s) personal directory. If one is unsure whereone’s personal directory is, it will be displayed when the command personal is issuedin Stata. Replace “local directory” in the following syntax with one that matches one’slocal setting so that the Mata functions are stored in the appropriate directory foraccess.

A.1. mept()—mean endpoint

mata:real scalar function mept(transmorphic matrix data){version 11.2/* DECLARATIONS */real matrix costreal scalar mept/* Extract data from "data" for calculation */cost=data[,1]/* Calculate mean endpoint */mept=mean(cost)return(mept)}mata mosave mept(), dir("local directory:\ado\personal") replaceend


A.2. inb()—incremental net benefit

mata:real scalar function inb(transmorphic matrix data, |real scalar lambda){version 11.2/* DECLARATIONS */real matrix cost, effect, treatreal scalar mc_ctl, mc_tmt, me_ctl, me_tmt/* DATA CHECKING */if (cols(data)!=4){_error("Function requires an input data matrix of 4 columns wide.")

}cost=data[,1]effect=data[,2]treat=data[,4]/* SET UNSPECIFIED PARAM VALUES */if (lambda==.) lambda=20000/* DATA CHECKING *//* 1. Stop if treatment var has anything other than 2 levels */if (rows(uniqrows(treat))<2) {_error("Treatment variable has <2 unique values.")} else if (rows(uniqrows(treat))>2) {_error("Treatment variable has >2 unique values.")

}/* Calculate INB */mc_ctl=mean(select(cost,treat:==uniqrows(treat)[1])) /* mean cost for ctl */mc_tmt=mean(select(cost,treat:==uniqrows(treat)[2])) /* mean cost for tmt */me_ctl=mean(select(effect,treat:==uniqrows(treat)[1])) /* mean eff for ctl */me_tmt=mean(select(effect,treat:==uniqrows(treat)[2])) /* mean eff for tmt */inb=(me_tmt-me_ctl)*lambda - (mc_tmt-mc_ctl)return(inb)}mata mosave inb(), dir("local directory:\ado\personal") replaceend


A.3. nb()—net benefit (for calculating cost-effectivenessprobabilities)

mata:real matrix function nb(transmorphic matrix data, | real scalar lambda){version 11.2/* DECLARATIONS */real matrix cost, effect, treatreal scalar mc_ctl, mc_tmt, me_ctl, me_tmt/* DATA CHECKING */if (cols(data)!=4){_error("Function requires an input data matrix of 4 columns wide.")}

cost=data[,1]effect=data[,2]treat=data[,4]/* SET UNSPECIFIED PARAM VALUES */if (lambda==.) lambda=20000nstrata=rows(uniqrows(treat))nb=J(1,nstrata,.)for (i=1; i<=nstrata; i++) {nb[1,i] = mean(select(effect,treat:==uniqrows(treat)[i]))*lambda -

> mean(select(cost,treat:==uniqrows(treat)[i]))}return(nb)}mata mosave nb(), dir("local directory:\ado\personal") replaceend


Joint modeling of longitudinal and survival data

Michael J. CrowtherDepartment of Health Sciences

University of LeicesterLeicester, UK

[email protected]

Keith R. AbramsDepartment of Health Sciences


Paul C. LambertDepartment of Health Sciences


andDepartment of Medical Epidemiology and Biostatistics

Karolinska InstitutetStockholm, Sweden

Abstract. The joint modeling of longitudinal and survival data has receivedremarkable attention in the methodological literature over the past decade; how-ever, the availability of software to implement the methods lags behind. The mostcommon form of joint model assumes that the association between the survivaland the longitudinal processes is underlined by shared random effects. As a re-sult, computationally intensive numerical integration techniques such as adaptiveGauss–Hermite quadrature are required to evaluate the likelihood. We describe anew user-written command, stjm, that allows the user to jointly model a continu-ous longitudinal response and the time to an event of interest. We assume a linearmixed-effects model for the longitudinal submodel, allowing flexibility through theuse of fixed or random fractional polynomials of time. Four choices are availablefor the survival submodel: the exponential, Weibull or Gompertz proportionalhazard models, and the flexible parametric model (stpm2). Flexible parametricmodels are fit on the log cumulative-hazard scale, which has direct computationalbenefits because it avoids the use of numerical integration to evaluate the cumu-lative hazard. We describe the features of stjm through application to a datasetinvestigating the effect of serum bilirubin level on time to death from any cause in312 patients with primary biliary cirrhosis.

Keywords: st0289, stjm, stjmgraph, stjm postestimation, joint modeling, mixedeffects, survival analysis, longitudinal data, adaptive Gauss–Hermite quadrature

1 Introduction

A joint model of longitudinal and time-to-event data can effectively assess the im-pact that a longitudinal covariate, measured with error, has on the time to an eventof interest, providing a framework to assess the predictive ability of a biomarker onsurvival. Wulfsohn and Tsiatis (1997) and Henderson, Diggle, and Dobson (2000) have


166 Joint modeling of longitudinal and survival data

shown that by undertaking a joint model that evaluates both the longitudinal and thesurvival data simultaneously, we can reduce biases and improve precision over simplerapproaches. Such approaches include the separate modeling of each form of data byusing standard tools such as xtmixed and streg or a two-stage approach whereby fit-ted values, including empirical Bayes estimates of the longitudinal model, are used as atime-varying covariate in a survival model. Conversely, joint models can also be viewedfrom the perspective of adjusting for informative drop-out in a longitudinal study (forexample, if one finds when modeling quality of life over time in patients with cancerthat patients with lower quality of life are more likely to die, resulting in nonignorabledrop-out, as described in Billingham and Abrams [2002]).

The most widely used form of joint model assumes that the longitudinal and sur-vival processes are underpinned by shared random effects. This results in a joint like-lihood that cannot be evaluated analytically. Consequently, computationally demand-ing numerical integration techniques such as adaptive Gauss–Hermite quadrature (seePinheiro and Bates [1995]) must be used to evaluate both the cumulative hazard andthe overall joint likelihood.

The implementation of joint modeling in Stata is somewhat limited. The extensivegllamm suite (see Rabe-Hesketh, Skrondal, and Pickles [2002]) can fit shared parametermodels but can assume only a piecewise exponential form for the survival submodel. Thenewly implemented jmre1 command (see Pantazis and Touloumi [2010]) approachesanalyses from the point of view of adjusting for informative drop-out in a longitudinalstudy, assuming the longitudinal and survival components are multivariate normal.

We present the stjm command, which allows the user to jointly model a contin-uous longitudinal response and the time to an event of interest. We assume a lin-ear mixed-effects model for the longitudinal submodel, allowing flexibility through theuse of fixed or random fractional polynomials of time. Four choices are available forthe survival submodel, including the exponential, Weibull (Guo and Carlin 2004), andGompertz proportional hazards models. We believe this is the first implementationof the Gompertz survival model within a joint modeling context. Furthermore, weimplement the joint model of Crowther, Abrams, and Lambert (2012), which incorpo-rates the flexible parametric survival model, stpm2 (see Royston and Parmar [2002] andLambert and Royston [2009]). Flexible parametric survival models are fit on the logcumulative-hazard scale, which has direct computational benefits because it avoids theneed for numerical integration to evaluate the cumulative hazard. The models are fit byusing maximum likelihood, with both simple and adaptive Gauss–Hermite quadratureavailable.

M. J. Crowther, K. R. Abrams, and P. C. Lambert 167

We illustrate the command by using a dataset of 312 patients with primary biliarycirrhosis (see Murtaugh et al. [1994] for further details). Of the 312, 158 were ran-domized to receive D-penicillamine, and 154 assigned a placebo. Serum bilirubin wasmeasured repeatedly at intermittent time points. We investigate the effect of treatmentafter adjusting for the relationship between serum bilirubin levels and time to death.There may be other areas of application; however, in this article, we concentrate on thebiostatistical aspect.


Consider a clinical trial where we observe a continuous longitudinal biomarker, measuredintermittently and with error, and the time to an event of interest. Baseline covariatesare also recorded. Let Si be the survival time of the ith patient, where i = 1, . . . , n,and Ti = min(Si, Ci) the observed survival time, with Ci the censoring time. Definean event indicator di, which takes the value of 1 if Si ≤ Ci and 0 otherwise. Letyij = {yi(tij), j = 1, . . . , mi} denote the longitudinal response measurements of thecontinuous biomarker for the ith patient taken at times tij . Furthermore, we defineshared random effects, bi, which underpin the survival and longitudinal processes. Eachsubmodel can be dependent on a set of baseline covariates, Ui, which can potentiallydiffer between submodels. We impose the common assumptions that both censoringand time of measurements are noninformative.

2.1 Longitudinal submodel

We specify for the longitudinal submodel a linear mixed-effects model where time can bemodeled by using a combination of fixed or random fractional polynomials. This shouldprovide a highly flexible framework to capture a variety of longitudinal trajectories (seeRoyston and Altman [1994]). Therefore, we observe

yi(tij) = Wi(tij) + eij , eij ∼ N(0, σ2e)

Wi(tij) = x′i(tij)β + z′i(tij)bi + uiδ (1)

with design matrices Xi and Zi for the fixed (β) and random (bi) effects, respectively,consisting of fractional polynomial time variables. Furthermore, we also have a vector ofcovariates (possibly time dependent), ui ∈ Ui, and corresponding regression coefficients,δ. We assume that the measurement error, eij , is independent from the random effectsand that cov(eij , eik) = 0 (where j �= k). Wi(tij) now represents the “true” underlyingbiomarker trajectory.


2.2 Survival submodel

Exponential, Weibull, and Gompertz

Standard parametric distributions have been implemented for the survival submodel.We define the proportional hazards submodel

h(t|bi, vi) = h0(t) exp {αWi(tij) + viφ}

where h0(t) is the baseline hazard function (see [ST] streg for more details), α denotesthe association parameter, and φ is a set of regression coefficients associated with a set ofcovariates (again possibly time dependent), vi ∈ Ui. In this formulation, we assume theassociation is based on the current value of the longitudinal response. In other words,the value of the biomarker, as estimated by the longitudinal submodel, is included inthe survival linear predictor as a time-varying covariate.

If covariates are included in both submodels, then we can obtain overall effects onsurvival through combining the direct effect on the longitudinal marker, multiplied bythe association parameter, plus the direct effect on survival. This concept is explainedfurther in the example below and in Ibrahim, Chu, and Chen (2010).

Flexible parametric model

We define the proportional cumulative hazards time-to-event submodel

log{H(t|bi, vi)} = log{H0(t)} + αWi(tij) + viφ (2)

where H0(t) is the cumulative baseline hazard function. The remaining parameters areas defined in “Exponential, Weibull, and Gompertz”.

The spline basis for this specification is derived from the log cumulative-hazardfunction of a Weibull proportional hazards model. The linear relationship between thebaseline log cumulative-hazard and log time is extended by using restricted cubic splines,which impose the restriction that the fitted function be linear before the first knot andafter the final knot. Further details can be found in Durrleman and Simon (1989),Royston and Parmar (2002), and Lambert and Royston (2009). We can therefore writea restricted cubic spline function of log(t), with knots k0, as s{log(t)|γ,k0}. This isnow substituted for the log cumulative baseline hazard in (2).

log{H(t|bi, ubs,i)} = ηi = s{log(t)|γ,k0} + αWi(tij) + viφ

Transforming to the hazard and survival scales, we obtain

h(t|bi, vi) =[1t

ds{log(t)|γ,k0}d log(t)

+ αdW (t)

dt

]exp(ηi), S(t|bi, vi) = exp{− exp(ηi)}

Again this formulation is specific to the current value parameterization. We discuss thevarious forms of association in section 2.4.


2.3 Joint likelihood

Constructing the full likelihood for the joint model, we obtain

n∏i=1

⎛⎝∫ ∞

−∞

⎡⎣mi∏j=1

f {yi(tij)|bi, θ}⎤⎦ f(bi|θ)f(Ti, di|bi, θ) dbi

⎞⎠ (3)

where

f {yi(tij)|bi, θ} = (2πσ2e)−1/2 exp

{−yi(tij) − Wi(tij)

2σ2e

}and

f(bi|θ) = (2π|V |)−1/2 exp(−b′iV

−1bi

2

)The survival likelihood component under an exponential, Weibull, or Gompertz sub-model can be expressed as

f(Ti, di|bi, θ) = [h0(Ti) exp {αWi(t) + φvi}]di exp

[−∫ Ti

0

h0(u) exp {αWi(u) + φvi} du

]Under the flexible parametric modeling approach, the survival likelihood component iswritten as

f(Ti, di|bi, θ) =([

1Ti

ds{log(Ti)|γ,k0}d log(Ti)

+ αdW (Ti)

dTi

]exp(ηi)

)di

exp {− exp(ηi)}

Evaluating (3) is a computationally demanding task, the details of which are discussedin section 2.5.

2.4 Association structure

There are a variety of ways to link the longitudinal and survival components by usingthe trajectory function defined in (1). The most commonly used form, called the currentvalue parameterization (described above), includes the trajectory function as a time-dependent covariate in the linear predictor of the survival submodel. As in (2), weassess the strength of the association through α.

Alternatively, we may be interested in the effect that the slope or rate of change ofthe biomarker has on survival. This can be achieved by including αW ′

i (tij) in the linearpredictor of the survival submodel.

Finally, we could link the component models through a time-independent associationstructure, α(βk + bik), linking the subject-specific deviation from the mean of the kthrandom effect. A special case of this links the subject-specific random intercept and itseffect on survival.

The value of α is simply the log hazard-ratio for a one-unit increase in the longi-tudinal component included in the survival submodel. Note that if α is estimated to


be 0, that is, no association is present, then the joint model reduces to the two standardseparate models. Any combination of the three association structures can be used inthe same model: for example, in some settings, both the subject-specific baseline andthe current value may be predictive of survival. Choice of association structure shouldbe guided by the clinical question under investigation.

2.5 Maximization

Using Stata’s default Newton–Raphson method (see Gould, Pitblado, and Poi [2010]),stjm uses a d0 evaluator program to maximize the likelihood. The joint likelihood in (3)contains an analytically intractable integral where we wish to integrate out the randomeffects. This can be achieved by using numerical techniques such as simple Gauss–Hermite quadrature (Pinheiro and Bates 1995). Essentially, we can approximate theintegral by a weighted summation of the function evaluated at a set of m points, wherethe m points are the roots of a mth degree Hermite polynomial. Increasing m increasesthe accuracy of the approximation; however, computation time also increases. Extensionto multivariate integrals (random effects) follows naturally; however, computation timewill grow exponentially. For example, a model with only a random intercept evaluatedwith 5-point quadrature evaluates the likelihood at 5 specified points. If this is extendedto a random intercept and slope model, with 5-point quadrature for each random effect,then the likelihood is evaluated at 5 × 5 = 25 points.

In addition to the full joint likelihood, under an exponential, Weibull, or Gompertzsurvival submodel, we must use Gauss–Kronrod quadrature to calculate the cumulativehazard. This can be done by using 7- or 15-point quadrature in stjm. This is notrequired when using a flexible parametric survival submodel, because we model on thelog cumulative-hazard scale, providing computational benefits.

Crowther, Abrams, and Lambert (2012) note that the use of simple Gauss–Hermitequadrature in the joint model setting can drastically underestimate the standard errorsof the parameters in the longitudinal submodel unless a sufficiently high number ofquadrature nodes are used. This substantially increases computation time, which isexponentiated with the addition of more random effects. A more complex but accurateextension is to use adaptive Gauss–Hermite quadrature. The implementation of this inStata in the mixed-model context has been described in Rabe-Hesketh, Skrondal, andPickles (2002). At the beginning of each full Newton–Raphson iteration, we can centerand scale the quadrature node locations for each individual panel, positioning the nodematrix in the most appropriate area. This is achieved by using the empirical Bayesestimates and associated standard errors of the random effects for each panel. Theuse of adaptive quadrature means that a much-reduced number of nodes are requiredfor each random-effects dimension, resulting in substantial computational benefits andmuch greater accuracy in the estimation.

We caution the user that these models are complex, and sometimes the defaultestimation algorithm may lead to a model that does not converge. As in all random-effects models, one should be cautious about overmodeling, particularly the number of


random-effects parameters. The majority of previous work on joint models has onlyconsidered up to two random effects, that is, intercept and slope (Wulfsohn and Tsiatis1997). stjm can have up to five random effects; however, with a limited data size, it isnot feasible to have too complex a model.

2.6 Delayed entry and time-varying covariates

stjm has been developed to be entirely consistent with the setup of multiple-recordst data. We can therefore use t0 to denote the measurement times defined as tij insection 2. This allows both for delayed entry models, which, for example, let age beused as the time scale, and for inclusion of further time-varying covariates within bothsubmodels, assuming they vary at the time of measurements, that is, that they areallowed to change at times t0 but are constant within intervals [ t0, t).

3 The stjm command

3.1 Syntax

stjm depvar[indepvars

] [if] [

in], panel(varname) survmodel(survsubmodel)[

ffp(numlist) rfp(numlist) timeinteraction(varlist) covariance(vartype)

survcov(varlist) df(#) knots(numlist) noorthog nocurrent

derivassociation intassociation association(numlist)

assoccovariates(varlist) gh(#) gk(#) adaptit(#) noshowadapt atol(#)

nonadapt fulldata nullassoc maximize options showinitial variance

showcons keepcons level(#)]

You must stset the data into enter and exit times before using stjm; see [ST] stset.depvar is the longitudinal response, and indepvars are covariates in the longitudinalsubmodel. stjm uses t0 as measurement times and each patient’s final row of t asthe survival time.

3.2 Options

Required

panel(varname) contains the panel identification variable. Each panel should be iden-tified by a unique integer. panel() is required.

survmodel(survsubmodel) specifies the survival submodel to be fit. survmodel() isrequired. survsubmodel can be one of the following:


survmodel(fpm) fits a flexible parametric survival submodel. This is a highly flexi-ble, fully parametric alternative to the Cox model, modeled on the log cumulative-hazard scale by using restricted cubic splines. For more details, see stpm2.

survmodel(exponential) fits an exponential survival submodel.

survmodel(weibull) fits a Weibull survival submodel.

survmodel(gompertz) fits a Gompertz survival submodel.

Longitudinal submodel

ffp(numlist) specifies power transformations of the time variable to be included in thelongitudinal submodel as fixed effects. t0 is used as the time of measurements.Values must be in {-5, -4, -3, -2, -1, -0.5, 0, 0.5, 1, 2, 3, 4, 5}.

rfp(numlist) specifies power transformations of the time variable to be included inthe longitudinal submodel as fixed and random effects. t0 is used as the time ofmeasurements. Values must be in {-5, -4, -3, -2, -1, -0.5, 0, 0.5, 1, 2, 3, 4, 5}.

timeinteraction(varlist) specifies covariates to interact with the fixed fractional poly-nomials of measurement time.

covariance(vartype) specifies the variance–covariance structure of the random effects.vartype can be one of the following:

covariance(independent) specifies a distinct variance for each random effect, withall covariances equal to 0.

covariance(exchangeable) specifies equal variances for all random effects and onecommon pairwise covariance.

covariance(identity) specifies equal variances for all random effects, with all co-variances equal to 0.

covariance(unstructured) specifies that all variances and covariances are dis-tinctly estimated. This is the default.

Survival submodel

survcov(varlist) specifies covariates to be included in the survival submodel.

df(#) specifies the degrees of freedom for the restricted cubic spline function used forthe baseline cumulative hazard under a flexible parametric survival submodel. #must be between 1 and 10, but usually, a value between 1 and 5 is sufficient.

knots(numlist) specifies knot locations for the baseline distribution function under aflexible parametric survival submodel, as opposed to the default locations set bydf(). Note that the locations of the knots are placed on the standard time scale.However, the scale used by the restricted cubic spline function is always log time.Default knot positions are determined by the df() option.


noorthog suppresses orthogonal transformation of spline variables under a flexible para-metric survival submodel.

Association

nocurrent specifies that the association between the survival and the longitudinal sub-models is not based on the current value. The default association is based on thecurrent value of the longitudinal response. If nocurrent is invoked, at least one ofintassociation, association(), and derivassociation must be specified.

derivassociation specifies that the association between the survival and the longitu-dinal submodels is based on the first derivative of the longitudinal submodel.

intassociation specifies that the association between the survival and the longitudinalsubmodels is based on the random intercept of the longitudinal submodel.

association(numlist) specifies that the association between the survival and the lon-gitudinal submodels is based on a random coefficient of time fractional polynomialsspecified in rfp().

assoccovariates(varlist) specifies covariates to be included in the linear predictorof the association parameters. Under the default current value association, thiscorresponds to interacting the longitudinal submodel with covariates.

Maximization

gh(#) specifies the number of quadrature points for the simple or adaptive Gauss–Hermite quadrature used to evaluate the joint likelihood. Minimum number ofquadrature points is two. The default is gh(5) or gh(15) under adaptive or simplequadrature, respectively.

gk(#) specifies the number of quadrature points for the Gauss–Kronrod quadratureused to evaluate the cumulative hazard under an exponential, Weibull, or Gompertzsurvival submodel. Two choices are available, either 7 or 15. The default is gk(15).

adaptit(#) defines the number of iterations of adaptive Gauss–Hermite quadrature touse in the maximization process. The default is adaptit(5). Adaptive quadratureis implemented at the beginning of each full Newton–Raphson iteration.

noshowadapt suppresses the display of the log-likelihood values under the subiterationsused to assess convergence of the adaptive quadrature implemented at the beginningof each full Newton–Raphson iteration.

atol(#) specifies tolerance for the log likelihood under adaptive quadrature subitera-tions. The default is atol(1.0E-05).


nonadapt uses nonadaptive Gauss–Hermite quadrature to evaluate the joint likelihood.This will generally require a much higher number of nodes, gh(), to ensure accurateestimates and standard errors, resulting in much greater computation time.

fulldata forces stjm to use all rows of data in the survival component of the likelihood.By default, stjm assesses whether all covariates specified in survcov() are constantwithin panels; if they are, stjm only needs to use the first row of t0 and the finalrow of t in the maximization process, providing considerable advantages in speed.

nullassoc sets the initial value for association parameters to 0. Use of the defaultinitial values may in rare situations cause stjm to display initial values not feasible.Using this option solves this; however, convergence time is generally longer.

maximize options: difficult, technique(algorithm spec), iterate(#),[no]log, trace, gradient, showstep, hessian, shownrtolerance,

tolerance(#), ltolerance(#), gtolerance(#), nrtolerance(#),nonrtolerance, and from(init specs); see [R] maximize. These options are seldomused, but the difficult option may be useful if there are convergence problems.

Reporting

showinitial displays the output from the xtmixed and stpm2 or streg models fit toobtain initial values.

variance shows random-effects parameter estimates as variances–covariances.

showcons displays the constraints used by stpm2 and stjm for the derivatives of thespline function. This option is only valid under a flexible parametric survival sub-model.

keepcons prevents the constraints imposed by stjm on the derivatives of the splinefunction when fitting delayed entry models from being dropped. By default, theconstraints are dropped. This option is only valid under a flexible parametric survivalsubmodel.

level(#) specifies the confidence level, as a percentage, for confidence intervals (CIs).The default is level(95) or as set by set level.

4 The stjm postestimation command

4.1 Syntax for obtaining best linear unbiased predictions (BLUPs)of random effects or the standard errors of BLUPs

predict {stub* |newvarlist}, {reffects | reses}


4.2 Syntax for obtaining other predictions

predict newvar[if] [

in] [

, longitudinal residuals rstandard hazard

survival cumhazard martingale deviance reffects reses xb fitted m(#)

at(varname #[varname # ...

]) ci timevar(varname) meastime

survtime zeros]

4.3 Options

Longitudinal submodel

longitudinal predicts the fitted values for the longitudinal submodel. If xb is specified(the default), then only contributions from the fixed portion of the model are in-cluded. If fitted is specified, then estimates of the random effects are also included.

residuals calculates residuals for the longitudinal submodel, equal to the responsesminus fitted values. By default, the fitted values take into account the randomeffects.

rstandard calculates standardized residuals, equal to the residuals multiplied by theinverse square root of the estimated error covariance matrix.

Survival submodel

hazard calculates the predicted hazard. Default prediction, xb, is the average of thefixed portion of the model plus m() random draws from the estimated variance–covariance matrix of the random-effects distribution. If fitted is specified, thenindividual specific estimates of the random effects are included with the fixed portionof the model.

survival calculates each observation’s predicted survival probability. Default predic-tion, xb, is the average of the fixed portion of the model plus m() random drawsfrom the estimated variance–covariance matrix of the random-effects distribution.If fitted is specified, then individual specific estimates of the random effects areincluded with the fixed portion of the model.

cumhazard calculates the predicted cumulative hazard. Default prediction, xb, is theaverage of the fixed portion of the model plus m() random draws from the estimatedvariance–covariance matrix of the random-effects distribution. If fitted is specified,then individual specific estimates of the random effects are included with the fixedportion of the model.

martingale calculates martingale-like residuals. Default includes contributions fromrandom effects.

deviance calculates the deviance residuals.


Random effects

reffects calculates BLUPs of the random effects. You must specify q new variables,where q is the number of random-effects terms in the model (or level). However, itis much easier to just specify stub* and let Stata name the variables stub1, . . ., stubqfor you.

reses calculates the standard errors of the BLUPs of the random effects. You mustspecify q new variables, where q is the number of random-effects terms in the model(or level). However, it is much easier to just specify stub* and let Stata name thevariables stub1, . . ., stubq for you.

Subsidiary

xb specifies predictions based on the fixed portion of the model when a longitudinaloption is specified. When the prediction option is hazard, cumhazard, or survival,the predictions are based on the average of the fixed portion plus m() draws fromthe estimated random-effects variance–covariance matrix.

fitted specifies the linear predictor of the fixed portion plus contributions based onpredicted random effects.

m(#) specifies, when xb is chosen, the number of draws from the estimated random-effects variance–covariance matrix in survival submodel predictions.

at(varname #[varname # ...

]) requests that the covariates specified by the listed

varnames be set to the listed # values. For example, at(x1 1 x3 50) would evalu-ate predictions at x1 = 1 and x3 = 50. This is a useful way to obtain out-of-samplepredictions. Note that if at() is used together with zeros, all covariates not listedin at() are set to 0. If at() is used without zeros, then all covariates not listed inat() are set to their sample values. See also zeros.

ci calculates a CI for the requested statistic and stores the confidence limits in new-var lci and newvar uci.

timevar(varname) defines the variable used as time in the predictions. This is usefulfor large datasets where for plotting purposes, predictions are only needed for, say,200 observations. Note that you should take some caution when using this optionbecause predictions may be made at whatever covariate values are in the first 200rows of data. This can be avoided by using the at() option or the zeros option todefine the covariate patterns for which you require the predictions.

meastime evaluates predictions at measurement times, that is, t0. Default for longi-tudinal submodel predictions.

survtime evaluates predictions at survival times, that is, t. Default for survival sub-model predictions.

zeros sets all covariates to 0 (baseline prediction). For example, predict s0, survivalzeros calculates the baseline survival function. See also at().


5 The stjmgraph command

A subsidiary command, stjmgraph, is available. This creates a longitudinal trajectoryplot whereby the time scale is adjusted by taking away each patient’s event or censoringtime. This form of graph can be useful to display joint longitudinal and survival data,giving an indication of any association between the two processes. A separate plot iscreated for patients who were censored and for patients who experienced the event ofinterest. They are then combined by using graph combine.

5.1 Syntax

stjmgraph depvar[if] [

in], panel(varname)

[censgraphopts(string)

eventgraphopts(string) combineopts(string) draw lowess]

The dataset must be stset, as described for stjm.

5.2 Options

panel(varname) defines the panel identification variable. panel() is required.

censgraphopts(string) pass options to the twoway graph of censored observations; see[G-3] twoway options.

eventgraphopts(string) pass options to the twoway graph of observations who expe-rienced the event of interest; see [G-3] twoway options.

combineopts(string) pass options to the final graph combine; see [G-2] graph com-bine.

draw displays the intermediate twoway plots used to create the final graph.

lowess overlays a lowess smoother to each graph to aid interpretation.

6 Example

We illustrate stjm through application to a dataset of 312 patients with primary biliarycirrhosis (see Murtaugh et al. [1994]). Of the 312, 158 were randomized to receive D-penicillamine, and 154 assigned a placebo. Serum bilirubin was measured repeatedlyat intermittent time points. We investigate the effect of treatment after adjusting forthe relationship between serum bilirubin levels and time to death. Because of rightskewness, in all analyses, we work with log(serum bilirubin).

The dataset must be correctly stset for use with stjm through the use of start andstop times. This allows stjm to use t0 as the measurement times and the final row oft as the survival times. We illustrate the data structure below:


. use fullpbc

. stset stop, enter(start) f(event=1) id(id)

id: idfailure event: event == 1

obs. time interval: (stop[_n-1], stop]enter on or after: time startexit on or before: failure

1945 total obs.0 exclusions

1945 obs. remaining, representing312 subjects140 failures in single failure-per-subject data

2000.307 total analysis time at risk, at risk from t = 0earliest observed entry t = 0

last observed exit t = 14.30566

. list id logb drug _t0 _t _d if id==3 | id==5, noobs sepby(id)

id logb drug _t0 _t _d

3 .3364722 D-penicil 0 .48187494 03 .0953102 D-penicil .48187494 .99660498 03 .4054651 D-penicil .99660498 2.0342789 03 .5877866 D-penicil 2.0342789 2.7707808 1

5 1.223776 placebo 0 .54484725 05 .6418539 placebo .54484725 1.070529 05 .9162908 placebo 1.070529 2.1054649 05 1.740466 placebo 2.1054649 3.0062425 05 1.648659 placebo 3.0062425 3.9836819 05 2.944439 placebo 3.9836819 4.1205783 0

Here we have two patients with four and six measurements of log(serum bilirubin),respectively. The data have been stset, allowing t0 to be used to denote the timethat measurements were taken and the final row (for each patient) of t to denote thesurvival time. We can explore the joint data by using stjmgraph. We use the lowessoption to aid interpretation.


. stjmgraph logb, panel(id) lowess

−20

24

Long

itudi

nal r

espo

nse

−15 −10 −5 0Time before censoring

Censored

−20

24

Long

itudi

nal r

espo

nse

−15 −10 −5 0Time before event

Event

Figure 1. Longitudinal profiles of log(serum bilirubin) for patients who were censoredor who died. Time scale is adjusted by taking away each patient’s survival time.

Figure 1 displays all patients’ longitudinal trajectories against time, across died andcensoring status, with the time scale adjusted by subtracting each patient’s survival orcensoring time. We could restrict the plotted sample by using the if or in qualifier.There appears to be a generally increasing trend that is much sharper in patients whodied than in those who were censored. This is indicative of a positive association betweenlongitudinal response and time to death, whereby a higher level of the biomarker appearsto be associated with time to death. We now investigate this formally by using stjm.

We model the longitudinal process by using a linear trajectory model with randomintercept and slope, adjusting for treatment group. We model the survival process byusing a Weibull proportional hazards survival submodel and adjusting for treatmentgroup. We use the default current value association and the default unstructuredform for the random-effects variance–covariance matrix.


. stjm logb trt, panel(id) survm(weibull) rfp(1) survcov(trt)-> gen double _time_1 = X^(1)(where X = _t0)

Obtaining initial values:

Fitting full model:

-> Conducting adaptive Gauss-Hermite quadrature

-- Iteration 0: Adapted log likelihood = -1920.5096-- Iteration 1: Adapted log likelihood = -1923.2378-- Iteration 2: Adapted log likelihood = -1923.2206-- Iteration 3: Adapted log likelihood = -1923.2214

(output omitted )

Joint model estimates Number of obs. = 1945Panel variable: id Number of panels = 312

Number of failures = 140

Log-likelihood = -1918.5172

Coef. Std. Err. z P>|z| [95% Conf. Interval]

Longitudinal_time_1 .1848437 .0132919 13.91 0.000 .1587921 .2108953

trt -.1313587 .1120029 -1.17 0.241 -.3508803 .0881629_cons .5591394 .0812295 6.88 0.000 .3999324 .7183463

Survivalassoc:value

_cons 1.240947 .0931014 13.33 0.000 1.058471 1.423422ln_lambda

trt .0389711 .1790989 0.22 0.828 -.3120563 .3899985_cons -4.408948 .2738691 -16.10 0.000 -4.945722 -3.872175

ln_gamma_cons .0189773 .0827617 0.23 0.819 -.1432327 .1811874

Random effects Parameters Estimate Std. Err. [95% Conf. Interval]

id: Unstructuredsd(_time_1) .1805185 .0123477 .1578695 .2064167

sd(_cons) 1.00034 .0425768 .9202769 1.087369corr(_time_1,_cons) .4247242 .0727761 .2723586 .5563106

sd(Residual) .3471654 .0066731 .3343297 .3604939

Longitudinal submodel: Linear mixed effects modelSurvival submodel: Weibull proportional hazards model

Integration method: Adaptive Gauss-Hermite quadrature using 5 nodesCumulative hazard: Gauss-Kronrod quadrature using 15 nodes

We observe a nonstatistically significant direct treatment effect on log (serum biliru-bin) of −0.131 (95% CI: [−0.351, 0.088]). A nonstatistically significant direct treatmenteffect on survival is observed of 0.039 (95% CI: [−0.312, 0.390]). However, a highly pos-itive statistically significant association can be seen of 1.241 (95% CI: [1.058, 1.423]),indicating that a higher value of log (serum bilirubin) increases the risk of death. Thiscorresponds to a hazard ratio for a one-unit increase in the value of the time-dependentbiomarker of 3.459 (95% CI: [2.881, 4.150]). This is consistent with figure 1. Because we


have adjusted for treatment in both submodels, we can calculate an overall treatmenteffect on survival. For example, we have α = 1.241, δ = −0.131, and φ = 0.039. Theoverall log hazard-ratio for the effect of treatment is therefore αδ + φ. This can becalculated as follows:

. nlcom [alpha_1][_cons]*[Longitudinal][trt] + [ln_lambda][trt]

_nl_1: [alpha_1][_cons]*[Longitudinal][trt] + [ln_lambda][trt]

Coef. Std. Err. z P>|z| [95% Conf. Interval]

_nl_1 -.124038 .2293071 -0.54 0.589 -.5734717 .3253957

This shows a nonstatistically significant log hazard-ratio due to treatment of −0.124(95% CI: [−0.573, 0.325]). Standard predictions can be obtained following an stjm fit.Fitted values and standardized residuals can be plotted against each other to evaluatemodel fit.

. predict longfitvals, fitted longitudinal

. predict stresids, rstandard

. scatter stresids longfitvals, yline(0) ytitle("Standardized residuals")> xtitle("Fitted values") title("Fitted values vs. residuals")

−4−2

02

46

Stan

dard

ized

resi

dual

s

−1 0 1 2 3 4Fitted values

Fitted values vs. residuals

Figure 2. Fitted values versus standardized residuals to assess model fit

Note that the longitudinal residuals described in this article must be interpretedwith caution because of the inherent missing-data process underpinning the longitudi-nal process. A form of multiple-imputed residuals has been proposed by Rizopoulos,Verbeke, and Molenberghs (2010).


We can also compare predicted values of the survival function with the Kaplan–Meierestimate.

. predict survfit, xb survival

0.0

0.2

0.4

0.6

0.8

1.0

Surv

ival

pro

babi

lity

0 5 10 15Follow−up (months)

95% Confidence Interval Kaplan−Meier curveMarginal survival

Predicted marginal survival

Figure 3. Predicted survival function for patients in the treatment group

One of the benefits of fitting joint models within a shared parameter framework isthe ability to tailor predictions at the individual level. The set of fitted predictionsdescribed above is not exhaustive and does not include conditional survival predictions,whereby we wish to predict a patient’s survival conditional on a set of observed longi-tudinal measurements. A Monte Carlo scheme has been proposed by Rizopoulos (2011)to fully account for variability in parameter estimates and in empirical Bayes estimatesof the random effects. This proposal is currently being implemented in Stata.

7 Discussion

The new stjm command implements shared parameter joint modeling of longitudinaland survival data within Stata. It provides a highly flexible framework for both thelongitudinal submodel through the use of fractional polynomials and the survival sub-model through the four choices of submodel. Through the implementation of adaptiveGauss–Hermite quadrature, accurate estimates of effect can be obtained by using amuch-reduced number of quadrature nodes, resulting in substantial computational ben-efits.

The software is being constantly updated and improved, and we aim to write furtherarticles for the Stata Journal to include the extension to competing risks, the inclusionof a cure proportion, and the allowance of categorical longitudinal responses.


8 Acknowledgments

We thank an associate editor for constructive comments that greatly improved the ar-ticle. Part of this work was conducted when Michael Crowther undertook an internshipat StataCorp. He would like to thank all the people at StataCorp for their hospitality,in particular, Yulia Marchenko, Jeff Pitblado, Alan Riley, and Vince Wiggins.

Michael Crowther was funded by a National Institute for Health Research MethodsFellowship (RP-PG-0407-10314).

9 ReferencesBillingham, L. J., and K. R. Abrams. 2002. Simultaneous analysis of quality of life and

survival data. Statistical Methods in Medical Research 11: 25–48.

Crowther, M. J., K. R. Abrams, and P. C. Lambert. 2012. Flexible parametric jointmodelling of longitudinal and survival data. Statistics in Medicine 31: 4456–4471.

Durrleman, S., and R. Simon. 1989. Flexible regression models with cubic splines.Statistics in Medicine 8: 551–561.

Gould, W., J. Pitblado, and B. Poi. 2010. Maximum Likelihood Estimation with Stata.4th ed. College Station, TX: Stata Press.

Guo, X., and B. P. Carlin. 2004. Separate and joint modeling of longitudinal and eventtime data using standard computer packages. American Statistician 58: 16–24.

Henderson, R., P. Diggle, and A. Dobson. 2000. Joint modelling of longitudinal mea-surements and event time data. Biostatistics 1: 465–480.

Ibrahim, J. G., H. Chu, and L. M. Chen. 2010. Basic concepts and methods for jointmodels of longitudinal and survival data. Journal of Clinical Oncology 28: 2796–2801.

Lambert, P. C., and P. Royston. 2009. Further development of flexible parametricmodels for survival analysis. Stata Journal 9: 265–290.

Murtaugh, P. A., E. R. Dickson, G. M. V. Dam, M. Malinchoc, P. M. Grambsch, A. L.Langworthy, and C. H. Gips. 1994. Primary biliary cirrhosis: Prediction of short-termsurvival based on repeated patient visits. Hepatology 20: 126–134.

Pantazis, N., and G. Touloumi. 2010. Analyzing longitudinal data in the presence ofinformative drop-out: The jmre1 command. Stata Journal 10: 226–251.

Pinheiro, J. C., and D. M. Bates. 1995. Approximations to the log-likelihood function inthe nonlinear mixed-effects model. Journal of Computational and Graphical Statistics4: 12–35.

Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2002. Reliable estimation of generalizedlinear mixed models using adaptive quadrature. Stata Journal 2: 1–21.


Rizopoulos, D. 2011. Dynamic predictions and prospective accuracy in joint models forlongitudinal and time-to-event data. Biometrics 67: 819–829.

Rizopoulos, D., G. Verbeke, and G. Molenberghs. 2010. Multiple-imputation-basedresiduals and diagnostic plots for joint models of longitudinal and survival outcomes.Biometrics 66: 20–29.

Royston, P., and D. G. Altman. 1994. Regression using fractional polynomials of con-tinuous covariates: Parsimonious parametric modelling (with discussion). Journal ofthe Royal Statistical Society, Series C 43: 429–467.

Royston, P., and M. K. B. Parmar. 2002. Flexible parametric proportional-hazards andproportional-odds models for censored survival data, with application to prognosticmodelling and estimation of treatment effects. Statistics in Medicine 21: 2175–2197.

Wulfsohn, M. S., and A. A. Tsiatis. 1997. A joint model for survival and longitudinaldata measured with error. Biometrics 53: 330–339.

About the authors

Michael Crowther is a research associate in medical statistics. His main interest is the devel-opment and application of joint models for longitudinal and survival data.

Keith Abrams is a professor of medical statistics who maintains an active research interest inthe joint modeling of longitudinal and survival data.

Paul Lambert is a reader in medical statistics. His main interest is in the development andapplication of methods in population-based cancer research.


Doubly robust estimation in generalized linearmodels

Nicola OrsiniUnit of Biostatistics and Unit of Nutritional Epidemiology

Institute of Environmental MedicineKarolinska InstitutetStockholm, [email protected]

Rino BelloccoDepartment of Statistics and Quantitative Methods

University of Milano–BicoccaMilan, Italy

andDepartment of Medical Epidemiology and Biostatistics

Karolinska InstitutetStockholm, [email protected]

Arvid SjolanderDepartment of Medical Epidemiology and Biostatistics

Karolinska InstitutetStockholm, Sweden

[email protected]

Abstract. A common aim of epidemiological research is to assess the associationbetween a particular exposure and a particular outcome, controlling for a set of ad-ditional covariates. This is often done by using a regression model for the outcome,conditional on exposure and covariates. A commonly used class of models is thegeneralized linear models. The model parameters are typically estimated throughmaximum likelihood. If the model is correct, then the maximum likelihood esti-mator is consistent but may otherwise be inconsistent. Recently, a new class ofestimators known as doubly robust estimators has been proposed. These estima-tors use two regression models, one for the outcome and one for the exposure, andare consistent if either model is correct, not necessarily both. Thus doubly robustestimators give the analyst two chances instead of only one to make valid infer-ence. In this article, we describe a new Stata command, drglm, that implementsthe most common doubly robust estimators for generalized linear models.

Keywords: st0290, drglm, doubly robust, generalized linear model


186 Doubly robust estimators

1 Introduction

A common aim of epidemiological research is to assess the association between a par-ticular exposure and a particular outcome, controlling for a set of additional covariates.This is often done by fitting a regression model for the outcome, conditional on expo-sure and covariates. A commonly used class of models is the generalized linear models(GLMs). The model parameters are typically estimated through maximum likelihood(ML). If the model is correct, then the ML estimator is consistent but may otherwise beinconsistent.

When the mechanisms that bring about the outcome are well understood, the out-come is a natural target for regression modeling. Sometimes, the researcher may havea better understanding of the exposure mechanisms, in which case the exposure maybe a more natural target. For example, this could be the case when the exposure isa treatment or a medical drug, which are typically assigned to patients according toreasonably well-defined protocols. Robins, Mark, and Newey (1992) showed that expo-sure regression models, like outcome regression models, can be used to estimate theconditional exposure–outcome association, given covariates.

Often the researcher may not have a strong preference for either modeling strategy,in which case a doubly robust (DR) estimator is attractive. A DR estimator requires onemodel for the outcome and one model for the exposure but is consistent if either modelis correct, not necessarily both. Thus a DR estimator gives the researcher two chancesinstead of only one to make valid inference. Over the last decade, DR estimators havebeen developed for various parameters (see Bang and Robins [2005] and the referencestherein).

In this article, we describe a new Stata command, drglm, that implements DR esti-mators for GLMs. The article is organized as follows: In section 2, we establish notationand definitions and define the target estimand. In section 3, we review estimators thatuse outcome regression models, estimators that use exposure regression models, and DR

estimators. The DR estimators that we review in section 3 are special cases of more gen-eral estimators developed in Robins (2000) and Tchetgen Tchetgen and Robins (2010).In section 4, we present the drglm command with syntax and options. In section 5, wecarry out a simulation study to investigate the performance of the DR estimators, andin section 6, we describe a practical example.

2 Target parameter

Let A and Y denote the exposure and outcome of interest, respectively. Let L denotea vector of covariates that we wish to control for. We use p(·) generically for bothpopulation probabilities and densities, and we assume that data consist of n independentand identically distributed observations from p(Y,A,L). We use E(·) for populationmeans and E(·) for sample means; that is, E(R) =

∫rp(r)dr, and E(R) =

∑ni=1 Ri/n

for any random variable R.

N. Orsini, R. Bellocco, and A. Sjolander 187

A standard way to assess the conditional association between A and Y , given L, isto use a GLM on the form

g{E(Y |A,L;β, γ)} = βA + γT L (1)

where β quantifies the conditional A-Y association, given L, and g(·) is a suitable linkfunction. Typical link functions are the identity link (for continuous Y ), the log link(for “counts”), and the logit link (for binary Y ), for which β is a mean difference, a logrisk-ratio, and a log odds-ratio, respectively. Typically, a constant term (“intercept”)is included in the model. This can be achieved without changing notation by definingthe first component of L to be the constant 1. The model in (1) has no interactionterm between A and L; thus it assumes a constant strength of A-Y association on thescale defined by g(·) across levels of L. To allow for interactions between A and L andbetween separate components of L, we consider GLMs on the form

g{E(Y |A,L;β, γ)} = βT AX + γT V (2)

where X is a (p× 1)-dimensional function of L, and V is a (q× 1)-dimensional functionof L. For instance, if L = (L1, L2), X = (1, L1), and V = (1, L1, L2, L1L2), then (2)reduces to

g{E(Y |A,L;β, γ)} = β0A + β1AL1 + γ0 + γ1L1 + γ2L2 + γ12L1L2

The model in (2) consists of two parts. The part

m(A,L;β) = g{E(Y |A,L)} − g{E(Y |A = 0, L)} = βT AX (3)

quantifies the conditional A-Y association, given L, and is typically of main interest; werefer to it as the “main model”. The parameter β in the main model (3) is our targetparameter. The part

g{E(Y |A = 0, L; γ)} = γT V (4)

is primarily included to control for L; we refer to it as the “outcome nuisance model”.

3 Estimators

3.1 Estimators that use the nuisance model for the outcome

We first consider an estimator of β that uses the outcome nuisance model for E(Y |A =0, L) in (4). This estimator is obtained by solving the estimating equation

E

[(AXV

){Y − E(Y |A,L;β, γ)}

]= 0 (5)


for (βT , γT )T . We use βOBE to denote the first p elements of the solution to (5), whereOBE stands for outcome-based estimation. Using the law of iterated expectations, wehave that

E

[(AXV

){Y − E(Y |A,L;β, γ)}

]= E

[(AXV

)E{Y − E(Y |A,L;β, γ)|A,L}

]which equals 0, so the estimating equation in (5) is unbiased when both (3) and (4)are correct. It follows from standard theory (Newey and McFadden 1994) that βOBE isconsistent and asymptotically normal (CAN) when both (3) and (4) are correct.

In the standard use of GLMs, Y is assumed to follow a distribution in the exponentialfamily, conditional on A and L. If g(·) is the canonical link function (for example, theidentity link in the normal distribution, the log link in the Poisson distribution, andthe logit link in the Bernoulli distribution), then βOBE is an ML estimator. βOBE isthe default estimator produced by the glm command. We emphasize that βOBE is CAN

even when it is not an ML estimator. The default standard errors produced by the glmcommand are consistent under the distributional assumption, but are generally incon-sistent when the distributional assumption is incorrect. Consistent standard errors thatdo not rely on any distributional assumptions can be obtained through the “sandwich”formula by specifying the vce(robust) option in the glm command.

3.2 Estimators that use the nuisance model for the exposure

We next consider estimators of β that use the nuisance model for the exposure. Wefirst give a heuristic argument for the case when g(·) is the identity link. Supposethat the true value of β was known. We could then construct residuals on the formY − m(A,L;β). These residuals unbiasedly predict E(Y |A = 0, L). Conditionally onL, E(Y |A = 0, L) is a constant and therefore uncorrelated with A. This argumentsuggests the following estimation strategy: find the value of β for which the residualY − m(A,L;β) becomes conditionally uncorrelated with A, given L, in the sample. Interms of an estimating equation, we find the value of β that solves

E [X{A − E(A|L)}{Y − m(A,L;β)}] = 0 (6)

Equation (6) involves E(A|L), which typically is unknown. Therefore, we predictE(A|L) by using the exposure nuisance model in the form

h{E(A|L;α)} = αT Z (7)


where h(·) is a smooth link function not necessarily equal to g(·) used in the main model(3) and in the outcome model (4). Z is an (r × 1)-dimensional function of L, with thefirst element typically being the constant 1. We will allow for the identity link, the loglink, and the logit link in the exposure nuisance (7). We fit the model in (7) by solvingthe unbiased estimating equation for α,

E [Z{A − E(A|L;α)}] = 0

and we replace the true value of E(A|L) in (6) with the model-based prediction.

Combining these steps into one estimating equation for (βT , αT )T gives

E

[X{A − E(A|L;α)}{Y − m(A,L;β)}

Z{A − E(A|L;α)}]

= 0 (8)

We use βEBE to denote the first p elements of the solution to (8), where EBE stands forexposure-based estimation. Using the law of iterated expectations, we have that

E

[X{A − E(A|L;α)}{Y − m(A,L;β)}

Z{A − E(A|L;α)}]

= E

[XE{A − E(A|L;α)|L}E(Y |A = 0, L)

ZE{A − E(A|L;α)|L}]

(9)

if (3) with the identity link is correct. If (7) is also correct, then the right-hand sideof (9) equals 0, so the estimating equation in (8) is unbiased when both (3) with theidentity link and (7) are correct. Thus βEBE is CAN when both (3) with the identitylink and (7) are correct.

A minor modification is required when g(·) in (3) is the log link. For this linkfunction, we replace Y −m(A,L;β) on the first p rows in (8) with Y e−m(A,L;β). Using thelaw of iterated expectations, we can easily show that this modified estimating equationis unbiased when both (3) with the log link and (7) are correct.

We now consider the case when g(·) is the logit link. For this link, we assume thatboth A and Y are binary (0/1). We use the nuisance model in the form

logit{E(A|Y = 0, L; δ)} = δT W (10)

where W is an (s × 1)-dimensional function of L, with the first element typically beingthe constant 1. Because of the symmetry of the odds ratio, (3) with the logit link and(10) together define the joint model

logit{E(A|Y,L;β, δ)} = βT Y X + δT W

Under (3) with the logit link and (10), an ML estimator of (βT , δT )T is obtained bysolving the estimating equation

E

[(Y XW

){A − E(A|Y,L;β, δ)}

]= 0 (11)

Using the law of iterated expectations, we can show that the estimating equation in(11) is unbiased when both (3) with the logit link and (10) are correct. For simplicity,we use βEBE to denote the first p elements of the solution to either (8) or (11).


3.3 DR estimators

We finally consider DR estimators of β. We first consider the case when g(·) is theidentity link. For this case, a DR estimator of β can be obtained by “combining” theestimating equations (5) and (8) into

E

⎡⎢⎢⎣X{A − E(A|L;α)}{Y − E(Y |A,L;β, γ)}(

AXV

){Y − E(Y |A,L;β†, γ)}

Z{A − E(A|L;α)}

⎤⎥⎥⎦ = 0 (12)

and solving for (βT , β†T , γT , αT )T . We use βDR to denote the first p elements of thesolution to (12). It follows from a more general result in Robins (2000) that the esti-mating equation in (12) is unbiased if either (4) with the identity link or (7) is correct,together with the main model (3) with the identity link.1 Thus βDR is CAN if either ofthe nuisance models is correct, not necessarily both.

A minor modification is required when g(·) is the log link. For this link function,we replace Y −E(Y |A,L;β, γ) = Y −m(A,L;β)−E(Y |A = 0, L; γ) on rows 1 throughp in (12) with Y e−m(A,L;β) − E(Y |A = 0, L; γ); and replace Y − E(Y |A,L;β†, γ) =Y − m(A,L;β†) − E(Y |A = 0, L; γ) on rows p + q + 1 through 2p + q + 1 in (12)with Y e−m(A,L;β†) −E(Y |A = 0, L; γ). Following Robins (2000), we can show that thismodified estimating equation system is unbiased if either (4) with the log link or (7) iscorrect, together with the main model (3) with the log link.

We now consider the case when g(·) is the logit link. For this case, a DR estimatorof β can be obtained by solving the estimating equation

E

⎡⎢⎢⎢⎢⎣X {A − E∗(A|L;β, γ, δ)} {Y − E(Y |A,L;β, γ)}(

AXV

){Y − E(Y |A,L;β†, γ)}(

Y XW

){A − E(A|Y,L;β‡, δ)}

⎤⎥⎥⎥⎥⎦ = 0 (13)

for (βT , β†T , γT , β‡T , δT )T , where

E∗(A|L;β, γ, δ) =[1 +

{1 − E(A|Y = 0, L; δ)}E(Y |A = 0, L; γ)E(A|Y = 0, L; δ)E(Y |A = 1, L;β, γ)

]−1

For simplicity, we use βDR to denote the first p elements of the solution to either (12)or (13). It follows from a more general result in Tchetgen Tchetgen and Robins (2010)that the estimating equation in (13) is unbiased if either (4) with the logit link or (10)is correct, together with the main model (3) with the logit link.2

1. Here we define (β†T , γT , αT )T as the asymptotic solution to the last p+ q + r rows in (12) whether(4) and (7) are misspecified or not. It follows that the last p + q + r rows in (12) are unbiased bydefinition.

2. Here we define (β†T , γT , β‡T , δT )T as the asymptotic solution to the last p + q + p + s rows in (13)whether (4) and (10) are misspecified or not. It follows that the last p + q + p + s rows in (13) areunbiased by definition.


3.4 Standard errors

All estimators of β that we have considered in section 3 are generalized method ofmoments estimators, also referred to as Z-estimators (van der Vaart 1998). Specifically,they are the first p elements of the solution to an unbiased estimating equation on theform E{U(θ)} = 0, where θ = (βT , ηT )T , and η is a nuisance parameter. It follows fromgeneral results on generalized method of moments estimators (Newey and McFadden1994) that n1/2(θ − θ) is asymptotically normal with mean 0 and variance–covariancematrix

Σ =[E

{∂U(θ)∂θT

}]−1

Var{U(θ)}([

E

{∂U(θ)∂θT

}]−1)T

(14)

A consistent estimator of Σ is obtained by replacing θ in (14) with the estimator θ andthe population moments in (14) with their sample counterparts.

3.5 A note on the possible combinations of link functions

The DR estimators that we have considered in section 3.3 only apply to main models onthe parametric form in (3) and to the combination of link functions listed in table 1. Inprinciple, it would be desirable to implement DR estimators that do not suffer from thislimitation. In practice, though, such DR estimators typically require stronger modelingassumptions, or they may not even exist. For instance, when the outcome is binary andthe exposure is continuous, it would be desirable to have a DR estimator that uses a logitlink for the outcome and an identity link for the exposure. However, such an estimatorrequires not only a mean model for the exposure but also a fully specified model for theexposure distribution (Tchetgen Tchetgen and Robins 2010). This makes the estimatorless robust and more computationally intensive. For binary outcomes and exposures, itwould also be desirable to implement a DR estimator that uses probit links. However,to the best of our knowledge, no such DR estimator exists.

Table 1. Possible combinations of link functions

main/outcome link exposure link

identity identityidentity logidentity logit

log identitylog loglog logit

logit logit


4 The drglm command

drglm provides DR estimates for the main model (3) in GLMs.

4.1 Syntax

drglm depvar expvar[if] [

in] [

, main(varlist) outcome(varlist)

exposure(varlist) olink(linkname) elink(linkname) level(#) obe ebe

eform vce(vcetype)]

The expvar (exposure, treatment, predictor, or covariate) must be numerical. Af-ter drglm estimation, one can use postestimation commands such as test, testparm,lincom, and predictnl.

Options

main(varlist) determines which variables are used in the main model part of the es-timator. The constant 1 is always added to main(varlist). Then each variable inmain(varlist) is multiplied by expvar and saved in the current dataset.

outcome(varlist) determines which variables are used in the outcome model part of theestimator. The constant 1 is always added to outcome(varlist).

exposure(varlist) determines which variables are used in the exposure model part ofthe estimator. The constant 1 is always added to exposure(varlist).

olink(linkname) specifies the link function of the outcome model (identity, logit,log). The default is olink(identity). If olink(logit) is specified, expvar cantake on only two values (either 0 or 1).

elink(linkname) specifies the link function of the exposure model (identity, logit,log). The default is elink(identity).

level(#) specifies the confidence level, as a percentage, for confidence intervals. Thedefault is level(95) or as set by set level.

obe specifies the outcome-based estimation.

ebe specifies the exposure-based estimation.

eform reports coefficient estimates as exp(b) rather than as b.

vce(vcetype) specifies the type of standard error reported. vcetype may be robust,cluster clustvar, bootstrap, or jackknife. The default is vce(robust).


Saved results

drglm saves the following in e():

Scalarse(N) number of observations e(rank) rank of e(V)

Macrose(cmd) drglm e(olink) link function of the outcomee(cmdline) command as typed modele(depvar) name of dependent variable e(elink) link function of the exposuree(vcetype) title used to label Std. Err. modele(properties) b V e(estimator) type of estimator (dr, obe, or

ebe)

Matricese(b) coefficient vector e(V) variance–covariance matrix

of the estimators

Functionse(sample) marks estimation sample

5 Simulation study

To demonstrate the doubly robustness of the implemented estimators, we present theresults from two simulation studies.

5.1 Simulation 1

We generated 1,000 samples of 500 observations each from the model

L = (L1, L2)L1 ⊥ L2

L1 ∼ N(0, 1)L2 ∼ N(0, 1)

A|L ∼ N {E(A|L), 1}Y |A,L ∼ N {E(Y |A,L), 1}E(A|L) = α0 + α1L1 + α2L2 + α12L1L2︸︷︷︸

Exposure nuisance model

E(Y |A = 0, L) = γ0 + γ1L1 + γ2L2 + γ12L1L2︸︷︷︸Outcome nuisance model

m(A,L) = E(Y |A,L) − E(Y |A = 0, L)= β0A + β1AL1︸︷︷︸

Main model

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭with nuisance parameter η = (α0, α1, α2, α12, γ0, γ1, γ2, γ12) = (0, 1, 1,−1.5,−1,−1,−1, 1.5) and target parameter β = (β0, β1) = (1.5, 1). For each sample, we calcu-lated βOBE, βEBE, and βDR by using correct models for E(A|L), E(Y |A = 0, L), andm(A,L). We calculated the mean estimates (over the 1,000 samples), the mean theo-retical standard errors (as obtained from the sandwich formula), the empirical standarderrors, and the empirical coverage probabilities of the corresponding 95% Wald confi-


dence intervals (CIs). This procedure was repeated twice: we first used correct modelsfor E(Y |A = 0, L) and m(A,L) but the incorrect model E(A|L) = α0 + α1L1 + α2L2;we then used correct models for E(A|L) and m(A,L) but the incorrect model E(Y |A =0, L) = γ0 + γ1L1 + γ2L2. Table 2 shows the results. All three estimators work wellunder correct model specifications. The mean estimates are close to the true value ofβ; the mean theoretical standard errors are close to the mean empirical standard er-rors; and the coverage probabilities of the CIs are very close to the nominal level of95%. When the model for E(A|L) is misspecified, βEBE is biased. Similarly, when themodel for E(Y |A = 0) is misspecified, βOBE is biased. βDR is unbiased even if either ofthese models is misspecified. The differences in empirical standard error for the threeestimators are minor.

Table 2. Simulation results for the estimate of β0 and β1. I: Correct models for E(A|L),E(Y |A = 0, L), and E(Y |A,L) − E(Y |A = 0, L); II: Correct models for E(Y |A = 0, L)and E(Y |A,L) − E(Y |A = 0, L) and incorrect model for E(A|L); III: Correct modelsfor E(A|L) and E(Y |A,L) − E(Y |A = 0, L) and incorrect model for E(Y |A = 0, L).

mean mean theoretical empirical coverageestimate standard error standard error probability

Iβ0,OBE 1.50 0.04 0.05 94β1,OBE 1.00 0.02 0.02 93β0,EBE 1.52 0.06 0.06 96β1,EBE 0.99 0.14 0.14 97β0,DR 1.50 0.05 0.05 94β1,DR 1.00 0.05 0.05 95

IIβ0,OBE 1.50 0.04 0.05 94β1,OBE 1.00 0.02 0.02 93β0,EBE 0.85 0.05 0.05 0β1,EBE 1.07 0.03 0.03 42β0,DR 1.50 0.04 0.05 94β1,DR 1.00 0.02 0.02 93

IIIβ0,OBE 0.84 0.04 0.05 0β1,OBE 1.06 0.03 0.03 40β0,EBE 1.52 0.06 0.06 96β1,EBE 0.99 0.14 0.14 97β0,DR 1.51 0.06 0.06 96β1,DR 0.98 0.13 0.13 96


5.2 Simulation 2

We generated 1,000 samples of 500 observations each from the model

L = (L1, L2)L1 ⊥ L2

L1 ∼ N(0, 1)L2 ∼ N(0, 1)

(A, Y ) = ∈ (0, 1)logit {E(A|Y = 0, L)} = α0 + α1L1 + α2L2 + α12L1L2︸︷︷︸

Exposure nuisance model

logit {E(Y |A = 0, L)} = γ0 + γ1L1 + γ2L2 + γ12L1L2︸︷︷︸Outcome nuisance model

m(A,L) = logit {E(Y |A,L)} − logit {E(Y |A = 0, L)}= β0A + β1AL1︸︷︷︸

Main model

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭with nuisance parameter η = (α0, α1, α2, α12, γ0, γ1, γ2, γ12) = (−1, 1, 1,−1.5,−1,−1,−1, 1.5) and target parameter β = (β0, β1) = (1.5, 1). For each sample, we calcu-lated βOBE, βEBE, and βDR by using correct models for both logit{E(A|Y = 0, L)},logit{E(Y |A = 0, L)}, and m(A,L). We calculated the same summary measures asin simulation 1. This procedure was repeated twice: we first used correct models forlogit{E(Y |A = 0, L)} and m(A,L) but the incorrect model logit{E(A|Y = 0, L)} =α0 + α1L1 + α2L2; we then used correct models for logit{E(A|Y = 0, L)} and m(A,L)but the incorrect model logit{E(Y |A = 0, L)} = γ0 + γ1L1 + γ2L2. Table 3 shows theresults. All three estimators work well under correct model specifications. The meanestimates are close to the true value of β; the mean theoretical standard errors are closeto the mean empirical standard errors; and the coverage probabilities of the CIs are veryclose to the nominal level of 95%. When the model for E(A|L) is misspecified, βEBE isbiased. Similarly, when the model for E(Y |A = 0) is misspecified, βOBE is biased. βDR

is unbiased even if either of these models is misspecified. The differences in empiricalstandard error for the three estimators are minor.


Table 3. Simulation results for the estimate of β0. I: Correct models for logit{E(A|Y =0, L)}, logit{E(Y |A = 0, L)}, and logit {E(Y |A,L)}−logit {E(Y |A = 0, L)}; II: Correctmodels for logit{E(Y |A = 0, L)} and logit {E(Y |A,L)} − logit {E(Y |A = 0, L)} andincorrect model for logit{E(A|Y = 0, L)}; III: Correct models for logit{E(A|Y = 0, L)}and logit {E(Y |A,L)} − logit {E(Y |A = 0, L)} and incorrect model for logit{E(Y |A =0, L)}.

mean mean theoretical empirical coverageestimate standard error standard error probability

Iβ0,OBE 1.53 0.27 0.26 96β1,OBE 1.03 0.30 0.28 95β0,EBE 1.53 0.28 0.27 96β1,EBE 1.04 0.35 0.33 96β0,DR 1.54 0.28 0.28 95β1,DR 1.05 0.41 0.39 94

IIβ0,OBE 1.53 0.27 0.26 96β1,OBE 1.03 0.30 0.28 95β0,EBE 0.73 0.25 0.25 13β1,EBE 1.51 0.37 0.34 71β0,DR 1.53 0.28 0.27 96β1,DR 1.06 0.41 0.38 96

IIIβ0,OBE 0.78 0.24 0.25 17β1,OBE 1.28 0.26 0.25 80β0,EBE 1.53 0.28 0.27 96β1,EBE 1.04 0.35 0.33 96β0,DR 1.54 0.28 0.27 96β1,DR 1.06 0.40 0.37 94

6 Example

Sjolander and Vansteelandt (2011) used data from the National Match Cohort (NMC)(Bellocco et al. 2010) to illustrate the use of DR estimators of attributable fractions.We use the same dataset to illustrate the use of the drglm command. The NMC wasestablished in 1997, when 300,000 Swedes participated in a national fund-raising eventorganized by the Swedish Cancer Society. Every participant was asked to fill out a


questionnaire that included items on known or suspected risk factors for cardiovasculardisease (CVD). Using the Swedish patient registry, the NMC followed participants until2006, and each CVD event was recorded. Sjolander and Vansteelandt (2011) considereda binary outcome cvd, with cvd = 1 if a subject developed CVD before end of follow-up,and cvd = 0 otherwise. They considered a binary exposure bmi, with bmi = 0 forthose subjects with baseline body mass index (BMI)—body weight in kilograms dividedby height squared in meters—between 18.5 and 25 kg/m2 and bmi = 1 for subjectswith baseline BMI outside this range. The range 18.5 < BMI < 25 kg/m2 is considerednormal weight by the World Health Organization (World Health Organization 1995).Based on self-reported history of physical activity, Sjolander and Vansteelandt (2011)constructed a continuous measure. They controlled for both age at baseline (age)and the constructed measure of physical activity (pa). The dataset nmc sj of 41,295individuals is a sample that can be requested from the authors; it can be used only toreproduce the current analysis.

A standard way to assess the association between bmi and cvd, controlling for ageand pa, is to use the logistic regression model logit{E(cvd|bmi, age, pa)} = βbmi+γ0 +γ1age + γ2pa. Fitting this model with the logit command gives the output below.The option vce(robust) is used to allow a comparison of the standard errors with thedrglm command.

. use nmc_sj(National Match Cohort - SJ version)

. logit cvd bmi age pa, vce(robust) nolog

Logistic regression Number of obs = 41295Wald chi2(3) = 1345.18Prob > chi2 = 0.0000

Log pseudolikelihood = -27190.223 Pseudo R2 = 0.0253

Robustcvd Coef. Std. Err. z P>|z| [95% Conf. Interval]

bmi .1464322 .044115 3.32 0.001 .0599684 .2328959age .0173548 .0006421 27.03 0.000 .0160964 .0186133pa -.1348361 .0067156 -20.08 0.000 -.1479983 -.1216738

_cons -.7620794 .0434613 -17.53 0.000 -.8472621 -.6768968


If both the main model logit{E(cvd|bmi, age, pa)} − logit{E(cvd|bmi = 0, age, pa)}= βbmi and the outcome nuisance model logit{E(cvd|bmi = 0, age, pa)} = γ0 +γ1age+γ2pa are correct, then the estimate of β is consistent. An identical analysis is performedby using the drglm command with the option obe (outcome-based estimator).

. drglm cvd bmi, outcome(age pa) olink(logit) elink(logit) obe

Generalized Linear Models Number of obs = 41295Estimator: Outcome BasedLink functions: Outcome[logit] Exposure[logit]


mainbmi .1464322 .044115 3.32 0.001 .0599684 .2328959

As argued in section 3.2, a consistent estimate of β can also be obtained throughthe model logit{E(bmi|cvd, age, pa)} = βcvd + α0 + α1age + α2pa. Fitting this modelgives the output below.

. logit bmi cvd age pa, vce(robust) nolog

Logistic regression Number of obs = 41295Wald chi2(3) = 2618.37Prob > chi2 = 0.0000

Log pseudolikelihood = -7552.0003 Pseudo R2 = 0.1837

Robustbmi Coef. Std. Err. z P>|z| [95% Conf. Interval]

cvd .3012316 .0445681 6.76 0.000 .2138798 .3885834age .0975369 .0019346 50.42 0.000 .0937451 .1013287pa -.0545103 .0166798 -3.27 0.001 -.0872021 -.0218186

_cons -8.53162 .1471068 -58.00 0.000 -8.819944 -8.243296

If both the main model logit{E(bmi|cvd, age, pa)} − logit{E(bmi|cvd = 0, age, pa)}= βcvd and the exposure nuisance model logit{E(bmi|cvd = 0, age, pa)} = α0+α1age+α2pa are correct, then the estimate of β is consistent. An identical analysis is performedby using the drglm command with the option ebe (exposure-based estimator).

. drglm cvd bmi, exposure(age pa) olink(logit) elink(logit) ebe

Generalized Linear Models Number of obs = 41295Estimator: Exposure BasedLink functions: Outcome[logit] Exposure[logit]


mainbmi .3012316 .0445681 6.76 0.000 .2138798 .3885834


A DR estimate of β that uses both nuisance models is obtained as follows:

. drglm cvd bmi, outcome(age pa) exposure(age pa) olink(logit) elink(logit)

Generalized Linear Models Number of obs = 41295Estimator: Double RobustLink functions: Outcome[logit] Exposure[logit]


mainbmi .2991997 .0442286 6.76 0.000 .2125133 .3858861

By not specifying the option main(), the main model becomes equal to

logit {E(cvd|bmi, pa, age)} − logit {E(cvd|bmi = 0, pa, age)} = βbmi

Interpretation of the regression coefficient is usually done on an exponential scale(odds ratios rather than log odds-ratios). One can use either the drglm’s option eformor the postestimation command lincom. Compared with subjects with 18.5 < BMI < 25kg/m2, the odds of CVD for subjects with BMI < 18.5 or BMI > 25 were 31% higher(95% CI: [1.20, 1.43]).

. lincom bmi, eform

( 1) [main]bmi = 0

cvd exp(b) Std. Err. z P>|z| [95% Conf. Interval]

(1) 1.309507 .0579729 6.09 0.000 1.200672 1.428207

We observe that the DR estimate of β is very close to the estimate obtained throughthe exposure nuisance model (option ebe) but less close to the estimate obtained throughthe outcome nuisance model (option obe). This indicates that the exposure nuisancemodel may be reasonably correct, whereas the outcome nuisance model may suffer frommore severe misspecifications.


We refined the nuisance models by taking into account nonlinearities for both ageand pa. We modeled both quantitative covariates by using restricted cubic splines withthree knots at fixed percentiles of the distribution.

. mkspline pas = pa, nk(3) cubic

. mkspline ages = age, nk(3) cubic

. drglm cvd bmi, outcome(ages1 ages2 pas1 pas2) exposure(ages1 ages2 pas1 pas2)> olink(logit) elink(logit)



mainbmi .2696505 .0442708 6.09 0.000 .1828814 .3564196

With the refined outcome and exposure nuisance model, we obtained βOBE =0.25 and βEBE = 0.27, respectively. Whereas the refinement resulted in a change inβOBE with (0.15 − 0.25)/0.15 = −67%, it only resulted in a change in βEBE with(0.30 − 0.27)/0.30 = 10%. This further indicates that the misspecification in the sim-ple outcome nuisance model was more severe than the misspecification in the simpleexposure nuisance model.

We next considered the hypothesis that the association between BMI and CVD mayvary with physical activity. Therefore, we specify the main model of the form below byspecifying the main(pa) option.

logit {E(cvd|bmi, pa, age)} − logit {E(cvd|bmi = 0, pa, age)} = β0bmi + β1bmipa

. drglm cvd bmi, main(pa) outcome(ages1 ages2 pas1 pas2)> exposure(ages1 ages2 pas1 pas2) olink(logit) elink(logit)



mainbmi .2316385 .1228567 1.89 0.059 -.0091562 .4724332

bmipa .0106238 .0319545 0.33 0.740 -.0520059 .0732535


The variable bmipa is the product of bmi and pa created internally by the drglmcommand. The coefficient of the interaction term, bmipa is not statistically significant(p = 0.740). A test for overall no association between BMI on CVD is obtained with thepostestimation command testparm.

. testparm bmi bmipa

( 1) [main]bmi = 0( 2) [main]bmipa = 0

chi2( 2) = 37.31Prob > chi2 = 0.0000

Because of the interaction between BMI and physical activity in the main model,to quantify the association between BMI (1 versus 0) and CVD, we need to consider aspecific value for physical activity. The coefficient of BMI depends on physical activityvia (β0 + β1pa). For example, the odds ratios of BMI for the minimal (0), median (4),and maximal (8) physical activity level are calculated as follows:

. lincom _b[bmi] + _b[bmipa]*0, eform

( 1) [main]bmi = 0


(1) 1.260664 .154881 1.89 0.059 .9908856 1.603892


( 1) [main]bmi + 4*[main]bmipa = 0


(1) 1.315391 .0607068 5.94 0.000 1.20163 1.439921


( 1) [main]bmi + 8*[main]bmipa = 0


(1) 1.372493 .2028368 2.14 0.032 1.027339 1.833609


To present graphically how the odds ratio for CVD associated with BMI varieswith physical activity (figure 1), we can use the convenient postestimation commandpredictnl.

. predictnl logor = _b[bmi] + _b[bmipa]*pa, ci(lo hi)note: Confidence intervals calculated using Z critical values

. generate or = exp(logor)

. generate lb = exp(lo)

. generate ub = exp(hi)

. by pa, sort: generate flag = (_n == 1)

. twoway (line or lb ub pa, sort lp(l - -) lc(black black black)) if flag,> yscale(log) ytitle("Odds Ratio of BMI") xtitle("Physical activity")> legend(off) scheme(sj) ylabel(1(.2)1.8, angle(horiz) format(%3.2fc))

1.00

1.20

1.40

1.60

1.80

Odd

s R

atio

of B

MI

0 2 4 6 8Physical activity

Figure 1. Odds ratio for CVD associated with BMI as function of physical activity

Although the logit link is by far the most common link for binary exposures andoutcomes, all combinations listed in table 1 are possible. In table 4, we present βOBE,βEBE, and βDR together with the corresponding 95% CIs, obtained by using the mainmodel g{E(cvd|bmi, age, pa)}− g{E(cvd|bmi = 0, age, pa)} = β, the outcome nuisancemodel g{E(cvd|bmi = 0, age, pa)} = γ0 + γ1age + γ2pa, and the exposure nuisancemodel h{E(bmi|age, pa)} = α0 + α1age + α2pa for each of the first six link-functioncombinations in table 1. We remind the reader that the interpretation of β depends onthe choice of link function in the main model.


Table 4. Estimated values of β using three estimators (outcome based, exposure based,and DR) and various combinations of link functions

main/outcome exposure

link link bβOBE 95% CI bβEBE 95% CI bβDR 95% CI

identity identity 0.04 [0.02, 0.06] 0.04 [0.02, 0.06] 0.04 [0.02, 0.06]identity log 0.04 [0.02, 0.06] 0.08 [0.06, 0.10] 0.08 [0.06, 0.10]identity logit 0.04 [0.02, 0.06] 0.07 [0.05, 0.10] 0.07 [0.05, 0.10]

log identity 0.06 [0.02, 0.11] 0.08 [0.03, 0.12] 0.06 [0.02, 0.10]log log 0.06 [0.02, 0.11] 0.16 [0.12, 0.20] 0.16 [0.12, 0.21]log logit 0.06 [0.02, 0.11] 0.15 [0.11, 0.19] 0.15 [0.11, 0.20]

Let us consider two alternative DR measures of association with logit as exposurelink. When the outcome link is identity, the regression coefficient is a difference in meanoutcome.

. drglm cvd bmi, outcome(age pa) exposure(age pa) olink(identity) elink(logit)

Generalized Linear Models Number of obs = 41295Estimator: Double RobustLink functions: Outcome[identity] Exposure[logit]


mainbmi .0738476 .0109156 6.77 0.000 .0524533 .0952418

The CVD risk difference comparing subjects with 18.5 < BMI < 25 kg/m2 versus subjectswith BMI < 18.5 or BMI > 25 was 7% (95% CI: [5%, 10%]). If the outcome link insteadis log, the regression coefficient is a log risk-ratio.

. drglm cvd bmi, outcome(age pa) exposure(age pa) olink(log) elink(logit) eform

Generalized Linear Models Number of obs = 41295Estimator: Double RobustLink functions: Outcome[log] Exposure[logit]

Robustcvd exp(b) Std. Err. z P>|z| [95% Conf. Interval]

mainbmi 1.165457 .0248518 7.18 0.000 1.117752 1.215198

Compared with subjects with 18.5 < BMI < 25 kg/m2, the risk of CVD for subjects withBMI < 18.5 or BMI > 25 was 17% higher (95% CI: [1.12, 1.22]).


7 Discussion

In this article, we have presented the new Stata command drglm, which carries out DR

estimation in GLMs. The DR estimators use two regression models and are consistentif either model is correct, not necessarily both. In our simulated scenarios, the DR

estimators were almost as efficient as the more “standard” estimators, which used onlyone regression model. Furthermore, in our simulated scenarios, the estimators that usedonly one regression model were severely biased whenever the model was incorrect. Theseresults speak in favor of the DR estimators.

The target parameter β is a subpopulation parameter; it quantifies the conditionalA-Y association, given covariates L (that is, the association in each subpopulationdefined by a distinct level of L). In the special case when g(·) is the identity link or thelog link, and there are no interactions between A and L in the main model, β may beinterpreted as a population parameter because of the collapsibility of mean differencesand log risk-ratios. In the general case (that is, for a link function other than the identitylink and the log link and with interactions between A and Y ), it is possible to constructDR estimators for population parameters through inverse probability weighting. Thesemethods have been implemented in Stata by Emsley et al. (2008).

In practice, it is unlikely for any model to be exactly correct. Several authors haveinvestigated the performance of DR estimators in various contexts when both work-ing models are misspecified (Bang and Robins 2005; Davidian, Tsiatis, and Leon 2005;Kang and Schafer 2007). These authors have drawn somewhat different conclusions.Bang and Robins (2005) state: “In our opinion, a DR estimator has the following ad-vantage that argues for its routine use: if either the [outcome] model or the [exposure]model is nearly correct, then the bias of a DR estimator . . . will be small”. In contrast,Kang and Schafer (2007) provided a simulated example where DR estimators were out-performed by estimators that rely on only one regression model; all involved modelsbeing moderately misspecified. They concluded that “two wrong models are not neces-sarily better than one”.

8 Acknowledgments

Nicola Orsini was partly supported by a Young Scholar Award from the KarolinskaInstitutet’s Strategic Program in Epidemiology. Arvid Sjolander acknowledges finan-cial support from the Swedish Research Council (2008-5375). Rino Bellocco acknowl-edges financial support from the Italian Ministry of University and Research (PRIN 2009

X8YCBN).

9 ReferencesBang, H., and J. M. Robins. 2005. Doubly robust estimation in missing data and causal

inference models. Biometrics 61: 962–973.


Bellocco, R., C. Jia, W. Ye, and Y. T. Lagerros. 2010. Effects of physical activity, bodymass index, waist-to-hip ratio and waist circumference on total mortality risk in theSwedish National March Cohort. European Journal of Epidemiology 25: 777–788.

Davidian, M., A. A. Tsiatis, and S. Leon. 2005. Semiparametric estimation of treatmenteffect in a pretest–posttest study with missing data. Statistical Science 20: 261–301.

Emsley, R., M. Lunt, A. Pickles, and G. Dunn. 2008. Implementing double-robustestimators of causal effects. Stata Journal 8: 334–353.

Kang, J. D. Y., and J. L. Schafer. 2007. Demystifying double robustness: A compari-son of alternative strategies for estimating a population mean from incomplete data.Statistical Science 22: 523–539.

Newey, W. K., and D. McFadden. 1994. Large sample estimation and hypothesis testing.In Handbook of Econometrics, ed. R. F. Engle and D. L. McFadden, vol. 4, 2111–2245.Amsterdam: Elsevier.

Robins, J. M. 2000. Robust estimation in sequentially ignorable missing data and causalinference models. In Proceedings of the American Statistical Association Section onBayesian Statistical Science 1999, 6–10. Alexandria, VA: American Statistical Associ-ation.

Robins, J. M., S. D. Mark, and W. K. Newey. 1992. Estimating exposure effects bymodelling the expectation of exposure conditional on confounders. Biometrics 48:479–495.

Sjolander, A., and S. Vansteelandt. 2011. Doubly robust estimation of attributablefractions. Biostatistics 12: 112–121.

Tchetgen Tchetgen, E. J., and J. M. Robins. 2010. On doubly robust estimation in asemiparametric odds ratio model. Biometrika 97: 171–180.

van der Vaart, A. W. 1998. Asymptotic Statistics. Cambridge University Press: Cam-bridge.

World Health Organization. 1995. Physical status: The use and interpretation of anthro-pometry. Report of a WHO Expert Committee. Technical Report Series 854, Geneva:World Health Organization.

About the authors

Nicola Orsini is an associate professor of medical statistics and an assistant professor of epi-demiology in the Unit of Biostatistics and Unit of Nutritional Epidemiology at the Institute ofEnvironmental Medicine, Karolinska Institutet, Sweden.

Rino Bellocco is an associate professor of biostatistics at the Department of Medical Epidemi-ology and Biostatistics, Karolinska Institutet, Sweden, and at the Department of Statistics andQuantitative Methods, University of Milano–Bicocca, Italy.

Arvid Sjolander is a postdoc at the Department of Medical Epidemiology and Biostatistics,Karolinska Institutet, Sweden.


Review of Data Analysis Using Stata, ThirdEdition, by Kohler and Kreuter

L. Philip SchummDepartment of Health Studies

University of [email protected]

Abstract. This article reviews Data Analysis Using Stata, Third Edition, byUlrich Kohler and Frauke Kreuter (2012 [Stata Press]).

Keywords: gn0056, data analysis, introductory, teaching, German Socio-EconomicPanel

1 Introduction

In the latest edition of their introductory text, Kohler and Kreuter (2012) offer a sub-stantial update covering new statistical material, new features introduced in Stata 11and 12, and updated datasets used throughout the examples and exercises. I reviewedthe first American edition of this book (Schumm 2005), and much of what I wrote inthat review applies equally to this new edition. Yet rather than repeat my originalcomments, I shall highlight the new material and try to make some fresh remarks.

For those unfamiliar with the book, it is organized around the premise that dataanalysis is an activity best learned by doing—a premise few (if any) statisticians ordata analysts would question. The authors have spent considerable effort developing apackage of real datasets and do-files to accompany the book (these are easily installedfrom within Stata by using net get or downloaded by using a web browser), and muchof the book is written as a tutorial assuming that the reader is following along on hisor her computer. Despite this, most of the book may also be read productively withoutfollowing along, as might be done by someone who already has some experience withStata or data analysis but is looking to pick up some new skills.

The data used in the book are drawn primarily from the 2009 German Socio-Economic Panel, with a few exceptions including the well-known dataset on survivalamong passengers of the Titanic. Although the examples are intrinsically interesting(for example, an exploration of the wage gap between men and women), the choice ofdatasets and material to cover (for example, complex surveys and svy) gives the book adistinctly social science feel. Thus, while most of the Stata-related and statistical con-tent is universally applicable, students from other fields may need to work a bit harderto apply what they are learning to their own data and analytic questions.


P. Schumm 207

Many who use Stata regularly do so because they believe that it offers unique advan-tages over other software for analyzing data. Although this book covers many of thesefeatures, it does so without identifying them as such and without any comparisons toother software. Thus teachers who assign the book may wish to explain up front whythey chose Stata and what its strengths are to help motivate the students. Readersnew to Stata who are considering this book for self-study can find such informationsummarized on StataCorp’s website.

2 Overview

This book covers three distinct, albeit related, topics: an introduction to Stata, anintroduction to the practice of data management and analysis, and an introductionto statistical inference and modeling. These topics are interwoven such that the bookmay be worked through linearly by readers initially unfamiliar with all three. At thesame time, those wishing to concentrate on a subset of these topics can do so easily byskipping some of the chapters and referring to them only as necessary (when a discussiondepends on material from another chapter, this is usually well marked). In the preface,the authors provide suggestions for using the book to teach introductory courses in dataanalysis, regression, and the analysis of categorical data.

Introducing Stata receives the most comprehensive treatment. Chapter 1 (appropri-ately titled “The first time”) takes a new Stata user by the hand and dives right intoStata’s command-line interface (the command-line interface is used throughout the bookto facilitate the discussion of examples and the use of do-files). This involves workingthrough a Stata session involving loading and looking at a dataset; making some simplechanges (recoding or labeling a variable); generating a few familiar summaries (mean,range, and standard deviation), tables, and a graph; and even fitting a linear regressionmodel. This chapter does a good job of orienting a new Stata user and is written in away that makes it accessible even to those with modest computer skills (as is true ofthe entire book).

Seven additional chapters are devoted primarily to instruction in the use of Stata.Chapter 3 provides an in-depth explanation of Stata’s command syntax, including (butnot limited to) the use of different types of weights, Stata expressions, operators, andfunctions, and even the somewhat more advanced topics of the by prefix and the useof foreach loops. Chapter 4 provides an overview of how Stata’s statistical and esti-mation commands work and, in particular, how the results from such commands maybe accessed programmatically for subsequent use. Chapter 5 covers the major issuesin manipulating variables, including handling dates and times, missing values, and abrief discussion of storage types and precision. Of particular note, this chapter alsoshows how to use the underscore variables n and N together with by, which is oneof Stata’s more powerful features. Chapter 6 provides a self-contained introduction toStata’s graphics, covering several of the most commonly used graph types and generaltechniques for modifying and manipulating graphs (including use of the graph editor).Chapter 12 briefly covers the more advanced topics of macro usage and how to write

208 Review of Kohler and Kreuter

Stata programs (both those defined in do-files and those defined in ado-files). Finally,chapter 13 provides information on keeping Stata current and on the various onlineresources available for learning more about Stata and for obtaining user-written com-mands (the Stata Journal, Statalist, and Statistical Software Components). In sum,a novice user who masters the material in these chapters will have attained sufficientproficiency to work effectively in Stata.

Data management and analysis are primarily covered in chapter 2 (“Working withdo-files”), chapters 5 and 6 (described above), chapter 7 (“Describing and comparingdistributions”), and chapter 11 (“Reading and writing data”). These chapters, perhapswith occasional reference to chapters 3 and 4, could be used by someone with someexperience in Stata who wants to increase his or her skills in manipulating and summa-rizing data. Chapter 2 sets the tone perfectly by emphasizing the use of do-files (scriptscontaining Stata code) and providing guidance on organizing them within a project.These do-files then become the official record of one’s work and may be rerun at anytime to recreate intermediate datasets or analytic results. Although written for Statausers, the ideas in this chapter are relevant for anyone who wants to make his or herdata management and analyses more efficient and reproducible.

Chapter 7 uses some of the tools presented in previous chapters to examine thedistributions of both discrete and continuous variables. The emphasis is on exploratorytechniques without reference to formal statistical notation or analyses. Included areseveral ways to generate tables, the effective use of dot charts, box plots, histograms,kernel density estimates, quantile plots, and Q–Q plots. By encouraging the reader tolook at and think about his or her data before moving to a more formal analysis, thebook teaches that data analysis is a back-and-forth, investigative process as opposed tothe mechanical application of a fixed set of techniques (this emphasis is then carriedthrough the next three chapters on statistical inference and modeling).

Few analysts will be lucky enough to deal exclusively with Stata-format files. Inchapter 11, the authors show how to read data from the three main types of text files(spreadsheet, free format, and fixed format), how to read data from various binaryformats (Excel, SAS), and how to enter data by hand. This chapter also deals withcombining data, including merging and appending, and with handling large datasets.Once again, although written for Stata, several of the ideas in this chapter are generalconcepts that would apply in some form regardless of the software package being used.

Lastly, chapters 8, 9, and 10 provide a nonmathematical introduction to statisticalinference and modeling. Each chapter is relatively self-contained and can be read onits own. Chapter 8, titled “Statistical inference”, is new in this edition and is discussedbelow. Chapter 9 is a long chapter (87 pages) covering linear regression. This chapterbegins with an intuitive example of the regression principle (modeling home size asa function of income) and then goes through the results available following a simplelinear regression (coefficients, ANOVA table, F test, and R2). A section on multipleregression comes next, which includes an explanation of standardized coefficients andof the effects of adding a covariate to the model based on an added-variable plot. Thisis followed by sections on regression diagnostics, model-building techniques such as the

P. Schumm 209

use of categorical covariates, interaction terms and transformations, and methods forexploring and reporting regression results. The chapter ends with a whirlwind tour ofmedian regression and regression models for panel data (including fixed-effects, random-effects, and population-averaged models).

Chapter 10 deals with regression models for discrete variables, focusing primarily onlogistic regression. It begins with an exploration of the linear probability model as away to motivate the logistic model and spends ample time explaining how to interpretthe coefficients from a logistic regression. The chapter also introduces the maximumlikelihood principle for fitting a model and the likelihood-ratio test for comparing nestedmodels. Similarly to chapter 9, methods for assessing model fit and for checking modelassumptions are discussed, as are building models with transformed covariates andinteraction terms. Finally, the alternative probit model, multinomial logistic regression,and models for ordinal data (the stereotype and proportional odds models) are brieflyintroduced.

Throughout chapters 9 and 10, the emphasis is on model specification (that is, whichcovariates to include and in what way), checking model assumptions, and exploringthe response surface as a function of the covariates. This is entirely appropriate forstudents first learning to analyze data. Teachers who wish to increase the emphasison hypothesis testing and model-based inference can easily do so by using the ideaspresented in chapter 8 or additional materials of their own.

3 New material

Among the new Stata features covered by the book, four are especially noteworthy.Factor variables (added in Stata 11) are introduced and demonstrated repeatedly, in-cluding their use in constructing quadratic and complicated interaction terms. Readingdata directly from Excel files (added in Stata 12) is also demonstrated, which will be ofparticular interest to researchers in biological fields where raw data are often providedin this format. Multiple imputation using the mi suite of commands (first added inStata 11 and substantially enhanced in Stata 12) features prominently in chapter 8 (seebelow). Finally, margins (added in Stata 11) and marginsplot (added in Stata 12) arenow used as one of the main tools for interpreting and presenting the results of regressionmodels. By making it easy to construct conditional-effects plots even in the presenceof interaction or nonlinear terms, these commands encourage researchers to spend timeexploring the models they fit (as opposed simply to looking at the significance of thecoefficients). Because this is exactly the approach emphasized throughout the text, theaddition of these commands strengthens the book considerably.

210 Review of Kohler and Kreuter

The other major addition is a new chapter (chapter 8) devoted to statistical inference.All the discussion of inference (that is, standard errors and confidence intervals) has beenrelocated to this chapter, and readers are referred back to it from chapters 9 and 10when discussing the estimated coefficients from those models. In addition to presentingthe idea of a sampling distribution, the chapter includes a discussion of inference fromcomplex samples (using Stata’s svy command), the use of poststratification and multipleimputation for handling nonresponse, and a brief discussion of causal inference.

In keeping with the rest of the book, random variables and their distributions arepresented nonmathematically via the use of simulation. This is, of course, an excellentdevice for giving an intuitive sense of what bias is, of how the sampling distribution ofa statistic changes as the sample size increases, and of how the central limit theoremworks. It is also an excellent way of explaining confidence intervals, and the chapterincludes an informative demonstration in this regard. Importantly, the authors do notmerely present simulation results but provide the reader with the tools necessary toperform his or her own simulations (via the use of Stata’s random number functionsand techniques for drawing repeated subsamples from an existing dataset).

At the same time, I believe that the authors may have overreached a bit with thischapter. It is organized around the distinction between descriptive inference (makinginferences about a fixed population from which you have obtained a sample) and causalinference (making inferences about an underlying data-generating mechanism) and evenintroduces the concept of counterfactuals. Yet although this is undoubtedly an impor-tant distinction, I’m afraid that the relatively short treatment provided here may raisemore questions than it answers, especially among students who may not be workingwith survey data (for example, data from a scientific experiment or randomized clinicaltrial). In fairness, the authors themselves acknowledge that they had the same concernand have addressed this by providing numerous excellent references for those who wishto explore these issues further. Those teaching with the book may also want to provideadditional materials and support for this chapter.

4 Final thoughts

With each new release of Stata, those who have been using it for a long time relish in thenew features; at the same time, the task faced by new users seems ever more daunting.Thus books like this that guide a new user through the initial learning process arearguably becoming even more valuable. Data Analysis Using Stata provides a broadintroduction to the Stata software—one that does not assume any prior experience withstatistical software or programming.

Selecting an introductory book for a technical subject can be difficult because manycontain information that is misleading or in some cases downright incorrect. With bookson data analysis and statistics, this often takes the form of reducing the subject to aseries of cookbook-like steps, discouraging readers from exploring their own data andgiving them a false sense of confidence. This book avoids these pitfalls and insteadprovides an accurate picture of how real data analysis should be done. In fact, by

P. Schumm 211

choosing to cover a broad range of material while still including many of the technicaldetails, the authors open up the subject and encourage the motivated reader to pursuefurther study (the text is rich with carefully selected references to help with this).

5 ReferencesKohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station,

TX: Stata Press.

Schumm, L. P. 2005. Review of Data Analysis Using Stata by Kohler and Kreuter.Stata Journal 5: 594–600.

About the author

Phil Schumm is a statistical consultant, an assistant director of the Biostatistics ConsultingLaboratory, and the director of the Research Computing Group in the Department of HealthStudies at the University of Chicago.


Review of Flexible Parametric Survival AnalysisUsing Stata: Beyond the Cox Model by Patrick

Royston and Paul C. Lambert

Nicola OrsiniUnit of Nutritional Epidemiology

andUnit of Biostatistics

Institute of Environmental Medicine, Karolinska InstitutetStockholm, [email protected]

Abstract. In this article, I review Flexible Parametric Survival Analysis UsingStata: Beyond the Cox Model, by Patrick Royston and Paul C. Lambert (2011[Stata Press]).

Keywords: gn0057, flexible parametric survival models, survival analysis

1 Introduction

Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model, by PatrickRoyston and Paul C. Lambert, is a welcome complement to the existing documentationon survival analysis using Stata ([ST] and Cleves et al. [2010]). The authors clearly de-fine the goal of the book—to describe and illustrate the use and applications of flexibleparametric survival models—and most certainly succeed in achieving it. The commandsused in this book have been written by the authors and their collaborators, and it isclear that the authors bring considerable experience in the development, application,and teaching of statistical methods to this book. With a clear pedagogical and prac-tical approach, the key features of flexible parametric survival models are explained,and the context of application, with its challenges and problems, is carefully described.Implementation of the user-friendly estimation and postestimation commands is clearlyillustrated. Throughout the book, a rich set of worked examples and graphical repre-sentations of different quantities of interest in survival analysis are used to aid readers.At every stage, the similarities and differences between flexible parametric models andmore traditional parametric (Poisson, Weibull), semiparametric (Cox), and nonparamet-ric (Kaplan–Meier) models are highlighted. This book provides a thorough introductionto flexible parametric survival models and is suitable for data analysts regardless of theirbackground or experience with Stata. It is also suitable for use as a textbook in a courseon survival analysis.


N. Orsini 213

2 Overview of the book

Chapter 1 establishes the goals of the book, describes what the authors mean by goingbeyond the Cox model, and explains why one should be interested in knowing how todo that. After a brief review of the Cox model and its proportionality assumption,the authors illustrate how to obtain, represent graphically, and interpret the baselinehazard. The advantages listed in a concise paragraph are smoothed baseline hazard andsurvival functions, time-dependent associations, modeling on different scales, relativesurvival, out-of-sample prediction, and modeling on multiple time scales.

Chapter 2 presents two important and helpful Stata commands, stset and stsplit.This chapter can be very useful for readers not yet familiar with the suite of st com-mands. The authors explain some stset options commonly used in survival analysis,such as defining the units of time (months, years), restricting the follow-up time (a max-imum of five years), and defining the time scale (time since entry, time since diagnosis,time from birth). The stsplit command is demonstrated with associations (time-dependent regression coefficients) or predictors (time-varying covariates) that changewith the time scale. Several worked examples illustrating the use of the stsplit com-mand, followed by estimation of Poisson, Cox, and Royston–Parmar models, can befound in chapters 4 and 7.

Chapter 3 describes the main motivating examples used in the book. Briefly, theoutcomes are time from surgery to recurrence of or death from breast cancer (Rotterdambreast cancer data), time from breast cancer diagnosis to death (England and Walesbreast cancer data), and time from birth to hip fractures (orchiectomy data). All of themare relatively large observational studies and are used to explain the advantages offeredby flexible parametric models and their predictions in the analysis of such outcomes.

Chapter 4 shows how to use Poisson regression to flexibly model rates of an event.This is achieved by splitting the time scale in different ways. The authors start witha simple model, assuming that the baseline rate is constant throughout the time scale(exponential model). Next they allow the baseline rate to vary over the time scaleby splitting the time scale into a number of intervals and then modeling changes of therate over time with indicator variables (piecewise exponential model). By finely splittingthe time scale, they can model the baseline rate like any other quantitative covariate byusing fractional polynomials or splines. The relation between Poisson and Cox regressionis illustrated using both examples and formulas. The chapter concludes with a usefuldiscussion of possible advantages and disadvantages of using Poisson regression to modelsurvival data.

Chapter 5 introduces Royston–Parmar regression, which generalizes the standardparametric regression models (Weibull, loglogistic, lognormal). These models imple-mented in the stpm2 command play a central role in this book. The main advantageof this approach over Poisson regression is that it uses one row for each individual(which avoids data augmentation implied by stsplit) and therefore makes postesti-mation commands easier to use. Royston–Parmar models are motivated by showingthe lack of fit of standard parametric models (exponential, Weibull) in an attempt to

214 Review of Flexible Parametric Survival Analysis Using Stata

model years from surgery in the Rotterdam breast cancer data. The flexibility is ob-tained by modeling the log cumulative-hazard function as a smooth function of the logof time. A set of covariates is then added to the linear predictor for the log cumulativehazard, so the proportionality assumption for the covariates (regression coefficients thatdo not change with the time scale) continues to hold with Royston–Parmar regression.Nonproportionality, however, can be modeled, as shown later in chapter 7. Because flex-ibility is achieved by using spline transformations, the authors describe in detail howthose spline transformations—more specifically, restricted cubic splines—are generatedand how the choice of the spline model may affect the results. It is noteworthy thatthe findings based on Royston–Parmar models appear to be fairly insensitive to thenumber and, particularly, the location of the knots. The chapter continues with a moreconcise generalization of the loglogistic and lognormal (probit) models, demonstratingthat results from standard parametric models can be obtained by choosing 1 degreeof freedom for spline transformations. The Aranda–Ordaz family of link functions forsurvival models, of which cumulative hazards and cumulative odds are special cases, isdescribed in paragraph 5.6 with some useful advice on how to choose between them.

Chapter 6 focuses on prognostic models, essentially multivariable regression modelsused to predict the occurrence of future outcomes. It discusses the different aspectsinvolved in developing and reporting a prognostic model. The various steps requiredto build a multivariable model, such as the choice of a suitable scale (cumulative haz-ard versus cumulative odds) and spline transformations, the selection of covariates,and the functional form for quantitative predictors, are illustrated with the Rotter-dam breast cancer data as the motivating example. Once a final model is obtained,the authors emphasize the benefits provided by using flexible parametric Royston–Parmar models. Those benefits include being able to easily estimate a variety of in-terpretable quantities—unconditional and conditional survival probabilities, survivalprobabilities at specified centiles of the prognostic index, survival probabilities at givencovariate values, differences between survival probabilities, centiles of the predicted sur-vival distribution—by using the helpful predict postestimation command. The chapterthen describes how to assess the goodness of fit of the multivariable model based onresiduals and some summary measure of discrimination and explained variation. Thefinal two paragraphs are dedicated to out-of-sample predictions (to interpolate andextrapolate beyond the observed time points or to validate a prognostic model) andimputations of censored survival times (to visualize the relation between survival timeand the prognostic index).

Chapter 7 illustrates how to estimate and present survival models when one or moreregression coefficients vary with the time scale; this is known as time-dependent effectsor nonproportionality. The nonproportionality is modeled within different frameworks:Cox, Poisson, and Royston–Parmar regression. Each regression model can be extendedto model the fact that a regression coefficient is not constant over the time scale. Itis clear that the differences across approaches are in terms of flexibility when modelingthe interaction between the covariate and time, the amount of data handling, and thepossibility to easily predict different quantities of interest such as survival probabilities,hazard rates, hazard ratios, and hazard differences. Thanks to powerful estimation and

N. Orsini 215

postestimation commands, the flexible parametric Royston–Parmar models performthe best in all aspects. The possibility to control separately the complexity of splinetransformations for the baseline distribution function and each of the time-dependentregression coefficients could not be simpler with the stpm2 command. The family ofRoyston–Parmar models can offer an additional way of handling nonproportionality,that is, of changing the scale. Several times, the chapter emphasizes that proportionalityof the regression coefficient is scale dependent. For instance, if cumulative odds areproportional, the hazard rates are unlikely to be proportional, and the hazard ratio willconverge toward the null, that is, 1, as the follow-up time increases. The analysis ofthe orchiectomy data provides an excellent example of how multiple time scales can bedefined (attained age and time since diagnosis).

Chapter 8 describes relative survival for the analysis of population-based cancerstudies. The idea is to compare the observed survival experience of patients with theexpected survival of the general populations as provided by national registries. Therelated concepts of excess mortality and relative survival are explained and illustratedwith the England and Wales breast cancer data. The chapter begins with a review of thetraditional life-table approach for estimating relative survival based on the user-writtenstrs command. Next the Poisson regression model is extended to include expected mor-tality. This is accomplished in the generalized linear model framework, which allowsuser-written link functions. The Royston–Parmar models are also extended to incorpo-rate information on expected mortality, providing a unique and easy-to-use frameworkthat works seamlessly for all-cause, cause-specific, and relative survival. Modeling issuesin relative survival are similar to the standard survival analysis illustrated in previouschapters. As nonproportional excess hazard is common in cancer studies, the advan-tages offered by Royston–Parmar models in terms of reduced data handling, easy modelspecification, and a rich set of predictions (excess mortality rates, excess mortality rateratios or differences, relative survival, and relative survival difference) are once againevident.

Chapter 9 describes how flexible parametric models can be useful to estimate rele-vant quantities in a variety of contexts. In clinical trials, the number needed to treat asa function of time can be estimated by the inverse of the difference in predicted survivalprobabilities (the option stdiff of predict). When one presents survival curves aftermultivariable models, the predicted survival probabilities as a function of time can beobtained with the mean covariate method (the options survival and at() of predict)or with the direct method (the options meansurv and at() of predict). When onemodels a continuous outcome with no censoring and with a distribution that varieswith a continuous covariate, an outcome-dependent association can be handled with anoutcome-varying regression coefficient (the options tvc() and dftvc() of stpm2). Inthe case of multiple events, different kinds of marginal models that take into accountdependent outcome data are estimated within the framework of Royston–Parmar mod-els. These allow for delayed entry, robust cluster standard errors, and event-specificbaseline hazards. To help those considering a Bayesian approach in survival analysis,the chapter illustrates step by step how to fit a Royston–Parmar model in WinBUGSfrom Stata. In the presence of competing risks, modeling a cause-specific hazard is

216 Review of Flexible Parametric Survival Analysis Using Stata

done with Royston–Parmar models, from which one can easily predict hazard and sur-vival and obtain the cumulative incidence function by numerical integration. In theanalysis of population-based cancer studies, current period analysis estimates of patientsurvival and crude probability of death can be computed once again by using a flexibleparametric approach.

3 Conclusion

This insightful book clearly explains what flexible parametric survival models are andwhat they can offer compared with traditional methods. The main strength of thisbook is that it gently introduces the reader to the significant advantages of a flexibleparametric approach in survival analysis. These advantages can be summarized in oneword: predictions. The authors’ deep knowledge of flexible parametric survival modelsis evident in the clear way that model specification, prediction, and presentation aredescribed. The Stata code and reproducible examples available on the book’s website,http://www.stata-press.com/data/fpsaus.html, can greatly facilitate the application offlexible parametric models in different areas, especially medical research. I highly rec-ommend the methods presented in the book to any professional data analyst dealingwith time-to-event, eventually censored, outcomes. Because this is an area of activeresearch, I am confident that the authors will continue expanding the set of tools andrange of applications presented in this book.

4 ReferencesCleves, M., W. Gould, R. G. Gutierrez, and Y. V. Marchenko. 2010. An Introduction

to Survival Analysis Using Stata. 3rd ed. College Station, TX: Stata Press.

Royston, P., and P. C. Lambert. 2011. Flexible Parametric Survival Analysis UsingStata: Beyond the Cox Model. College Station, TX: Stata Press.

About the author

Nicola Orsini is an associate professor of medical statistics and an assistant professor of epi-demiology in the Unit of Biostatistics and Unit of Nutritional Epidemiology at the Institute ofEnvironmental Medicine, Karolinska Institutet, Sweden.


Stata tip 114: Expand paired dates to pairs of datesNicholas J. CoxDepartment of GeographyDurham UniversityDurham, UK

[email protected]

Often dates for transactions or events come in pairs, for example, start and finishor open and close. People arrive for an appointment at a clinic and later leave; peopleorder goods, which are later delivered; people start work for an employer and later leave.In all cases, when we take a snapshot, we may find people still in the clinic, goods notyet delivered, or (it is hoped) people still working for the employer.

With such events, it is natural that each pair of events is often recorded as anobservation (case, row, or record) in a dataset. Such a structure makes some calculationseasy. For example, the differences between times of arrivals and departures or of ordersand deliveries are key to system performance, although the ideal will be short delays forseeing patients or selling goods and long delays for periods of employment or lifetimes.In Stata, with two variables named, say, arrival and depart, the delay is just

. generate lapse = depart - arrival

Precisely how we record time will depend on the problem, but here I am imagininga date or date–time or time variable.

If the closing events have yet to happen, then depart may need to be recorded asmissing. If so, lapse will in turn be missing. Notice, by the way, a simple data qualitycheck that the time lapse can never be negative. Time lapses recorded as zeros mightalso need checking in some situations: Was the system really that fast and efficient (orruthless)?

Such a simple data structure—one observation for each time interval—may also beawkward, which leads to the main reason for this tip. Experience with data structuresin Stata might lead readers to suggest a reshape long, which could be a good idea,but there is an easier alternative, to use expand.

We need first a unique or distinct identifier for each interval, which may alreadyexist. The command isid allows a simple check of whether an identifier variable indeedmatches its purpose. If the identifier variable is broken or nonexistent, then somethinglike

. generate long id = _n

creates a new identifier fit for the purpose. Specifying a long variable type allows overtwo billion distinct positive identifiers, should they be needed. Otherwise, we use theexisting identifier. Then

. expand 2

c© 2013 StataCorp LP dm0068

218 Stata tip 114

is the command needed. We turn each observation into two. The new observations areadded at the end of the dataset, so we need to sort them before we can create two newvariables that are the keys to other calculations. The first new variable combines thetime information:

. by id, sort: generate time = cond(_n == 1, arrival, depart)

By virtue of the expand command, each distinct value of the identifier now occursprecisely twice. We can therefore use the framework provided by the by: prefix; seeCox (2002) for a tutorial if desired. Under by:, the observation number n is interpretedwithin groups (here all pairs), and we assign time arrival to the first observation oftwo and time depart to the second.

Let’s imagine a small section of a toy dataset and apply our expansion method.

. list

id arrival depart

1. 1 1000 11002. 2 1100 13003. 3 1200 1400

. expand 2(3 observations created)

. by id, sort: generate time = cond(_n == 1, arrival, depart)

The second new variable flags whether each time is an arrival or departure. Theusual kind of indicator or dummy variable with values 1 or 0 would serve, but values of1 and −1 are even better, given that each arrival is an addition and each departure asubtraction.

. by id: generate inout = cond(_n == 1, 1, -1)

Typing ( n == 1) - ( n == 2) is another way to code this.

Now all we need to do is sort and calculate the net results of all events:

. sort time

. list, separator(0)

id arrival depart time inout

1. 1 1000 1100 1000 12. 1 1000 1100 1100 -13. 2 1100 1300 1100 14. 3 1200 1400 1200 15. 2 1100 1300 1300 -16. 3 1200 1400 1400 -1

Because the flag variable inout records additions and subtractions, so also its cumu-lative or running sum keeps track of the number inside the system. In a jargon common

N. J. Cox 219

in economics, flows are used to calculate stocks, the assumption being that any stockfrom before the start of records would need to be added.

. generate present = sum(inout)

. list, separator(0)

id arrival depart time inout present

1. 1 1000 1100 1000 1 12. 1 1000 1100 1100 -1 03. 2 1100 1300 1100 1 14. 3 1200 1400 1200 1 25. 2 1100 1300 1300 -1 16. 3 1200 1400 1400 -1 0

This is only one trick, and others will depend on your problem. For example, if aclinic is only open daily, the number present should drop to zero at the end of eachday. More generally, stocks cannot be negative. The logic of how your system operatesprovides a logic for your code and checks on data quality. If that logic implies separateaccounting for each panel of panel or longitudinal data, then that merely implies adifferent sort order and operations under the aegis of by: (Cox 2002).

For a different problem arising with paired data, see Cox (2008).

ReferencesCox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

———. 2008. Stata tip 71: The problem of split identity, or how to group dyads. StataJournal 8: 588–591.

The Stata Journal (2013)13, Number 1, p. 220

Software Updates

st0236 2: Comparing coefficients of nested nonlinear probability models. U. Kohler,K. B. Karlson, and A. Holm. Stata Journal 11: 420–438.

The help file previously listed unpublished articles about the KHB method. Thesearticles have now been accepted for publication or have already been published.These references have been updated in the help file.

c© 2013 StataCorp LP up0039

Date post:	25-Jul-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

The Stata Journal - Mahidol University · The editors of the Stata Journal are pleased to invite...

Documents