Efficient Estimation for Staggered Rollout Designs...E cient Estimation for Staggered Rollout...

Post on 16-Aug-2021

Efficient Estimation for Staggered Rollout Designs

Jonathan Roth


Pedro H. C. Sant’Anna

Microsoft and Vanderbilt University

July 2021


• In many settings, treatments are rolled out to different units at different points in time.

• Social policies may be introduced in different locations at different times.

• Companies may introduce new feature or marketing campaign to customers at different times.

• Researcher can often ensure that treatment timing is random by design.

• Alternatively, researcher might argue that treatment timing is quasi-random or

as-good-as-random (which is somehow common in Economics).

• This paper studies efficient estimation of causal effects in such (quasi-)randomized

staggered rollout designs


What is the status-quo method in these cases?


Current Practice

• Two-way fixed effects (TWFE) methods are commonly use, but we now know they may not

recover interpretable treatment effect parameters (Borusyak and Jaravel, 2017; Athey and Imbens,

2021; Goodman-Bacon, 2018; de Chaisemartin and D’Haultfœuille, 2020; Sun and Abraham, 2020)

• Alternative DiD estimators that give more sensible estimands under heterogeneity have been

proposed (Callaway and Sant’Anna, 2020; Sun and Abraham, 2020; de Chaisemartin and D’Haultfœuille,


• All of these procedures exploit different parallel trends assumptions, not random timing.

• Although technically weaker, parallel trends assumptions are routinely motivated by arguing that

treatment timing is “quasi-random”.

• In the absence of random treatment timing, DiD estimators may be sensitive to functional form

restrictions. (Roth and Sant’Anna, 2021).


What we do in this paper

• We introduce a design-based framework formalizing the notion of random treatment

timing (Imbens and Rubin, 2015; Athey and Imbens, 2021; Abadie, Athey, Imbens and Wooldridge, 2020;

Bojinov, Rambachan and Shephard, 2020; Rambachan and Roth, 2020; Xu, 2021)

• Consider estimation of a large class of causal parameters that aggregate average effects

across periods and cohorts

• Solve for the efficient estimator in a class of estimators that nests existing approaches• Sample analog plus linear adjustment for pre-treatment differences in outcome

• In our setting, pre-treatment outcomes play a similar role to fixed covariates in a

randomized experiment (Freedman, 2008b,a; Lin, 2013; Li and Ding, 2017; Imbens and Rubin, 2015)

• We provide both t-based and permutation-test based methods for randomization inference.


Does this matter in practice?


Application - Background

• Reducing police misconduct and use of force is an important policy objective.

• Wood, Tyler and Papachristos (2020a, PNAS) studied a randomized roll out of aprocedural justice training program for police officers• Emphasized respect, neutrality, and transparency in the exercise of authority

• Original study found large & significant reductions in complaints/use of force

• But we discovered a statistical error in the original analysis!• They compared cohorts without normalizing by cohort size

• In Wood, Tyler, Papachristos, Roth and Sant’Anna (2020b), we re-analyzed data usingthe method of Callaway and Sant’Anna (2020)• No significant impacts on complaints; borderline significant effects on force; but CIs for all

outcomes were wide



Table 1: Estimates and 95% CIs as a Percentage of Pre-treatment Means

Note: This table shows the pre-treatment means for the three outcomes. It also displays the

estimates and 95% CIs in Figure ?? as percentages of these means, as well as the p-value from

a Fisher Randomization Test (FRT). The final columns shows the ratio of the CI length using

the CS estimator relative to the plug-in efficient estimator.





• Finite population of units: i = 1, ...,N

• T periods: t = 1, ...,T

• Unit i is first treated at period Gi ∈ G ⊂ {1, ...,T} ∪ {∞}• Gi =∞ denotes never treated.

• Treatment is an “absorbing state.”

• Potential outcomes: Yi ,t(g) = i ’s outcome in t if first treated at g

• We observe Yi ,t =∑

g 1[Gi = g ]Yi ,t(g)

• Adopt a design-based framework: Yi ,t(·) and Ng =∑

i 1[Gi = g ] treated as fixed, G is



Two Key Assumptions

Assumption 1 (Random treatment timing)

Let G = (G1, ...,GN). Then P (G = g) = (∏

g∈G Ng !)/N! if∑

i 1[gi = g ] = Ng for all g , and

zero otherwise.

• Any permutation of treatment timing that preserves group size is equally likely

Assumption 2 (No anticipation)

For all i ,t and g , g ′ > t, Yi ,t(g) = Yi ,t(g′).

• No Anticipation may fail if treatment timing announced in advance (Malani and Reif, 2015)

• Note that No Anticipation allows for arbitrary treatment effect dynamics once treatment

has occurred.


Special Case: 2-Period Model

• Suppose T = 2 and G = {2,∞}, so some units are treated in period 2 and some are never


• Under Randomization and No Anticipation, this is analogous to a cross-sectionalrandom experiment with Yi ,t=2 the outcome and Yi ,t=1 playing the role of a fixedcovariate.

• Yi = Yi,t=2;

• Xi = Yi,t=1 ≡ Yi,t=1(∞)

• Di = 1[Gi = 2]


Causal Parameters of Interest



• With staggered treatment timing, there are many possible causal estimands.

Consider a flexible class of possible aggregations.

• Building block: Following Athey and Imbens (2021), let τt,gg ′ be average effect on

outcome in period t of switching treatment from g ′ to g

τt,gg ′ =1



Yi ,t(g)− Yi ,t(g′).

• Consider a (scalar) estimand that aggregates these building blocks:

θ =∑t,gg ′

at,gg ′τt,gg ′


Estimands in the Staggered Case

• In the staggered case, there are many possible ways of aggregating effects across cohorts

and time periods.

• One useful parameter is ATE (t, g), the average effect at time t of being treated at g

relative to being never treated:

ATE (t, g) =1



Yi ,t(g)− Yi ,t(∞).

• Following Callaway and Sant’Anna (2020), one might also be interested in summary

parameters that are weighted averages of ATE (t, g) along different dimensions.

• All these aggregations fit into our setup


Class of estimators we consider



• Define θ0 to be the sample plug-in estimator for θ:

θ0 =∑t,gg ′

at,gg ′ τt,gg ′ ,

where τt,gg ′ = Ytg − Ytg ′ and Ytg is the sample mean of Yi ,t for cohort g .

• We will consider the class of estimators of the form

θβ = θ0 − X ′β,

where X is a vector guaranteed to be mean-zero by No Anticipation.

• Formally, each element of X aggregates differences btwn groups before either is treated

Xj =∑

(t,g ,g ′):g ,g ′>t

bjt,gg ′ τt,gg ′ .


Example: 2 period Example

• Set-up: Two periods (T = 2). Units treated in period 2 or never (G = {2,∞})

• Target parameter: Average treatment effect (ATE) in period 2:

θ = τ2,2∞ =1



Yi ,t=2(2)− Yi ,t=2(∞)

• Class of estimators:

θ0 is the simple difference in means at t = 2:

Yt=2,g=2 − Yt=2,g=∞

X is the pre-treatment difference in means: Yt=1,g=2 − Yt=1,g=∞


Example: 2 period Example

• Our proposed estimator is of the form

θβ = θ0 − X ′β,

• The difference-in-differences estimator is

θ1 = θ0 − X = (Yt=2,g=2 − Yt=2,g=∞)− (Yt=1,g=2 − Yt=1,g=∞),

corresponding with θβ with β = 1.

• In this simple 2x2 case, θβ = θ0 − X ′β is isomorphic to class of regression adjustedestimators in Freedman (2008b,a); Lin (2013)

• They consider τ(β1, β2) = θ0 − β′1X1 + β′0X0 = θβ , for β = N1

N β0 + N0

N β1.


Estimators in this class

• Several previously proposed estimators correspond with θ1 = θ0 − X for an appropriately

specificied θ0 and X .

• Callaway and Sant’Anna (2020) consider estimators that aggregate 2x2 diff-in-diff


τCSw =∑t,g


(Yt,g − Yt,∞)︸ ︷︷ ︸Diff in period t

− (Yg−1,g − Yg−1,∞)︸ ︷︷ ︸Diff in period g−1

.• This can be viewed as an estimator of the form θ0 − X , where

θ0 =∑t,g

wt,g τt,g∞ and X =∑t,g

wt,g τg−1,g∞


Related Staggered Estimators

• Several variants to the Callaway and Sant’Anna (2020) estimator have been proposed that

can likewise be cast into this class

• Callaway and Sant’Anna (2020) propose an alternative estimator using not-yet-treated

instead of never-treated as the comparison

• Sun and Abraham (2020) propose a similar estimator using last-to-be-treated as the


• de Chaisemartin and D’Haultfœuille (2020)’s estimator equivalent to Callaway and

Sant’Anna (2020) estimator for particular choice of weights, corresponding with

event-study at lag 0


The Efficient Estimator



Proposition 1 (Unbiasedness)The estimator θβ = θ0 − X ′β is unbiased over the randomization distribution for any β,


]= θ for all β.


Efficient Estimator

Proposition 2The variance of θβ = θ0 − X ′β is uniquely minimized at

β∗ = (Var[X]

︸ ︷︷ ︸=VX

)−1 Cov[X , θ0

]︸ ︷︷ ︸

=VX ,θ0

if VX is positive definite.


Solving for the Variance

Recall that θ0 and X are both linear functions of cohort sample means Yg .

Can write them as:

θ0 =∑g

Aθ,g Yg and X =∑g

A0,g Yg .

Can apply Li and Ding (2017)’s results for experiments with multiple outcomes/treatments:

Proposition 3





( ∑g Ng

−1 Aθ,g Sg A′θ,g−N−1Sθ,

∑g Ng

−1 Aθ,g Sg A′0,g∑

g Ng−1 A0,g Sg A

′θ,g ,

∑g Ng

−1 A0,g Sg A′0,g


where Sg = Varf [Yi (g)], Sθ = Varf

[∑g Aθ,gYi (g)


• Depends on estimable variances of potential outcomes (Sg ), and

non-estimable variances of treatment effects Sθ.

• But β∗ depends only on estimable quantities not on heterogeneous treatment effects.25

Properties of the Plug-In Efficient Estimator


The Plug-In Estimator

• So far we have solved for the efficient β∗, but it depends on the variances of potential

outcomes Sg , which are typically not known ex ante.

• Consider the feasible plug-in efficient estimator based on β∗, which replaces Sg with asample analog Sg in the expression for β∗.

• Sg = 1/(Ng − 1)∑

i 1[Gi = g ](Yi − Yg )(Yi − Yg )′.

• Will show that in large populations the plug-in estimator θβ∗ has similar properties to the

“oracle” estimator θβ∗ .


Large population asymptotics

• Consider a sequence of populations in which Ng grows large for all g , satisfying certain

regularity conditions

Assumption 3

(i) Cohort shares converge to a constant:

• For all g ∈ G, Ng/N → pg ∈ (0, 1).

(ii) Variances of potential outcomes converge to a constant:

• For all g , g ′, Sg and Sgg′ have limiting values denoted S∗g and S∗gg′ , respectively, with S∗g positive


(iii) No individual dominates the variance of potential outcomes (Lindeberg-type condition):

• maxi,g ||Yi (g)− Y (g)||2/N → 0.


Asymptotic Properties of the Plug-In Estimator

• Under the given asymptotic conditions, the plug-in efficient estimator is asymptotically

normally distributed with the same variance as the “oracle” efficient estimator.

Proposition 4

Under the given asymptotic conditions,

√N(θβ∗ − θ)→d N

(0, σ2



σ2∗ = lim




Variance Estimation


Covariance Estimation

• As is common in finite-population settings, the variance of θβ∗ can only be estimated


• The issue is that the variance of θβ∗ contains the term −Sθ = −Varf

[∑g Aθ,gYi (g)


This is not consistently estimable since it depends on covariances of potential outcomes

that are never observed together.

• A natural conservative approach is the Neyman-style variance estimate, which ignores

Sθ and replaces Sg with Sg in the variance formula.

• In paper, we show that a less conservative variance estimator can be obtained byestimating the part of Sθ explained by X .

• Mirrors the use of pre-treatment covariates in Lin (2013); Abadie et al. (2020)


What about Fisher Randomization Tests


Fisher Randomization Tests

• An alternative approach to inference uses Fisher Randomization Tests (FRTs)

• We show that an FRT using a studentized version of the efficient estimator has thedual advantages :

1. has exact size under the sharp null of no treatment effects for all units;

2. is asymptotically valid for the weak null that θ = 0.

• Studentization is key!

• In general, (unstudentized) FRTs may not have correct size for such weak null hypotheses

even asymptotically (Wu and Ding, 2020).

• We build on Wu and Ding (2020) and Zhao and Ding (2020) to show that studentization

bypass this problem: FRT is asy. equiv. to testing that 0 falls within the t-based confidence

interval CI∗∗


Fisher Randomization Tests

The following regularity condition imposes that the means of the potential outcomes have

limits, and that their fourth moment is bounded.

Assumption 4Suppose that for all g , limN→∞ Ef [Yi (g)] = µg <∞, and there exists L <∞ such that


i ||Yi (g)− Ef [Yi (g)] ||4 < L for all N.


Fisher Randomization Tests

With this assumption in hand, we can make precise the sense in which the FRT is

asymptotically valid under the weak null.

Proposition 5

Suppose Assumptions 1-4 hold. Let tπ = (θ∗/se)π be the studentized statistic under

permutation π. Then tπ →d N (0, 1), PG -almost surely. Hence, if pFRT is the p-value from

the FRT associated with |tπ|, then under H0 : θ = 0,


P(pFRT ≤ α) ≤ α,

PG -almost surely, with equality if and only if S∗θ = 0.





• We study staggered rollout designs, in which units randomly receive treatment at different


• Estimation in these settings is often done using generalized difference-in-differences


• We solve for the efficient estimator in a class that nests common procedures.

Our proofs draw on parallels to the literature on covariate adjustment in experiments.

• We provide both t-based and permutation-test based methods for randomization inference.

• The plug-in efficient estimator offers substantial precision gains relative to existing




