Optimal Multilevel Matching in Clustered Observational Studies:A Case Study of the School Voucher System in Chile∗
José R. Zubizarreta† Luke Keele‡
March 16, 2015
Abstract
A distinctive feature of a clustered observational study is its multilevel or nested datastructure arising from the assignment of treatment, in a non-random manner, to groups orclusters of units or individuals. Examples are ubiquitous in the health and social sciencesincluding patients in hospitals, employees in firms, and students in schools. What is theoptimal matching strategy in a clustered observational study? At first thought, one mightstart by matching clusters of individuals and then, within matched clusters, continue bymatching individuals. But, as we discuss in this paper, the optimal strategy is the oppo-site: first match individuals and, once all possible combinations of matched individuals areknown, then match clusters. In this paper we use dynamic and integer programming to im-plement this strategy and extend optimal matching methods to hierarchical and multilevelsettings. In particular, our method attempts to replicate a paired clustered randomizedstudy by finding the largest sample of matched pairs of treated and control individualswithin matched pairs of treated and control clusters that is balanced according to speci-fications given by the user. Our method directly balances covariates both at the clusterand individual levels and does not require estimating the propensity score, although thepropensity score can be balanced as an additional covariate. We illustrate our method ona case study of the comparative effectiveness of public versus private voucher schools inChile, a question of intense policy debate in the country at the present.
Keywords: Causal Inference; Group Randomization; Hierarchical/Multilevel Data; ObservationalStudy; Optimal Matching
∗For comments and suggestions, we thank Cinar Kilcioglu, Sam Pimentel and Paul Rosenbaum, and seminarparticipants at Johns Hopkins University and the University of Pennsylvania.
†Assistant Professor, Division of Decision, Risk and Operations, and Statistics Department, Columbia Uni-versity, 3022 Broadway, 417 Uris Hall, New York, NY 10027, Email: [email protected].
‡Associate Professor, Department of Political Science, 211 Pond Lab, Penn State University, University Park,PA 16802 Phone: 814-863-1592, Email: [email protected].
1
1 Introduction
1.1 Clustered Observational Studies with Multilevel Data
Clustered observational studies are ubiquitous in the health and social sciences. Examples include
patients receiving similar treatments in hospitals, employees facing a policy change inside firms,
and students following a particular learning program within schools. Clustered observational
studies have a nested or multilevel data structure with observed and unobserved covariates both
at the cluster and unit levels. In this context, research interest typically lies in the effect of
the cluster level treatment on unit level outcomes, however this effect may be confounded by
differences in the distributions of covariates across the treatment groups both at the cluster
and unit levels. Therefore, an important question in clustered observational studies is: how to
adjust for observed covariates taking into account the multilevel structure? In an attempt to
be transparent, these adjustments will ideally balance covariates at both levels and facilitate
sensitivity analyses to hidden biases due to unobserved covariates (Rosenbaum 2010).
Educational settings are perhaps the most well-known multilevel structure with unit level measures
such as the student’s score on a standardized test, and also cluster level covariates such as school
enrollment (Lee and Bryk 1989). Both covariates may act as confounders when evaluating, for
instance, the impact of a study program or administration regime targeted at improving learning.
A conventional approach to adjust for cluster and unit level covariates is hierarchical or multilevel
regression modeling. A hierarchical regression model allows the researcher to fit a model for
the mean outcome using unit level covariates while accounting for unexplained variation among
clusters. The cluster level predictors are often referred to as “contextual effects,” and may
be interpreted as causal effects under certain assumptions (Gelman 2006; Feller and Gelman
2015).
Nonparametric alternatives to multilevel regression modeling often rely on propensity scores (Hong
and Raudenbush 2006; Arpino and Mealli 2011; Li et al. 2013). For example, Hong and Rau-
2
denbush (2006) stratify on a multilevel propensity score to approximate a two-stage experiment
where schools and students are randomly assigned to treatment within blocks. Matching methods
are extensively used in observational studies (Stuart 2010; Lu et al. 2011), however they do not
typically account for multilevel data structures.
In this paper, we develop an optimal matching method for multilevel data structures. Contrary
to intuition, our method first matches pairs of units and then clusters to create matched pairs
of clusters with pairs of units within these cluster pairs. Our method is optimal in the sense
that it maximizes the size of the matched sample and it directly balances the observed covariates
in a specific way. For example, our method not only allows the researcher to balance central
moments of distributions but also the entire distribution of the observed covariates. Our method
does not require estimating the propensity score, although the propensity score can be used as an
additional balancing covariate. Although we illustrate our method in a study with two levels—
students within schools—it can readily be extended to settings with three or more levels such as
students within schools within districts. In particular, we illustrate our method on a case study
of the comparative effectiveness of public versus private voucher schools in standardized tests in
Chile. This a question of intense policy debate in the country at the present, and we describe it
subsequently.
1.2 Vouchers and School Choice
Governments often enact policy reforms to improve educational outcomes. One educational
reform is the use of vouchers to create education markets. In a voucher system, parents receive
a voucher to choose among a number of competing schools both public and private. While
voucher systems are relatively rare in the United States, many other countries, however, have
adopted universal voucher programs.1 Countries with universal voucher programs include Chile,
Denmark, Netherlands, South Korea and Sweden (Lara et al. 2011). Chile was the first country
to adopt a universal voucher system in the 1980’s. Public and private voucher schools receive a1As of 2007, only 16% of U.S. students lived in areas with an active voucher system.
3
direct payment from the cover on a per-student basis. This voucher system fueled the growth
of a parallel private education system. By 2006, nearly 5,000 private voucher schools existed,
and these schools enrolled nearly 44% of students with 48% attending public schools (Lara et al.
2011). By 2013, there were over 6,000 private voucher schools with around 53% of students,
and 38% attended public schools (MINEDUC 2013).
Are private vouchers schools more effective than public schools? Current evidence on the voucher
system in Chile is mixed. A number of studies have found that private voucher schools increase
test scores by at least 15% to 20% of a standard deviation (Mizala and Romaguera 2001; Anand
et al. 2009) though other studies have found larger effects (Sapelli and Vial 2002, 2005). While
other work has found effects that are either not statistically detectable (Hsieh and Urquiola 2006;
McEwan 2001) or are much smaller (Lara et al. 2011). We conduct an observational study of
whether private voucher schools produce students with higher test scores than public schools.
These data have a multilevel structure in that we observe student level covariates such as gender
and socio-economic status as well as school level covariates such as enrollment and whether the
school is in an urban or rural area.
Our study also provides a basic template for the design of observational studies of school level
interventions. We demonstrate how temporal ordering is critical to selecting covariates for the
removal of overt biases. In an observational study, use of longitudinal data is necessary to avoid
conditioning on a post-treatment covariate. If data are carefully collected over time as events
occur, then the temporal order of events is clear, and the distinction between covariates and
outcomes is clear as well. In contrast, if data are collected from subjects at a single time, as in
a cross-sectional study, such temporal order is unclear and one might mistakenly condition on an
outcome. In this study, we use the transition from primary school to secondary school to clearly
delineate the temporal order of pre- and post-treatment covariates. The Chilean data form a full
panel over time which allows us to carefully separate both school and student level covariates
from outcomes.
4
This article is organized as follows. Section 2 describes the Chilean school system, the longitudinal
census data, and the design that we use in our study. Section 3 reviews cardinality matching for
finding the largest matched sample that is balanced, explains the multilevel matching method,
and presents the method more generally. Section 4 shows the resulting matches. Section 5
analyzes the comparative effectiveness of public and private voucher schools in Chile. Section 6
concludes with a summary and a discussion.
2 Schools in Chile; Educational Census Data; Study De-
sign
2.1 The Chilean School System and Restricted Secondary School Choice
In Chile there are three basic types of schools: (i) private non-subsidized schools, (ii) private
subsidized schools, and (iii) public schools, also called municipal schools. Private non-subsidized
schools are generally elite schools that do not receive any funds from the state and are funded
entirely through private tuition. Approximately 7% of the students in Chile attend this type of
schools. Private subsidized schools receive funding from the state on a per pupil basis, and some
of these schools charge an additional monthly tuition fee to parents (these are called “shared
funding" private subsidized schools or “financiamiento compartido" in Spanish). These schools
are a mix of for-profit and not-for-profit organizations. Approximately 53% of the students in
Chile attend this type of school. Finally, approximately 40% of students attend public schools.2
In Chile, the voucher system is based on a direct payment to the schools as a function of daily
attendance. In our
We seek to test whether private vouchers schools are more effective than public schools. Pre-2The exception is 60 schools (called “Delegated Administration") that are publicly funded on a non-voucher
basis, i.e. they receive a fixed amount of money from the state regardless of how many students they enroll.
5
treatment differences in these two groups of student may be either those that are measurable and
thus form overt biases or those that are unmeasured which are hidden biases. In an observational
study, analysts use pretreatment covariates and a statistical adjustment strategy to remove overt
biases in the hopes of consistently estimating treatment effects. We face two challenges in our
observational study if we simply compare the test scores of students in these two types of schools
and remove overt biases via matching.
First, data are not collected on students until they are in the fourth grade. Without measurements
that precede the treatment, we may mistakenly adjust for an outcome rather than a covariate and
bias the study (Rosenbaum 1984). Second, the choice to attend a private voucher school is most
likely highly confounded by not only observed but unobserved factors. To solve both problems,
we exploit an opportunity: an unusual setting in which we believe confounding from unobserved
covariates is lessened (Rosenbaum 2010, section 5.1).
The opportunity we exploit is related to the concept of differential effects, which are used to reduce
bias in observational studies by the study of parallel treatments (Rosenbaum 2006). Consider the
following example of a differential effect. It is thought that nonsteroidal anti-inflammatory drugs
(NSAIDs) reduce the risk of Alzheimer disease. To test this theory one might compare subjects
that regularly use NSAIDs to those that do not. However, there are a number of observed and
unobserved reasons why these two populations may differ other than the use of NSAIDs. As an
alternative design, one might instead compare regular NSAID users to regular users of other pain
medications such as acetaminophen. The differential effect of acetaminophen versus NSAIDs may
be less subject to bias from unmeasured confounders than the effect of NSAIDS versus no use
of pain medication. As such, a differential effect is an effort to reduce sensitivity to bias through
the comparison of parallel treatments.
The Chilean school system provides us with an opportunity that is parallel to a differential effect.
Chilean schools are divided into primary and secondary schools. Primary schools are comprised
of Grades 1–8 and secondary schools encompass Grades 9–12. Since 2003 students are required
6
to attend both primary and secondary schools. However, primary and secondary schools need not
be separate institutions. It is not uncommon for primary and secondary schools to be fused into
a single school, such that students can attend Grades 1–12 at the same institution. Students
that attend a primary only school have to select a secondary school once they reach Grade 8. In
2004, 76.4% of public schools and 52% of private voucher schools were primary only schools, and
thus about 56% of students in the eighth grade had to select a new secondary school to attend
(Lara et al. 2011). We use this opportunity in our design by restricting the analysis to students
from public primary schools that had to select a secondary school since they attended a primary
only school. Under this design, treated students are students that switch from a public primary
school to a private voucher secondary school. We compare these treated students to students in
public primary only schools that choose to attend public secondary schools.
Exploiting this aspect of the Chilean school system has two advantages. First, it allows us to
clearly delineate the temporal ordering of the treatment. Since the treatment is a private voucher
secondary school, we may safely condition on covariates collected while students were in primary
school. In this way we avoid biases from adjusting for a concomitant outcome (Rosenbaum
1984). Second, we also suspect that a student that switches from a public to a private voucher
school when they are not required to, may do so for a number of reasons, many of which are
unobservable. Here, we restrict the analysis to students that attend their local public school but
must switch due to school structure. This may be a more directly observable treatment selection
mechanism.
2.2 Longitudinal Census of Students and Schools
In 1988, Chile introduced a national student assessment system known as the Sistema Nacional
de Medición de la Calidad de la Educación or SIMCE. The SIMCE is an “educational census.”
That is, in the SIMCE, the Ministry of Education collects data to evaluate all students in fourth,
eighth, tenth and eleventh grades in language, mathematics and sciences, roughly every two years.
SIMCE data are collected from four different sources. First, data are collected from students,
7
which includes test scores that are complemented with other student covariates such as gender.
Second, both parents and teachers complete questionnaires. Finally for schools, student test
scores are aggregated, and a few additional covariates are collected. Students are given unique
identifiers which allows us to form a true panel over a two year period. Student records can also
be linked to teacher, parent, and school level covariates.
In our study, we use SIMCE data from 2003, 2004, and 2006. The SIMCE from 2003 only
collected data from secondary schools and students enrolled in secondary schools. For 2004 and
2006, SIMCE collected data from both primary and secondary schools and students. We use test
scores on language and mathematics administered in 2006 when students are in the tenth grade
as our outcome measures.
2.3 Data Structure and Study Design
In our study, the data structure and design are intricately linked. We now outline how we
constructed the match to fit the data structure. One advantage of our approach is that we
can tailor the statistical adjustment to exactly fit the multilevel structure of the data, which
is important since we have student, parent, teacher, and school level data. We perform two
matches: one for students and one for schools. Next, we describe the covariates that form the
student level match.
For each student with test scores observed in 2006, we match on student, parent, teacher, and
primary school covariates from the SIMCE data collection in 2004. For the student match, we
first list student level covariates. The key covariate, here, is student test scores from the 8th
grade. In 8th grade students are tested on four topics: language, mathematics, social sciences,
and natural sciences. The student level data also measures gender. For the student match, we
also include three covariates from parents: income measured in six categories, father’s education,
and mother’s education. We also link students to primary school level measures, and we match
on primary school covariates in the student level match. At the primary school level, we match
8
on a five category socio-economic status indicator for each school that is created by the Chilean
Ministry of Education. This five category indicator is constructed from questions based on
parental education, family incomes in the school and an index of school vulnerability. We also
use school level measures that are aggregates of data observed at other levels. As such, we
match on average test scores for each primary school, the number of teachers and the number
of enrolled students. Finally, in the teacher survey, teachers are asked what level of education
they expect the majority of their students to achieve. Teachers responded using a five category
scale that records responses from 8th grade to a college degree. We aggregate this measure and
recorded the median for each primary school and use it in the student level match. To reiterate
while we observed covariates measured at different levels in 2004, we treat all these measures as
pre-treatment student level covariates in the match.
The school match is based on secondary school data from 2003. Since the SIMCE forms a panel,
we can match on characteristics of the secondary schools before any student is exposed to the
treatment. That is, we match on the schools the students will attend using data from before
they attend that school. For the school match, we match on enrollment, school level math and
language test score averages, the percentage of female students in the school, average student
income, urban versus rural status, and the same five category socio-economic status indicator
for each school that is recorded for primary schools.3 Some of these covariates are aggregates
that we created from either student, teacher, or parent level data in 2003. We did not match on
several other covariates that are also observed in the 2003 data. These measures include whether
teachers are allowed class preparation time, the proportion of teachers with a post-graduate
diploma, the average teacher experience, the number of hours teachers worked per week. We
do not match on these covariates since they are plausibly part of the school level treatment.
Matching on such covariates would remove their effect on students from the final outcomes and
thus could potentially attenuate the treatment effect.3For student in secondary schools, the SIMCE only collects test scores on language and math. At the primary
school level, we have test scores for language, mathematics, social sciences, and natural sciences.
9
We describe the matching algorithm in greater detail in Section 3. However, the match is based on
integer programming which allows us to enforce different forms of balance for different covariates
(Zubizarreta 2012; Zubizarreta et al. 2014). This is relevant since we tailored the constraints for
each covariate. Here, we describe the different balance constraints we applied to each covariate.
For the student level covariates, we applied a mean balance constraint to primary school test score
measures, primary school enrollment, the number of teachers in the primary school, the average
expected level of educational attainment, and the proportion of female student in the primary
school. For student level test score measures, we enforced a constraint on the entire distribution
via the Kolmogorov-Smirnov test statistic which is the maximum discrepancy in the empirical
cumulative distribution functions. For the school level match, we enforced a mean balance
constraint on secondary school test scores, missingness indicators for test scores, secondary school
enrollment, income category, SES category, urban or rural status, and the proportion of female
students in the secondary school.
For discrete student and school level covariates, we used a fine balance constraint. Under fine
balance, we exactly balance covariates without exactly matching. Fine balance is achieved for
discrete covariates by balancing the marginal distributions of covariates exactly in aggregate but
without constraining who is matched to whom. We applied fine balance to student sex, father
and mother’s education level, parental income categories, and primary school SES categories. We
now describe the notation and the optimal matching algorithm. See Rosenbaum et al. (2007) for
a discussion of fine balance and Rosenbaum (2010, Part II) for a discussion of different forms of
covariate balance.
3 Dynamic and Integer Programming for Multilevel Match-
ing
The goal of our multilevel matching method is to find the largest sample of matched pairs of
treated and control units within matched pairs of treated and control clusters that is balanced
10
on the observed covariates. For assessing the sensitivity of results to the influence of unobserved
covariates we use the methods for sensitivity analysis proposed by Rosenbaum (1987, 2002) and
tailored to clustered treatment assignments by Hansen et al. (2014) (see subsection 5.4 of the
paper). In our case study, units are students and clusters are schools, and, importantly, because
results can be confounded both by student and school level covariates, we match pairs of students
and schools to balance covariates at both levels. The basic tool that we use in our multilevel
matching method is cardinality matching which we describe subsequently.
3.1 Review of Cardinality Matching
Common matching methods attempt to achieve covariate balance indirectly, by finding treated
and control units that are close on a summary measure of the covariates such as the Mahalanobis
distance or the propensity score (see Stuart 2010 and Lu et al. 2011 for reviews). Unlike these
matching methods, cardinality matching uses the original covariates to match units and directly
balance their covariate distributions (Zubizarreta et al. 2014). Specifically, by solving an integer
programming problem, cardinality matching finds the largest matched sample that satisfies the
researcher’s specifications for covariate balance. Following Zubizarreta (2012), these specifica-
tions for covariate balance may not only require mean balance, but perhaps also other forms of
distributional balance such as fine balance (Rosenbaum et al. 2007), x-fine balance (Zubizarreta
et al. 2011), and strength-k matching (Hsu et al. 2015). For example, cardinality matching will
find the largest sample of matched pairs in which all the covariates have differences in means
smaller than one tenth of a standard deviation and the marginal distributions of nominal co-
variates of greater prognostic importance are perfectly balanced (fine balance). In this manner,
with cardinality matching subject matter knowledge about the research question at hand comes
into the matching problem through the specifications for covariate balance, finding the largest
matched sample that satisfies them.
As we describe in the next subsection, our multilevel matching method uses cardinality matching
to match treated and control students across all the possible combinations of treated and control
11
schools, and then uses a modified version of cardinality matching to match schools with the
largest number of matched students.
3.2 A Multistage Decision Method for Multilevel Matching
Let kt ∈ Kt = {1, ..., Kt} index the treated clusters and kc ∈ Kc = {1, ..., Kc} denote the
control clusters. Let jkt be treated unit j in treated cluster kt, with jkt ∈ Jkt = {1, ..., Jkt},
and jkc stand for control unit j in control cluster kc with jkc ∈ Jkc = {1, ..., Jkc}. Put xkt for
the vector of observed covariates of treated cluster kt, and similarly write xjkt for the observed
covariates of treated unit jkt ; analogous notation applies for control clusters and units. Based
on the unit-level covariates, calculate a distance δjkt ,jkc between treated unit jkt and control unit
jkc (for instance, this distance may be the robust Mahalanobis distance specified in section 8.3
of Rosenbaum 2010). Define A and Ba as the sets of feasible solutions for the cluster- and
unit-level matches within matched clusters (hence the subindex a in Ba). In practice, A and
Ba are implemented as linear inequality constraints in a integer program and they enforce the
researcher’s requirements for covariate balance and matching structures at the cluster and unit
levels respectively (for instance, A may require the means of the cluster covariates to be balanced
and the matched groups to form pairs of clusters, and Ba may require the marginal distributions
of the unit covariates to be balanced and the matched groups to form pairs of units). Importantly,
since the requirements in A refer to clusters and those in Ba refer to units, A and Ba are disjoint.
Let J (m)kt be the set of treated units matched in treated cluster kt and J(m)t =
⋃kt∈Kt J
(m)kt
be the set of treated units matched across all treated clusters. Finally, let K(m)t be the set of
matched treated clusters.
Building upon the framework of Rosenbaum (2012a), an optimal cardinality matching of units
within clusters can be characterized by the quadruple (K(m)t , α,J(m)t , β) of assignments of clus-
ters α : K(m)t → Kc and units β : J(m)kt→ Jkc that maximize the cardinality of the set of
matched of units within matched clusters subject to the constraints in A and Ba, respectively.
If there are two cardinality matchings that satisfy the requirements in A and Ba, then we prefer
12
one matching over the other if it has a larger cardinality, or, alternatively, if they both have
the same cardinality, if it has a smaller sum of total distances between matched units. For-
mally, we prefer the cardinality matching (K(m)t , α,J(m)t , β) to (K̃
(m)t , α̃, J̃
(m)t , β̃), denoted by
(K(m)t , α,J(m)t , β) � (K̃
(m)t , α̃, J̃
(m)t , β̃), if |J
(m)t | > |J̃
(m)t |, or alternatively if |J
(m)t | = |J̃
(m)t |
and ∑jkt∈J
(m)t
δjkt ,β(jkt ) <∑jkt∈J̃
(m)t
δjkt ,β(jkt ). If |J(m)t | = |J̃
(m)t | and
∑jkt∈J
(m)t
δjkt ,β(jkt ) =∑jkt∈J̃
(m)t
δjkt ,β(jkt ), then we are indifferent between the two cardinality matchings and write
(K(m)t , α,J(m)t , β)∼ (K̃
(m)t , α̃, J̃
(m)t , β̃). If we have either (K
(m)t , α,J
(m)t , β)� (K̃
(m)t , α̃, J̃
(m)t , β̃)
or (K(m)t , α, J(m)t , β) ∼ (K̃
(m)t , α̃, J̃
(m)t , β̃), we write (K
(m)t , α,J
(m)t , β) % (K̃
(m)t , α̃, J̃
(m)t , β̃).
Our optimal multilevel matching problem is the following.
Problem 3.1. For given sets of cluster-level constraints A and unit-level constraints Ba, find a
matching (K(m)t , α,J(m)t , β) that satisfies A and Ba such that, for any other matching (K̃
(m)t , α̃,
J̃ (m)t , β̃) that also satisfies A and Ba, (K(m)t , α,J
(m)t , β) % (K̃
(m)t , α̃, J̃
(m)t , β̃).
Intuition may suggest that the the best way to solve Problem 3.1 and match with multilevel data
is first to match clusters and then within matched clusters to match units. In our case study, this
would require first pairing schools and then, within pairs of schools, pairing students. However
this strategy will not always find the largest matched sample that is balanced as two schools
that are paired on their school level characteristics may have different student compositions so
that when their students are paired it may result in a smaller sample size than optimal. For
this reason, the optimal matching strategy needs to contemplate what is optimal both at the
student and school levels simultaneously. Applying Bellman’s (1957) principle of optimality, the
optimal matching strategy is, under the assumption that schools have been matched optimally,
first match students and then, considering these optimal student matches, match schools.
In abstract terms, the following algorithm and proposition state this; again, that the optimal
strategy is first to match units across all the possible combinations of pairs of treated and control
clusters, and, once all possible combinations of matched units are known, then match clusters.
To implement the optimal assignments α and β, let akt,kc = 1 if treated cluster kt is paired to
13
control cluster kc and akt,kc = 0 otherwise; similarly let bjkt ,jkc = 1 if treated unit j in treated
cluster kt is paired to control unit j in control cluster kc, and bjkt ,jkc = 0 otherwise.
Algorithm 3.2. For each of the possible Kt × Kc pairs of treated and control clusters, find
the optimal cardinality matching of units that satisfies Ba. This is, for each kt ∈ Kt and each
kc ∈ Kc find mkt,kc = maxb∑jkt∈Jkt
∑jkc∈Jkc bjkt ,jkc subject to b ∈ Ba. Then find the optimal
cardinality cluster matching that solves maxa∑kt∈Kt
∑kc∈Kc mkt,kcakt,kc subject to a ∈ A.
Proposition 3.3. Algorithm 3.2 solves the optimal multilevel cardinality matching problem 3.1.
Proof. Let f(a, b) be the the total number of pairs of treated and control units matched by a
within pairs of treated and clusters matched by b. In the abstract, in Problem 3.1 we want to
maximize the function f(a, b) subject to the constraints A and Ba. This is, find a and b to solve
maxa,b
f(a, b) subject to a ∈ A, b ∈ Ba. (1)
In a trivial way, we may solve (1) by first solving
g(a) = maxbf(a, b) subject to b ∈ Ba (2)
for each a ∈ A, and then solving
maxa
g(a) subject to a ∈ A. (3)
While (2) seems hard in general (because there are many possible choices of b), the nested
structure of the units-in-clusters problem makes it easier because f(a, b) separates into a sum
of parts for cluster pairs because the constraint sets A and Ba are disjoint. Algorithm 3.2 does
exactly this.
In our case study, for each pairing of schools a, we find the best pairing of students b within
14
those schools (2), and then pick the best pairing of schools with the associated best pairing of
students for that pairing of schools (3). Again, while (2) seems hard in general (because there
are many possible student matches b), the nested structure of the students-in-schools problem
makes it easier because f(a, b) separates into a into a sum of parts for school pairs. For example,
if treated school kt is paired to control school kc, then the contribution of schools kt and kc is
the same of number of pairs regardless of how the other schools are paired.
With Algorithm 3.2, the multilevel cardinality matching problem can be solved optimally by
breaking it into simpler matching subproblems and recursively finding the optimal match. This is
an application of dynamic programming to matching in observational studies that takes advantage
of the multilevel structure of the data (see Bertsekas 2005 for an extensive exposition of dynamic
programming).
3.3 Extensions and Computation
Note that if we had three or more levels (such as students within schools within districts), then
the multilevel matching procedure would extend naturally. With l levels, the procedure would
require first matching the lower level l under the assumptions that levels l − 1, l − 2, ..., 1 have
been matched optimally, to then (once the matches at level l are completed) matching level l−1
under the assumptions that levels l− 2, l− 3..., 1 have been matched optimally, and so on.
Note that our multilevel matching method maximizes the size of the matched sample, but it can
also be formulated to minimize a covariate distance between students. If this was the case and if
each of the student level matching problems was solved using optimal matching as in Rosenbaum
(1989) and Hansen (2007), then a trivial worst-case time bound for the multilevel matching
method would be of order O(J3K2 + K3) where J = max{Jkt=1, ..., Jkt=Kt , Jkc=1, ..., Jkc=Kc}
and K = max{Kt, Kc}. In general it is possible to find worst-case time bounds based on the
component problems. In our presentation above, each of the component problems is a cardinality
matching problem and, while at the present there is no polynomial time algorithm for cardinality
15
matching, in practice many instances with data sets of reasonable size run in time comparable
to that of optimal matching. Furthermore, a useful feature of problem (1) is that the student
level matches can be found in parallel by separating all the possible pairs of treated and control
schools into smaller mutually exclusive but exhaustive pairs of treated and control schools. In
practice, we found the matches using the package mipmatch for R (Zubizarreta 2012).
4 Covariate Balance in the Matched Sample
After applying basic exclusion criteria, there are 64245 students in 517 schools, 150 subsidized
and 367 public schools (henceforth treated and control schools respectively). Out of the 64245
students, 15682 students are from treated schools and 48563 are from control schools. Using
our multilevel matching method, we matched in two stages within similar groups regions of the
country (namely, regions I-III, IV-V, VI-VII, VIII, IX, X-XII and the Metropolitan region).
At the student level, we used cardinality matching to find the largest balanced sample of pairs
of students across all the possible combinations of pairs of schools within the groups of regions.
In each of these matches we required mean balance for 19 covariates (including student test
scores, school test scores, and indicators for socioeconomic status and expected educational
achievement; see Table 1 for details), fine balance for 4 covariates (sex, mother and father
education, and household income; see Table 2) and distributional balance for the sum of the
test scores in language and mathematics at baseline. Figure 1 shows not only that the marginal
distributions of the baseline test scores are very closely balanced after matching but also their
joint distribution. As a matter of fact, the 95% bivariate normal density contours are almost
indistinguishable after matching.
At the school level, we used the modification of cardinality matching in the second stage of
Algorithm 3.2 and mean balanced 16 other covariates: percentage female, total enrollment, lan-
guage and math scores (plus indicators for missing values), urban area, parental income categories
(1-5), and socioeconomic groups (A-D). Again, covariates were exact matched for the 7 region
16
Table 1: Covariate balance at the student level after matching. All the covariates are measuredin 2004.
Covariate Mean Std.Subsidized Public dif.
Language score 243.24 243.68 -0.01Mathematics score 243.43 243.04 0.01Natural science score 246.19 247.00 -0.02Social science score 243.01 243.91 -0.02School language score 237.57 237.02 0.03School mathematics score 238.37 237.89 0.03School female proportion 0.51 0.50 0.05School number of students 82.34 82.34 0.00School teacher to student ratio 8.09 8.03 0.03Urban area 0.83 0.83 0.00Socioeconomic status A 0.13 0.12 0.01Socioeconomic status B 0.61 0.61 0.01Socioeconomic status C 0.25 0.26 -0.02Socioeconomic status D 0.01 0.01 0.01Expected education: primary 0.01 0.02 -0.04Expected education: secondary, technical-professional 0.77 0.77 -0.00Expected education: secondary, scientific-humanities 0.13 0.12 0.02Expected education: technical-professional 0.09 0.10 -0.01Expected education: college 0.00 0.00 0.01
17
Table 2: Balance for nominal covariates at the student level. All the covariates are measured in2004. Fine balance constraints balanced sex, type of education of the mother and father, andhousehold income category. The tabulated values are counts of the number of students in eachcategory. In addition, matching was exact for groups of counties (not shown here).
Covariate Subsidized PublicSex
Male 2084 2084Female 1981 1981
Mother educationPrimary school 1886 1886Secondary school 1152 1152Technical 56 56College or higher 17 17Missing 954 954
Father educationPrimary school 1647 1647Secondary school 1255 1255Technical 56 56College or higher 17 17Missing 954 954
Household income category (in 1000 pesos)[0, 100) 1647 1647[100, 200] 1255 1255(200, 400] 1643 1643(400, 600] 446 446(600, 1400] 124 124> 1400 84 84Missing 183 183
18
Figure 1: Distribution of student test scores at baseline after matching. The baseline test scoresare measured in 2004. The ellipses trace the 95% bivariate normal density contours of the jointdistributions of test scores for the matched treated and control units. The contours are almostidentical showing that not only marginal distributions of the test scores are very closely balancedbut also their joint distribution.
100 150 200 250 300 350 400
100
150
200
250
300
350
400
PublicSubsidizedPub.
Sub.
100 150 200 250 300 350 400
Test scores in language at baseline
Pub. Sub.
100
150
200
250
300
350
400
Test
sco
res
in m
athe
mat
ics
at b
asel
ine
Distribution of student-level test scores after matching
groups. We balanced all covariates with and without weighting for the size of the school; see
Table 3. Note that after matching all the differences in means are smaller than 0.05 standard
deviations. In this way, we matched 8130 students in 4065 pairs, and 166 schools in 83 pairs.
In this match, 7 out of the 13 region of the country are represented in both the treatment and
19
control groups.
Table 3: Covariate balance at the school level after matching. Both means and standardizeddifferences are weighted by the number of students in each school.
Covariate Mean Std.Subsidized Public dif.
Female proportion 0.49 0.49 -0.00Number of students 262.33 262.67 -0.00Language score 241.36 241.43 -0.00Mathematics score 230.20 229.19 0.04Language score missing 0.00 0.00 0.00Math score missing 0.00 0.00 0.00Urban area 0.98 0.99 -0.03Income category 1 0.08 0.09 -0.02Income category 2 0.80 0.79 0.01Income category 3 0.12 0.12 0.00Income category 4 0.00 0.00 0.00Income category 5 0.00 0.00 0.00Socioeconomic status A 0.18 0.18 0.01Socioeconomic status B 0.71 0.71 0.01Socioeconomic status C 0.10 0.11 -0.02Socioeconomic status D 0.00 0.00 0.00
Thus while our match yields highly comparable treated and control groups, geographic coverage
is somewhat poor. To that end, we also implemented a second match designed to increase geo-
graphic representation. Starting from the same student matches, we also found a more externally
valid school in which all the differences in means are smaller than 0.15 standard deviations, and
where 10776 students are matched in 5388 pairs and 210 schools are matched in 105 pairs. In
this match, 12 out of the 13 regions in Chile are represented. We deem the first match a match
with greater internal validity, and the second match, a match with greater external validity. Table
(4) compares the results of these two matches. In the next section we compare treatment effect
estimates for these two different matched samples.
20
Table 4: Comparison of the two school level matches. In the internally valid match the largeststandardized difference in means is smaller than 0.05, whereas in the externally valid matchthis difference is smaller than 0.15. Regions represented are the regions present in both in thetreatment and control groups.
Internal validity match External validity matchMatched students 8130 10776Matched schools 166 210Regions represented 4, 5, 6, 7, 9, 10, 13 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13
5 Outcome Analyses
Having found the optimal match, we now estimate the voucher effect, test its significance, and
assess the robustness of these conclusions with sensitivity analysis methods to hidden bias. We
follow the notation and methods for clustered experiments by Small et al. (2008), and use the
sensitivity analysis methods for clustered observational studies by Hansen et al. (2014).
5.1 Notation: Treatment Effects for Students in Schools
There are S matched pairs of clusters, s = 1, . . . , S, with two schools, j = 1, 2, one treated and
one control for 2S total units. The ordered pair sj thus identifies a unique cluster. Each cluster
sj contains nsj > 1 individuals, i = 1, . . . , nsj. Each pair is matched for observed, pretreatment
covariates, so xs11 = xs22 for each j and i, where xsji represents the observed covariates on
which we matched. A student i in school sj is described by both observed covariates and an
unobserved covariate usji. The set (xsji, usji) may describe either the student sji or the school
sj containing this student. In our study, treatment assignment occurs at the school level as whole
schools are assigned to treatment or control. If the jth school in pair s receives the treatment,
write Zsj = 1, whereas if this school receives the control, write Zsj = 0, so Zs1 + Zs2 = 1, for
each s as each pair contains one treated school and one control school. If nsj = 1 for all sj then
the clusters are individuals, and we have unclustered treatment assignment.
Each student has two potential responses; one response that is observed under treatment Zsj = 1
21
and the other observed under control Zsj = 0 (Neyman 1923; Rubin 1974). We denote these
responses with (yTsji, yCsji), where yTsji is observed from the ith subject in pair s under Zsj = 1,
and yCsji is observed from this subject under Zsj = 0. In our application yTsji is the test score
that student sji would exhibit if he or she switched from a public school to a private voucher
school and yCsji is the test score this same student would exhibit if he or she remained in a public
school. Under this notation, we allow for interference among students in the same school but
not across schools. In this context, yTsji denotes the response of student sji if all students in
school sj receive the treatment, while yCsji denotes the response of student sji if all students
in school sj receive the control. Therefore, we do not assume that we would observe the same
response from student sji if the treatment were assigned to some but not all of the students in
school sj.
For each student, the unobservable effect of treatment is yTsji − yCsji, which is the change in
test scores induced by attending a private voucher school. We do not observe both potential
outcomes, but we do observe responses: Ysji = ZsjyTsji + (1−Zsj)yCsji. Under this framework,
the observed response Ysji varies with Zsj but the potential outcomes do not vary with treatment
assignment. Write R = (R111, . . . , RS2,ns2)T for theN =∑s,j ns,j dimensional vector of observed
responses with the same notation for yc, which are potential responses under control. Below
we test the sharp null hypothesis of no treatment effect on (yTsji, yCsji) which stipulates that
H0 : yTsji = yCsji for all sji (Fisher 1935). This hypothesis asserts that changing the treatment
assigned to school sj would leave the response of student sji unchanged.
5.2 Randomization Inference When Treatment is Assigned at the School
Level
In our analysis, we initially assume that treatment assignment is as-if randomly assigned to
clusters conditional on the matches. In short, we assume as if the toss of a fair coin was used
to allocate private voucher status within matched school pairs. We collect in the set Ω the 2S
22
treatment assignments for all 2S clusters: Z = (Z11, Z12, . . . , ZS2)T . In a matched-pair group
randomized experiment, one treatment assignment Zsj would be picked at random and each
assignment would therefore have probability Pr(Z = Zsj) = 2−S, which yields the randomization
distribution. In an observational study, if the probability of receiving treatment is equal for both
schools in each pair, then the conditional distribution of Z given that there is exactly one treated
unit in each pair equals this randomization distribution, and Pr(Zsj = 1) = 1/2 for each unit j
in pair s (see Rosenbaum 2002 for details). However, in an observational study it may not be
true Pr(Zsj = 1) = 1/2 for each unit j in pair s due to an unobserved covariate usji. We explore
this possibility through a sensitivity analysis described below.
To test Fisher’s sharp null hypothesis of no treatment effect, we define T a test statistic which is
a function of Z and R where T = t(Z,R). Under the sharp null hypothesis R = yc, therefore
T = t(Z,yc). If the model for treatment assignment above were true then the randomization
distribution for T is
Pr{t(Z,R) ≥ w|yTsji, yCsji,xsji, usji,Ω} = Pr{t(Z,yc) ≥ w|yTsji, yCsji,xsji, usji,Ω}
since yc is fixed by conditioning on yTsji, yCsji,xsji, usji and Pr(Z = Zsj|yTsji, yCsji,xsji, usji,Ω) =
1/|Ω|. We use a test statistic from Hansen et al. (2014) that provides inferences for units in
clusters.
For this test statistic, qsji is a score or rank given to Ysji, so that under the null hypothesis, the
qsji are functions of the yCsji and xsji, and they do not vary with Zsk. To make qsji resistant to
outliers, we use the ranks of the residuals when Ysji is regressed on the student level covariates
using Huber’s method of m-estimation Small et al. (2008). We regressed the outcome, test
scores recorded in 2006, on student level test scores recorded in 2004 when the student was still
in primary school. The test statistic T is a weighted sum of the mean ranks in the treated school
23
minus the mean ranks in the control school. Formally the test statistic is
T =S∑s=1
BsQs
where
Bs = 2Zs1 − 1 = ±1, Qs =wsns1
ns1∑i=1
qs1i −wsns2
ns2∑i=1
qs2i.
Hansen et al. (2014) show that T is the sum of S independent random variables each taking the
value ±Qs with probability 1/2, so E(T ) = 0 and var(T ) =∑Ss=1 Q
2s. The central limit theorem
implies that as S → ∞, then T/√
var(T ) converges in distribution to the standard Normal
distribution. In the above equation, ws defines the weights which are a function of nsj.
The choice of weights ws has important implications in our application. Hansen et al. (2014)
discuss three possible choices for ws. One possibility is to use constant weights, ws ∝ 1. Another
possibility is to use weights that are proportional to the total number of students in a matched
cluster pair: ws ∝ ns1 +ns2 or ws = (ns1 +ns2)/∑Sl=1(n11 +n12). These proportional weights are
particularly useful if we believe that the private school voucher effect varies with cluster size. This
would be true if, for example, the private school effect was larger in smaller schools. However,
if we suspect that the private voucher school effect is constant, we could select the weights to
minimize the variance of T . For example, ws ∝ ns1ns2/(ns1 + ns2) will minimize the variance of
T if cross cluster variability is low, while constant weights will minimize the variance if there is
little variance within schools. Hansen et al. (2014) note that for testing the null hypothesis each
set of weights is valid. Given that our cluster sizes exhibit considerable variation, and it is fully
possible that the treatment effect varies with cluster size, we use all three sets of weights to test
the sharp null hypothesis. Below we discuss how we incorporate the different weights into the
sensitivity analysis.
If we test the hypothesis of a shift effect instead of the hypothesis of no effect, we can apply the
method of Hodges and Lehmann (1963) to estimate the voucher school effect. The Hodges and
24
Lehmann (HL) estimate of τ is the value of τ0 that when subtracted from Ysji makes T as as
close as possible to its null expectation. Intuitively, the point estimate τ̂ is the value of τ0 such
that T equals 0 when Tτ0 is computed from Ysji − Zsjτ0. Using constant effects is convenient,
but this assumption can be relaxed; see Rosenbaum (2003). If the treatment has an additive
effect, Ysji = yCsji + τ then a 95% confidence interval for the additive treatment effect is formed
by testing a series of hypotheses H0 : τ = τ0 and retaining the set of values of τ0 not rejected at
the 5% level.
5.3 Comparative Effectiveness of Public Versus Private Voucher Schools
We now test the hypothesis of no effect for private voucher schools. We test this hypothesis in
both matches. For the match with greater external validity, with constant weights ws ∝ 1, the
approximate one-sided p-value is 0.256. Thus we are unable to reject the null that the voucher
school are completely without effect. We also found that both sets of weights, ws ∝ ns1 + ns2
and ws ∝ ns1ns2/(ns1 + ns2), lead to identical p-values of 0.292. In the absence of bias from
hidden confounders, the point estimate is τ̂ = 2.81 with a 95% confidence interval of -5.68 and
11.34.
For the match with greater internal validity with constant weights, the approximate one-sided p-
value is 0.492. Using non-constant weights, we find the approximate one-sided p-value is 0.633. If
there are no hidden confounders, the point estimate for the match with greater internal validity is
τ̂ = 0.0743 with a 95% confidence interval of -8.58 and 9.36. Thus for both matches, we cannot
reject the hypothesis that attending a private voucher school has no effect on test scores. We
next explore the likelihood that bias from a hidden confounder masks a treatment effect.
5.4 Test of Equivalence and Sensitivity Analysis
In an observational study, one concern is that bias from a hidden covariate can give the impression
that a treatment effect exists when in fact no effect is present. Bias from hidden confounders can
25
also mask an actual treatment effect leaving the analyst to conclude there is no effect when in
fact such an effect exists. We explore this possibility using a test of equivalence and a sensitivity
analysis (Rosenbaum 2008; Rosenbaum and Silber 2009; Rosenbaum 2010).
Above we were unable to reject the null hypothesis that τ = 0 for all students. Next, we apply
a test of equivalence to test the hypotheses that τ is not small. Under a test of equivalence, we
test the following null hypothesis H(δ)6= : |τ | > δ. Rejecting H(δ)6= provides a basis for asserting
with confidence that |τ | < δ. H(δ)6= is the union of two exclusive hypotheses:←−H
(δ)0 : τ ≤ −δ
and −→H (δ)0 : τ ≥ δ, and H(δ)6= is rejected if both
←−H
(δ)0 and
−→H
(δ)0 are rejected (Rosenbaum and
Silber 2009). We can apply the two tests without correction for multiple testing since we test
two mutually exclusive hypotheses. Thus we can test whether the estimate from our study is
different from other possible treatment effects which are represented by δ.
With a test of equivalence, it is not possible to demonstrate a total absence of effect, but if
this were a randomized trial we could safely test that our estimated effect is not as large as δ.
That is we may be able to reject H(δ)6= : |τ | > δ. In an observational study, however, there are
additional complications. Since the treatment was not randomly assigned, it may be the case
that we reject the null hypothesis of equivalence due to hidden confounding. However, using a
sensitivity analysis we may find evidence that the test of equivalence is insensitive to biases from
nonrandom treatment assignment.
In a sensitivity analysis, we quantify the degree to which a key assumption must be violated in
order for our inference to be reversed. Our model of treatment assignment assumes that within
matched pairs, receipt of the treatment is effectively random conditional on the matches. We
consider how sensitive our conclusions are to violations of this assumption using a model of
sensitivity analysis discussed in Rosenbaum (2002, ch. 4).
In our study, matching on observed covariates xsji made students more similar in their chances
of being exposed to the treatment. However, we may have failed to match on an important
unobserved covariate usji such that xsji = xsji′ ∀ s, j, i, i′, but possibly usji 6= usji′ . If true, the
26
probability of being exposed to treatment may not be constant within matched pairs. To explore
this possibility, we use a sensitivity analysis that imagines that before matching, student i in pair
s had a probability, πs, of being exposed to the voucher school treatment. For two matched
students in pair s, say i and i′, because they have the same observed covariates xsji = xsji′ it
may be true that πs = πs′ . However, if these two students differ in an unobserved covariate,
usji 6= usji′ , then these two students may differ in their odds of being exposed to the voucher
school treatment by at most a factor of Γ ≥ 1 such that
1Γ ≤
πs/(1− πs′)πs′/(1− πs)
≤ Γ, ∀ s, s′, with xsji = xsji′ ∀ j, i, i′. (4)
If Γ = 1, then πs = πs′ , and the randomization distribution for T is valid. If Γ > 1, then quantities
such as p-values and point estimates are unknown but are bounded by a known interval. In a
sensitivity analysis, we use several values of Γ to compute bounds on the p-value for the test
of equivalence. We then observe at which value of Γ the upper bound on the p-value exceeds
0.05. If the value of Γ is large, we can be confident that it would take a large bias from a hidden
confounder to reverse the conclusions of the study. The derivation for the sensitivity analysis as
applied to our test statistic T is in Hansen et al. (2014).
Under a test of equivalence, we may be able to reject H(δ)6= : |τ | > δ if the p-value from the test
is low. Rejecting this null, allows us to infer that the estimate treatment effect is not as large
as δ. We then apply the sensitivity analysis to understand whether this inference is sensitive to
biases from nonrandom treatment assignment. In the analysis, we observe at what value of Γ the
p-value exceeds the conventional 0.05 threshold for each test. If this Γ value is relatively large,
we can be confident that the test of equivalence is not sensitive to hidden bias from nonrandom
treatment assignment.
Hansen et al. (2014) note that sensitivity to hidden bias may vary with the choice of weights ws.
To understand whether different weights lead to different sensitivities to hidden confounders, we
27
can conduct a different sensitivity analysis for each set of weights and correct these tests using
a Bonferroni correction. However, Rosenbaum (2012b) shows that the Bonferroni correction is
overly conservative when applied sensitivity analysis. He develops an alternative multiple testing
correction based on correlations among the test statistics. Under this correction, for a given
value of Γ we conduct a sensitivity analysis using each set of weights. We then apply the multiple
testing correction from Rosenbaum (2012b) which produces a single corrected p-value for that
value of Γ.
5.5 How Much Bias Would Need to be Present to Mask a Positive
Effect of Private Voucher Schools?
We now apply the test of equivalence to both matches. In this test, the null hypothesis asserts
H(δ)6= : |τ | > δ for some specified δ > 0. Rejection of this null hypothesis provides evidence
that the effect of attending a private voucher school on test scores is less than δ. What values
should we select for δ? A number of studies in the literature have found that private voucher
schools increase test score achievement. The smallest effect size in the extant literature is 0.15
of a standard deviation (Sapelli and Vial 2002). However, among low income students the effects
may be as large as 0.5 of a standard deviation, and Sapelli and Vial (2005) find an effect size of 0.6
standard deviations. These results suggest a range of possible effects from 0.15 to 0.6 standard
deviations. To that end, we use three values for δ of 0.15, 0.30 and 0.6 standard deviations.
This allows us to test whether the point estimates in our study are equivalent to small, medium
or large voucher effects. Thus we define three values δ1, δ2, and δ3 to correspond to these three
different possible effect sizes.
We first ask whether the point estimate from the match with greater external validity is large
relative to effect sizes in the literature. Table 5 contains a summary of the test of equivalence
and sensitivity analysis to the match with greater external validity. We first assume that there
is no hidden bias such that Γ = 1. We first test ←−H (δ1)0 and find that the one-sided p-value
28
from this test is 0.008. We then test −→H (δ1)0 and we find that the one-sided p-value is 0.089.
Therefore we are unable to reject H(δ1)6= for the match with greater external validity. Thus our
point estimate from this match may be consistent with a small effect. For a larger effect size
of 0.30 standard deviations, however, we can reject H(δ2)6= with a p-value of 0.003. Thus the
estimated treatment effect is not consistent with moderate effect size. Is this inference sensitive
to bias from a confounder? We find that for Γ = 1.85, the p-value is 0.049. A bias of magnitude
Γ = 1.85 means that two matched students might differ in terms of an unobserved usji such
that one student is almost twice as like as the other to attend a private voucher school before
it would alter our conclusions. Finally, we test whether the point estimate is equivalent with a
large effect size of 0.60 standard deviations. Again, we can reject H(δ3)6= with p < .001. With a
bias of magnitude Γ = 5.71 the p-value is 0.049. Therefore, it would take a very large bias for
our conclusions about a large treatment effect to be altered.
Table 5: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Externally Valid Match
Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.089 0.003
In sum, for both matches, we either cannot reject that the estimated effect is as large as the small-
est effects found in previous studies or that association could be easily explained by unobserved
confounding. Bias from an unobserved covariate would need to double the odds of selecting a
private voucher school to mask a moderate size effect of 0.30 standard deviations. To mask a
large effect size of 0.60 standard deviations, the bias from the unobserved founders would have
to nearly quintuple the odds of differential treatment assignment in both matches.
Table 6: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Internally Valid Match
Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.036 0.003 0.00151.1 0.048 0.005
for covariate balance may not only require mean balance, but also other forms of distributional
balance such as fine balance, x-fine balance, and strength-k matching. In practice, this method
facilitates sensitivity analyses to hidden biases due to unobserved covariates, and it readily extends
to clustered observational studies with three or more levels of data. To our knowledge, this method
is the first application of dynamic and integer programming to observational studies.
31
References
Anand, P., Mizala, A., and Repetto, A. (2009), “Using School Scholarships to Estimate the Effect
of Government Subsidized Private Education on Academic Achievement in Chile,” Economics
of Education Review, 28, 370–381.
Arpino, B. and Mealli, F. (2011), “The specification of the propensity score in multilevel obser-
vational studies,” Computational Statistics & Data Analysis, 55, 1770–1780.
Bellman, R. (1957), Dynamic Programming, Princeton, NJ: Princeton University Press.
Bertsekas, D. P. (2005), Dynamic Programming and Optimal Control, Vol. I, Belmont, MA:
Athena Scientific.
Feller, A. and Gelman, A. (2015), “Hierarchical Models for Causal Effects,” Working Paper.
Fisher, R. A. (1935), The Design of Experiments, London: Oliver and Boyd.
Gelman, A. (2006), “Multilevel (hierarchical) modeling: what it can and cannot do,” Techno-
metrics, 48, 432–435.
Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), “Optimal Multivariate Matching
Before Randomization,” Biostatistics, 5, 263–275.
Hansen, B. B. (2007), “Flexible, Optimal Matching for Observational Studies,” R News, 7, 18–24.
Hansen, B. B., Rosenbaum, P. R., and Small, D. S. (2014), “Clustered Treatment Assignments
and Sensitivity to Unmeasured Biases in Observational Studies,” Journal of the American
Statistical Association, 109, 133–144.
Hodges, J. L. and Lehmann, E. (1963), “Estimates of Location Based on Ranks,” The Annals of
Mathematical Statistics, 34, 598–611.
Hong, G. and Raudenbush, S. W. (2006), “Evaluating Kindergarten Retention Policy: A Case of
32
Study of Causal Inference for Multilevel Data,” Journal of the American Statistical Association,
101, 901–910.
Hsieh, C.-T. and Urquiola, M. (2006), “The Effects of Generalized School Choice on Achievement
and Stratification: Evidence from Chile’s voucher program,” Journal of Public Economics, 90,
1477–1503.
Hsu, J. Y., Zubizarreta, J. R., Small, D. S., and Rosenbaum, P. R. (2015), “Strong Control
of the Family-Wise Error Rate in Observational Studies that Discover Effect Modification by
Exploratory Methods,” Working Paper.
Lara, B., Mizala, A., and Repetto, A. (2011), “The Effectiveness of Privte Voucher Education:
Evidence From Structural School Switches,” Educational Evaluation and Policy Analysis, 33,
119–137.
Lee, V. E. and Bryk, A. S. (1989), “A Multilevel Model of The Social Distribution of High School
Achievement,” Sociology of Education, 62, 172–192.
Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013), “Propensity score weighting with multilevel
data,” Statistics in medicine, 32, 3373–3387.
Lu, B., Greevy, R., Xu, X., and Beck, C. (2011), “Optimal Nonbipartite Matching and its Statis-
tical Applications,” The American Statistician, 65, 21–30.
McEwan, P. J. (2001), “The Effectiveness of Public, Catholic, and Non-Religious Private Schools
in Chile’s Voucher System,” Education Economics, 9, 183–219.
MINEDUC (2013), “Estadśticas de la Educación,” http://www.ministeriodesarrollosocial.gob.cl/encuesta-
post-terremoto/index.html.
Mizala, A. and Romaguera, P. (2001), “Factors Explaining Secondary Education Outcomes in
Chile,” El Trimestre Economico, 272, 515–549.
Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay
33
on Principles. Section 9.” Statistical Science, 5, 465–472. Trans. Dorota M. Dabrowska and
Terence P. Speed (1990).
Rosenbaum, P. R. (1984), “The Consequences of Adjusting For a Concomitant Variable That
Has Been Affected By The Treatment,” Journal of The Royal Statistical Society Series A, 147,
656–666.
— (1987), “Sensitivity Analysis for Certain Permutation Inferences in Matched Observational
Studies,” Biometrika, 74, 13–26.
— (1989), “Optimal Matching for Observational Studies,” Journal of the American Statistical
Association, 84, 1024–1032.
— (2002), Observational Studies, New York, NY: Springer, 2nd ed.
— (2003), “Exact Confidence Intervals for Nonconstant Effects by Inverting the Signed Rank
Test,” The American Statistician, 57, 132–138.
— (2006), “Differential effects and generic biases in observational studies,” Biometrika, 93,
573–586.
— (2008), “Testing hypotheses in order,” Biometrika, 95, 248–252.
— (2010), Design of Observational Studies, New York: Springer-Verlag.
— (2012a), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal
of Computational and Graphical Statistics, 21, 57–71.
— (2012b), “Testing One Hypothesis Twice in Observational Studies,” Biometrika, 99, 763–774.
Rosenbaum, P. R., Ross, R. N., and Silber, J. H. (2007), “Minimum Distance Matched Sampling
with Fine Balance in an Observational Study of Treatment for Ovarian Cancer,” Journal of the
American Statistical Association, 102, 75–83.
Rosenbaum, P. R. and Silber, J. H. (2009), “Sensitivity Analysis for Equivalence and Difference in
34
an Observational Study of Neonatal Intensive Care Units,” Journal of the American Statistical
Association, 104, 501–511.
Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandom-
ized Studies,” Journal of Educational Psychology, 6, 688–701.
Sapelli, C. and Vial, B. (2002), “The Perfomance of Private and Public Schools in the Chilean
Voucher System,” Cuadernos De Economia, 39.
— (2005), “Private vs public voucher schools in Chile: New evidence on efficiency and peer
effects,” Working Paper 289, Catholic University of Chile, Instituto de Economia.
Small, D. S., Have, T. R. T., and Rosenbaum, P. R. (2008), “Randomization Inference in a Group–
Randomized Trial of Treatments for Depression: Covariate Adjustment, Noncompliance, and
Quantile Effects,” Journal of the American Statistical Association, 103, 271–279.
Stuart, E. A. (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,”
Statistical Science, 25, 1–21.
Zubizarreta, J. R. (2012), “Using Mixed Integer Programming for Matching in an Observational
Study of Kidney Failure after Surgery,” Journal of the American Statistical Association, 107,
1360–1371.
Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), “Matching for Balance, Pairing
for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit
High Schools in Chile,” Annals of Applied Statistics, 8, 204–231.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), “Match-
ing for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following
Surgery,” The American Statistician, 65, 229–238.
35
IntroductionClustered Observational Studies with Multilevel DataVouchers and School Choice
Schools in Chile; Educational Census Data; Study DesignThe Chilean School System and Restricted Secondary School Choice Longitudinal Census of Students and SchoolsData Structure and Study Design
Dynamic and Integer Programming for Multilevel MatchingReview of Cardinality MatchingA Multistage Decision Method for Multilevel MatchingExtensions and Computation
Covariate Balance in the Matched SampleOutcome AnalysesNotation: Treatment Effects for Students in SchoolsRandomization Inference When Treatment is Assigned at the School LevelComparative Effectiveness of Public Versus Private Voucher SchoolsTest of Equivalence and Sensitivity AnalysisHow Much Bias Would Need to be Present to Mask a Positive Effect of Private Voucher Schools?
Summary and Discussion