Problems with and Solutions for Two-DimensionalModels of Continuous Dependent Variables
Ben Goodrich1
November 9, 2004
1Harvard University, Department of Government, Littauer Center (North Yard), 1875 Cam-bridge St., Cambridge, MA, 02163; email: [email protected]
Abstract
This paper addresses hierarchical models with continuous dependent variables, such as time-
series cross-section models. Building on the argument in Zorn (2001), the main point of this
paper is that the pooled OLS estimator is deeply flawed – especially for time-series cross-
section data – but for reasons that have not explicitly been raised in previous papers. The
pooled OLS estimator, the within estimator, the between estimator, and the random effects
estimator can be seen as special cases of the fractionally pooled estimator presented in Bartels
(1996), which allows all of these estimators to be evaluated in a common framework. On
both bias and efficiency grounds, using both the within estimator and the between estimator
is probably the best estimation strategy for almost all applications in political science.
1 Introduction
How should we specify linear regression models when the data vary across two dimensions,
such as space and time? It is impossible to say what should be done in all situations, but it
is easier to determine what should not be done. This paper identifies current practices that
should be avoided, and offers some alternatives that can be used.
The paper is purely methodological but is directed toward substantive researchers, refer-
ees, and editors. Although there is a lot of algebra and subtle notation, I tend to cite rather
than reproduce proofs and tend to use the “intuitive” technique of showing that familiar esti-
mators are in fact special cases of estimators that seem unreasonable. In essence, this paper
is just a synthesis of the ideas in Zorn (2001) and Bartels (1996) that formalizes some of the
methodological reservations many researchers “feel” when analyzing two-dimensional data.
I replicate some results from Green, Kim and Yoon (2001) in order to bolster its conclusion
given the criticisms from the other papers in a recent International Organization symposium
(see Oneal and Russett, 2001; Beck and Katz, 2001; King, 2001), but the implications reach
far beyond this particular example and this subfield of political science.
The stakes in this debate are great. As data have become easier to collect, models for
two-dimensional data have become more prevalent. The paradigmatic examples come from
comparative and international political economy, where each country (or country-pair) is ob-
served over time (usually years). But scholars of American politics utilize two-dimensional
models when counties are nested within states, justices within circuit courts, etc. Two-
dimensional models are also used in other disciplines – economic models where firms are
nested within industries, policy models where students are nested within schools, and soci-
ology models where individuals are nested within families to name a few.
I make a modification to the estimator described in Zorn (2001) that improves the stan-
dard errors without affecting the coefficient estimates. From there, placing weights on the
1
data and restrictions on the coefficients yields the fractionally pooled estimator developed in
Bartels (1996). Depending on how the data are weighted, the between estimator, the within
estimator, the random effects estimator, and the pooled OLS estimator can all be seen as
special cases of the fractionally pooled estimator. Thus, all of these estimators, plus a few
more, can be analyzed in a common framework.
The fractionally pooled estimator always requires a correction the standard errors that
computers produce. Since all of the above estimators are special cases, they also require
corrections to the standard errors. These corrections are already familiar for the within
estimator and the between estimator, but the corrections to the standard errors of the random
effects estimator and pooled OLS estimator are novel, non-trivial, and strictly increasing.
The conclusion of this paper is that researchers should use within estimators and between
estimators when the data are two-dimensional. The restrictions that other estimators place
on the coefficients can cause bias, as demonstrated in Zorn (2001) and Green, Kim and Yoon
(2001). Given that the efficiency advantages of the random effects estimator and the pooled
OLS estimator are overstated because the uncorrected standard errors are understated, it is
unlikely that the true efficiency gains are worth the cost of bias. This finding is particularly
true when the data vary over time: Papers following the recommendations in Beck and Katz
(1996) are critically flawed unless a fixed effects specification is used.
2 Problems in the General Case
This section discusses problems with two-dimensional models of continuous dependent vari-
ables. The same problems, but not necessarily the same solutions, apply when the dependent
variable is discrete. Since the dependent variable is continuous, I focus on least squares es-
timators, but the same points apply if maximum likelihood or Markov Chain Monte Carlo
is used to estimate the model. After establishing some notation, this section demonstrates
2
how the common pooled OLS estimator can be constructed from component models. Seen
in this light, it is clear that alternative estimators are superior to pooled OLS.
Let i = 1, 2, . . . N index one dimension of the data and let j = 1, 2, . . . J index the other
dimension. For example, i could index states while j indexes counties or i could index
countries while j indexes time, but i and j can be anything that uniquely identifies a data
point. I will refer to the N “units of observation” in which the J “observations” are nested.
For convenience, I assume that the data are balanced, which implies that each of the N
units of observation has a complete set of J observations on all the independent variables
and the dependent variable. In most cases, the math could be modified slightly to account
for the problem of missing data, which is the norm in social science. Any unbalanced dataset
can be balanced using multiple imputation (see King et al., 2001; Little and Rubin, 2002).
A sample mean (x) and a unit mean (xi)
x =1
NJ
N∑i=1
J∑j=1
xij, (1)
xi =1
J
J∑j=1
xij, (2)
are different quantities. A unit mean can be calculated for each of the N units in the dataset
for any or all of the K covariates. In this paper, a column vector is indicated by lowercase
boldface lettering without subscripts. Thus, when we collect all the unit means on a covariate
into a vector,i
x, the length of this vector is often NJ rather than N because each unit mean
is copied J times. The context should indicate whether the length ofi
x is NJ or N .
For example, I define a “demeaned” variable to be the deviations from the unit means.
Thus, x −i
x is a demeaned variable but in order to make the “meaned” variable,i
x, con-
formable for subtraction with the two-dimensional variable, x, each unit mean must be copied
J times so that the length ofi
x is NJ . I often use the tilde notation to indicate demeaned
3
vectors such thati
y = y−i
y and similarly for matrices, as ini
X = X−i
X.
The least squares dummy variable estimator (LSDVE),
yij = αi + xijβw + εij, (3)
simply includes a separate intercept (αi) for each of the N units of observation. The row
vector of observations on the K covariates is denoted by xij (lowercase boldface lettering with
subscripts). One important characteristic of a LSDVE is that the error term does not have
a unit-specific component, because the omitted variables that would otherwise constitute a
unit-specific error component are perfectly captured by the unit-specific intercept. There
are NJ data points in the LSDVE since there are J observations on each of N units. For
the rest of this paper, I assume the objective is to estimate causal effects. However, if the
goal is merely to predict the dependent variable, the LSDVE is optimal for this purpose.
Aside from a degrees of freedom correction discussed in section 3, the LSDVE is equivalent
to the other type of “fixed effects estimator”, the within estimator (W-E),
(yij − yi) = 0 + (xij − xi) βw + εij, (4)
which eliminates the unit-specific error component by demeaning all the variables. A variety
of assumptions are possible regarding the distribution of the error term in fixed effects models.
However, nothing important in this paper turns on what those assumptions are. βw from
the LSDVE is identical to βw from the W-E, and the w subscript denotes a “within” effect.1
The counterpart to the W-E is the between estimator (B-E),
yi = β[0] + xiβb + εi, (5)
1For simplicity, I assume that all the estimated parameters are single points. Nothing in this paper(except space) precludes specifying a more complicated model with variable coefficients.
4
which posits that a meaned dependent variable is a linear function of an intercept(β[0]
)and
the meaned independent variables. The B-E has a total of N observations since there are
only N unit means. The b subscript on βb denotes a “between” effect.
It is more convenient to express the W-E and B-E in matrix form:
i
yNJ×1
= 0NJ×1
+
[i
XNJ×K
]βwK×1
+ εNJ×1
, (6)
i
yN×1
= β[0]
N×1
+
[i
XN×K
]βb
K×1
+i
εN×1
. (7)
Since number of rows for the W-E in equation 6 is NJ , and the number of rows for the B-E
in equation 7 is N , the two equations are not naturally conformable for addition. Most of
the criticisms in this paper can be traced back to this step where the meaned data are copied
J times in order to add equation 6 to equation 7:
i
yNJ×1
+i
yNJ×1
= β[0]
NJ×1
+ 0NJ×1
+
[i
XNJ×K
]βb
K×1
+
[i
XNJ×K
]βwK×1
+i
εNJ×1
+ εNJ×1
, (8)
which results in an equation that can be expressed at the observation level as:
yi + (yij − yi) = β[0] + 0 + xiβb + (xij − xi) βw + (εi + εij) , (9a)
yij = β[0] + xiβb + (xij − xi) βw + (εi + εij) . (9b)
Equation 9b is a special case of a general model that was introduced to political science by
Zorn (2001), although it has been derived in other disciplines (see Neuhaus and Kalbfleisch,
1998; Gould, 2001). I call equation 9b a simultaneous parsed model (SPM) – parsed be-
cause the effects of the covariates are split into their between and within components and
simultaneous because the between and within estimates are obtained at the same time.
5
As Zorn (2001) notes, xi is uncorrelated with (xij − xi) in the SPM.2 Intuitively, meaned
variables only have between variance and demeaned variables only have within variance, so
the two vectors cannot covary. More formally, Rice (1995, p.180) proves that meaned and
demeaned variables are independent and thus orthogonal. Although meaned (demeaned)
variables covary with other meaned (demeaned) variables, the “cross-dimensional” cells of
the SPM’s variance-covariance matrix are zero, as can be seen from section 5’s example.
If two covariates are orthogonal, excluding one from the model does not affect the point
estimate of the other. Both the W-E and the B-E exclude variables from the SPM, but
the variables that the W-E and the B-E exclude are orthogonal to the variables that each
includes. Thus, β[0], βb, and βw from the SPM are identical to the point estimates that would
be obtained from applying the B-E and W-E separately, but the standard errors differ.3
Least squares is reasonable for the W-E and the B-E. Although the error term in the
W-E and/or the B-E may not be spherical, there are many well-known post hoc corrections
to the standard errors that are compatible with least squares point estimation. In particular,
panel-corrected standard errors (PCSEs, see Beck and Katz, 1995) and clustered standard
errors (see Arellano, 1987; Kristensen and Wawro, 2003) are popular fixes to the standard
errors of the W-E while White standard errors are popular in cross-sectional models, such
as the B-E, when the form of the heteroskedasticity is unknown.
Unfortunately, OLS standard errors are unreasonable for the SPM, and there is no at-
tractive fix. All J errors for each of the N units of observation have a common component,
εi, violating the independence assumption. Thus, there is “within-unit” correlation in the
errors, although this correlation is conceptually distinct from “autocorrelation”, since auto-
correlation implies that E [εij × εij′ ] 6= 0 when j 6= j′ where j indexes time. But one result
from the time-series literature is applicable here: The damage autocorrelation does to the
2It is also true that εi is orthogonal to (εij − εi). Although the error term is unobserved, both εi and(εij − εi) are linear functions of variables that have no cross-dimensional covariance.
3The exactness of this claim depends on the data being balanced.
6
estimated standard errors depends on the degree of persistence in the independent variables.
Recall that in order to construct the SPM, the meaned observations needed to be copied J
times each. Since the meaned variables in a SPM do not vary within units, the “persistence”
of meaned variables is perfect, and the common component to the error term has a pernicious
impact on the estimated standard errors of between estimates. Although the point estimates
are the same, the standard errors from the B-E exceed the standard errors of the between
estimates in the SPM by a factor that is no less than√
J . Thus, if J = 36, a t statistic of
about 12 would be necessary to make a between estimate in a SPM statistically significant.
PCSEs do not fix this problem with the SPM. PCSEs do fix unit heteroskedasticity and
correlation across units of observation, but only if there is no within-unit correlation in the
errors – a condition that does not hold if εi exists (see Kristensen and Wawro, 2003). What
Stata calls “robust clustered Huber/White/sandwich” standard errors “fix” this problem in
a very conservative way by creating one “super error” for each of the N units of observation.4
Thus, this correction sacrifices all but N degrees of freedom and, unlike PCSEs, does not
address the problem of correlation in the error term across units of observation.
Regardless, if one places the restrictions that βb = βw on the SPM, the result,
yij = β[0] + xiβb + (xij − xi) βw + (εi + εij) = β[0] + xijβPOLSE + (εi + εij) , (10)
is the familiar pooled OLS estimator (POLSE), which is the most common estimator for
two-dimensional data in political science. However, the POLSE is nested within the SPM,
and the primary point in Zorn (2001) is that anyone who intends to use a pooled model
should first verify whether the restrictions that βb = βw are valid using a SPM. What is not
mentioned in Zorn (2001) is that the OLS estimate for σ2 is biased in the SPM due to the
4Robust clustered Huber/White/sandwich standard errors are distinct from what Kristensen and Wawro(2003) calls clustered standard errors following Arellano (1987). The cluster() option following the regcommand in Stata produces the former while the cluster() option following areg produces the latter.
7
presence of εi. Thus, the F test and other ways to check whether βb = βw in the SPM are
biased against the restrictions. This problem with the SPM is overcome in section 3.
The existence of εi in a POLSE is what Green, Kim and Yoon (2001) calls “unit hetero-
geneity”, which is also discussed in Wilson and Butler (2003). One way of addressing the
problem of unit heterogeneity that is discussed briefly in those two papers is the random
effects estimator (REE). The REE,
y∗ij = β[0]∗ + x∗ijβREE + (εi + εij)∗ , (11)
is just a POLSE with variables that have undergone a GLS transformation. In equation 11,
y∗ij = yij − θyi, x∗ij = xij − θxi, β[0]∗ =(1− θ
)β[0], and (εi + εij)
∗ = (εi + εij)− θεi, where
θ = 1−
√σ2
W -E
Jσ2B-E
. (12)
The quasi-differencing parameter θ is a function of the estimated error variance in the
W-E and B-E. Clearly, if θ = 1, the REE reduces to a W-E, and if θ = 0, the REE reduces
to a POLSE. Greene (2000, p.569) notes that “[t]o the extent that [θ > 0], we see that
the inefficiency of least squares will follow from an inefficient weighting of the [between and
within] estimators. Compared with generalized least squares, ordinary least squares places
too much weight on the between-units variation. It includes it all in the variation in X,
rather than apportioning some of it to random variation across groups.”
Greene’s claim will be given more force in section 3, but it should be kept in mind that the
REE, like the POLSE, imposes the restrictions that βb = βw. If these restrictions are invalid,
the REE and the POLSE yield biased estimates because the parameter constancy assumption
is violated. The obvious question is, “Why would the between effect of explanatory variable
k not be equal to the within effect of k?”. However, a better question is “Why not estimate
8
a more general model and check?” Zorn (2001) provides several examples where one might
expect between and within effects to differ, either in sign or magnitude. To me, the most
intuitive example comes from Gould (2001): Suppose the dependent variable is a sample of
Americans’ wages over time, and the independent variables are regional dummies with the
northeast excluded. Wages in southern states are lower than in the northeast, on average,
and the between-unit effect of the SOUTH dummy variable is expected to be negative.
However, if people move to the south from the northeast, they are likely taking better-
paying jobs. Thus, the within-unit effect of the SOUTH dummy variable is expected to be
positive. In other cases, the expected sign of the between and within effects is the same but
the magnitudes may be significantly different.
In many cases, theory will suggest that βb should equal βw, but theory is not an excuse
for failing to verify their equality. Although it is neither possible nor necessary to gather
data on all relevant variables, we should always admit the possibility that a specification
error could drive a wedge between βb and βw for at least one of the explanatory variables,
which may be the key causal variable(s) or a control variable that is correlated with the key
causal variable(s). Although the SPM does not permit a fair test of whether βb = βw, I
modify the SPM in section 3 to avoid the problem of duplicative error components and to
facilitate the evaluation of these restrictions.
3 The Fractionally Pooled Estimator
To reemphasize, although the standard errors from the SPM are wrong unless εi = 0 ∀i,
the POLSE is just a SPM with the restrictions that βb = βw. Hence, the POLSE generally
has the wrong standard errors and adds the additional risk that βb may not equal βw.
The root of the problem with the standard errors is that the meaned data – for both the
dependent variable and the independent variables – are copied in the SPM, which creates
9
the common components in the error term. Thus, instead of adding the demeaned data
to the meaned data as in equation 8, we could stack the NJ demeaned observations and
the N meaned observations using properties of partitioned matrices. This move is a two-
dimensional extension of Bartels (1996), which addresses the “pooling” of data generally. In
Bartels’ language, there are two “regimes” or data-generating processes, one “within-unit
regime” (with NJ data points) and one “between-unit regime” (with N data points).
The main point in Bartels (1996) is that the regimes can be preferentially weighted before
imposing pooling restrictions on the coefficients. Let w and b be known scalars between zero
and one inclusive that serve as weights for the within and between regimes respectively. Let
Y =
wi
yNJ×1
bi
yN×1
, A =
0NJ×1
b 1N×1
, B =
0NJ×K
bi
XN×K
, W =
wi
XNJ×K
0N×K
, E =
wεNJ×1
bi
εN×1
, and we can
consider the properties of the following two regression models:
Y=β[0]A + Bβb + Wβw + E, (13)
Y=β[0]A + (B + W) βFPE+E. (14)
Equation 13 is also a simultaneous parsed model, which I call the SPM2 to distinguish it
from the SPM given in equation 9b. The SPM2 produces point estimates that are identical
to those of the SPM but standard errors that are more appropriate. In the SPM2, each
of the meaned data points appears only once, so there are no common components in the
error term and no duplicative data. Thus, the SPM as formulated in Zorn (2001) should not
be used when the dependent variable is continuous, but all the conceptual points in Zorn
(2001) still apply to the SPM2. Equation 14 imposes the pooling restrictions that βb = βw
on equation 13 and is an example of the “fractionally pooled estimator” (FPE) developed in
Bartels (1996) since w and b are allowed to take any values between zero and one inclusive.
In this formulation, the number of demeaned variables (K) equals the number of meaned
10
variables, but this need not be the case. Often, there will be some variables that vary across
units but not within units. Such variables can be included in the “between” matrix (B) but
not in the “within” matrix (W). Less often (and only when the data are balanced), there
may be some covariates that vary within units but do not vary across units, which can be
included in W but not in B. Any problems with matrix conformability in the FPE can be
avoided by adding columns of zeroes to W or B as appropriate.
There are two important points regarding the standard errors of the SPM and the FPE.
First, as Bartels (1996, note 4) and others have pointed out, even if the point estimates
are allowed to differ across regimes, the SPM2 still makes the assumption that the error
variance is the same in the within regime as in the between regime, which is dubious. This
homoskedasticity assumption can be relaxed in the SPM2 by weighting the regimes appro-
priately or by using GLS. It is not immediately obvious how something like PCSEs would
work for a SPM2 because different fixes to the error term are needed for the two regimes.
Second, Bartels (1996) warns that if the FPE is used, the standard errors the computer
reports need to be scaled by a correction factor. Intuitively, if an observation is down-
weighted, it cannot count as a whole degree of freedom. For all least squares estimators, we
make degrees of freedom corrections to the standard errors by multiplying the standard er-
rors by√
Wrong # DFRight # DF
. The correction factor in the two-dimensional case is√
NJ+N−KwNJ+bN−wN−K
,
which differs slightly from the correction factor in Bartels (1996).5 Depending on the values
of w and b, the magnitude of this degrees of freedom correction will change but will always
increase the standard errors. One main result of this paper is that, depending on the values
of w and b, all the estimators discussed in this paper relate to the FPE, as shown in table 1.
5The correction factor given in Bartels (1996) is essentially√
NJ+N−KλNJ+(1)N−K , where λ corresponds to
the weight between zero and one inclusive. I generalize this correction factor to allow either the within orbetween regime to be downweighted, which would merely restates Bartels’ correction factor as
√NJ+N−K
wNJ+bN−K .However, in the econometrics literature, N degrees of freedom are subtracted when using a W-E, but if thewithin regime is downweighted this correction should be mitigated. Thus, the correction factor used in thispaper is
√NJ+N−K
wNJ+bN−wN−K , which behaves properly in the extreme cases that w = 0 or b = 0.
11
Table 1: Special cases and generalizations of the fractionally pooled estimator (FPE)Weights Special Cases Correction Unrestricted Analoguew b of the FPE Factor for σFPE to the FPE
1 0 Within Estimator√
NJ+N−KNJ−N−K
Unrestricted because b = 0
0 1 Between Estimator√
NJ+N−KN−K
Unrestricted because w = 0
? ? Random Effects Estimator√
NJ+N−KwNJ+bN−wN−K
Consecutive Parsed Estimator
1√J
1 Pooled OLS Estimator√
NJ+N−K
NJ12 +N
�1−J−
12
�−K
Simultaneous Parsed Model
1 1 Pooled OLS Estimator 2√
NJ+N−KNJ−K
Simultaneous Parsed Model 2
Notes: The general form of the correction factor for σFPE (and thus for the standard errors) is√NJ+N−K
wNJ+bN−wN−K. The weights for the pooled OLS estimator assume the data are balanced.
The consecutive parsed estimator (CPE) and the pooled OLS estimator 2 (POLSE2) arediscussed below.
If all the weight is placed on the within regime by specifying that w = 1 and b = 0, the
FPE reduces to a W-E, and√
NJ+N−KwNJ+bN−wN−K
reduces to√
NJ+N−KNJ−N−K
, where the denominator
reflects the textbook degrees of freedom correction for a W-E (see, for example Greene, 2000,
p.562). If w = 0 and b = 1, all the weight is placed on the between regime, and the FPE
reduces to a B-E while√
NJ+N−KwNJ+bN−wN−K
reduces to√
NJ+N−KN−K
, which again produces the
correct number of degrees of freedom for a B-E because there are only N unit means.
The FPE also encompasses the REE but apparently not in a closed form fashion. Recall
that in the REE, θ = 1−√ bσ2
W -E
Jbσ2B-E
, which allows the REE to compromise between a W-E and
a POLSE. In the extreme case that θ = 1, the REE reduces to a W-E, so the corresponding
weights for the FPE are w = 1 and b = 0. In the other extreme case that θ = 0, the REE
reduces to a POLSE, which corresponds to a FPE with weights of w = J−12 and b = 1. For
intermediate values of θ, there should be unique values of w and b that produce the same
point estimates as a REE, but to find them, a computer would need to search numerically
over the intervals w ∈[
1√J, 1
]and b ∈ [0, 1] to minimize
∑Kk=0 abs
(β
[k]REE − β
[k]FPE | w, b
).
Perhaps the main significance of this finding is that, while the GLS standard errors
12
produced by the REE are only valid asymptotically, the FPE emulation would lend itself to
a finite sample correction if the critical values of w and b can be found. This finding has
intuition behind it: The REE is a compromise between the W-E and the POLSE; the W-E
has fewer degrees of freedom than a POLSE; so how is the usual practice of using NJ −K
degrees of freedom for both a REE and a POLSE justified if θ > 0?
Moreover, the POLSE is a compromise between the W-E and the B-E; both the W-E
and the B-E have fewer than NJ −K degrees of freedom; so why should the POLSE have
NJ − K degrees of freedom? The FPE formalizes this intuition because it reduces to a
POLSE if w = J−12 and b = 1. The exactness of this conclusion depends on the data being
balanced, but when the data are unbalanced, it is possible to approximate a POLSE by
minimizing∑K
k=0 abs(β
[k]POLSE − β
[k]FPE | w, b
). The correction factor for the standard errors
of the FPE is√
NJ+N−K
NJ12 +N
�1−J−
12
�−K
. Hence, the correct standard errors in this FPE can be
much larger than the standard errors that the computer produces for the POLSE. We should
be suspicious of any published result that utilizes a POLSE if the t statistic fails to exceed
conventional levels when divided by√
NJ−K
NJ12 +N
�1−J−
12
�−K
(assuming balanced data).
Also, the FPE exposes the fact the POLSE implicitly weights the data in a non-substantive
fashion, which could have been anticipated given the derivation in equation 10. The POLSE
is a SPM with the restrictions that βb = βw, and the SPM copies the meaned data J times.
Thus, the POLSE weights the between variance relative to the within variance using a 1 : J−1
scheme. In the equivalent FPE, we use w = J−12 rather than w = J−1 because the w term
is squared when least squares is utilized (see Bartels, 1996, equation 20), which implies that
the weighting of the between variance relative to the within variance in this FPE ultimately
follows a 1 : J−1 scheme as in the POLSE. Thus, we should also be suspicious that published
results from a POLSE are driven by the arbitrary implicit weights. If βb 6= βw, then the
implicit weights have a major effect on βPOLSE, which is a weighted average of βb and βw.
The SPM is the unrestricted version of the POLSE, but these estimators are strictly
13
dominated by the SPM2 and the POLSE2 respectively. The POLSE2 is a special case of
the FPE where w = b = 1, which is what the POLSE masquerades as. The correction
factor for the POLSE2 reduces to√
NJ+N−KNJ−K
, whose denominator reflects the usual degrees
of freedom for a POLSE. Hence, the usual degrees of freedom for a POLSE assume non-
preferential weighting even though the POLSE implicitly downweights the within variance.
The best way to evaluate the restrictions that βb = βw is to find the smaller of the two
Bayesian Information Criteria (BIC) between the SPM2 and the POLSE2.6 There are many
different formulations of the BIC. In Raftery (1995, equation 26),
BIC′= (NJ + N) ln
(1−R2
)+ (K − 1) ln (NJ + N) , (15)
and R2 is the proportion of explained variance. This BIC can be used to approximate the
odds in favor of the SPM2 over the POLSE2 using the formula exp
(BIC
′POLSE2−BIC
′SPM2
2
).
Another important conclusion is that using the BIC to choose between a POLSE and a
LSDVE is flawed. First, the POLSE has arbitrary implicit weights that cannot be defended
on substantive grounds. Second, equation 15 assumes the errors are normally distributed,
which does not hold if there is unit-heterogeneity in the POLSE. Third, Bartels (1996, note 9)
claims that the R2 of the FPE is too large unless the preferential weighting is accounted for,
which implies that the BIC for the POLSE is biased. The POLSE2 does not preferentially
weight the data and can properly be compared with the SPM2 using the BIC.
The only argument for the POLSE over the LSDVE is that the POLSE exploits all the
variation in the data, while the LSDVE really only uses the within variation. This point can
be nullified by separately utilizing a W-E and a B-E, which I collectively call the consecutive
parsed estimator (CPE). Thus, the CPE makes use of all the within variance in the data
6Bartels (1996) has valid criticisms of testing restrictions, which can be partially mitigated by using theBIC instead of a hypothesis test. Furthermore, Gould (2001) notes that the Hausman test to discriminatebetween fixed and random effects is asymptotically equivalent to a F test of restrictions that βb = βw in aSPM. But the F test is biased in the SPM due to the duplicative error components.
14
and all of the between variance in the data, but does not attempt to synthesize the results
statistically (but does not preclude the researcher from synthesizing the results analytically).7
The point estimates from the SPM2 are identical to those produced by the CPE, and the
CPE produces better standard errors than the SPM2. Thus, if the SPM2 casts doubt on the
restrictions that βb = βw, the CPE should be used for the following reasons.
The CPE automatically relaxes the homoskedasticity assumption in the SPM2 that the
error variance for the within-unit regime is equal to the error variance for the between-
unit regime. The W-E and the B-E may suffer from other violations of the spherical error
assumption, but it is easy to calculate PCSEs or clustered standard errors following the W-E
and equally easy to calculate White standard errors following the B-E. As yet, we do not
know how to make such corrections for a SPM2.
Misspecification along the between (within) dimension in a SPM2 increases the standard
errors of all estimates. If the B-E is estimated separately from the W-E, omitting a relevant
variable from the B-E will affect the standard errors of the between-unit estimates only (and
vice versa). Finally, the W-E and the B-E individually estimate fewer parameters than does
the SPM2. For these reasons, the CPE has a small efficiency advantage over the SPM2,
although this efficiency advantage might not be apparent from the output if the standard
errors in the SPM2 reflect the dubious regime-homoskedasticty assumption.
The conclusion that researchers should use the CPE when the data are two-dimensional
may seem too orthodox. Many believe that only a POLSE or a REE can estimate the effect
of variables that do not vary within units of observation. However, the B-E half of the
CPE is optimal for this purpose. A B-E can produce good standard errors, and its point
estimates are not biased by the potentially invalid equality restrictions the POLSE and the
REE impose on the coefficients of two-dimensional variables. Variables that do not change
7In table 1, I claimed that the CPE was an unrestricted version of the REE. This claim is somewhattortuous. The REE does use a CPE to calculate θ, and after transforming the data, imposes the restrictionsthat βb = βw. The GLS transformation muddles the connection between the CPE and the REE.
15
within units of observation can only explain between-unit variance in the dependent variable,
so nothing is “lost” when the B-E is used to estimate the effects of such variables. What
appears to be lost is some precision in the standard errors, but the apparent precision of the
POLSE and the REE is merely a reflection of incorrect standard errors.
But granted that the CPE approach is nevertheless somewhat orthodox, there are three
possibilities for compromise, although I am not especially well-disposed toward any of them.
First, Bartels (1996) gives a Bayesian justification for fractional pooling where the unbiased
regime (the within regime in this case) receives a weight of unity and the other regime is
downweighted. The catch is that an explicit and substantive justification of the weighting
scheme must be given, but setting w = 1 and b ∈ (0, 1) is a potentially plausible course of
action when the amount of within variation is small. Ironically, many researchers – claiming
that there is “too little within variation in the data” – resort to a POLSE, which implicitly
downweights the scarce within variation. Moreover, the reason that the POLSE produces
small standard errors in this situation is because the standard errors are not corrected for
unit heterogeneity and implicit weighting.
A second compromise is to impose the restrictions that the between effect of a covariate
is equal to the corresponding within effect for some, but not all, of the covariates. Zorn
(2001) contemplates this “partial pooling” compromise, although the SPM2 should be used
to evaluate the restrictions rather than the SPM. It would be difficult to justify restricting
the between and within effects of a control variable to be equal, because doing so would risk
bias to the key causal variable while only saving one degree of freedom for each restriction
imposed. However, restricting the between and within effects of the key causal variable to
be equal would increase the precision for the key estimate and, depending on the evidence
from the SPM2, may not cause too much bias. Fortunately, the matter can (and should)
always be resolved in a data-driven way rather than assumed.
At present, the partial pooling route unfortunately does not lend itself to PCSEs or
16
other post hoc corrections to the standard errors. The third possibility for compromise is
to average the within effect of the key causal variable (k) with the corresponding between
effect using the CPE and the textbook formulas for the sum of two random variables:
β[k]
=1
2×
(β[k]
w + β[k]b
), (16)
SE(β
[k])
=1
2×
√[SE
(β
[k]w
)]2
+[SE
(β
[k]b
)]2
+ 2× Cov(β
[k]w , β
[k]b
). (17)
Since a demeaned variable has no covariance with a meaned variable, the last term under the
radical in equation 17 drops out and information from non-nested models can be averaged.
One virtue of this approach is that the standard errors of the within and between estimates
can be fixed with any of the well-known post hoc corrections before equation 17 is used.
However, how should we interpret an averaged estimate? All estimators that are used
for two-dimensional data either yield between estimates, within estimates, or some matrix-
weighted average of between and within estimates. Between and within estimates have
clear, albeit different, interpretations. A between effect reflects the expected difference in
the dependent variable when two units of observation only differ on one independent variable.
A within effect reflects the expected change in a unit’s dependent variable when one of its
independent variables changes. But there is no substantive interpretation for a matrix-
weighted average of between and within estimates except in the limiting case that βb = βw,
making both interpretations valid. This point applies to equation 16 just as much as it
applies to the POLSE, REE, and POLSE2.
This section has provided a framework to answer three questions that should always be
asked when analyzing two-dimensional data. First, are the restrictions on the coefficients,
if any, valid? Second, given the restrictions, are the implicit or explicit weights on the data
appropriate? Third, given the weights, are the degrees of freedom correctly calculated? The
answer to the first question is often negative, implying that a CPE should be used.
17
4 The Time-series-Cross-section Case
The previous sections discussed issues that arise with all two-dimensional datasets. This sec-
tion focuses on additional problems that occur in the special case where the second dimension
is time, which is the most common type of two-dimensional dataset in political science. Let
i = 1, 2, . . . N continue to index the units of observation, which are usually countries or pairs
of countries in the political economy literature. However, now the second dimension is time,
which is usually years in the political economy literature. Thus, j = t = 1, 2, . . . T indexes
the temporal dimension of the data. For concreteness, I assume that T is relatively large
so that the data can be considered “time-series-cross-section” (TSCS) data, but all of my
claims also apply to “panel” data where T is relatively small.
Beck and Katz (1996) recommends a two-dimensional version of the “auto-regressive
distributed lag” (ARDL) model, as a starting point from which to “test down”. For example,
yit = α + φyit−1 + xitβ + xit−1γ + (νi + νij) ; |φ| < 1. (18)
The first number in ARDL( 1 ,1) notation indicates that the right-hand side includes one
lag of the dependent variable(yit−1
). The second number indicates that the right-hand
side includes one lag of the exogenous variables (xit−1). This particular ARDL model also
includes the contemporaneous exogenous variables (xit) and a single intercept (α).
Beck and Katz (2001, p.493) elaborates that fixed effects are “never ideal” but should
be included when they are necessary – provided that no time invariant variables are of
substantive interest – and that the BIC, rather than a F test, should be used to judge
necessity. The problems with the comparison between a POLSE and a LSDVE were discussed
in section 3, and the proper BIC comparison is between a POLSE2 and a SPM2. But if one
were to follow the recommendation in Beck and Katz (2001), it is likely that a specification
with a single intercept would be adopted, so I focus on the problems that arise in that case.
18
The ARDL model has a long history in the econometrics literature for single time-series,
but is not immune from criticism. I make no claim that the ARDL model is appropriate,
even for the example given in section 5. For consistency, the ARDL model requires the error
term to be uncorrelated with current values of xit, past values of xit, and future values of xit.
This is a very strong assumption, but the consequences of it have not been explored much
in the political science literature. Also, Wilson and Butler (2003) urges us to think more
carefully about lag structures. Nevertheless, I focus on the ARDL(1,1) model because it has
been specifically recommended for political scientists and use it as an example of what can
go wrong with TSCS data. If the ARDL model were eschewed in favor of a different model,
all the points in the previous sections would continue to hold and some of the points in this
section would probably apply as well.
Wilson and Butler (2003, table 1) claims that, as of May 31, 2003, 135 papers published in
political science have used linear TSCS models and have cited Beck and Katz (1995) or Beck
and Katz (1996). All of which use an ARDL model but likely do so by placing restrictions
on equation 18. Thus, it is important to determine if the two-dimensional ARDL model
is sound. Beck and Katz are always careful to allow for the possibility of fixed effects,
but the fact that only 47 of those papers report fixed effects estimates indicates that most
ignore this possibility. My impression is that pooled specifications are usually given more
emphasis even when fixed effects estimates are reported. Many reviewers insist that fixed
effects estimates be reported in a footnote as a “robustness check” on the POLSE, but this
thinking is backwards because there is no reason to believe that the POLSE is sound. The
B-E should be used as a robustness check on the W-E when the data vary over time.
In order to turn the ARDL(1,1) model into a SPM, it is first necessary to discuss the
difference between the unit mean of a lagged variable and the unit mean of a contemporaneous
variable. When lagged variables are used, data are lost. Thus, the unit mean of a lagged
covariate(xi[t−1]
)is calculated over a slightly different sample than the unit mean of a
19
contemporaneous covariate (xi):
xi[t−1] =1
T
T−1∑t=0
xit 6=1
T
T∑t=1
xit = xi, (19)
yi[t−1] =1
T
T−1∑t=0
yit 6=1
T
T∑t=1
yit = yi. (20)
When demeaning the lagged covariate, we should use xi[t−1] rather than xi. Thus, xit−1 =
xit−1 − xi[t−1] while xit = xit − xi. One can then see that the ARDL(1,1) model is a SPM,
yit = yi + yit =α + φbyi[t−1] + xiβb + xi[t−1]γb + νi
+ φwyit−1 + xitβw + xit−1γw + νit,(21)
with the restrictions that φb = φw, βb = βw, and γb = γw.
Of course, it is possible to avoid the problems with the standard errors of between esti-
mates inherent in a SPM with a SPM2 or CPE, but I want to elaborate why the restrictions
that φb = φw, βb = βw, and γb = γw are especially inappropriate. In my opinion, these
criticisms invalidate the ARDL(1,1) model with a single intercept regardless of the substance
of the research question or what the results of a (biased) BIC comparison imply.
Equation 21, like any SPM, is the sum of a B-E and a W-E, but no textbook includes
meaned lagged variables in the B-E for good reason. The meaned lagged variables should
be excluded from a B-E and should be included in a SPM2 only to determine whether the
equality restrictions are valid, which I will now demonstrate is virtually impossible.
First, when yit−1 is included in the SPM, βw represents the short-term effects of the
exogenous variables. There is no such thing as a “short-term cross-sectional effect”, so there
can never be a theoretical reason to impose the restrictions that βb = βw in an ARDL
specification. I could imagine a scenario where it could be sensible to constrain the long-
term temporal effects – which can be calculated using the formula βw+γw
1−φw– to equal the
20
cross-sectional effects, but that is not what the ARDL model does.
Second, νi is highly correlated with yi[t−1] unless νi = 0 ∀i. It is impossible for the
variance of νi to be eliminated unless a fixed effects model is estimated. But in theory, if
νi = 0 ∀i without using fixed effects, then φb ∈ {0, 1}, and neither estimate is promising if
one intends to constrain φb to equal φw since φw generally falls within the [0, 1] interval.
What about the restriction that φb = φw when νi 6= 0 ∀i? Recall that the point estimate
for φb in the SPM is the same as in the B-E. Using this fact, we could agree that if yi were
regressed on yi[t−1] and a constant, φb ≈ 1 because yi and yi[t−1] are calculated using almost
the same sample as shown in equation 20. But adding xi and xi[t−1] to the right-hand side
of the B-E would not affect φb very much at all because yi[t−1] is a consequence of xi and
xi[t−1]. Thus, xi and xi[t−1] have virtually no net effect on yi, conditional on the effect they
have on yi[t−1]. Since xi and xi[t−1] have virtually no net explanatory power, yi[t−1] is left to
do all the explaining of the cross-sectional variance in the dependent variable, and φb ≈ 1 in
both the B-E and the SPM.
Given that φb = 1, the restriction the ARDL model imposes on the SPM that φb = φw is
valid only if φw = 1. Thus, the ARDL model makes the assumption that φb = φw = 1 and
the stationarity assumption that |φ| < 1, which are mutually exclusive. If the restriction
that φb = φw = 1 were valid, the model would be explosive, the long-run effects of the
exogenous variables would be infinite, the distribution of the test statistics would be wrong.
In short, finding that φb = φw = 1 is pretty much the worst thing that could ever happen
in a regression, but the ARDL model imposes this restriction. In general, the restriction
is invalid because φw < 1, which implies that φ is biased in the ARDL model.8 The point
estimates for the effects of the exogenous variables are biased as well due to their correlation
8Both Green, Kim and Yoon (2001, p.453) and Kristensen and Wawro (2003, note 18), among others,recognize in passing that the pooled estimate of a lagged dependent variable is biased upward and blameheterogeneity in the units. However, the same phenomenon could theoretically occur with homogenous unitsthat all experience transitory shocks in the error term.
21
with the lagged dependent variable.
Third, it does not make sense to think about the effects of xi conditional on xi[t−1]
(and vice versa) because the two vectors are conceptually the same and are almost perfectly
collinear. However, it does make sense to think about xit conditional on xit−1 if lagged effects
are possible. Thus, imposing the restrictions that γw = γb and that βw = βb when γb and
βb are non-sensible undermines the estimates for γw and βw. Many papers simply exclude
xit−1, making the ARDL(1,1) model into an ARDL(1,0) model, which is called the “partial
adjustment model” or the “lagged dependent variable model” by Beck and Katz. Excluding
xit−1 avoids the collinearity problem, but does not change the fact that the restrictions that
φb = φw and βb = βw are deeply problematic.
Fourth, the reason Beck and Katz (1996) recommends the ARDL(1,1) is to test hypothe-
ses about γ in order to capture possible efficiency advantages. If γ = 0, one should estimate
the partial adjustment model, which is more parsimonious than the ARDL(1,1) model. If
γ = −φβ, imposing these restrictions yields an efficient GLS model. However, neither of
these tests are fruitful in a two-dimensional context because all the estimates are biased to
the extent that the restrictions that φb = φw, βb = βw, and γb = γw do not hold. The SPM2,
or better yet the W-E, permits alternative tests that γw = 0 or whether γw = −φwβw, and
these restrictions can possibly be imposed on the W-E to increase precision.
Fifth, the ARDL(1,1) incorrectly assumes that the “unit effects” (νi) do not exist. The
existence of the unit heterogeneity implies that each residual is highly correlated with every
other residual for that unit, which undermines the consistency of PCSEs. Beck and Katz
(1996) recommends – but does not derive – a Lagrange Multiplier (LM) test to verify that
there is no autocorrelation in the residuals (νit). This LM test takes the form of an auxiliary
regression of the residuals on their lags and every independent variable in the original model:
νit = α + φνit−1 + κyit−1 + xitβ + xit−1γ + ηit. (22)
22
Beck and Katz (1996) recommends looking primarily at the magnitude of φ in equation
22 to determine if there is any autocorrelation left in the residuals, but a pooled auxiliary
regression has all the same problems that plague the original pooled model. In particular, φ
is biased unless φb= φ
w= 1 (which would be very bad). When a lagged dependent variable
is included in the original model, there is so little cross-sectional variation in νit that φ is
essentially a within estimate. However, there is some cross-sectional variation in νit and
the fact that φb≈ 1 would be easy to verify if an auxiliary SPM, SPM2, or B-E were used
instead of an auxiliary POLSE. We do not exactly know how PCSEs fare when the small
cross-sectional component of the residuals is almost perfectly predictable, but the Monte
Carlo evidence in Kristensen and Wawro (2003) is not particularly encouraging for PCSEs.
Of course, PCSEs following a W-E should work well because νi = 0 ∀i by construction.
Estimating a W-E prior to a B-E affords an opportunity to model the cross-unit corre-
lation in the errors of the B-E. In order to make corrections to the W-E’s standard errors,
PCSEs calculate Σ, which is a N ×N variance-covariance matrix of the residuals where the
off-diagonal elements are estimates of the covariance in the residuals for two units (see Beck
and Katz, 1995). Thus, one can create a covariate,i
c, that is a weighted sum of the unit
means of the dependent variable where the weights are given by the elements of Σ:
i
cN×1
=
[i
y′1×N
ΣN×N
]′
. (23)
Includingi
c in the B-E is consistent with the advice in Franzese and Hays (2004), but
future Monte Carlo experiments are needed. First, the diagonal elements of Σ should prob-
ably be replaced by zeroes to precludei
c from being influenced by a unit’s own dependent
variable. Second, it is possible that Σ should be rescaled to a correlation matrix before using
equation 23 to construct the new covariate. Third, the W-E may include a lagged dependent
variable and unique intercepts for each time period, both of which will reduce the contem-
23
poraneous correlation in the residuals of the W-E. It is possible that a underspecified W-E
should be estimated (after the properly specified W-E) to create a Σ that is not affected by
any variables that are included in the W-E but not the B-E. However, the CPE is still the
best estimation strategy discussed in this paper even ifi
c is omitted from the B-E.
Finally, some comments on the two-dimensional error correction model (ECM),
∆yit = α + ∆xitβ + φ (yit−1 + xit−1γ) + ζit, (24)
which is also recommended in Beck and Katz (1996) in some circumstances. An ECM and a
ARDL model are closely related, so it should come as no surprise that the problems with the
ARDL model carry over to the ECM. In particular, unless α is replaced with αi in equation
24, the point estimates will reflect excessive weight on the cross-sectional dimension and the
error term (ζit) will have common components. Differencing reduces but does not eliminate
the cross-sectional variation in the data, necessitating the use of fixed effects to obtain pure
within estimates of the coefficients and draw upon ECM theory from econometrics. My
impression is that most political scientists who use ECMs include fixed effects. However, I
have not seen a paper where the B-E is used to check the results of an ECM.
5 Empirical Example
This section replicates part of Green, Kim and Yoon (2001) in order to illustrate the problems
and solutions identified in previous sections. The model in Green, Kim and Yoon (2001) is
a dyadic TSCS model of bilateral trade between 1952 to 1992 called a “gravity model”. Let
a and b indicate the two states in the dyad. The logarithm of distance between a and b
is included on the right-hand side in addition to the logarithm of the product of the two
countries’ gross domestic products and the logarithm of the product of the two countries’
24
populations. The two covariates of interest are a dummy variable indicating the presence of
a military alliance between a and b and the minimum level of democracy between a and b.
The data are not balanced (N = 3079; T = 28.89), but I ignore this and a number of other
important, but tangential, methodological issues in order to present an exact replication.
Column 1 of table 2 is an exact replication of the POLSE reported in column 3 of table
2 of Green, Kim and Yoon (2001). All estimates are statistically significant given the OLS
assumptions. The FPE in the column 2 approximates the POLSE by weighting the meaned
data by b = 0.9713 and weighting the demeaned data by w = 0.2059, weights which were
calculated to minimize the discrepency between the POLSE and the FPE that arises when
the data are unbalanced. If all 3079 units had the maximum of 41 years of complete data, w
would equal 41−12 or 0.1562. The standard errors for this FPE reflect a correction factor of
2.11, which is not trivial but does not affect any inferences. However, none of the standard
errors in table 2 have much credibility because they assume a spherical error term.
Column 3 of table 2 is a FPE with w = b = 1 or a POLSE2. Since there are 88, 946 de-
meaned observations and only 3079 meaned observations, the absence of preferential weight-
ing implies that the results are essentially within estimates. All the coefficients retain the
same signs, but their magnitudes change somewhat. In particular, the estimated effect of
the lagged dependent variable decreases by almost 0.2, which is a large change for an au-
toregressive parameter. Also, the effects of alliances and democracy are insignificant.
The SPM2 presents pure within and between estimates in the columns 4 and 5 respec-
tively. The within effects for the population and alliance variables change signs and become
statistically significant. This SPM2 is a bad specification, at least along the cross-sectional
dimension, because the unit means of the lagged dependent variable are included in the
model. The estimated “effect” of the meaned lagged variable is 0.981 and insignificantly dif-
ferent from unity. As a result, the effects of all the other meaned variables are insignificant
because the meaned lagged variable is a consequence of them.
25
Table 2: Comparison of various least squares estimators for a two-dimensional gravity model of bilateral tradePOLSE FPE POLSE2 Bad SPM2 Good SPM2 CPE
Column number: (1) (2) (3) (4) (5) (6) (7) (8) (9)Variable ↓ / Info. → Replication w=0.2059
b=0.9713w=1b=1
Within Between Within Between W-E B-E
Intercept−3.046 −3.044 −5.071 −0.197 −24.375 −24.375(0.177) (0.406) (0.689) (1.185) (1.149) (1.306)
ln(Distance[ab])i−0.328 −0.384 −0.524 0.002 −1.507 −1.507(0.012) (0.028) (0.055) (0.063) (0.060) (0.068)
ln(GDP[a] 0.250 0.254 0.363 0.342 0.028 0.342 1.534 0.342 1.534×GDP[b])it (0.006) (0.012) (0.007) (0.013) (0.049) (0.013) (0.044) (0.013) (0.050)
ln(Population[a] −0.059 −0.046 −0.034 0.143 −0.020 0.143 −0.648 0.143 −0.648×Population[b])it (0.006) (0.014) (0.023) (0.067) (0.045) (0.068) (0.044) (0.068) (0.051)
Alliance −0.247 −0.341 −0.017 0.419 −0.037 0.419 −0.787 0.419 −0.787
Dummy[ab]it (0.027) (0.065) (0.092) (0.119) (0.148) (0.122) (0.150) (0.121) (0.171)
min(Democracy[a], 0.022 0.022 −0.003 −0.009 −0.005 −0.009 0.079 −0.009 0.079Democracy[b])it (0.001) (0.003) (0.002) (0.002) (0.009) (0.002) (0.009) (0.002) (0.010)
ln(Trade[ab])it−10.736 0.722 0.549 0.533 0.981 0.533 0.533
(0.002) (0.004) (0.003) (0.003) (0.015) (0.003) (0.003)
Notes: The dependent variable is ln(Trade[ab]
)it
or some transformation thereof. The two states in the dyad are denoted
by a and b. For details on the dataset, see Green, Kim, and Yoon (2001). Standard errors are in parentheses and are“OLS standard errors” with the following modifications. The standard errors in the FPE are multiplied by 2.11, whichreflects the correction in table 1. The standard errors in the POLSE2, the Bad SPM2, the Good SPM2, and the withinestimator component of the CPE reflect a deduction of N degrees of freedom only.
26
The only purpose of estimating an overspecified SPM2 is to evaluate the restrictions that
the POLSE2 places on the SPM2, which are clearly invalid in this case, especially for the
lagged dependent variable. The odds of this SPM2 relative to the POLSE2 are approximately
e500 to one based on the BIC. Also, note that Beck and Katz (2001) found evidence in favor
of the POLSE relative to the W-E using the BIC, but this is not a meaningful comparison
because the POLSE is flawed.
For reference, I also include the results of a well-specified SPM2 in columns 6 and 7
that excludes the meaned lagged dependent variable. The within estimates do not change
because the demeaned variables are orthogonal to the excluded variable. However, the
between estimates of the exogenous variables become significant by a wide margin, as would
be expected because the included variable bias is eliminated.
Columns 8 and 9 present a W-E and a B-E respectively. The results of the W-E are
identical to column 4 in table 2 of Green, Kim and Yoon (2001), and the point estimates are
identical to those in column 6 in this paper due to orthogonality. Although table 2 presents
standard errors that incorrectly assume a spherical error term, it would be easy to calculate
PCSEs or clustered standard errors for the W-E.9 The point estimates from the B-E are also
identical to the between estimates in the well-specified SPM2, again due to orthogonality.
And it would be easy to use White standard errors to correct the B-E for heterosedasticity.
We should not overlook the fact that the estimates in the CPE are all significant, but
the within and the between estimates have opposite signs in every case except the GDP
variable (where the magnitudes are very different, even in the long-run). In the absence of a
theoretical explanation for the differing signs, we should probably conclude that important
variables are missing. If the missing variables only vary across dyads, the results from the
W-E are unbiased. However, if any of the missing variables two-dimensional, all bets are off
9Although there is evidence that a second lag of the dependent variable would be needed in the W-E toeliminate the autocorrelation.
27
concerning the unbiasedness of the CPE. Of course, the true answer is unknowable.
It is also useful to note the results for the distance variable, which is the only time
invariant covariate. The estimated effect goes up in magnitude from the POLSE to the
POLSE2 but the estimate is an artifact of the invalid pooling constraints on the other
covariates. When these restrictions are relaxed in the overspecified SPM2, the effect of
distance is nil conditional on the meaned lagged variable. When this variable is excluded, the
effect of distance rebounds to a level that is consistent with the literature and is significant.
The main point to take away is that the B-E, rather than the POLSE, POLSE2 or REE
(not shown), is the best way to estimate the effects of time-invariant variables.
6 Conclusions
The message of this paper is fairly simple: Think about the restrictions put on the coef-
ficients, think about the weights put on the data, and think about how the weights affect
the degrees of freedom. Using a CPE is safe; reviewers and editors should enforce tough
standards if an author attempts to justify another estimation technique when the data are
two-dimensional. Conversely, researchers have plenty of opportunity to revisit previous stud-
ies. To summarize:
1. The SPM in Zorn (2001) produces unbiased point estimates but the wrong standard
errors for the between estimates, and the well-known post hoc corrections to the stan-
dard errors do not solve this problem adequately. The SPM2 produces the same point
estimates as the SPM but more reasonable standard errors. Thus, the SPM2 rather
than the SPM should be used to evaluate pooling restrictions on the coefficients.
2. The POLSE2, which restricts all the coefficients, is a special case of the FPE where
the within and between regimes are weighted equally. By specifying different weights
28
for the two regimes, the FPE can produce the same point estimates as the W-E, B-E,
REE, and POLSE. In the case of the POLSE, it is virtually impossible to substantively
justify a weight of J−12 for the within regime.
3. The FPE requires that the standard errors be multiplied by√
NJ+N−KwNJ+bN−wN−K
, which
implies that the reported standard errors of the REE and POLSE are too small.
4. If the pooling restrictions appear invalid when the BIC of the SPM2 is compared to
the BIC of the POLSE2, it is best to employ a W-E and then a B-E. Doing so allows
more flexibility to relax the assumption that the error term is spherical.
5. If time is one of the dimensions and a lagged dependent variable is included on the
right-hand side, it is virtually impossible to justify the pooling restrictions. The rec-
ommendation in Beck and Katz (1996) is sound only if fixed effects are used.
6. The B-E can be used as a robustness check for the results of a W-E or ECM and is
appropriate for estimating the effects of variables that do not change over time.
References
Arellano, Manuel. 1987. “Computing Robust Standard Errors for Within-Groups Estima-tors.” Oxford Bulletin of Economics and Statistics 49(4):431–34.
Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Second ed. New York: JohnWiley & Sons, LTD.
Bartels, Larry M. 1996. “Pooling Disparate Observations.” American Journal of PoliticalScience 40(3):905–942.
Beck, Nathaniel and Jonathon N. Katz. 1995. “What to Do (and Not to Do) with Time-Series–Cross-Section Data.” American Political Science Review 89(3):634–647.
Beck, Nathaniel and Jonathon N. Katz. 1996. “Nuisance vs. Substance: Specifying andEstimating Time-Series–Cross-Section Models.” Political Analysis 8(3):1–36.
Beck, Nathaniel and Jonathon N. Katz. 2001. “Throwing the Baby Out with the Bathwater:A Comment on Green, Kim, and Yoon.” International Organization 55(2):487–495.
29
Franzese, Robert J. and Jude C. Hays. 2004. “Empirical Modeling Strategies for SpatialInterdependence: Omitted-Variable vs. Simultaneity Biases.” Paper presented at the 2004Political Methodolgy Conference and is avaialbe from http://sitemaker.umich.edu/
jchays/files/franzesehays 1 .polmeth.2004.pdf.
Gould, William. 2001. “What is the Between Estimator?” STATA FAQ: http://www.
stata.com/support/faqs/stat/xt.html.
Green, Donald P., Soo Yeon H. Kim and David Yoon. 2001. “Dirty Pool.” InternationalOrganization 55(2):441–468.
Greene, William H. 2000. Econometric Analysis. Fourth ed. Upper Saddle River, NJ: PrenticeHall.
King, Gary. 2001. “Proper Nouns and Methodological Propriety: Pooling Dyads in Interna-tional Relations Data.” International Organization 55(2):497–507.
King, Gary, James Honaker, Anne Joseph and Kenneth Scheve. 2001. “Analyzing IncompletePolitical Science Data.” American Political Science Review 95(1):49–69.
Kristensen, Ida Pagter and Gregory Wawro. 2003. “Lagging the Dog? The Robustness ofPanel Corrected Standard Errors in the Presence of Serial Correlation and ObservationSpecific Effects.” Paper presented at the 2003 Political Methodology Conference. Prelim-inary version available from: http://polmeth.wustl.edu/papers/03/krist03.pdf.
Little, Roderick J.A. and Donald B. Rubin. 2002. Statistical Analysis with Missing Data.Second ed. Hoboken, New Jersey: John Wiley & Sons, Inc.
Neuhaus, J. M. and J. D. Kalbfleisch. 1998. “Between- and Within-Cluster Covariate Effectsin the Analysis of Clustered Data.” Biometrics 54:638–645.
Oneal, John R. and Bruce Russett. 2001. “Clear and Clean: The Fixed Effects of the LiberalPeace.” International Organization 55(2):469–485.
Raftery, Adrian E. 1995. “Bayesian Model Selection in Social Research.” SociologicalMethodology 25:111–163.
Rice, John A. 1995. Mathematical Statistics and Data Analysis. Second ed. InternationalThomson Publishing.
Wilson, Sven E. and Daniel M. Butler. 2003. “Too Good to Be True? The Promise andPeril of Panel Data in Political Science.” Working Paper. Preliminary version availablefrom http://fhss.byu.edu/POLSCI/Wilson/papers/.
Zorn, Christopher. 2001. “Estimating Between- and Within-Cluster Covariate Effects, withan Application to Models of International Disputes.” International Interactions 27(4):433–45.
30