Effects of Intrinsic Sources of Spatial Autocorrelation on Spatial Regression Modelling
Shuqing N. Teng1*, Chi Xu2, Brody Sandel1 and Jens-Christian Svenning1
1 Section for Ecoinformatics & Biodiversity, Department of Bioscience, Aarhus University, Ny Munkegade 114, DK-8000 Aarhus C, Denmark
2 School of Life Sciences, Nanjing University, 163 Xianlin Road, 210023, Nanjing, PR China
*Correspondence author: E-mail: [email protected]
Running title: Intrinsic autocorrelation in regression modelling
Abstract
1. Detecting and dealing with spatial autocorrelation (SA) are indispensable steps in analyses of
geospatial data. Despite consensus that SA originates from both extrinsic factors and intrinsic
interactions, previous studies on regression analysis of spatially autocorrelated data have rarely controlled
for intrinsic sources in addition to extrinsic ones to assess ceteris paribus (i.e. causal) effects of interest,
with the strict exogeneity assumption that errors containing unexplained variance are uncorrelated with
explanatory variables for all observations. This assumption becomes invalid when intrinsic SA is not an
external process modeled as errors and needs to be controlled for.
2. Here, we aimed to assess the extent to which not controlling for intrinsic SA negatively affects model
performance, specifically in terms of type I error rate and unbiasedness of coefficient estimates, and to
identify models that are able to handle these problems. To this end, we applied two categories of
regression models that do (intrinsic category) or do not (extrinsic category) explicitly control for intrinsic
SA to artificial data generated with both SA sources. These models included the extended spatial Durbin
model (ESDM) and its nested models. Four analytic scenarios simulated realistic modelling conditions,
1
2
3
45
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
with realism increasing with the complexity of variables omitted during modelling. The two more realistic
scenarios involved additional violations of strict exogeneity.
3. We found that intrinsic – just as extrinsic – SA can produce incorrect type I error rates, if not explicitly
controlled for. Failing to control for intrinsic SA also generated bias in estimates of ceteris paribus
effects. However, ESDM from the intrinsic category exhibited consistently good performance in dealing
with intrinsic SA across all the scenarios, but suffered from other violations of the strict exogeneity
assumption.
4. Overall, model specification should control for both extrinsic and intrinsic processes generating SA in
spatial data to provide reliable type I error rates and unbiased estimates of ceteris paribus effects. Given
the likely widespread occurrence in observational spatial data of unknown or unmeasurable processes,
ESDM should be a generally preferred starting point to explore the optimal model specification for
estimating ceteris paribus effects, with due caution to other violations of strict exogeneity.
Key-words
Autoregressive models, ceteris paribus effects, biotic interactions, endogenous variables, spillover,
omitted variable bias, strict exogeneity
Introduction
Over the past decades, it has become generally recognized among ecologists that spatial autocorrelation
may lead to violation of the basic assumption of independent errors in standard statistical models,
resulting in incorrect inferences (e.g. underestimated confidence intervals and inflated type I error rates in
the presence of positive spatial autocorrelation) (Legendre 1993; Lennon 2000; Beale et al. 2010).
Currently, the most common way of dealing with this issue is applying spatially explicit modelling
techniques that are designed for modelling spatially autocorrelated data, although outputs from these
spatial models may not always be consistent (Keitt et al. 2002; Dormann et al. 2007). Meanwhile, there
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
are opponents who are skeptical about the underlying assumptions of such spatial models, arguing that
they lack realistic ecological meaning and that we should caution against their limitations (Hawkins
2012). Overall, there is yet no generally accepted solution for handling spatial autocorrelation in
modelling. In contrast, there is consensus on the origin of spatial autocorrelation in both intrinsic and
extrinsic processes (Cliff & Ord 1981; Legendre 1993; Fortin & Dale 2005). The generation of spatial
autocorrelation by intrinsic processes can be interpreted as the response of a variable at one position to its
values at neighboring positions, e.g., via dispersal. Hence, spatial structure is generated internally by the
focal variable. Alternatively, spatial autocorrelation may result from responses to extrinsic factors
(including unknown factors modeled as errors), for example, environmental gradients, i.e., with spatial
structuring generated by external forcing.
In terms of regression modelling, one important distinction between the two processes is whether
they violate the strict exogeneity assumption (see below), which is crucial for the unbiasedness of the
ordinary least squares estimator for standard linear regression applied to estimate ceteris paribus (also
referred to as partial or causal) effects (Hayashi 2000; Wooldridge 2012). The standard linear regression
model y = Xβ + ε assumes two types of independence. Firstly, elements in the error vector ε are
independent of each other (i.e. the spherical error assumption: the variance-covariance matrix of ε is an
identity matrix premultiplied by a scalar). Secondly, every element in the error vector ε is uncorrelated
with every element of the design matrix X that contains the explanatory variables (i.e. the strict
exogeneity assumption: the covariance matrix between a column vector in X and ε is a zero matrix;
mathematically equivalent to E(ε|X) = 0). If the first assumption is violated, the estimate by ordinary least
squares for β will be inefficient (i.e. the estimate’s variance becomes larger than that by generalized least
square, the best linear unbiased estimator; see Cressie (1993) p. 20-21), but still unbiased (i.e. the
estimate’s expectation equals the true value). If the second assumption is violated, the estimate by
ordinary least squares for β will be biased. Specifically, the strict exogeneity assumption is violated when
the following situations occur (Wooldridge 2012): measurement for explanatory variables has errors;
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
omitted variables left in the error term are correlated with explanatory variables; one observation of
response variable itself has an effect on its temporal and/or spatial neighbors; explanatory variables are
simultaneously influenced by the response variable. Explanatory variables that meet the strict exogeneity
assumption are said to be exogenous, while those that do not are referred to as endogenous. Hence, in
linear regression models, explanatory variables representing intrinsic processes are endogenous, while
those representing extrinsic processes can be either exogenous or endogenous, depending on specific
contexts. In cases where response variable that responds to extrinsic processes yields no feedback,
variables of the extrinsic processes are exogenous; otherwise, they are endogenous, determined inside of
the model jointly with the response variable due to the feedback. In this sense, extrinsic factors like
environmental conditions (e.g. climate and topography determined by natural forces at scales larger than
the modelling scale) and some types of biotic interactions (e.g. commensalism, amensalism) can be
exogenous, while some biotic interactions (e.g. mutualism, parasitism) should be viewed as endogenous.
Ideally, if a statistical model takes into account all processes generating variation in the response
variable, there should be no signatures of autocorrelation in the errors. In other words, incomplete
representation of these processes is the cause of spatially autocorrelated errors in regression analysis.
Unfortunately, this situation is common in ecological analyses due to either poor knowledge (often as a
result of the complexity of ecological systems) or simply lack of data. For example, it may not be clear
which and how environmental factors or interspecific interactions shape the distribution of a species, or
there may simply not be any data on one potential explanatory variable. When the missing explanatory
variables are autocorrelated and independent of the explanatory variables in a model, the spherical error
assumption will be violated; when the missing explanatory variables are cross-correlated with explanatory
variables in a model, the errors will be correlated with explanatory variables in the model, resulting in
also violating the strict exogeneity assumption.
Previous studies on spatial autocorrelation in ecological modelling have shown that spatial
regression models with distinct purposes, scale issues and conditions of fixed effects will generate
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
coefficient shifts (also known as spatial confounding) and inconsistent interpretations (Diniz-Filho, Bini
& Hawkins 2003; Hawkins et al. 2007; Hodges & Reich 2010; Paciorek 2010). For regression models
with a purpose of quantifying causal effects of some factors, intrinsic processes are usually not the focus
and incorporated into the error term with an implicit assumption of strict exogeneity, leaving causal
effects of intrinsic processes uncontrolled. In practice, however, extrinsic and intrinsic processes can
jointly generate spatial patterns in both response and explanatory variables (e.g., abundance patterns
driven by intraspecific, interspecific and species-environment interactions) (Fortin & Dale 2005), thus
making it possible and even likely to have study cases where effects of intrinsic processes need to be
controlled for. For instance, in the context of explaining spatial patterns in ecology using regression
models, we might be interested in the extent to which intrinsic processes (e.g. dispersal, population
dynamics) of the focal organisms contribute to the pattern of interest, with all other factors controlled. In
this sense, a linear model y = f(yneighbor) + Xβ + ε (where X is exogenous) will tell us something about the
causal effects of the neighbors on one observation, at the expense of the validity of ordinary least squares.
In contrast, a common linear model y = Xβ + ε (where X is exogenous) will not provide such information;
even worse, the effect of f(yneighbor) will in this case be left in ε so that ε will be correlated with X, then the
model is no longer valid. Another example is that X sometimes should theoretically include variables
representing biotic interactions, including not only direct ones, but also biotic modifiers or modulators
(Linder et al. 2012). Such biotic variables are usually driven by not only their own neighbors, but also
abiotic covariates already in the model. However, they are more likely to be omitted in ecological
modelling than abiotic ones, as the latter are often easily monitored and accessed, resulting in correlation
between the errors and the explanatory variables and thus violation of strict exogeneity.
The questions that we want to answer here are: How intrinsic sources of autocorrelation in either
response or explanatory variables, given their likely ubiquity in ecology, affect performances of spatial
regression models that assume strict exogeneity when applied to investigate spatial patterns? What
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
models are capable of handling both sources of spatial autocorrelation in ecological data for the purpose
of assessing ceteris paribus effects?
Materials and methods
Artificial data
Spatial structure generated by extrinsic processes can be modeled as a function of covariates, while
structure derived from intrinsic processes is usually modeled as a function of the response variable itself
(Anselin 1988; LeSage & Pace 2009). In simultaneous autoregressive models, intrinsic processes are
modelled by the term ρWy, where ρ is the parameter indicating the average strength of spatial
autocorrelation across all observations and |ρ| is less than one; W is the row stochastic (i.e. sum of each
row equals one) weight matrix which identifies eligible neighbors (Anselin 1988; LeSage & Pace 2009)
and y is the response variable vector. In this study, we assumed all extrinsic processes to be exogenous
and ρ positive (see Figure S6 for results with negative ρ).
We generated artificial data vectors whose spatial structures were shaped by extrinsic and/or
intrinsic processes. To interpret the artificial dataset ecologically, imagine that a population density vector
(y) for a species A is perfectly explained by three environmental variable vectors (x1, x2 and x4,
representing, e.g., temperature, productivity, and available soil water, respectively), positive or negative
interaction with a population density vector (x3) for a species B, migration between neighbor locations
(Wy) and an independent and identically distributed (i.i.d.) error vector (ε) of normal distribution. x3 is a
linear combination of two cross-correlated environmental variables x1 and x2, plus other unknown
independent factors (represented by spatially structured errors). To increase the complexity and variability
of the simulated data, we used four different spatial structures (exponential, Gaussian, spherical and
rational quadratic; see Cressie 1993) to generate the explanatory variables, with strong first-lag
autocorrelation strength roughly equal to 0.8. For simplicity, all relationships among the response variable
and the explanatory variables were set to be linear and stationary; the intrinsic process in y used the same
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
weight matrix W as that in x3 (see Figure S5 for results based on mis-specified weight matrix). The
mathematical forms of the data vectors are as follows:
y = ρ1Wy + µ0α + β1x1 + β2x2 + β3x3 + β4x4 + ε0, ε0 ~ N(0, σ2I), eqn 1
x1 = µ1α + ε1, ε1 ~ N(0, V1), eqn 2
x2 = µ2α + γ1x1 + ε2, ε2 ~ N(0, V2), eqn 3
x3 = ρ2Wx3 + µ3α + φ1x1 + φ2x2 + ε3, ε3 ~ N(0, V3), eqn 4
x4 = µ4α + ε4, ε4 ~ N(0, V4), eqn 5
where β1, β2, β3, β4, γ1, φ1, φ2, μ0, μ1, μ2, μ3, μ4 are scalars as coefficients; α is a vector of ones; I is an
identity matrix; V1, V2, V3, V4 are variance-covariance matrices of differing spatial structures –
exponential, Gaussian, spherical and rational quadratic, respectively. See Figure S4 for results based on
x1, x2, x3 and x4 with an i.i.d. normal distribution.
Given that the spdep package (Bivand et al. 2014) is the most convenient tool to perform spatial
autoregressive modelling and that this package assumes i.i.d. errors of normal distribution, response
variable data with normally distributed errors was our choice for this study. The ecological interpretation
suggested above of the artificial dataset is intended to serve as an intuitive example to illustrate the
underlying logic behind the formulas for data generation. For rigorous treatment of binary or count
response variables in empirical studies, see Rue & Held (2005) and LeSage & Pace (2009).
The data generations were conducted on a 20×20 grid, with 50 random sets (from the uniform
distribution with a range between 0.01 and 100) of coefficients for x1, x2, x3 and x4, with 100 replicates for
each set, thus 5000 replicates in total.
Models
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
We assessed eight types of regression models, namely standard linear regression model estimated by
ordinary least squares (we used CLM as an abbreviation for such a classical linear model), generalized
additive mixed model (GAMM), conditional autoregressive model (CAR), simultaneous autoregressive
error model (ERR), simultaneous autoregressive lag model (LAG), LAG with autoregressive errors
(SAC), simultaneous autoregressive mixed model (MIX, equivalent to the spatial Durbin model), and the
extended spatial Durbin model (ESDM). As a brief introduction, GAMM (y = Xβ + fsmooth +Zu + ε, ε ~ N
(0, σ2I)) is an combination of the generalized additive model and the linear mixed model, capable of
accounting for additional spatial structures by the smooth function fsmooth and the random effects Zu + ε;
CAR (y = Xβ + u, u ~ N(0, (I – ρW)-1D), where W is symmetric and D is diagonal) and ERR (y = Xβ + u, u
~ N(0, (I – λW)-1D((I – λW)-1)T), where W need not be symmetric) incorporate additional spatial structures
in their error terms; LAG (y = ρWy + Xβ + ε, ε ~ N(0, σ2I)) uses a spatial term ρWy to account for intrinsic
processes, while MIX (y = ρWy + Xβ + WXγ + ε, ε ~ N(0, σ2I)) uses two terms ρWy and WXγ to account
for additional spatial structures. For more details, see Cressie (1993), Banerjee, Carlin & Gelfand (2004),
Wood (2006), Gelfand et al. (2010), and West, Welch & Gałecki (2015). However, SAC and ESDM need
further elaboration.
The SAC model takes the following form (Anselin 1988; LeSage & Pace 2009),
y = ρW1y + Xβ + λW2u + ε, eqn 6
where W1 and W2 are either identical or non-identical weight matrices; ρ and λ are autocorrelation
parameters; X is a matrix of regressors; β is a vector of coefficients; u is a vector of spatially structured
residuals; ε is an i.i.d. vector.
The ESDM model takes the following form (LeSage & Pace 2009),
y = ρW1y + Xβ + W1Xγ + λW2u + ε, eqn 7
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
where γ is another vector of coefficients and W1Xγ represents effects from neighboring explanatory
variables as in MIX, with the rest as above.
GAMM, CAR, and ERR assume strict exogeneity for their design matrix X, as does CLM.
GAMM’s random effects u and ε are independent of its fixed effects (Wood 2006; Monohan 2008). CAR
and ERR also assume that the error term u is independent of the term Xβ (Cressie 1993). The distinction
between CAR and ERR can be seen more clearly by setting Xβ = 0, without losing generality. In this case
where E(y) = 0, CAR and ERR simplify to similar autoregressive forms y = u = ρWy + ε and y = u = (I –
λW)-1ε = λWy + ε, respectively. The covariance for CAR is Cov(ε, y) = D, which means that εi is
uncorrelated with yj (j ≠ i). In contrast, the covariance for ERR is Cov(ε, y) = D((I – λW)-1)T, which is not
diagonal and indicates that εi is correlated with yj (j ≠ i). Therefore, in such an autoregressive form, both
CAR and ERR violate strict exogeneity and their explanatory variable Wy is endogenous. More
specifically, the Wy in CAR corresponds to the spatial case of sequential exogeneity in time series
(Wooldridge 2012), a special case of being endogenous where the neighbors of each observation are
conditionally assumed exogenous. From an interpretive perspective, CAR indicates a local, one-
directional spatial process: for one site, given the rest sites, only its neighbor(s) has an effect on its value;
given its neighbors, changes at the site have no effect on the remaining sites, as its neighbors are
exogenously determined and the rest sites are conditionally independent of the site (Rue & Held 2005;
Gelfand et al. 2010). ERR indicates a global, two-directional spatial process: one site can be influenced
by its neighbor(s) as well as the remaining sites, and changes at the site simultaneously have an effect on
the other sites (Anselin 2003; Gelfand et al. 2010). ERR in the form y = λWy + ε can be viewed as LAG,
which normally differs from the former in the explicit inclusion of endogenous Wy as an explanatory
variable to investigate the effects of intrinsic processes. Meanwhile, given that W is a row stochastic
matrix and |ρ| is less than one, LAG (y = ρWy + Xβ + ε) can be written as y = (I – ρW)-1 (Xβ + ε) = limn→∞
(I + ρW + ρ2W2 + … + ρnWn)(Xβ + ε) (Horn & Johnson 2013), where W represents neighbors, and W2
represents neighbors of neighbors, and so on for higher powers of W. The interpretation for the expansion
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
is that, for example, the observation at one site is influenced not only by the local climatic condition, but
also by its surrounding climatic condition, as responses at surrounding sites to surrounding condition can
spread via dispersal, and can spill over across all the sites that are connected by a neighbor network, with
decaying strength along increasing neighboring distance. Therefore, LAG permits the possibility that
whatever changes in y at site i due to changes in either X or ε will impact on y at another site j (j ≠ i). In
contrast, the usual forms of ERR and CAR assume that changes in X at site i have no partial effects on y
at site j. Such autocorrelation of spillover is referred to as intrinsic spillover autocorrelation (ISA) in this
study. By denoting limn→∞ (ρW + ρ2W2 + … + ρnWn) = (I – ρW)-1 – I by A, LAG can be viewed as an ERR
y = (X + AX)β + (I – ρW)-1ε that alternatively allows for non-zero partial derivative of y at site i with
respect to X at site j and accounts for intrinsic autocorrelation without violating strict exogeneity. In this
sense, the intrinsic autocorrelation caused by neighbors can, under certain conditions, be decomposed into
two parts. One is driven by X, while the other is independent of X. The error term in ERR or CAR only
considers the independent part of intrinsic autocorrelation.
Although under certain conditions (normal distribution, symmetric weight matrix) the variance-
covariance matrix in CAR and ERR can shift between each other (Cressie 1993), the weight matrix will
change, and so will the implied ecological process. Similarly, although any valid variance-covariance
matrix (e.g. for ERR with an asymmetric weight matrix, the linear mixed model or the geo-statistical
model) can be equivalent to that in CAR (Cressie 1993), the interpretation of weight matrix and data-
generating processes can differ from the original one. Another implication is that having X controlled,
identical observation of the spatial structure of response variable could suggest alternative spatial
processes whose interpretations differ.
MIX has multiple interpretations, including omitted endogenous variables and uncertainty among
CLM, ERR and LAG (LeSage & Pace 2009). Although the implementation of SAC and ESDM are
readily available, their empirical interpretations can be complex and may depend on both theoretical and
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
substantive conditions (Anselin 2003). Generally, SAC represents more elaborate error structures, while
ESDM additionally represents more heterogeneous spillover effects.
We categorized the models into two groups, Group E and Group I. The former differs from the
latter in the absence of an endogenous term in model formulas. Therefore, Group E explicitly accounts for
extrinsic processes that are assumed exogenous, while Group I additionally accounts for intrinsic
processes that are endogenous (Table 1). Group E includes CLM, GAMM, CAR and ERR, the latter three
of which have been reported to perform well when dealing with spatial autocorrelation deriving from
exogenous variables (Beale et al. 2010). CLM was included for the purpose of contrast. Group I includes
models explicitly formulated to account for spatial autocorrelation of intrinsic sources, namely LAG,
SAC, MIX and ESDM.
From the equation y = ρWy + Xβ + ε = (I – ρW)-1(Xβ + ε), we can see that the whole spatial
structure in y can be partitioned into two independent parts, (I – ρW)-1Xβ and (I – ρW)-1ε, as X and ε are
assumed to be independent. The part (I – ρW)-1Xβ consists of Xβ plus AXβ. Since AX is clearly dependent
on X, leaving AX in the error term will make X endogenous. Only in the case where models from Group E
capture the entire (I – ρW)-1Xβ by their explanatory variables will they yield unbiased estimate for β, if the
bias from Group E is indeed caused by intrinsic autocorrelation. To test this hypothesis, numerical values
from 0.02 to 0.98 by a step of 0.03 (balancing computational efficiency and sampling density) were
assigned to ρ and then X* =AX = [(I – ρW)-1 – I]X was used as extra explanatory variables in addition to X
to obtain the estimates from Group E for β. We implemented such an additional test in a context (i.e.
Scenario MN described below) where all anomalies can be attributed to intrinsic autocorrelation.
In terms of GAMM specification, since we assumed linear relationships between the response
variable and the explanatory variables, we did not include smooth functions of the explanatory variables
when specifying the additive part of GAMM. Instead, the geographical coordinates were used to capture
extra spatial structure (Beale et al. 2010). We chose the smooth function of the alternative tensor product
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
(Wood 2014), as it performed best in an additional test (results not shown here) where the fitting results
using different smooth functions were compared in terms of both likelihood ratio tests and AIC values
(Chapter 5 & 6 in Wood 2006). We set the basis dimension (i.e. the k) to be the limit imposed by both the
numbers of observations and parameters so that it would not be restrictive. GAMM includes by default a
random effects component for the smooth function. Due to the assumed stationarity in data generation, we
did not additionally consider effects of random slope.
In addition, we tested the performance of the eigenvector-based approach of Moran’s eigenvector
maps (Dray, Legendre & Peres-Neto 2006). See results in Fig. S7.
Scenarios
Based on the idea that correlation structure in the residuals of a model usually indicates model
misspecification where some important information about the variable in question has been omitted, we
simulated scenarios where different extrinsic explanatory variables were omitted. Particularly, we
assessed the performance of these models in four analytic scenarios, the settings of which changed
gradually from being ideal to realistic (Table 2). Our main interest was in how Group E performs with
missing intrinsic processes and how Group I performs with missing extrinsic processes.
Scenario MN (Missing None of the extrinsic variables): all the extrinsic explanatory variables are
included in the models from Group E and Group I, with additional ρWy in Group I by default. In this ideal
context, a comparison of results between Group E and Group I will reveal the pure impacts of ISA on
coefficient estimation and type I error rate, because all extrinsic processes have been accounted for.
Scenario MU (Missing an Uncorrelated variable): In reality, it is usually impossible to include all
the extrinsic variables; it is highly likely that we might omit some unnoticeable or unmeasurable
variables. To assess the effect of ISA in the context of missing uncorrelated extrinsic factors, we omit the
variable x4 that is uncorrelated with other variables.
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
Scenario MC (also Missing a Correlated variable): In addition to the uncorrelated extrinsic
variable x4, we might also omit one more variable (i.e. x2 in this study) that is cross-correlated with
variables included in the model. Due to the correlation of x2 with x1 and x3, an extra source of endogeneity
(i.e. omitted variable bias) emerges and none of the models is capable of eliminating it unless relevant
information is incorporated. Since the relationships among x1, x2 and x3 are known, we corrected the
omitted variable bias in the way that the expected true coefficients of x1 and x3 in this scenario are the sum
of the true pre-set value and the coefficient of the regression of x2 on x1 and x3, respectively (Hayashi
2000), with a purpose of exclusively showing the bias caused by ISA. Results without such correction
were also shown.
Scenario MB (also Missing a Biotic variable): In a more complicated and realistic situation, it is
not uncommon to omit a correlated variable involving ISA per se (referred to as the biotic variable here),
i.e. x3 in this study, the population density for species B. Again, omitted variable bias occurs due to the
correlation of x3 with x1 and x2. The omitted variable bias was corrected in the way that the expected true
coefficients of x1 and x2 are the sum of the preset true value and the coefficient of the regression of x3 on
x1 and x2, respectively. Results without such correction were also shown.
In all the four scenarios, we assessed the type I error rate by adding a spatially-structured random
variable x5 that is independent of y, x1, x2, x3 and x4 to the model formula and checking the corresponding
p-value, with a comparison using its non-spatial counterpart as a control. In Scenario MN, an inflation of
type I error rate would be exclusively caused by intrinsic autocorrelation, while in the other three
scenarios inflations of type I error rate may be caused by intrinsic, extrinsic, or both forms of
autocorrelation.
Comparisons
To compare model performance in coefficient estimation, we calculated “relative imprecision” as |
(estimated value – true value)/true value| and “relative bias” as (the mean – true value)/true value for each
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
covariate taken into account by models in the different scenarios. It is noteworthy that imprecision is not
equivalent to bias. Imprecision is the spread between every estimated value and the true value for a
parameter (i.e. a visual reflection of the variance of estimates), while bias is the difference between the
mean of estimated values and the true value for a parameter. In every scenario, the type I error rate at
α=0.05 level of each model was displayed.
All the simulations and assessments were implemented in R 3.1.2 (R Core Team 2014) using
spdep (Bivand 2014) package.
Results
Pure intrinsic spillover autocorrelation (ISA) (Fig. 1a) inflated the type I error rate of Group E models
except for the ERR model. In more realistic scenarios (Fig. 1b-d), Group E and Group I fairly consistently
exhibited an inflated type I error rate, with the ERR, MIX and ESDM models performing relatively well.
CAR and ERR even showed a deflated type I error rate for irrelevant non-spatial variables in the presence
of ISA (Fig. 1, S1).
In relation to coefficient estimation, pure ISA caused noticeable imprecision and bias of estimates
from Group E (Fig. 2a, 3a), which across four scenarios generally suffered more in comparison to Group
I. Moreover, imprecision and bias increased with scenario realism (Fig. 2, 3). In Scenario MC and MB, all
models suffered extra omitted variable bias (Fig 3, S3). In Scenario MN, MU and MC, SAC performed
almost as well as ESDM. In Scenario MB with correction, however, ESDM yielded correct estimates
other than the other models, and more so when the biotic variable had a larger effect (Fig. 3, S3).
For Group E, exogenous variables that appropriately accounted for the information of ISA
yielded unbiased coefficient estimates (Fig. 4). Any ISA left in the error term always led to bias in
coefficient estimates, even when the explanatory variables were non-spatial, with bias magnitude
associated with structures of the explanatory variables (Fig.4, S4).
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
Discussion
Our results clearly show that omitting information on intrinsic spillover autocorrelation (ISA) generally
inflates type I error rates of the explicitly extrinsic models (Fig. 1). However, ERR, along with MIX and
ESDM, produced nearly correct error rates, depending on the realism in the simulation. In contrast to
some researchers who favored not reporting p-values at all from regression models applied to empirical
data (Jetz & Rahbek 2001; Rahbek & Graves 2001; Hawkins 2012), we show that some spatial models
(e.g. ERR, MIX and ESDM) produce generally reliable significance test results when the null hypothesis
βk = 0 is true, even when they suffer the endogeneity problem. CAR’s and ERR’s deflation of type I error
rates in some cases for irrelevant non-spatial variables is due to their algorithms, which over-compute the
standard errors of the coefficient estimates for such variables (Table S1, S2). Caution must be taken when
reporting significance tests from CLM, GAMM, CAR and LAG, which are prone to falsely report
significant relations for spatially-structured variables in the presence of autocorrelated residuals of either
source.
With regard to unbiasedness of coefficient estimates, models from Group E are incapable of
dealing with ISA, unless their exogenous variables appropriately represent such autocorrelation. For
Group I, SAC appears to be a more competitive alternative to ESDM except – importantly – when it
comes to Scenario MB where one omitted variable involves both self-correlation and cross-correlation.
Omitting such variables is one motivation for MIX (LeSage & Pace 2009). Theoretically, MIX has
previously been argued to be the optimal model balancing model complexity and precision when dealing
with autocorrelation of two origins (LeSage & Pace 2009). However, given that ESDM is a general
extension of MIX, our support for ESDM’s overall good performance is largely consistent with this
theoretical argument. On the other hand, ESDM only addresses partly the bias of spillover effects present
in Scenario MC and MB. It is important to bear in mind that none of the models tested here, including
ESDM, can rectify omitted variable bias to produce true values because the extra information needed is
carried by the omitted variables.
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
Besides, we should be aware of one potential drawback of ESDM. The need to specify two
weight matrices in ESDM means higher risk of inappropriate weight matrices, as an appropriate weight
matrix is usually assumed to correspond to the realistic network of interactions (Anselin 1988). While
Beale et al. (2010) noted that mis-specified spatial structures had a relatively trivial effect on precision,
compared to no spatial structure, it has been reported that invalid weight matrices (Bardos, Guillera-
Arroita & Wintle 2015) and different definitions of neighbors and weights (Anselin 1988) could lead to
contrasting results. Given the spillover effect of (I – ρW)-1 where a change at one site will spill over across
all the sites that are connected by a neighbor network, a priori knowledge about interaction network may
play an important role in specifying weight matrices and coefficient estimation (Fig. S5). The exact effect
of weight matrix specification on regression modelling remains to be further explored.
It is important to note that CLM’s weak performance in the present study in terms of
unbiasedness only indicates its limited ability to deal with endogenous variables, but not inherent
inferiority to spatially explicit models. When information responsible for endogeneity bias is added into
the model (e.g. adding X* in Fig. 4), the coefficients estimates from CLM are unbiased, regardless of the
spherical error assumption. Therefore, we stress the necessity of exploring and accounting for important
processes underlying observed patterns, rather than assuming absolute superiority of a certain model.
The strict exogeneity assumption matters only when the ceteris paribus effects of explanatory
variables are the focus of a study (Hayashi 2000; Wooldridge 2012). Beyond causal explanation,
prediction is another important application field for regression modeling (James et al. 2013). Thorough
elaboration on mechanisms of how linear regression models can be used to assess causality and how
modeling strategies differ between the two distinct purposes is far beyond the scope of this paper. Simply
put, the basic idea of estimating causal effects in a linear model is to control for theoretically relevant
variables as one would have done in an experiment. Hence, including the term ρWy in a model allows for
the possibility of quantifying the effect of intrinsic processes conditional on other variables. In contrast,
incorporating intrinsic processes into the error term lacks the capability of assessing its ceteris paribus
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
effect, with methods already developed to separate distinct sources of variation (Freckleton & Jetz 2009;
Diniz-Filho et al. 2012). A further distinction between the two approaches to modeling intrinsic processes
is whether the intrinsic process of interest is considered to be independent of explanatory variables in the
model. For example, phylogenetic generalized least squares model (PGLS) assumes that self-correlation
in trait evolution captured in the error term is generated by an independent Brownian motion (Martins &
Hansen 1997) or an Ornstein-Uhlenbeck process with a constant parameter of adaptation rate (Hansen
1997). However, some intrinsic processes (e.g. dispersal, population dynamics) are dependent on other
variables: environmental change or landscape alteration at one location can influence species responses at
other locations (Holt 2008; Gorzelak et al. 2015), which is the important spillover feature of ISA. Hence,
the importance of strict exogeneity and choice of processes to be controlled for depend on both the
purpose of regression modeling and the question being asked in a specific study as well as the
understanding of a specific intrinsic process.
Scale also plays a role in dealing with ISA. When a study is performed at large scales (e.g.
continental scale), ISA may be confined within so few local neighbors that it is reasonable to believe that
the spillover effects and the resulting issues of endogeneity are negligible and that other factors play a
more important role (Diniz-Filho et al. 2003). Alternatively, ISA that occurs across a large extent of the
study space (e.g. at landscape or regional scales) necessitates applying models in Group I to eliminate the
effect of endogeneity on other explanatory variables.
Although ESDM is the most general one in the simultaneous autoregressive model family and
performs generally well, we notice that in some cases ESDM may yield marginally less reliable
coefficients than other models when both ex- and intrinsic processes have been appropriately accounted
for. Actually, we do not propose ESDM as the single model of choice for addressing ISA, but as a good
starting point to explore the final model specification. In practice, we advocate simplifying the general
model ESDM, if possible, to a more specific, but less complex model, retaining the ability to account for
all autocorrelation-generating processes. Likelihood ratio tests, Wald tests and Hausman tests are useful
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
tools for examining model specification (LeSage & Pace 2009; Bivand et al. 2014). Moreover, as the
limitations of the simulation framework in this study, including perfect knowledge, linear and stationary
relationships, normal distributions for error terms and simple weight matrix structures, under-represent
complexities of real ecological data, we suggest careful further extrapolation of the findings here as well
as consideration of other useful modelling techniques for empirical studies.
Conclusion
Our simulation-based study has shown that omitting intrinsic spillover autocorrelation leads to incorrect
type I error rates and biased coefficient estimates in regression modelling. Our results also indicate that
significance tests from some spatial regression models produce generally reliable results even when their
coefficient estimates are biased. Importantly, for producing efficient and unbiased estimates of ceteris
paribus effects, specification of regression models should account for both extrinsic and intrinsic
processes that generate spatial autocorrelation in spatial data. The extended spatial Durbin model (ESDM)
emerges as the most promising technique for addressing intrinsic spillover autocorrelation across four
scenarios, but still suffers bias from other endogeneity sources.
Authors’ contributions
SNT conceived the ideas and designed methodology; SNT collected the data; SNT, CX, BS and JCS
analyzed the data; SNT and JCS led the writing of the manuscript. All authors contributed critically to the
drafts and gave final approval for publication.
Acknowledgements
We thank Robert B. O’Hara, J. Alexandre F. Diniz-Filho and an anonymous reviewer for their valuable
comments and constructive criticisms on the earlier versions of the manuscript. We thank W. Daniel
Kissling and Gudrun Carl for kindly providing the R code to simulate data in their study. SNT was
supported by China Scholarship Council (201406190179). JCS was supported by the European Research
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
Council (ERC-2012-StG-310886-HISTFUNC), and also considers this work a contribution to his
VILLUM Investigator project “Biodiversity Dynamics in a Changing World” (BIOCHANGE) funded by
VILLUM FONDEN. CX was supported by National Natural Science Foundation of China (41271197).
Data accessibility
The R scripts needed to perform simulations and comparisons are available in Appendix S1.
431
432
433
434
435
References
Anselin L. (1988) Spatial Econometrics: methods and models. Kluwer, Dordrecht.
Anselin L. (2003) Spatial externalities, spatial multipliers, and spatial econometrics. International
Regional Review, 26, 153-166.
Banerjee, S., Carlin, B.P. & Gelfand, A.E. (2004) Hierarchical Modeling and Analysis for Spatial Data.
CRC, Boca Raton.
Bardos, D.C., Guillera-Arroita, G. & Wintle, B.A. (2015) Valid auto-models for spatially autocorrelated
occupancy and abundance data. Methods in Ecology and Evolution, 6, 1137-1149.
Beale, C.M., Lennon, J.J., Yearsley, J.M., Brewer, M.J. & Elston, D.A. (2010) Regression analysis of
spatial data. Ecology Letters, 13, 246-264.
Bivand, R. (2014) spdep: Spatial dependence: weighting schemes, statistics and models. R package
version 0.5-77. http://CRAN.R-project.org/package=spdep
Cliff, A.D. & Ord, J.K. (1981) Spatial Processes: Models and Applications. Pion, London.
Cressie, N. (1993) Statistics for Spatial Data. John Wiley & Sons, Chichester, UK.
Diniz-Filho, J.A.F., Bini, L.M. & Hawkins, B.A. (2003) Spatial autocorrelation and red herrings in
geographical ecology. Global Ecology and Biogeography, 12, 53-64.
Diniz-Filho, J.A.F., Siqueira, T., Padial, A.A., Rangel, T.F., Landeiro, V.L. & Bini, L.M. (2012) Spatial
autocorrelation analysis allows disentangling the balance between neutral and niche processes in
metacommunities. Oikos, 121, 201-210.
Dormann, C.F., McPherson, J.M., Araujo, M.B., Bivand, R., Bolliger, J., Carl G., Davies, R.G., Hirzel A.,
Jetz, W., Kissling, W.D., Kühn, I., Ohlemüller, R., Peres-Neto, P.R., Reineking, B., Schröder, B.,
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
Schurr, F.M. & Wilson, R. (2007) Methods to account for spatial autocorrelation in the analysis
of species distributional data: a review. Ecography, 30, 609-628.
Dray, S., Legendre, P. & Peres-Neto, P.R. (2006) Spatial modelling: a comprehensive framework for
principal coordinate analysis of neighbor matrices (PCNM). Ecological Modelling, 196, 483-493.
Freckleton, R.P. & Jetz, W. (2009) Space versus phylogeney: disentangling phylogenetic and spatial
signals in comparative data. Proceedings of the Royal Society Series B, 276, 21-30.
Fortin, M.-J. & Dale, M. (2005) Spatial Analysis: guide for ecologists. Cambridge University Press,
Cambridge, UK.
Gelfand, A.E., Diggle, P.J., Fuentes, M. & Guttorp, P. eds. (2010) Handbook of Spatial Statistics. CRC,
Boca Raton.
Gorzelak, M.A., Asay, A.K., Pickles, B.J. & Simard, S.W. (2015) Inter-plant communication through
mycorrhizal networks mediates complex adaptive behavior in plant communities. Annals of
Botany Plants, 7, plv050.
Hansen, T.F. (1997) Stabilizing selection and the comparative analysis of adaptation. Evolution, 51, 1341-
1351.
Hawkins, B.A. (2012) Eight (and a half) deadly sins of spatial analysis. Journal of Biogeography, 39, 1-9.
Hawkins, B.A., Diniz-Filho, J.A.F., Bini, L.M., De Marco, P. & Blackburn, T.M. (2007) Red herrings
revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography,
30, 375-384.
Hayashi, F. (2000) Econometrics. Princeton University Press.
Hodges, J.S. & Reich, B.J. (2010) Adding spatially-correlated errors can mess up the fixed effect you
love. The American Statistician, 64, 325-334.
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
Horn, R.A. & Johnson, C.R. (2013) Matrix Analysis, 2nd Edn. Cambridge University Press, New York,
USA.
Holt, R.D. (2008) Theoretical perspectives on resource pulses. Ecology, 89, 671-681.
James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013) An Introduction to Statistical Learning: with
Applications in R. Springer, New York.
Jetz, W. & Rahbek, C. (2001) Geometric constraints explain much of the species richness pattern in
African birds. Proceedings of the National Academy of Sciences, 98, 5661-5666.
Keitt, T.H., Bjørnstad, O.N., Dixon, P.M. & Citron-Pousty, S. (2002) Accounting for spatial pattern when
modeling organism-environment interactions. Ecography, 25, 616-625.
Legendre, P. (1993) Spatial autocorrelation – trouble or new paradigm. Ecology, 74, 1659-1673.
Lennon, J.J. (2000) Red-shifts and red herrings in geographical ecology. Ecography, 23, 101-113.
LeSage, J. & Pace, R.K. (2009) Introduction to Spatial Econometrics. CRC Press, Boca Raton.
Linder, H.P., Bykova, O., Dyke, J., Etienne, R.S., Hickler, T., Kühn, I., Marion, G., Ohlemüller, R.,
Schymanski, S.J. & Singer, A. (2012) Biotic modifiers, environmental modulation and species
distribution models. Journal of Biogeography, 39, 2179-2190.
Martins, E.P. & Hansen, T.F. (1997) Phylogenies and the comparative method: a general approach to
incorporating phylogenetic information into the analysis of interspecific data. The American
Naturalist, 149, 646-667.
Monohan, J.F. (2008) A Primer on Linear Models. CRC, Boca Raton.
Paciorek, C.J. (2010) The importance of scale for spatial confounding bias and precision of spatial
regression estimators. Statistical Science, 25, 107-125.
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
R Core Team (2014) R: A language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna. URL http://www.R-project.org [accessed 12 December 2014]
Rahbek, C. & Graves, R. (2001) Multiscale assessment of patterns of avian species richness. Proceedings
of the National Academy of Sciences, 98, 4534-4539.
Rue, H. & Held, L. (2005) Gaussian Markov Random Fields: Theory and Applications. CRC, Boca
Raton.
West, B.T., Welch, K.B. & Gałecki, A.T. (2015) Linear Mixed Models: A Practical Guide Using
Statistical Software, 2nd Edn. CRC Press, Boca Raton.
Wood, S.N. (2006) Generalized Additive Models: an introduction with R. CRC, Boca Raton.
Wood, S. (2014) mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation.
R package version 1.8-3. http://CRAN.R-project.org/package=mgcv
Wooldridge, J.M. (2012) Introductory Econometrics: A Modern Approach, 5th Edn. South-Western
College Publishing, Mason, USA.
Supporting Information
Additional Supporting Information may be found in the online version of this article.
Appendix S1. R scripts for artificial data generation and model assessments.
Table S1. Comparison between the computed standard error by a model’s algorithm and the actual
standard error of the model’s coefficient estimate for the irrelevant variable x5 as in Fig. 1.
Table S2. Comparison between the computed standard error by a model’s algorithm and the actual
standard error of the model’s coefficient estimate for the irrelevant variable x5 as in Fig. S1.
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
Figure S1. Type I error rate from the eight regression models applied to data generated without intrinsic
spillover autocorrelation in y.
Figure S2. Relative imprecision of coefficients estimates from the eight regression models with
coefficients that are on average identical.
Figure S3. Relative bias of coefficients estimates from the eight regression models with coefficients that
are on average identical.
Figure S4. Relative bias of coefficients estimates from the eight regression models applied to data
generated with four explanatory variables free of spatial structure.
Figure S5. Relative bias of coefficients estimates from the eight regression models with mis-specified
weight matrix for SAC and ESDM.
Figure S6. Relative bias of coefficients estimates from the eight regression models with negative intrinsic
autocorrelation.
Figure S7. Relative bias of coefficients estimates from the eight regression models including the method
of Moran’s eigenvector maps.
520
521
522
523
524
525
526
527
528
529
530
531
532
533
Table 1. Overview of eight regression models’ abilities to account for two sources of spatially
autocorrelated residuals.
Model Extrinsic Intrinsic Group
CLM NO NO E
GAMM YES NO E
CAR YES NO E
ERR YES NO E
LAG NO YES I
SAC YES YES I
MIX* YES/NO NO/YES I
ESDM YES YES I
*The dual ability of MIX model to alternatively account for extrinsic or intrinsic autocorrelation derives from its formula
(Anselin, 1988; LeSage & Pace, 2009)
535
536
537
538
Table 2. Description of four simulated scenarios
Scenario Description
MN Capturing all extrinsic variables
MU Omitting an uncorrelated variable (x4)
MC Omitting also a cross-correlated variable not involving intrinsic processes (x2 & x4)
MB Omitting also a cross-correlated variable involving intrinsic processes (x3 & x4)
539
540
541
Fig. 1. Type I error rate at 5% significance level of the eight regression models in a) Scenario MN
(Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated variable), c)
Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the Biotic variable).
An error rate above 0.05 (dashed line) represents inflation. Error bar shows the 95% confidence interval
for the mean type I error rate from 50 rounds of 100 replicates of significance testing, with one type I
error rate for one round. Dark grey (or light grey) color indicates an irrelevant variable with (or without)
spatial structure.
542
543
544
545
546
547
548
549
Fig. 2. Relative imprecision in percentage of coefficients estimates from the eight regression models
under strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic
variables), b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a
Correlated variable) and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of
assessment per model. Explanatory variables are indicated by color (same in all panels). The coefficient
for the biotic variable x3 is on average larger than those for other covariates. Middle line in a box
represents the median; upper and lower box edges correspond to the first and third quartiles; notches in
box give a roughly 95% confidence interval for the median; whiskers extend bidirectionally to the farthest
value that is within 1.5 times the inter-quartile range; outliers are excluded.
550
551
552
553
554
555
556
557
558
559
560
Fig. 3. Relative bias in percentage of coefficients estimates from the eight regression models under strong
intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b) Scenario
MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable) and d)
Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model. The
coefficient for the biotic variable x3 is on average larger than those for other covariates. Orange line for a
box represents the bias after correction for omitted variable bias in Scenario MC and MB, while blue line
for a box represents the bias without such correction, with all other elements as in Fig. 2.
561
562
563
564
565
566
567
568
Fig. 4. Effect on coefficient estimates of adding an extra explanatory variable X* = [(I – ρW)-1 – I]X into
models from the extrinsic group in Scenario MN (Missing None of the extrinsic variables). Numerical
values from 0.02 to 0.98 by a step of 0.03 were assigned to ρ. For each ρ, 5000 replicates of assessment
per model were implemented. The vertical dashed line indicates the situation where X* accounts for
exactly all X-related intrinsic autocorrelation with strength equal to 0.8. Please note how the bias for the
four explanatory variables (in color) reduces to zero (the horizontal dashed line) at the point ρ = 0.8.
569
570
571
572
573
574
575
576
577
Table S1. Comparison between the computed standard error (CSE) by a model’s algorithm and the actual standard error (ASE) of the model’s
coefficient estimate for the irrelevant variable x5 as in Fig. 1. A CSE value larger than its ASE value signals deflation of type I error rate; a CSE
value smaller than its ASE value signals inflation of type I error rate, while no difference between CSE and ASE signals correct type I error rate.
Scenario MN Scenario MU Scenario MC Scenario MB
Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial
CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE
CLM 0.189 0.577 0.180 0.184 0.253 0.800 0.245 0.248 0.255 0.821 0.250 0.254 0.373 1.167 0.369 0.369
GAMM 0.043 0.064 0.013 0.014 0.103 0.134 0.037 0.037 0.109 0.142 0.039 0.040 0.108 0.139 0.037 0.038
CAR 0.165 0.348 0.097 0.075 0.204 0.425 0.121 0.088 0.205 0.426 0.122 0.089 0.318 0.687 0.190 0.144
ERR 0.080 0.086 0.034 0.024 0.119 0.140 0.051 0.040 0.121 0.144 0.052 0.041 0.139 0.159 0.060 0.041
LAG 0.007 0.007 0.007 0.007 0.060 0.136 0.056 0.057 0.073 0.167 0.070 0.071 0.059 0.139 0.057 0.059
SAC 0.007 0.007 0.007 0.007 0.080 0.092 0.037 0.038 0.090 0.108 0.040 0.040 0.079 0.091 0.038 0.038
MIX 0.015 0.016 0.007 0.007 0.109 0.119 0.048 0.053 0.119 0.130 0.053 0.057 0.116 0.127 0.052 0.056
ESDM 0.015 0.016 0.007 0.007 0.089 0.098 0.048 0.050 0.096 0.108 0.053 0.056 0.090 0.099 0.048 0.049
578
579
580
581
Table S2. Comparison between the computed standard error (CSE) by a model’s algorithm and the actual standard error (ASE) of the model’s
coefficient estimate for the irrelevant variable x5 as in Fig. S1. A CSE value larger than its ASE value signals deflation of type I error rate; a CSE
value smaller than its ASE value signals inflation of type I error rate, while no difference between CSE and ASE signals correct type I error rate.
Scenario MN Scenario MU Scenario MC Scenario MB
Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial
CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE
CLM 0.007 0.007 0.007 0.007 0.067 0.136 0.065 0.065 0.075 0.158 0.074 0.075 0.075 0.162 0.074 0.074
GAMM 0.007 0.007 0.007 0.007 0.054 0.061 0.037 0.038 0.058 0.066 0.040 0.042 0.059 0.067 0.041 0.042
CAR 0.007 0.007 0.007 0.007 0.058 0.070 0.044 0.040 0.065 0.079 0.049 0.044 0.064 0.079 0.048 0.044
ERR 0.007 0.007 0.007 0.007 0.055 0.057 0.037 0.038 0.061 0.062 0.041 0.041 0.061 0.063 0.041 0.041
LAG 0.007 0.007 0.007 0.007 0.059 0.110 0.058 0.058 0.065 0.120 0.064 0.065 0.064 0.121 0.063 0.064
SAC 0.007 0.007 0.007 0.007 0.055 0.057 0.037 0.038 0.060 0.062 0.041 0.041 0.061 0.063 0.041 0.041
MIX 0.009 0.010 0.007 0.007 0.056 0.057 0.040 0.041 0.061 0.061 0.043 0.045 0.062 0.063 0.044 0.045
ESDM 0.010 0.010 0.007 0.007 0.056 0.058 0.045 0.046 0.061 0.062 0.051 0.053 0.062 0.064 0.049 0.051
582
583
584
585
Fig. S1. Type I error rate at 5% significance level of the eight regression models in a) Scenario MN
(Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated variable), c)
Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the Biotic variable),
based on data generated without intrinsic spillover autocorrelation in y. An error rate above 0.05 (dashed
line) represents inflation. Error bar shows the 95% confidence interval for the mean type I error rate from
50 rounds of 100 replicates of significance testing, with one type I error rate for one round. Dark grey (or
light grey) color indicates an irrelevant variable with (or without) spatial structure.
586
587
588
589
590
591
592
593
Fig. S2. Relative imprecision in percentage of coefficients estimates from the eight regression models
under strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic
variables), b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a
Correlated variable) and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of
assessment per model. Explanatory variables are indicated by color (same in all panels). The coefficients
for four covariates are on average identical. Middle line in a box represents the median; upper and lower
box edges correspond to the first and third quartiles; notches in box give a roughly 95% confidence
interval for the median; whiskers extend bidirectionally to the farthest value that is within 1.5 times the
inter-quartile range; outliers are excluded.
594
595
596
597
598
599
600
601
602
603
604
605
Fig. S3. Relative bias in percentage of coefficients estimates from the eight regression models under
strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b)
Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)
and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model.
The coefficients for four covariates are on average identical. Orange line for a box represents the bias
after correction for omitted variable bias in Scenario MC and MB, while blue line for a box represents the
bias without such correction, with all other elements as in Fig. S2.
606
607
608
609
610
611
612
613
614
Fig. S4. Relative bias in percentage of coefficients estimates from the eight regression models under
strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b)
Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)
and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model,
with four covariates free of spatial structure. All other elements are as in Fig. S3.
615
616
617
618
619
620
621
Fig. S5. Relative bias in percentage of coefficients estimates from the eight regression models under
strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b)
Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)
and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model.
The true weight matrix for y uses a structure of King’s move, while that for x3 uses a structure of first
order Bishop’s move. Two weight matrices required by SAC and ESDM falsely use the same structure of
King’s move. All other elements are as in Fig. S3.
622
623
624
625
626
627
628
629
630
Fig. S6. Relative bias in percentage of coefficients estimates from the eight regression models under
negative intrinsic autocorrelation (ρ= -0.8) in a) Scenario MN (Missing None of the extrinsic variables),
b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)
and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model.
All other elements are as in Fig. S3.
631
632
633
634
635
636
637
Fig. S7. Relative bias in percentage of coefficients estimates from the eight regression models including
the method of Moran’s eigenvector maps (MEM) under strong intrinsic autocorrelation (ρ=0.8) in a)
Scenario MN (Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated
variable), c) Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the
Biotic variable) across 5000 replicates of assessment per model. All other elements are as in Fig. S3.
638
639
640
641
642
643