pure.au.dk · Web viewDespite consensus that SA originates from both extrinsic factors and...

Effects of Intrinsic Sources of Spatial Autocorrelation on Spatial Regression Modelling

Shuqing N. Teng1*, Chi Xu2, Brody Sandel1 and Jens-Christian Svenning1

1 Section for Ecoinformatics & Biodiversity, Department of Bioscience, Aarhus University, Ny Munkegade 114, DK-8000 Aarhus C, Denmark

2 School of Life Sciences, Nanjing University, 163 Xianlin Road, 210023, Nanjing, PR China

*Correspondence author: E-mail: [email protected]

Running title: Intrinsic autocorrelation in regression modelling

Abstract

1. Detecting and dealing with spatial autocorrelation (SA) are indispensable steps in analyses of

geospatial data. Despite consensus that SA originates from both extrinsic factors and intrinsic

interactions, previous studies on regression analysis of spatially autocorrelated data have rarely controlled

for intrinsic sources in addition to extrinsic ones to assess ceteris paribus (i.e. causal) effects of interest,

with the strict exogeneity assumption that errors containing unexplained variance are uncorrelated with

explanatory variables for all observations. This assumption becomes invalid when intrinsic SA is not an

external process modeled as errors and needs to be controlled for.

2. Here, we aimed to assess the extent to which not controlling for intrinsic SA negatively affects model

performance, specifically in terms of type I error rate and unbiasedness of coefficient estimates, and to

identify models that are able to handle these problems. To this end, we applied two categories of

regression models that do (intrinsic category) or do not (extrinsic category) explicitly control for intrinsic

SA to artificial data generated with both SA sources. These models included the extended spatial Durbin

model (ESDM) and its nested models. Four analytic scenarios simulated realistic modelling conditions,

1

2

3

45

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

with realism increasing with the complexity of variables omitted during modelling. The two more realistic

scenarios involved additional violations of strict exogeneity.

3. We found that intrinsic – just as extrinsic – SA can produce incorrect type I error rates, if not explicitly

controlled for. Failing to control for intrinsic SA also generated bias in estimates of ceteris paribus

effects. However, ESDM from the intrinsic category exhibited consistently good performance in dealing

with intrinsic SA across all the scenarios, but suffered from other violations of the strict exogeneity

assumption.

4. Overall, model specification should control for both extrinsic and intrinsic processes generating SA in

spatial data to provide reliable type I error rates and unbiased estimates of ceteris paribus effects. Given

the likely widespread occurrence in observational spatial data of unknown or unmeasurable processes,

ESDM should be a generally preferred starting point to explore the optimal model specification for

estimating ceteris paribus effects, with due caution to other violations of strict exogeneity.

Key-words

Autoregressive models, ceteris paribus effects, biotic interactions, endogenous variables, spillover,

omitted variable bias, strict exogeneity

Introduction

Over the past decades, it has become generally recognized among ecologists that spatial autocorrelation

may lead to violation of the basic assumption of independent errors in standard statistical models,

resulting in incorrect inferences (e.g. underestimated confidence intervals and inflated type I error rates in

the presence of positive spatial autocorrelation) (Legendre 1993; Lennon 2000; Beale et al. 2010).

Currently, the most common way of dealing with this issue is applying spatially explicit modelling

techniques that are designed for modelling spatially autocorrelated data, although outputs from these

spatial models may not always be consistent (Keitt et al. 2002; Dormann et al. 2007). Meanwhile, there

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

are opponents who are skeptical about the underlying assumptions of such spatial models, arguing that

they lack realistic ecological meaning and that we should caution against their limitations (Hawkins

2012). Overall, there is yet no generally accepted solution for handling spatial autocorrelation in

modelling. In contrast, there is consensus on the origin of spatial autocorrelation in both intrinsic and

extrinsic processes (Cliff & Ord 1981; Legendre 1993; Fortin & Dale 2005). The generation of spatial

autocorrelation by intrinsic processes can be interpreted as the response of a variable at one position to its

values at neighboring positions, e.g., via dispersal. Hence, spatial structure is generated internally by the

focal variable. Alternatively, spatial autocorrelation may result from responses to extrinsic factors

(including unknown factors modeled as errors), for example, environmental gradients, i.e., with spatial

structuring generated by external forcing.

In terms of regression modelling, one important distinction between the two processes is whether

they violate the strict exogeneity assumption (see below), which is crucial for the unbiasedness of the

ordinary least squares estimator for standard linear regression applied to estimate ceteris paribus (also

referred to as partial or causal) effects (Hayashi 2000; Wooldridge 2012). The standard linear regression

model y = Xβ + ε assumes two types of independence. Firstly, elements in the error vector ε are

independent of each other (i.e. the spherical error assumption: the variance-covariance matrix of ε is an

identity matrix premultiplied by a scalar). Secondly, every element in the error vector ε is uncorrelated

with every element of the design matrix X that contains the explanatory variables (i.e. the strict

exogeneity assumption: the covariance matrix between a column vector in X and ε is a zero matrix;

mathematically equivalent to E(ε|X) = 0). If the first assumption is violated, the estimate by ordinary least

squares for β will be inefficient (i.e. the estimate’s variance becomes larger than that by generalized least

square, the best linear unbiased estimator; see Cressie (1993) p. 20-21), but still unbiased (i.e. the

estimate’s expectation equals the true value). If the second assumption is violated, the estimate by

ordinary least squares for β will be biased. Specifically, the strict exogeneity assumption is violated when

the following situations occur (Wooldridge 2012): measurement for explanatory variables has errors;

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

omitted variables left in the error term are correlated with explanatory variables; one observation of

response variable itself has an effect on its temporal and/or spatial neighbors; explanatory variables are

simultaneously influenced by the response variable. Explanatory variables that meet the strict exogeneity

assumption are said to be exogenous, while those that do not are referred to as endogenous. Hence, in

linear regression models, explanatory variables representing intrinsic processes are endogenous, while

those representing extrinsic processes can be either exogenous or endogenous, depending on specific

contexts. In cases where response variable that responds to extrinsic processes yields no feedback,

variables of the extrinsic processes are exogenous; otherwise, they are endogenous, determined inside of

the model jointly with the response variable due to the feedback. In this sense, extrinsic factors like

environmental conditions (e.g. climate and topography determined by natural forces at scales larger than

the modelling scale) and some types of biotic interactions (e.g. commensalism, amensalism) can be

exogenous, while some biotic interactions (e.g. mutualism, parasitism) should be viewed as endogenous.

Ideally, if a statistical model takes into account all processes generating variation in the response

variable, there should be no signatures of autocorrelation in the errors. In other words, incomplete

representation of these processes is the cause of spatially autocorrelated errors in regression analysis.

Unfortunately, this situation is common in ecological analyses due to either poor knowledge (often as a

result of the complexity of ecological systems) or simply lack of data. For example, it may not be clear

which and how environmental factors or interspecific interactions shape the distribution of a species, or

there may simply not be any data on one potential explanatory variable. When the missing explanatory

variables are autocorrelated and independent of the explanatory variables in a model, the spherical error

assumption will be violated; when the missing explanatory variables are cross-correlated with explanatory

variables in a model, the errors will be correlated with explanatory variables in the model, resulting in

also violating the strict exogeneity assumption.

Previous studies on spatial autocorrelation in ecological modelling have shown that spatial

regression models with distinct purposes, scale issues and conditions of fixed effects will generate

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

coefficient shifts (also known as spatial confounding) and inconsistent interpretations (Diniz-Filho, Bini

& Hawkins 2003; Hawkins et al. 2007; Hodges & Reich 2010; Paciorek 2010). For regression models

with a purpose of quantifying causal effects of some factors, intrinsic processes are usually not the focus

and incorporated into the error term with an implicit assumption of strict exogeneity, leaving causal

effects of intrinsic processes uncontrolled. In practice, however, extrinsic and intrinsic processes can

jointly generate spatial patterns in both response and explanatory variables (e.g., abundance patterns

driven by intraspecific, interspecific and species-environment interactions) (Fortin & Dale 2005), thus

making it possible and even likely to have study cases where effects of intrinsic processes need to be

controlled for. For instance, in the context of explaining spatial patterns in ecology using regression

models, we might be interested in the extent to which intrinsic processes (e.g. dispersal, population

dynamics) of the focal organisms contribute to the pattern of interest, with all other factors controlled. In

this sense, a linear model y = f(yneighbor) + Xβ + ε (where X is exogenous) will tell us something about the

causal effects of the neighbors on one observation, at the expense of the validity of ordinary least squares.

In contrast, a common linear model y = Xβ + ε (where X is exogenous) will not provide such information;

even worse, the effect of f(yneighbor) will in this case be left in ε so that ε will be correlated with X, then the

model is no longer valid. Another example is that X sometimes should theoretically include variables

representing biotic interactions, including not only direct ones, but also biotic modifiers or modulators

(Linder et al. 2012). Such biotic variables are usually driven by not only their own neighbors, but also

abiotic covariates already in the model. However, they are more likely to be omitted in ecological

modelling than abiotic ones, as the latter are often easily monitored and accessed, resulting in correlation

between the errors and the explanatory variables and thus violation of strict exogeneity.

The questions that we want to answer here are: How intrinsic sources of autocorrelation in either

response or explanatory variables, given their likely ubiquity in ecology, affect performances of spatial

regression models that assume strict exogeneity when applied to investigate spatial patterns? What

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

models are capable of handling both sources of spatial autocorrelation in ecological data for the purpose

of assessing ceteris paribus effects?

Materials and methods

Artificial data

Spatial structure generated by extrinsic processes can be modeled as a function of covariates, while

structure derived from intrinsic processes is usually modeled as a function of the response variable itself

(Anselin 1988; LeSage & Pace 2009). In simultaneous autoregressive models, intrinsic processes are

modelled by the term ρWy, where ρ is the parameter indicating the average strength of spatial

autocorrelation across all observations and |ρ| is less than one; W is the row stochastic (i.e. sum of each

row equals one) weight matrix which identifies eligible neighbors (Anselin 1988; LeSage & Pace 2009)

and y is the response variable vector. In this study, we assumed all extrinsic processes to be exogenous

and ρ positive (see Figure S6 for results with negative ρ).

We generated artificial data vectors whose spatial structures were shaped by extrinsic and/or

intrinsic processes. To interpret the artificial dataset ecologically, imagine that a population density vector

(y) for a species A is perfectly explained by three environmental variable vectors (x1, x2 and x4,

representing, e.g., temperature, productivity, and available soil water, respectively), positive or negative

interaction with a population density vector (x3) for a species B, migration between neighbor locations

(Wy) and an independent and identically distributed (i.i.d.) error vector (ε) of normal distribution. x3 is a

linear combination of two cross-correlated environmental variables x1 and x2, plus other unknown

independent factors (represented by spatially structured errors). To increase the complexity and variability

of the simulated data, we used four different spatial structures (exponential, Gaussian, spherical and

rational quadratic; see Cressie 1993) to generate the explanatory variables, with strong first-lag

autocorrelation strength roughly equal to 0.8. For simplicity, all relationships among the response variable

and the explanatory variables were set to be linear and stationary; the intrinsic process in y used the same

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

weight matrix W as that in x3 (see Figure S5 for results based on mis-specified weight matrix). The

mathematical forms of the data vectors are as follows:

y = ρ1Wy + µ0α + β1x1 + β2x2 + β3x3 + β4x4 + ε0, ε0 ~ N(0, σ2I), eqn 1

x1 = µ1α + ε1, ε1 ~ N(0, V1), eqn 2

x2 = µ2α + γ1x1 + ε2, ε2 ~ N(0, V2), eqn 3

x3 = ρ2Wx3 + µ3α + φ1x1 + φ2x2 + ε3, ε3 ~ N(0, V3), eqn 4

x4 = µ4α + ε4, ε4 ~ N(0, V4), eqn 5

where β1, β2, β3, β4, γ1, φ1, φ2, μ0, μ1, μ2, μ3, μ4 are scalars as coefficients; α is a vector of ones; I is an

identity matrix; V1, V2, V3, V4 are variance-covariance matrices of differing spatial structures –

exponential, Gaussian, spherical and rational quadratic, respectively. See Figure S4 for results based on

x1, x2, x3 and x4 with an i.i.d. normal distribution.

Given that the spdep package (Bivand et al. 2014) is the most convenient tool to perform spatial

autoregressive modelling and that this package assumes i.i.d. errors of normal distribution, response

variable data with normally distributed errors was our choice for this study. The ecological interpretation

suggested above of the artificial dataset is intended to serve as an intuitive example to illustrate the

underlying logic behind the formulas for data generation. For rigorous treatment of binary or count

response variables in empirical studies, see Rue & Held (2005) and LeSage & Pace (2009).

The data generations were conducted on a 20×20 grid, with 50 random sets (from the uniform

distribution with a range between 0.01 and 100) of coefficients for x1, x2, x3 and x4, with 100 replicates for

each set, thus 5000 replicates in total.

Models

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

We assessed eight types of regression models, namely standard linear regression model estimated by

ordinary least squares (we used CLM as an abbreviation for such a classical linear model), generalized

additive mixed model (GAMM), conditional autoregressive model (CAR), simultaneous autoregressive

error model (ERR), simultaneous autoregressive lag model (LAG), LAG with autoregressive errors

(SAC), simultaneous autoregressive mixed model (MIX, equivalent to the spatial Durbin model), and the

extended spatial Durbin model (ESDM). As a brief introduction, GAMM (y = Xβ + fsmooth +Zu + ε, ε ~ N

(0, σ2I)) is an combination of the generalized additive model and the linear mixed model, capable of

accounting for additional spatial structures by the smooth function fsmooth and the random effects Zu + ε;

CAR (y = Xβ + u, u ~ N(0, (I – ρW)-1D), where W is symmetric and D is diagonal) and ERR (y = Xβ + u, u

~ N(0, (I – λW)-1D((I – λW)-1)T), where W need not be symmetric) incorporate additional spatial structures

in their error terms; LAG (y = ρWy + Xβ + ε, ε ~ N(0, σ2I)) uses a spatial term ρWy to account for intrinsic

processes, while MIX (y = ρWy + Xβ + WXγ + ε, ε ~ N(0, σ2I)) uses two terms ρWy and WXγ to account

for additional spatial structures. For more details, see Cressie (1993), Banerjee, Carlin & Gelfand (2004),

Wood (2006), Gelfand et al. (2010), and West, Welch & Gałecki (2015). However, SAC and ESDM need

further elaboration.

The SAC model takes the following form (Anselin 1988; LeSage & Pace 2009),

y = ρW1y + Xβ + λW2u + ε, eqn 6

where W1 and W2 are either identical or non-identical weight matrices; ρ and λ are autocorrelation

parameters; X is a matrix of regressors; β is a vector of coefficients; u is a vector of spatially structured

residuals; ε is an i.i.d. vector.

The ESDM model takes the following form (LeSage & Pace 2009),

y = ρW1y + Xβ + W1Xγ + λW2u + ε, eqn 7

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

where γ is another vector of coefficients and W1Xγ represents effects from neighboring explanatory

variables as in MIX, with the rest as above.

GAMM, CAR, and ERR assume strict exogeneity for their design matrix X, as does CLM.

GAMM’s random effects u and ε are independent of its fixed effects (Wood 2006; Monohan 2008). CAR

and ERR also assume that the error term u is independent of the term Xβ (Cressie 1993). The distinction

between CAR and ERR can be seen more clearly by setting Xβ = 0, without losing generality. In this case

where E(y) = 0, CAR and ERR simplify to similar autoregressive forms y = u = ρWy + ε and y = u = (I –

λW)-1ε = λWy + ε, respectively. The covariance for CAR is Cov(ε, y) = D, which means that εi is

uncorrelated with yj (j ≠ i). In contrast, the covariance for ERR is Cov(ε, y) = D((I – λW)-1)T, which is not

diagonal and indicates that εi is correlated with yj (j ≠ i). Therefore, in such an autoregressive form, both

CAR and ERR violate strict exogeneity and their explanatory variable Wy is endogenous. More

specifically, the Wy in CAR corresponds to the spatial case of sequential exogeneity in time series

(Wooldridge 2012), a special case of being endogenous where the neighbors of each observation are

conditionally assumed exogenous. From an interpretive perspective, CAR indicates a local, one-

directional spatial process: for one site, given the rest sites, only its neighbor(s) has an effect on its value;

given its neighbors, changes at the site have no effect on the remaining sites, as its neighbors are

exogenously determined and the rest sites are conditionally independent of the site (Rue & Held 2005;

Gelfand et al. 2010). ERR indicates a global, two-directional spatial process: one site can be influenced

by its neighbor(s) as well as the remaining sites, and changes at the site simultaneously have an effect on

the other sites (Anselin 2003; Gelfand et al. 2010). ERR in the form y = λWy + ε can be viewed as LAG,

which normally differs from the former in the explicit inclusion of endogenous Wy as an explanatory

variable to investigate the effects of intrinsic processes. Meanwhile, given that W is a row stochastic

matrix and |ρ| is less than one, LAG (y = ρWy + Xβ + ε) can be written as y = (I – ρW)-1 (Xβ + ε) = limn→∞

(I + ρW + ρ2W2 + … + ρnWn)(Xβ + ε) (Horn & Johnson 2013), where W represents neighbors, and W2

represents neighbors of neighbors, and so on for higher powers of W. The interpretation for the expansion

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

is that, for example, the observation at one site is influenced not only by the local climatic condition, but

also by its surrounding climatic condition, as responses at surrounding sites to surrounding condition can

spread via dispersal, and can spill over across all the sites that are connected by a neighbor network, with

decaying strength along increasing neighboring distance. Therefore, LAG permits the possibility that

whatever changes in y at site i due to changes in either X or ε will impact on y at another site j (j ≠ i). In

contrast, the usual forms of ERR and CAR assume that changes in X at site i have no partial effects on y

at site j. Such autocorrelation of spillover is referred to as intrinsic spillover autocorrelation (ISA) in this

study. By denoting limn→∞ (ρW + ρ2W2 + … + ρnWn) = (I – ρW)-1 – I by A, LAG can be viewed as an ERR

y = (X + AX)β + (I – ρW)-1ε that alternatively allows for non-zero partial derivative of y at site i with

respect to X at site j and accounts for intrinsic autocorrelation without violating strict exogeneity. In this

sense, the intrinsic autocorrelation caused by neighbors can, under certain conditions, be decomposed into

two parts. One is driven by X, while the other is independent of X. The error term in ERR or CAR only

considers the independent part of intrinsic autocorrelation.

Although under certain conditions (normal distribution, symmetric weight matrix) the variance-

covariance matrix in CAR and ERR can shift between each other (Cressie 1993), the weight matrix will

change, and so will the implied ecological process. Similarly, although any valid variance-covariance

matrix (e.g. for ERR with an asymmetric weight matrix, the linear mixed model or the geo-statistical

model) can be equivalent to that in CAR (Cressie 1993), the interpretation of weight matrix and data-

generating processes can differ from the original one. Another implication is that having X controlled,

identical observation of the spatial structure of response variable could suggest alternative spatial

processes whose interpretations differ.

MIX has multiple interpretations, including omitted endogenous variables and uncertainty among

CLM, ERR and LAG (LeSage & Pace 2009). Although the implementation of SAC and ESDM are

readily available, their empirical interpretations can be complex and may depend on both theoretical and

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

substantive conditions (Anselin 2003). Generally, SAC represents more elaborate error structures, while

ESDM additionally represents more heterogeneous spillover effects.

We categorized the models into two groups, Group E and Group I. The former differs from the

latter in the absence of an endogenous term in model formulas. Therefore, Group E explicitly accounts for

extrinsic processes that are assumed exogenous, while Group I additionally accounts for intrinsic

processes that are endogenous (Table 1). Group E includes CLM, GAMM, CAR and ERR, the latter three

of which have been reported to perform well when dealing with spatial autocorrelation deriving from

exogenous variables (Beale et al. 2010). CLM was included for the purpose of contrast. Group I includes

models explicitly formulated to account for spatial autocorrelation of intrinsic sources, namely LAG,

SAC, MIX and ESDM.

From the equation y = ρWy + Xβ + ε = (I – ρW)-1(Xβ + ε), we can see that the whole spatial

structure in y can be partitioned into two independent parts, (I – ρW)-1Xβ and (I – ρW)-1ε, as X and ε are

assumed to be independent. The part (I – ρW)-1Xβ consists of Xβ plus AXβ. Since AX is clearly dependent

on X, leaving AX in the error term will make X endogenous. Only in the case where models from Group E

capture the entire (I – ρW)-1Xβ by their explanatory variables will they yield unbiased estimate for β, if the

bias from Group E is indeed caused by intrinsic autocorrelation. To test this hypothesis, numerical values

from 0.02 to 0.98 by a step of 0.03 (balancing computational efficiency and sampling density) were

assigned to ρ and then X* =AX = [(I – ρW)-1 – I]X was used as extra explanatory variables in addition to X

to obtain the estimates from Group E for β. We implemented such an additional test in a context (i.e.

Scenario MN described below) where all anomalies can be attributed to intrinsic autocorrelation.

In terms of GAMM specification, since we assumed linear relationships between the response

variable and the explanatory variables, we did not include smooth functions of the explanatory variables

when specifying the additive part of GAMM. Instead, the geographical coordinates were used to capture

extra spatial structure (Beale et al. 2010). We chose the smooth function of the alternative tensor product

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

(Wood 2014), as it performed best in an additional test (results not shown here) where the fitting results

using different smooth functions were compared in terms of both likelihood ratio tests and AIC values

(Chapter 5 & 6 in Wood 2006). We set the basis dimension (i.e. the k) to be the limit imposed by both the

numbers of observations and parameters so that it would not be restrictive. GAMM includes by default a

random effects component for the smooth function. Due to the assumed stationarity in data generation, we

did not additionally consider effects of random slope.

In addition, we tested the performance of the eigenvector-based approach of Moran’s eigenvector

maps (Dray, Legendre & Peres-Neto 2006). See results in Fig. S7.

Scenarios

Based on the idea that correlation structure in the residuals of a model usually indicates model

misspecification where some important information about the variable in question has been omitted, we

simulated scenarios where different extrinsic explanatory variables were omitted. Particularly, we

assessed the performance of these models in four analytic scenarios, the settings of which changed

gradually from being ideal to realistic (Table 2). Our main interest was in how Group E performs with

missing intrinsic processes and how Group I performs with missing extrinsic processes.

Scenario MN (Missing None of the extrinsic variables): all the extrinsic explanatory variables are

included in the models from Group E and Group I, with additional ρWy in Group I by default. In this ideal

context, a comparison of results between Group E and Group I will reveal the pure impacts of ISA on

coefficient estimation and type I error rate, because all extrinsic processes have been accounted for.

Scenario MU (Missing an Uncorrelated variable): In reality, it is usually impossible to include all

the extrinsic variables; it is highly likely that we might omit some unnoticeable or unmeasurable

variables. To assess the effect of ISA in the context of missing uncorrelated extrinsic factors, we omit the

variable x4 that is uncorrelated with other variables.

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

Scenario MC (also Missing a Correlated variable): In addition to the uncorrelated extrinsic

variable x4, we might also omit one more variable (i.e. x2 in this study) that is cross-correlated with

variables included in the model. Due to the correlation of x2 with x1 and x3, an extra source of endogeneity

(i.e. omitted variable bias) emerges and none of the models is capable of eliminating it unless relevant

information is incorporated. Since the relationships among x1, x2 and x3 are known, we corrected the

omitted variable bias in the way that the expected true coefficients of x1 and x3 in this scenario are the sum

of the true pre-set value and the coefficient of the regression of x2 on x1 and x3, respectively (Hayashi

2000), with a purpose of exclusively showing the bias caused by ISA. Results without such correction

were also shown.

Scenario MB (also Missing a Biotic variable): In a more complicated and realistic situation, it is

not uncommon to omit a correlated variable involving ISA per se (referred to as the biotic variable here),

i.e. x3 in this study, the population density for species B. Again, omitted variable bias occurs due to the

correlation of x3 with x1 and x2. The omitted variable bias was corrected in the way that the expected true

coefficients of x1 and x2 are the sum of the preset true value and the coefficient of the regression of x3 on

x1 and x2, respectively. Results without such correction were also shown.

In all the four scenarios, we assessed the type I error rate by adding a spatially-structured random

variable x5 that is independent of y, x1, x2, x3 and x4 to the model formula and checking the corresponding

p-value, with a comparison using its non-spatial counterpart as a control. In Scenario MN, an inflation of

type I error rate would be exclusively caused by intrinsic autocorrelation, while in the other three

scenarios inflations of type I error rate may be caused by intrinsic, extrinsic, or both forms of

autocorrelation.

Comparisons

To compare model performance in coefficient estimation, we calculated “relative imprecision” as |

(estimated value – true value)/true value| and “relative bias” as (the mean – true value)/true value for each

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

covariate taken into account by models in the different scenarios. It is noteworthy that imprecision is not

equivalent to bias. Imprecision is the spread between every estimated value and the true value for a

parameter (i.e. a visual reflection of the variance of estimates), while bias is the difference between the

mean of estimated values and the true value for a parameter. In every scenario, the type I error rate at

α=0.05 level of each model was displayed.

All the simulations and assessments were implemented in R 3.1.2 (R Core Team 2014) using

spdep (Bivand 2014) package.

Results

Pure intrinsic spillover autocorrelation (ISA) (Fig. 1a) inflated the type I error rate of Group E models

except for the ERR model. In more realistic scenarios (Fig. 1b-d), Group E and Group I fairly consistently

exhibited an inflated type I error rate, with the ERR, MIX and ESDM models performing relatively well.

CAR and ERR even showed a deflated type I error rate for irrelevant non-spatial variables in the presence

of ISA (Fig. 1, S1).

In relation to coefficient estimation, pure ISA caused noticeable imprecision and bias of estimates

from Group E (Fig. 2a, 3a), which across four scenarios generally suffered more in comparison to Group

I. Moreover, imprecision and bias increased with scenario realism (Fig. 2, 3). In Scenario MC and MB, all

models suffered extra omitted variable bias (Fig 3, S3). In Scenario MN, MU and MC, SAC performed

almost as well as ESDM. In Scenario MB with correction, however, ESDM yielded correct estimates

other than the other models, and more so when the biotic variable had a larger effect (Fig. 3, S3).

For Group E, exogenous variables that appropriately accounted for the information of ISA

yielded unbiased coefficient estimates (Fig. 4). Any ISA left in the error term always led to bias in

coefficient estimates, even when the explanatory variables were non-spatial, with bias magnitude

associated with structures of the explanatory variables (Fig.4, S4).

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

Discussion

Our results clearly show that omitting information on intrinsic spillover autocorrelation (ISA) generally

inflates type I error rates of the explicitly extrinsic models (Fig. 1). However, ERR, along with MIX and

ESDM, produced nearly correct error rates, depending on the realism in the simulation. In contrast to

some researchers who favored not reporting p-values at all from regression models applied to empirical

data (Jetz & Rahbek 2001; Rahbek & Graves 2001; Hawkins 2012), we show that some spatial models

(e.g. ERR, MIX and ESDM) produce generally reliable significance test results when the null hypothesis

βk = 0 is true, even when they suffer the endogeneity problem. CAR’s and ERR’s deflation of type I error

rates in some cases for irrelevant non-spatial variables is due to their algorithms, which over-compute the

standard errors of the coefficient estimates for such variables (Table S1, S2). Caution must be taken when

reporting significance tests from CLM, GAMM, CAR and LAG, which are prone to falsely report

significant relations for spatially-structured variables in the presence of autocorrelated residuals of either

source.

With regard to unbiasedness of coefficient estimates, models from Group E are incapable of

dealing with ISA, unless their exogenous variables appropriately represent such autocorrelation. For

Group I, SAC appears to be a more competitive alternative to ESDM except – importantly – when it

comes to Scenario MB where one omitted variable involves both self-correlation and cross-correlation.

Omitting such variables is one motivation for MIX (LeSage & Pace 2009). Theoretically, MIX has

previously been argued to be the optimal model balancing model complexity and precision when dealing

with autocorrelation of two origins (LeSage & Pace 2009). However, given that ESDM is a general

extension of MIX, our support for ESDM’s overall good performance is largely consistent with this

theoretical argument. On the other hand, ESDM only addresses partly the bias of spillover effects present

in Scenario MC and MB. It is important to bear in mind that none of the models tested here, including

ESDM, can rectify omitted variable bias to produce true values because the extra information needed is

carried by the omitted variables.

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

Besides, we should be aware of one potential drawback of ESDM. The need to specify two

weight matrices in ESDM means higher risk of inappropriate weight matrices, as an appropriate weight

matrix is usually assumed to correspond to the realistic network of interactions (Anselin 1988). While

Beale et al. (2010) noted that mis-specified spatial structures had a relatively trivial effect on precision,

compared to no spatial structure, it has been reported that invalid weight matrices (Bardos, Guillera-

Arroita & Wintle 2015) and different definitions of neighbors and weights (Anselin 1988) could lead to

contrasting results. Given the spillover effect of (I – ρW)-1 where a change at one site will spill over across

all the sites that are connected by a neighbor network, a priori knowledge about interaction network may

play an important role in specifying weight matrices and coefficient estimation (Fig. S5). The exact effect

of weight matrix specification on regression modelling remains to be further explored.

It is important to note that CLM’s weak performance in the present study in terms of

unbiasedness only indicates its limited ability to deal with endogenous variables, but not inherent

inferiority to spatially explicit models. When information responsible for endogeneity bias is added into

the model (e.g. adding X* in Fig. 4), the coefficients estimates from CLM are unbiased, regardless of the

spherical error assumption. Therefore, we stress the necessity of exploring and accounting for important

processes underlying observed patterns, rather than assuming absolute superiority of a certain model.

The strict exogeneity assumption matters only when the ceteris paribus effects of explanatory

variables are the focus of a study (Hayashi 2000; Wooldridge 2012). Beyond causal explanation,

prediction is another important application field for regression modeling (James et al. 2013). Thorough

elaboration on mechanisms of how linear regression models can be used to assess causality and how

modeling strategies differ between the two distinct purposes is far beyond the scope of this paper. Simply

put, the basic idea of estimating causal effects in a linear model is to control for theoretically relevant

variables as one would have done in an experiment. Hence, including the term ρWy in a model allows for

the possibility of quantifying the effect of intrinsic processes conditional on other variables. In contrast,

incorporating intrinsic processes into the error term lacks the capability of assessing its ceteris paribus

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

effect, with methods already developed to separate distinct sources of variation (Freckleton & Jetz 2009;

Diniz-Filho et al. 2012). A further distinction between the two approaches to modeling intrinsic processes

is whether the intrinsic process of interest is considered to be independent of explanatory variables in the

model. For example, phylogenetic generalized least squares model (PGLS) assumes that self-correlation

in trait evolution captured in the error term is generated by an independent Brownian motion (Martins &

Hansen 1997) or an Ornstein-Uhlenbeck process with a constant parameter of adaptation rate (Hansen

1997). However, some intrinsic processes (e.g. dispersal, population dynamics) are dependent on other

variables: environmental change or landscape alteration at one location can influence species responses at

other locations (Holt 2008; Gorzelak et al. 2015), which is the important spillover feature of ISA. Hence,

the importance of strict exogeneity and choice of processes to be controlled for depend on both the

purpose of regression modeling and the question being asked in a specific study as well as the

understanding of a specific intrinsic process.

Scale also plays a role in dealing with ISA. When a study is performed at large scales (e.g.

continental scale), ISA may be confined within so few local neighbors that it is reasonable to believe that

the spillover effects and the resulting issues of endogeneity are negligible and that other factors play a

more important role (Diniz-Filho et al. 2003). Alternatively, ISA that occurs across a large extent of the

study space (e.g. at landscape or regional scales) necessitates applying models in Group I to eliminate the

effect of endogeneity on other explanatory variables.

Although ESDM is the most general one in the simultaneous autoregressive model family and

performs generally well, we notice that in some cases ESDM may yield marginally less reliable

coefficients than other models when both ex- and intrinsic processes have been appropriately accounted

for. Actually, we do not propose ESDM as the single model of choice for addressing ISA, but as a good

starting point to explore the final model specification. In practice, we advocate simplifying the general

model ESDM, if possible, to a more specific, but less complex model, retaining the ability to account for

all autocorrelation-generating processes. Likelihood ratio tests, Wald tests and Hausman tests are useful

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

tools for examining model specification (LeSage & Pace 2009; Bivand et al. 2014). Moreover, as the

limitations of the simulation framework in this study, including perfect knowledge, linear and stationary

relationships, normal distributions for error terms and simple weight matrix structures, under-represent

complexities of real ecological data, we suggest careful further extrapolation of the findings here as well

as consideration of other useful modelling techniques for empirical studies.

Conclusion

Our simulation-based study has shown that omitting intrinsic spillover autocorrelation leads to incorrect

type I error rates and biased coefficient estimates in regression modelling. Our results also indicate that

significance tests from some spatial regression models produce generally reliable results even when their

coefficient estimates are biased. Importantly, for producing efficient and unbiased estimates of ceteris

paribus effects, specification of regression models should account for both extrinsic and intrinsic

processes that generate spatial autocorrelation in spatial data. The extended spatial Durbin model (ESDM)

emerges as the most promising technique for addressing intrinsic spillover autocorrelation across four

scenarios, but still suffers bias from other endogeneity sources.

Authors’ contributions

SNT conceived the ideas and designed methodology; SNT collected the data; SNT, CX, BS and JCS

analyzed the data; SNT and JCS led the writing of the manuscript. All authors contributed critically to the

drafts and gave final approval for publication.

Acknowledgements

We thank Robert B. O’Hara, J. Alexandre F. Diniz-Filho and an anonymous reviewer for their valuable

comments and constructive criticisms on the earlier versions of the manuscript. We thank W. Daniel

Kissling and Gudrun Carl for kindly providing the R code to simulate data in their study. SNT was

supported by China Scholarship Council (201406190179). JCS was supported by the European Research

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

Council (ERC-2012-StG-310886-HISTFUNC), and also considers this work a contribution to his

VILLUM Investigator project “Biodiversity Dynamics in a Changing World” (BIOCHANGE) funded by

VILLUM FONDEN. CX was supported by National Natural Science Foundation of China (41271197).

Data accessibility

The R scripts needed to perform simulations and comparisons are available in Appendix S1.

431

432

433

434

435

References

Anselin L. (1988) Spatial Econometrics: methods and models. Kluwer, Dordrecht.

Anselin L. (2003) Spatial externalities, spatial multipliers, and spatial econometrics. International

Regional Review, 26, 153-166.

Banerjee, S., Carlin, B.P. & Gelfand, A.E. (2004) Hierarchical Modeling and Analysis for Spatial Data.

CRC, Boca Raton.

Bardos, D.C., Guillera-Arroita, G. & Wintle, B.A. (2015) Valid auto-models for spatially autocorrelated

occupancy and abundance data. Methods in Ecology and Evolution, 6, 1137-1149.

Beale, C.M., Lennon, J.J., Yearsley, J.M., Brewer, M.J. & Elston, D.A. (2010) Regression analysis of

spatial data. Ecology Letters, 13, 246-264.

Bivand, R. (2014) spdep: Spatial dependence: weighting schemes, statistics and models. R package

version 0.5-77. http://CRAN.R-project.org/package=spdep

Cliff, A.D. & Ord, J.K. (1981) Spatial Processes: Models and Applications. Pion, London.

Cressie, N. (1993) Statistics for Spatial Data. John Wiley & Sons, Chichester, UK.

Diniz-Filho, J.A.F., Bini, L.M. & Hawkins, B.A. (2003) Spatial autocorrelation and red herrings in

geographical ecology. Global Ecology and Biogeography, 12, 53-64.

Diniz-Filho, J.A.F., Siqueira, T., Padial, A.A., Rangel, T.F., Landeiro, V.L. & Bini, L.M. (2012) Spatial

autocorrelation analysis allows disentangling the balance between neutral and niche processes in

metacommunities. Oikos, 121, 201-210.

Dormann, C.F., McPherson, J.M., Araujo, M.B., Bivand, R., Bolliger, J., Carl G., Davies, R.G., Hirzel A.,

Jetz, W., Kissling, W.D., Kühn, I., Ohlemüller, R., Peres-Neto, P.R., Reineking, B., Schröder, B.,

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

http://CRAN.R-project.org/package=spdep

Schurr, F.M. & Wilson, R. (2007) Methods to account for spatial autocorrelation in the analysis

of species distributional data: a review. Ecography, 30, 609-628.

Dray, S., Legendre, P. & Peres-Neto, P.R. (2006) Spatial modelling: a comprehensive framework for

principal coordinate analysis of neighbor matrices (PCNM). Ecological Modelling, 196, 483-493.

Freckleton, R.P. & Jetz, W. (2009) Space versus phylogeney: disentangling phylogenetic and spatial

signals in comparative data. Proceedings of the Royal Society Series B, 276, 21-30.

Fortin, M.-J. & Dale, M. (2005) Spatial Analysis: guide for ecologists. Cambridge University Press,

Cambridge, UK.

Gelfand, A.E., Diggle, P.J., Fuentes, M. & Guttorp, P. eds. (2010) Handbook of Spatial Statistics. CRC,

Boca Raton.

Gorzelak, M.A., Asay, A.K., Pickles, B.J. & Simard, S.W. (2015) Inter-plant communication through

mycorrhizal networks mediates complex adaptive behavior in plant communities. Annals of

Botany Plants, 7, plv050.

Hansen, T.F. (1997) Stabilizing selection and the comparative analysis of adaptation. Evolution, 51, 1341-

1351.

Hawkins, B.A. (2012) Eight (and a half) deadly sins of spatial analysis. Journal of Biogeography, 39, 1-9.

Hawkins, B.A., Diniz-Filho, J.A.F., Bini, L.M., De Marco, P. & Blackburn, T.M. (2007) Red herrings

revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography,

30, 375-384.

Hayashi, F. (2000) Econometrics. Princeton University Press.

Hodges, J.S. & Reich, B.J. (2010) Adding spatially-correlated errors can mess up the fixed effect you

love. The American Statistician, 64, 325-334.

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

Horn, R.A. & Johnson, C.R. (2013) Matrix Analysis, 2nd Edn. Cambridge University Press, New York,

USA.

Holt, R.D. (2008) Theoretical perspectives on resource pulses. Ecology, 89, 671-681.

James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013) An Introduction to Statistical Learning: with

Applications in R. Springer, New York.

Jetz, W. & Rahbek, C. (2001) Geometric constraints explain much of the species richness pattern in

African birds. Proceedings of the National Academy of Sciences, 98, 5661-5666.

Keitt, T.H., Bjørnstad, O.N., Dixon, P.M. & Citron-Pousty, S. (2002) Accounting for spatial pattern when

modeling organism-environment interactions. Ecography, 25, 616-625.

Legendre, P. (1993) Spatial autocorrelation – trouble or new paradigm. Ecology, 74, 1659-1673.

Lennon, J.J. (2000) Red-shifts and red herrings in geographical ecology. Ecography, 23, 101-113.

LeSage, J. & Pace, R.K. (2009) Introduction to Spatial Econometrics. CRC Press, Boca Raton.

Linder, H.P., Bykova, O., Dyke, J., Etienne, R.S., Hickler, T., Kühn, I., Marion, G., Ohlemüller, R.,

Schymanski, S.J. & Singer, A. (2012) Biotic modifiers, environmental modulation and species

distribution models. Journal of Biogeography, 39, 2179-2190.

Martins, E.P. & Hansen, T.F. (1997) Phylogenies and the comparative method: a general approach to

incorporating phylogenetic information into the analysis of interspecific data. The American

Naturalist, 149, 646-667.

Monohan, J.F. (2008) A Primer on Linear Models. CRC, Boca Raton.

Paciorek, C.J. (2010) The importance of scale for spatial confounding bias and precision of spatial

regression estimators. Statistical Science, 25, 107-125.

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

R Core Team (2014) R: A language and Environment for Statistical Computing. R Foundation for

Statistical Computing, Vienna. URL http://www.R-project.org [accessed 12 December 2014]

Rahbek, C. & Graves, R. (2001) Multiscale assessment of patterns of avian species richness. Proceedings

of the National Academy of Sciences, 98, 4534-4539.

Rue, H. & Held, L. (2005) Gaussian Markov Random Fields: Theory and Applications. CRC, Boca

Raton.

West, B.T., Welch, K.B. & Gałecki, A.T. (2015) Linear Mixed Models: A Practical Guide Using

Statistical Software, 2nd Edn. CRC Press, Boca Raton.

Wood, S.N. (2006) Generalized Additive Models: an introduction with R. CRC, Boca Raton.

Wood, S. (2014) mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation.

R package version 1.8-3. http://CRAN.R-project.org/package=mgcv

Wooldridge, J.M. (2012) Introductory Econometrics: A Modern Approach, 5th Edn. South-Western

College Publishing, Mason, USA.

Supporting Information

Additional Supporting Information may be found in the online version of this article.

Appendix S1. R scripts for artificial data generation and model assessments.

Table S1. Comparison between the computed standard error by a model’s algorithm and the actual

standard error of the model’s coefficient estimate for the irrelevant variable x5 as in Fig. 1.

Table S2. Comparison between the computed standard error by a model’s algorithm and the actual

standard error of the model’s coefficient estimate for the irrelevant variable x5 as in Fig. S1.

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

http://CRAN.R-project.org/package=mgcv

http://www.R-project.org/

Figure S1. Type I error rate from the eight regression models applied to data generated without intrinsic

spillover autocorrelation in y.

Figure S2. Relative imprecision of coefficients estimates from the eight regression models with

coefficients that are on average identical.

Figure S3. Relative bias of coefficients estimates from the eight regression models with coefficients that

are on average identical.

Figure S4. Relative bias of coefficients estimates from the eight regression models applied to data

generated with four explanatory variables free of spatial structure.

Figure S5. Relative bias of coefficients estimates from the eight regression models with mis-specified

weight matrix for SAC and ESDM.

Figure S6. Relative bias of coefficients estimates from the eight regression models with negative intrinsic

autocorrelation.

Figure S7. Relative bias of coefficients estimates from the eight regression models including the method

of Moran’s eigenvector maps.

520

521

522

523

524

525

526

527

528

529

530

531

532

533

Table 1. Overview of eight regression models’ abilities to account for two sources of spatially

autocorrelated residuals.

Model Extrinsic Intrinsic Group

CLM NO NO E

GAMM YES NO E

CAR YES NO E

ERR YES NO E

LAG NO YES I

SAC YES YES I

MIX* YES/NO NO/YES I

ESDM YES YES I

*The dual ability of MIX model to alternatively account for extrinsic or intrinsic autocorrelation derives from its formula

(Anselin, 1988; LeSage & Pace, 2009)

535

536

537

538

Table 2. Description of four simulated scenarios

Scenario Description

MN Capturing all extrinsic variables

MU Omitting an uncorrelated variable (x4)

MC Omitting also a cross-correlated variable not involving intrinsic processes (x2 & x4)

MB Omitting also a cross-correlated variable involving intrinsic processes (x3 & x4)

539

540

541

Fig. 1. Type I error rate at 5% significance level of the eight regression models in a) Scenario MN

(Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated variable), c)

Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the Biotic variable).

An error rate above 0.05 (dashed line) represents inflation. Error bar shows the 95% confidence interval

for the mean type I error rate from 50 rounds of 100 replicates of significance testing, with one type I

error rate for one round. Dark grey (or light grey) color indicates an irrelevant variable with (or without)

spatial structure.

542

543

544

545

546

547

548

549

Fig. 2. Relative imprecision in percentage of coefficients estimates from the eight regression models

under strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic

variables), b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a

Correlated variable) and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of

assessment per model. Explanatory variables are indicated by color (same in all panels). The coefficient

for the biotic variable x3 is on average larger than those for other covariates. Middle line in a box

represents the median; upper and lower box edges correspond to the first and third quartiles; notches in

box give a roughly 95% confidence interval for the median; whiskers extend bidirectionally to the farthest

value that is within 1.5 times the inter-quartile range; outliers are excluded.

550

551

552

553

554

555

556

557

558

559

560

Fig. 3. Relative bias in percentage of coefficients estimates from the eight regression models under strong

intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b) Scenario

MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable) and d)

Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model. The

coefficient for the biotic variable x3 is on average larger than those for other covariates. Orange line for a

box represents the bias after correction for omitted variable bias in Scenario MC and MB, while blue line

for a box represents the bias without such correction, with all other elements as in Fig. 2.

561

562

563

564

565

566

567

568

Fig. 4. Effect on coefficient estimates of adding an extra explanatory variable X* = [(I – ρW)-1 – I]X into

models from the extrinsic group in Scenario MN (Missing None of the extrinsic variables). Numerical

values from 0.02 to 0.98 by a step of 0.03 were assigned to ρ. For each ρ, 5000 replicates of assessment

per model were implemented. The vertical dashed line indicates the situation where X* accounts for

exactly all X-related intrinsic autocorrelation with strength equal to 0.8. Please note how the bias for the

four explanatory variables (in color) reduces to zero (the horizontal dashed line) at the point ρ = 0.8.

569

570

571

572

573

574

575

576

577

Table S1. Comparison between the computed standard error (CSE) by a model’s algorithm and the actual standard error (ASE) of the model’s

coefficient estimate for the irrelevant variable x5 as in Fig. 1. A CSE value larger than its ASE value signals deflation of type I error rate; a CSE

value smaller than its ASE value signals inflation of type I error rate, while no difference between CSE and ASE signals correct type I error rate.

Scenario MN Scenario MU Scenario MC Scenario MB

Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial

CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE

CLM 0.189 0.577 0.180 0.184 0.253 0.800 0.245 0.248 0.255 0.821 0.250 0.254 0.373 1.167 0.369 0.369

GAMM 0.043 0.064 0.013 0.014 0.103 0.134 0.037 0.037 0.109 0.142 0.039 0.040 0.108 0.139 0.037 0.038

CAR 0.165 0.348 0.097 0.075 0.204 0.425 0.121 0.088 0.205 0.426 0.122 0.089 0.318 0.687 0.190 0.144

ERR 0.080 0.086 0.034 0.024 0.119 0.140 0.051 0.040 0.121 0.144 0.052 0.041 0.139 0.159 0.060 0.041

LAG 0.007 0.007 0.007 0.007 0.060 0.136 0.056 0.057 0.073 0.167 0.070 0.071 0.059 0.139 0.057 0.059

SAC 0.007 0.007 0.007 0.007 0.080 0.092 0.037 0.038 0.090 0.108 0.040 0.040 0.079 0.091 0.038 0.038

MIX 0.015 0.016 0.007 0.007 0.109 0.119 0.048 0.053 0.119 0.130 0.053 0.057 0.116 0.127 0.052 0.056

ESDM 0.015 0.016 0.007 0.007 0.089 0.098 0.048 0.050 0.096 0.108 0.053 0.056 0.090 0.099 0.048 0.049

578

579

580

581

Table S2. Comparison between the computed standard error (CSE) by a model’s algorithm and the actual standard error (ASE) of the model’s

coefficient estimate for the irrelevant variable x5 as in Fig. S1. A CSE value larger than its ASE value signals deflation of type I error rate; a CSE

value smaller than its ASE value signals inflation of type I error rate, while no difference between CSE and ASE signals correct type I error rate.

Scenario MN Scenario MU Scenario MC Scenario MB

Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial Spatial Non-spatial

CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE CSE ASE

CLM 0.007 0.007 0.007 0.007 0.067 0.136 0.065 0.065 0.075 0.158 0.074 0.075 0.075 0.162 0.074 0.074

GAMM 0.007 0.007 0.007 0.007 0.054 0.061 0.037 0.038 0.058 0.066 0.040 0.042 0.059 0.067 0.041 0.042

CAR 0.007 0.007 0.007 0.007 0.058 0.070 0.044 0.040 0.065 0.079 0.049 0.044 0.064 0.079 0.048 0.044

ERR 0.007 0.007 0.007 0.007 0.055 0.057 0.037 0.038 0.061 0.062 0.041 0.041 0.061 0.063 0.041 0.041

LAG 0.007 0.007 0.007 0.007 0.059 0.110 0.058 0.058 0.065 0.120 0.064 0.065 0.064 0.121 0.063 0.064

SAC 0.007 0.007 0.007 0.007 0.055 0.057 0.037 0.038 0.060 0.062 0.041 0.041 0.061 0.063 0.041 0.041

MIX 0.009 0.010 0.007 0.007 0.056 0.057 0.040 0.041 0.061 0.061 0.043 0.045 0.062 0.063 0.044 0.045

ESDM 0.010 0.010 0.007 0.007 0.056 0.058 0.045 0.046 0.061 0.062 0.051 0.053 0.062 0.064 0.049 0.051

582

583

584

585

Fig. S1. Type I error rate at 5% significance level of the eight regression models in a) Scenario MN

(Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated variable), c)

Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the Biotic variable),

based on data generated without intrinsic spillover autocorrelation in y. An error rate above 0.05 (dashed

line) represents inflation. Error bar shows the 95% confidence interval for the mean type I error rate from

50 rounds of 100 replicates of significance testing, with one type I error rate for one round. Dark grey (or

light grey) color indicates an irrelevant variable with (or without) spatial structure.

586

587

588

589

590

591

592

593

Fig. S2. Relative imprecision in percentage of coefficients estimates from the eight regression models

under strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic

variables), b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a

Correlated variable) and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of

assessment per model. Explanatory variables are indicated by color (same in all panels). The coefficients

for four covariates are on average identical. Middle line in a box represents the median; upper and lower

box edges correspond to the first and third quartiles; notches in box give a roughly 95% confidence

interval for the median; whiskers extend bidirectionally to the farthest value that is within 1.5 times the

inter-quartile range; outliers are excluded.

594

595

596

597

598

599

600

601

602

603

604

605

Fig. S3. Relative bias in percentage of coefficients estimates from the eight regression models under

strong intrinsic autocorrelation (ρ=0.8) in a) Scenario MN (Missing None of the extrinsic variables), b)

Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)

and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model.

The coefficients for four covariates are on average identical. Orange line for a box represents the bias

after correction for omitted variable bias in Scenario MC and MB, while blue line for a box represents the

bias without such correction, with all other elements as in Fig. S2.

606

607

608

609

610

611

612

613

614




and d) Scenario MB (also Missing the Biotic variable) across 5000 replicates of assessment per model,

with four covariates free of spatial structure. All other elements are as in Fig. S3.

615

616

617

618

619

620

621





The true weight matrix for y uses a structure of King’s move, while that for x3 uses a structure of first

order Bishop’s move. Two weight matrices required by SAC and ESDM falsely use the same structure of

King’s move. All other elements are as in Fig. S3.

622

623

624

625

626

627

628

629

630


negative intrinsic autocorrelation (ρ= -0.8) in a) Scenario MN (Missing None of the extrinsic variables),

b) Scenario MU (Missing an Uncorrelated variable), c) Scenario MC (also Missing a Correlated variable)


All other elements are as in Fig. S3.

631

632

633

634

635

636

637

Fig. S7. Relative bias in percentage of coefficients estimates from the eight regression models including

the method of Moran’s eigenvector maps (MEM) under strong intrinsic autocorrelation (ρ=0.8) in a)

Scenario MN (Missing None of the extrinsic variables), b) Scenario MU (Missing an Uncorrelated

variable), c) Scenario MC (also Missing a Correlated variable) and d) Scenario MB (also Missing the

Biotic variable) across 5000 replicates of assessment per model. All other elements are as in Fig. S3.

638

639

640

641

642

643

Date post:	07-Nov-2019
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

pure.au.dk · Web viewDespite consensus that SA originates from both extrinsic factors and...

Documents