Geostatistical and hierarchicalmodeling of spatial data
Veronica BerrocalDepartment of Biostatistics
University of Michigan
1 / 94
Topics
• Spatial processes
• Stationarity
• Kriging
• Hierarchical modeling of spatial data
• Gaussian spatial data
• Non-Gaussian spatial data
2 / 94
Spatial statistics
• Spatial data, that is, geographically referenced data, is encounteredin several disciplines: atmospheric sciences, ecology, geology,epidemiology, forestry, economy, etc.
• Following Gelfand et al. (2010), we identify three different basictypes of spatial data:
• point referenced data: observations of random variables Y (s),where s varies continuously in a fixed spatial domain D of Rd ;
• areal data: observations of random variables associated with afinite partition of the fixed spatial domain D in areas withwell-defined boundaries;
• point pattern data: observations are the locations where therandom events occur.The set D is now random and Y (s) = 1 for all s ∈D.Possibly, Y (s) might contain additional information on theevent (marked point process).
3 / 94
Point-referenced data
Example: ozone concentration over the Eastern United States onAugust 31, 2001 measured in parts per billions (ppb).
−120 −110 −100 −90 −80 −70
2530
3540
4550
Ozone concentration on 08/31/2001
Longitude
Latit
ude
20
40
60
80
100
ppb
4 / 94
Spatial process
• A spatial stochastic process:Y (s) : s ∈D ⊂ Rd
is a collection of
random variables indexed by s ∈D.
• If we consider a finite set of locations s1,s2, . . . ,sn ∈D, then(Y (s1),Y (s2), . . . ,Y (sn))′ is an n-dimensional random vector whosedistribution should reflect the spatial dependence among thevariables.
• If at sites s1, . . . ,sn, we observe data y = (y1, . . . ,yn)′, theny = (y1, . . . ,yn)′ is a realization of the random vector(Y (s1), . . . ,Y (sn))′.
5 / 94
Spatial process
• We specify the distribution of the spatial process through thefinite-dimensional distributions:
Fs1,...,sn(y1, . . . ,yn) = P(Y (s1)≤ y1, . . . ,Y (sn)≤ yn) (1)
for each n ≥ 1 and for each s1, . . . ,sn ∈D ⊂ Rd .
• The finite-dimensional distributions in (1) define a valid distributionfor the spatial stochastic process Y (s) : s ∈D if they satisfyKolmogorov’s compatibility conditions:
1 They are invariant under permutation
2 They are consistent under marginalization:
Fs1,...,sn,sn+1 (y1, . . . ,yn,∞) = P(Y (s1)≤y1,. . . ,Y (sn)≤yn,Y (sn+1)≤∞)
= Fs1,...,sn(y1, . . . ,yn)
6 / 94
Gaussian process
• Definition of a distribution for a spatial process throughfinite-dimensional distributions that satisfy Kolmogorov’scompatibility condition is usually difficult.
• An example of a spatial process is the Gaussian process.
• Y (s) : s ∈ D is a Gaussian process if for each n ≥ 1 and for eachs1, . . . ,sn ∈D ⊂ Rd , the finite-dimensional distributions of(Y (s1), . . . ,Y (sn))′ are multivariate normal distributions.
• To specify a Gaussian process it is sufficient to specify the meanand covariance matrix of each finite-dimensional distributions.
• Most models for geostatistical data use Gaussian processes,transformations of Gaussian processes or mixtures of Gaussianprocesses.
7 / 94
Goals of geostatistics
• Given observations y1, . . . ,yn of a spatial process Y (s) at afinite number of locations s1, . . . ,sn, the main goals of astatistical analysis of geostatistical data are:
• to infer upon the process Y (s),s ∈D
• predict (a functional of) the spatial process at a new locations0
• Note that we ONLY have a sample y = (y1, . . . ,yn)′ of size 1!=⇒ we specify association through structured dependence.
8 / 94
Ozone data
Example: ozone concentration over the Eastern United States onAugust 31, 2001 measured in parts per billions (ppb).
−120 −110 −100 −90 −80 −70
2530
3540
4550
Ozone concentration on 08/31/2001
Longitude
Latit
ude
20
40
60
80
100
ppb
9 / 94
Intrinsic stationarity and variograms
• Spatial process Y (s),s ∈D ⊂ Rd .Y (s) is called intrinsically stationary if:
1 for each s,s + h ∈D, h ∈ Rd , E (Y (s + h)−Y (s)) = 0 ;
2 for each s,s + h ∈D, h ∈ Rd ,Var(Y (s + h)−Y (s)) = E (Y (s + h)−Y (s))2 = 2γ(h)
• The function 2γ(h), where h is a vector in Rd is calledvariogram (γ(h) semi-variogram) and is a crucial parameter ingeostatistics.
• If γ(h) depends only on ‖h‖ ≡ (h1 +h2 + . . .+hd)1d where
h = (h1, . . . ,hd), then the variogram is said to be isotropic.Otherwise it is said to be anisotropic.
• A variogram is called geometrically anisotropic if2γ(h) = 2γ?(‖Ah‖) for some invertible A d ×d matrix.
10 / 94
Parametric variogram models• Parametric isotropic variogram models allow to express spatial
dependence as a function of few parameters.
• If ‖h‖= d , some of the most common parametric variogrammodels are:
• Spherical: γ(d) =
0 for d = 0
τ2 + σ2 ·(
3d2φ− d3
2φ3
)for 0 < d ≤ φ
τ2 + σ2 for d > φ
with φ > 0
• Power: γ(d) =
0 for d = 0τ2 + σ2dα for d > 0
with α ∈ [0,2) .
• Exponential: γ(d) =
0 for d = 0
τ2 + σ2 ·[1− exp(−d
φ)]
for d > 0
with φ > 0.
11 / 94
Parametric variogram models
• Gaussian: γ(d) =
0 for d = 0
τ2 + σ2[1− exp(−d2
φ2 )]
for d > 0
with φ > 0.
• Powered exponential:
γ(d) =
0 for d = 0
τ2 + σ2[1− exp
(−(dφ
)α)]for d > 0
with φ > 0,α ∈ (0,2].
• Matern:
γ(d) =
0 for d = 0
τ2 + σ2 ·[1− 1
2ν−1Γ(ν)·(dφ
)ν
·Kν(dφ
)]
for d > 0
with Kν modified Bessel function of order ν, and φ,ν > 0.
12 / 94
Parametric variogram models
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
γ(d)
Spherical semi−variogram with σ2=0.9, τ2=0.1
phi=2phi=4phi=8
0 2 4 6 8 10
01
23
4
d
γ(d)
Power semi−variogram with σ2=0.9, τ2=0.3
alpha=0.5alpha=1alpha=1.5
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
γ(d)
Exponential semi−variogram with σ2=0.9, τ2=0.1
phi=1phi=2phi=3
13 / 94
Parametric variogram models
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
γ(d)
Matern semi−variogram with σ2=0.9, τ2=0.1, φ=1
nu=0.5nu=2nu=10
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
dγ(
d)
Matern semi−variogram with σ2=0.9, τ2=0.1, ν=1
phi=1phi=2phi=4
14 / 94
Weak and strong stationarity
• Intrinsic stationarity defines only the first and secondmoments of the differences Y (s + h)−Y (s).It does not provide a likelihood for Y (s1), . . . ,Y (sn)!
• A spatial process Y (s),s ∈D ⊂ Rd such that E (Y (s)) existsand Var(Y (s)) is finite for all s is called weakly stationary orsecond-order stationary if
1 E (Y (s)) = µ(s)≡ µ for all s ∈D
2 Cov(Y (s),Y (s + h)) = C (s,s + h)≡ C (h) for all h ∈ Rd suchthat s,s + h ∈D
• A stationary covariance function C (h) is said isotropic ifC (h) = C (‖h‖) for each h ∈ Rd , otherwise it is calledanisotropic.
• An analogous definition of geometric anistropy for variogramsholds for covariance functions.
15 / 94
Strong stationarity
• A spatial process Y (s) is said to be strictly stationary if thejoint distribution of (Y (s1), . . . ,Y (sk)) is the same as that of(Y (s1 + h), . . . ,Y (sk + h)) for any k spatial pointss1, . . . ,sk ∈D ⊂ Rd and any vector h ∈ Rd , provided thats1 + h, . . . ,sk + h ∈D ⊂ Rd .
• Strict stationarity is a stronger property than second-orderstationarity.
• A Gaussian process that is second-order stationary is alsostrictly stationary.
16 / 94
Variograms and covariance functions
• Second-order stationarity is a stronger property than intrinsicstationarity.
• Suppose Y (s) is a second-order stationary spatial process,then:
2γ(h)=Var(Y (s + h)−Y (s))=2(C (0)−C (h)) (2)
• Given the covariance function C (h) of a second-orderstationary process we can recover the variogram function2γ(h).
• The converse is not true in general.However, if lim‖h‖→∞ γ(h) exists, then C (h) is well-defined:
C (h) = lim‖h‖→∞
γ(h)− γ(h)
and intrinsic stationarity implies second-order stationarity.
17 / 94
Stationarity
Intrinsic stationarity
Weak stationarity
Strong stationarity
18 / 94
Parametric models for covariance functions• As for variograms, we can write parametric models for covariance
functions.
• A valid covariance function C (s,s′) has to satisfy the followingproperties:
1 It is symmetric: for each s,s′ ∈D, C (s,s′) = C (s′,s).
2 It is positive definite: for each s1, . . . ,sN ∈D ⊂ Rd anda1, . . . ,aN ∈ R:
N
∑i=1
N
∑j=1
aiajC (si ,sj )≥ 0
• We will consider mainly parametric models for isotropic covariancefunctions C (h) = C (‖h‖) = C (d) where d = ‖h‖ that have one ormore of the following properties:
1 C (d) decreases as d increases.
2 C (d)→ 0 as d increases.
3 C (d)≥0 for all d .
19 / 94
Parametric models for covariance functions
• Since C (h) = lim‖h‖→∞ γ(h)− γ(h), if lim‖h‖→∞ γ(h) exists, we canderive corresponding parametric models for isotropic covariancefunctions.
• Spherical semi-variogram:
γ(d)=
0 for d = 0
τ2 + σ2·(
3d2φ− d3
2φ3
)for 0 < d ≤ φ
τ2 + σ2 for d > φ
• The corresponding spherical covariance function has equation:
C (d)=
τ2 + σ2 for d = 0
σ2 ·[1−(
3d2φ− d3
2φ3
)]for 0 < d ≤ φ
0 for d > φ
• The spherical covariance function has compact support.
20 / 94
Spherical covariance function
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
γ(d)
Spherical semi−variogram with σ2=0.9, τ2=0.1
phi=2phi=4phi=8
0 2 4 6 8 100.
00.
20.
40.
60.
81.
0d
C(d
)
Spherical cov. fct. with σ2=0.9, τ2=0.1
phi=2phi=4phi=8
21 / 94
Parametric models for covariance functions
• Analogously, we can derive the exponential covariance function, thepowered exponential covariance function, the Gaussian covariancefunction, the Matern covariance function.
• Matern covariance function:
C (d)=
τ2 + σ2 for d = 0
σ2 · 12ν−1Γ(ν)
·(dφ
)ν
·Kν
(dφ
)for d > 0
• The Matern covariance function is the most flexible covariancefunction and admits a smoothness parameter ν > 0 that controlsthe smoothness of the realizations.
22 / 94
Matern covariance function
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
γ(d)
Matern semi−variogram with σ2=0.9, τ2=0.1, φ=1
nu=0.5nu=2nu=10
0 2 4 6 8 100.
00.
20.
40.
60.
81.
0d
C(d
)
Matern cov. fct. with σ2=0.9, τ2=0.1, φ=1
nu=0.5nu=2nu=10
23 / 94
Attributes of a covariance function
• Let C (d) be a isotropic covariance function, then we candefine the sill, range, nugget effect, and partial sill.
• Sill: is the value C (0)
• Range: the range is the smallest value of d at which thecovariance function C (d) is equal to 0. The range might notexist!
• Nugget effect: the nugget effect is the difference, if not equalto 0,
Nugget effect = C (0)− limd→0+
C (d) = Sill− limd→0+
C (d)
The nugget effect corresponds to a discontinuity at the originin the covariance function C (d); it can be interpreted asmicroscale variability and/or measurement error
• The difference between the sill and the nugget effect is calledpartial sill.
24 / 94
Attributes of a covariance function
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
C(d
)
Spherical covariance function
C (d)=
τ2 + σ2 for d = 0
σ2 ·[1−(
3d2φ− d3
2φ3
)]for 0 < d ≤ φ
0 for d > φ
• What are the nugget effect, sill and partial sill?
• Does the range exist?25 / 94
Attributes of a covariance function
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
C(d
)
Spherical covariance function
Nugget effect
Partial sill
Sill
Range
C (d)=
τ2 + σ2 for d = 0
σ2 ·[1−(
3d2φ− d3
2φ3
)]for 0 < d ≤ φ
0 for d > φ
• What are the nugget effect, sill and partial sill?
• Does the range exist?26 / 94
Nugget effect
• If Y (s) is a weakly stationary spatial process with isotropiccovariance function CY (d) and non-null nugget effect τ2 > 0, thenY (s) can be written as:
Y (s) = w(s) + ε(s)
where
• w(s) weakly stationary spatial process with isotropiccovariance function Cw (d) with no nugget effect
• ε(s) pure error process, e.g. ε(s)i .i .d∼ N(0;τ2)
• w(s) and ε(s) independent for any s ∈D.
27 / 94
Correlation function and effective range
• If Y (s) is a weakly stationary spatial process with isotropiccovariance function C (d) and no nugget effect then for anyd ∈ R
C (d) = C (0) ·ρ(d)
where ρ(d) is the correlation function.
• The effective range is the smallest value of d at which thecorrelation function ρ(d) is equal to 0.05.
• The behavior of the correlation function determines thesmoothness of the realizations.
28 / 94
Effective range
• Consider the exponential correlation function ρ(d) = exp(−dφ
)for d > 0: what is the effective range of this correlationfunction?
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
ρ(d)
Exponential correlation function with φ=2
29 / 94
Effective range
• Consider the exponential correlation function ρ(d) = exp(−dφ
)for d > 0: what is the effective range of this correlationfunction?
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
d
ρ(d)
Exponential correlation function with φ=2
Effective range
30 / 94
Generating covariance functions• If C1(s,s′) and C2(s,s′) are two covariance functions and for each
s,s′ ∈D, C (s,s′) is defined as:
C (s,s′) := C1(s,s′) +C2(s,s′)
then, C (s,s′) is a valid covariance function.• If C1(s,s′) is a covariance function and b > 0, and for each s,s′ ∈D
C (s,s′) is defined as:
C (s,s′) := bC1(s,s′)
then, C (s,s′) is a valid covariance function.• If C1(s,s′) and C2(s,s′) are two covariance functions and for each
s,s′ ∈D, C (s,s′) is defined as:
C (s,s′) := C1(s,s′) ·C2(s,s′)
then, C (s,s′) is a valid covariance function.• Similarly, mixing and convolving valid covariance functions yield a
valid covariance function.• Bochner’s theorem provide a characterization of stationary
covariance functions.31 / 94
Generating covariance functions
• Let h ∈ Rd be decomposed as h = (h1,h2) with h1 ∈ Rm andh2 ∈ Rd−m.If C1(h1) is a stationary covariance function on Rm and C2(h2) is astationary covariance function on Rd−m, then
C (h) = C1(h1) ·C2(h2)
is a separable covariance function on Rd .
• If C ?(d) is an isotropic covariance function on Rd and A is a d ×dinvertible matrix then
C (h) := C ? (‖Ah‖)
is a geometrically anisotropic covariance function on Rd .
• Other constructions of non-stationary covariance functions fromstationary covariance functions are possible (Sampson and Guttorp(1992), Fuentes (2001), Nott and Dunsmuir (2002), Higdon (1998),Schmidt and O’Hagan (2003), Paciorek and Shervish (2006), etc.)
32 / 94
Statistical analysis of geostatistical data
Example: ozone concentration over the Eastern United States onAugust 31, 2001 measured in parts per billions (ppb).
−120 −110 −100 −90 −80 −70
2530
3540
4550
Ozone concentration on 08/31/2001
Longitude
Latit
ude
20
40
60
80
100
ppb
33 / 94
First law of geostatistics
• The observed data y = (y1, . . . ,yn)′ at sites s1, . . . ,sn is a partialrealization of a spatial process Y (s), s ∈D.
• The first law of geostatistics decomposes Y (s), s ∈D as:
Y (s) = µ(s) + η(s)
where
• µ(s): mean of Y (s). It is a deterministic function of s; it iscalled called the spatial trend and accounts for the large-scalevariability in the spatial process Y (s).It is often expressed as function of covariates X (s), e.g.µ(s) = X(s)β
• η(s) accounts for the small-scale variability in the spatialprocess Y (s). It has mean zero at each s, and accounts for thespatial dependence in Y (s).
34 / 94
Example of first law of geostatisticsExample: Spatial process
0 1 2 3 4 5
01
23
45
X
Y
−2
0
2
4
6
8
35 / 94
Example of first law of geostatistics
0 1 2 3 4 5
01
23
45
X
Y
−2
0
2
4
6
8
=
0 1 2 3 4 50
12
34
5
X
Y
−2
0
2
4
6
8
+
0 1 2 3 4 5
01
23
45
X
Y
−2
0
2
4
6
8
Spatial process Global trend Small-scale variation
36 / 94
Statistical inference
• Statistical inference for geostatistical data can be carried outin different ways:
1 Classical geostatistics:• Usually no distributional assumption on η(s).• Assume η(s) is a mean-zero, intrinsically stationary process.• Estimate provisionally β via OLS.• Derive residuals, compute empirical semi-variogram of
yi −X(si )β.• Fit parametric isotropic variogram γ(d) to empirical
semi-variogram via Weighted Least Squares.• Potentially re-estimate β via EGLS if isotropic variogram has a
sill (e.g. ∃ limd→∞ γ(d)).• R packages: geoR, gstat.
37 / 94
Statistical inference
2 Likelihood-based inference• Assume η(s) is a mean-zero Gaussian process.• Choose a parametric model for covariance function Cη(s,s′).
Derive corresponding n×n positive definite covariance matrixΣη with Ση,ij = Cη(si ,sj ).
• Maximize multivariate normal log-likelihood via numericalmethods.Alternatively, derive parameter estimates via restrictedmaximum likelihood (REML).
• Note that if Cη(s,s′) is modeled as an isotropic covariancefunction with nugget effect τ2 and partial sill σ2, thenη(s) = w(s) + ε(s) and:
Ση = σ2Rw + τ
2 · I
where Rw is the correlation matrix corresponding to thecovariance function of w(s).
• R package: geoR.
38 / 94
Example: log of zinc concentrationData: Observed log of zinc concentration
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log zinc concentration
X
Y
5.0
5.5
6.0
6.5
7.0
7.5
39 / 94
Example: log of zinc concentration
Interpolated log of zinc concenration and estimated spatial trend.
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log Zinc concentration
X
Y
4
5
6
7
8
5
5
5
5.5
5.5
6 6 6
6
6
6.5
6.5
7
7
7
179000 179500 180000 180500 181000
330000
331000
332000
333000
Estimated spatial trend
X
Y
4
5
6
7
8
Modeled µ(s) as a second order polynomial in sx and sy .
40 / 94
Example: log of zinc concentration
Empirical semi-variogram and fitted Matern semi-variogram.
distance
semivariance
0.1
0.2
0.3
0.4
500 1000 1500
• Used the following initial values for θ = (τ2,σ2,φ,ν):σ2 = 0.3,φ = 100,ν = 0.3,τ2 = 0.1.
• WLS estimates of θ = (τ2,σ2,φ,ν):σ2 = 0.38, φ = 548.7, ν = 0.3, τ2 = 0.02.
41 / 94
Example: log of zinc concentration
• Modeled η(s) as a mean-zero weakly stationary Gaussian process.
• We parametrize the covariance function using the Materncovariance function with nugget effect.
• We used the WLS estimates of σ2,τ2,φ,ν as initial values for thecovariance parameters.
• We specified a second degree polynomial in sx and sy for µ(s).
• Estimates of the β coefficients very similar between the estimationmethods.
Parameter WLS MLE REML
σ2 0.38 0.39 0.55
φ 548.7 548.7 548.7
τ2 0.02 0.0 0.10
ν 0.3 0.25 0.63
42 / 94
Statistical analysis of geostatistical data
Example: ozone concentration over the Eastern United States onAugust 31, 2001 measured in parts per billions (ppb).
−120 −110 −100 −90 −80 −70
2530
3540
4550
Ozone concentration on 08/31/2001
Longitude
Latit
ude
20
40
60
80
100
ppb
43 / 94
Kriging
• One of the main statistical objective in geostatistics is spatialprediction.
• LetY (s),s ∈ Rd
be a spatial process for which data y at n
points, s1,s2, . . . ,sn is available. Spatial prediction consists inpredicting g(Y (s)) given the observed data y = (y1, . . . ,yn).
• Kriging is a minimum mean-squared prediction error methodof spatial prediction named after D. G. Krige, who in the1950s developed empirical methods for predicting thedistribution of ore grade.
• In Krige’s case, the interest was in predicting:
Y (B) = g(Y (s)) =1
|B|
∫BY (u)du
44 / 94
Example: log zinc concentration
Observed log zinc concentration
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log of finc concentration
X
Y
5.0
5.5
6.0
6.5
7.0
7.5
45 / 94
Example: log zinc concentration
What is the log zinc concentration in the point marked with ?
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log of zinc concentration
X
Y
5.0
5.5
6.0
6.5
7.0
7.5
?*
46 / 94
Example: log zinc concentration
What is the log zinc concentration in the area marked with ?
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log of zinc concentration
X
Y
5.0
5.5
6.0
6.5
7.0
7.5
?
47 / 94
Main characteristics of kriging
• The main features of kriging are:
1 Kriging equations reflect the notion that observations shouldbe weighted differently: observations at sites that are morecorrelated to the prediction site should receive larger weight.
2 Spatial dependence contribute to the kriging equations (viavariograms or covariance function)
48 / 94
A loss-function approach to kriging
• Y (s),s ∈D ⊂ Rd spatial process in D with realized valuey = (y1, . . . ,yn)′ at locations, s1, . . . ,sn.
• We are interested in predicting g(Y (s)).We will illustrate the simple case of g(Y (s)) = Y (s0) for somepoint s0 ∈D.
• Let p(Y;s0) be the prediction of Y (s0) and let
L [Y (s0),p(Y;s0)]
be the loss incurred when Y (s0) is predicted using p(Y;s0)
• An optimal predictor is one that minimizes the expected lossor Bayes risk:
E (L [Y (s0),p(Y;s0)] |y)
49 / 94
A loss-function approach to kriging
• Thus, the optimal predictor of Y (s0) is obtained conditionallyon y
• In many prediction problems, the loss function L(·, ·) is thesquared loss, i.e.
L [Y (s0),p(Y;s0)] = (Y (s0)−p(Y;s0))2
• In this case, the optimal predictor is: p(Y;s0) = E [Y (s0)|y].
• If Y (s),s ∈D ⊂ Rd is a Gaussian process, E [Y (s0)|y] is linearin y and it involves
• µ = (µ(s1), . . . ,µ(sn))′ = (E (Y (s1)), . . . ,E (Y (sn)))′;• c0 = (C (s0,s1), . . . ,C (s0,sn))′ where
C (s0,si ) = Cov(Y (s0),Y (si )); and• Σ = (Σij )i ,j=1,...,n = (C (si ,sj ))i ,j=1,...,n.
50 / 94
Kriging
• In general, E [Y (s0)|y] is not linear in y.
• Kriging wants to determine the best linear unbiased predictorof Y (s0) given y
p(Y;s0) =n
∑i=1
λiyi
such that E[(Y (s0)−p(Y;s0))2
]is minimized.
• There are three different types of kriging:• simple kriging: E (Y (s)) = µ(s) is known for all s ∈D.• ordinary kriging: µ(s)≡ µ for all s ∈D and µ unknown.• universal kriging: µ(s) = X(s)β and β unknown.
• Kriging formulas can be obtained in terms of variograms orcovariance functions.
51 / 94
Simple kriging• Set up: y = (y1, . . . ,yn)′ are observations at sites s1, . . . ,sn of a
spatial process Y (s),s ∈D that we write as:
Y (s) = µ(s) + η(s)
where µ(s) is assumed to be known.
• Notation: y = (y1, . . . ,yn)′
Σ = (Σij ) = Cov(Y (si ),Y (sj )) = C (si ,sj ).
• We want to find the best linear predictor of Y (s0) of the form
p(Y;s0) =n
∑i=1
λiyi +k
such that E(
(Y (s0)−∑ni=1 λiyi −k)2
)is minimized.
• This happens if: k = µ(s0)−∑ni=1 λiµ(si ) and
λ′ = (λ1, . . . ,λn)′ = c′Σ−1
where c = (C (s0,s1), . . . ,C (s0,sn))′.
52 / 94
Simple kriging
• The simple kriging optimal linear predictor is:
p(Y;s0) = µ(s0) + c′Σ−1(y−µ)
where µ = (µ(s1), . . . ,µ(sn))′.
• The minimized mean-squared prediction error is:
σ2sk(s0)=E
[(Y (s0)−p(Y;s0))2
]=Var(Y (s0))−c′Σ−1c
• σ2sk(s0) is interpreted as kriging variance and can be used to
construct simple kriging prediction intervals
• Note that the simple kriging predictor p(Y;s0) is equal toE [Y (s0)|y] if η(s) (thus, Y (s)) is a Gaussian process.
53 / 94
Example: log of zinc concentration
Simple kriging predictor: p(Y;s0) = µ(s0) + c′Σ−1(y−µ)
179000 179500 180000 180500 181000
3300
0033
1000
3320
0033
3000
Log of zinc concentration
X
Y
5.0
5.5
6.0
6.5
7.0
7.5
?*
• Suppose we know from geologicalconsiderations that µ(s) is constanton D and equal to 5.8.Then: µ(s0) = 5.8
• Suppose also that we model thecovariance function of Y (s) using aMatern covariance function withparameters: σ2 = 0.55, φ = 548.7,τ2 = 0.10 and ν = 0.63
• The simple kriging predictor of logof zinc concentration at s0 is:p(Y;s0) = 5.13
• The simple kriging variance is: 0.23.
54 / 94
Ordinary and universal kriging
• In ordinary kriging, Y (s) = µ(s) + η(s) = µ + η(s). Given theobservation y, we want to find the best linear predictor of Y (s0)
p(Y;s0) =n
∑i=1
λiyi withn
∑i=1
λi = 1
such that E[(Y (s0)−p(Y;s0))2
]is minimized.
• In universal kriging, Y (s) = µ(s) + η(s) = X(s)β + η(s) and we wantto find the best linear predictor of Y (s0)
p(Y;s0) =n
∑i=1
λiyi with λ′X = X(s0)
that minimizes the mean squared error.
• Minimization in both cases is with constraints.
• Formulas for kriging variance account for additional uncertainty dueto not knowing the mean structure.
55 / 94
Block kriging
• Suppose we want to predict g(Y (s)) = Y (B) = 1|B|
∫B Y (u)du
where B is a subset of D of positive volume.We want to find the simple kriging predictor of Y (B).
• Then, the simple kriging equations are modified to:
p(Y;B) = µ(B) + c′(B)Σ−1(y−µ)
where1 µ(B) = 1
|B|∫B µ(u)du
2 c(B) = (C (B,s1), . . . ,C (B,sn))′ withC (B,sj ) = 1
|B|∫B C (u,sj )du j = 1, . . . ,n
3 Σ is the same as before.
• Additionally, σ2sk(B) = C (B,B)−c′(B)Σ−1c(B) with
C (B,B) = 1|B|2
∫B
∫B C (u,v)dudv.
56 / 94
Notes on kriging
• Kriging equations can be derived using expressions involvingvariograms or covariance functions.
• Assume parameters of the spatial dependence (e.g. variogramof covariance function) known −→ underestimation ofprediction uncertainty.
• It is relatively unbiased.
• If the covariance (or variogram) do not have a nugget effect,then kriging is an exact interpolator.
• Kriging approaches such as indicator kriging, trans-kriginghave been developed to predict probabilities, and data thatcan be transformed into Gaussian. Not much confidence inthem.
57 / 94
Hierarchical modeling of spatial data
• Classical geostatistical approaches are limited in scope• problematic to handle complex spatial dependencies• problematic to handle spatial prediction of non-Gaussian data
• Hierarchical modeling• allows to handle complicated dependence structures via
relatively simple conditional probability specifications• allows to combine multiple data sources together• allows to handle non-Gaussian spatial data• allows to handle multivariate spatial data• allows to account for all sources of uncertainty
• We will consider mostly Bayesian Hierarchical Models.
58 / 94
Classical geostatistical approach
• Let Y (s) be a spatial process: Y (s) = µ(s) + η(s)
• µ(s) = X(s)β models the mean structure of Y (s)• η(s) is a zero-mean spatial process that models the covariance
structure of Y (s):Cov(Y (s),Y (s′)) = CY (s,s′) = Cov(η(s),η(s′)) = Cη(s,s′).
• Let η(s) be a stationary spatial process with covariance functionCη(h) and θ vector of covariance parameters.If Cη(h) has a nugget effect τ2 6= 0, then we can think of η(s) asthe sum of two processes η(s) = w(s) + ε(s). Hence:
Y (s) = X(s)β +w(s) + ε(s)
• w(s) spatial process with covariance function Cw (s,s′) withvector of covariance parameters θ
• ε(s) collection of iid random variables with mean 0 andvariance τ2
• w(s) and ε(s) independent.
59 / 94
Bayesian hierarchical approach
• In a Bayesian hierarchical modeling framework, we write the spatialprocess Y (s) as
Y (s) = X(s)β +w(s) + ε(s)
However the model is formulated in stages.
• Suppose we have observed data y = (y1,y2, . . . ,yn)′, then, aBayesian hierarchical model for y specifies
1 Data model: [Data|Process,Parameters]: this is the model forthe data y given the process w(s) and all the modelparameters.
2 Process model: [Process|Parameters]: this is the model forw(s) given the parameters.
3 Parameter model: [Parameters]: this is the model for theparameters.
60 / 94
Data model for Gaussian spatial data
• For Gaussian spatial data, the most common approach is to assumethat
• w(s) mean zero spatial process that account for the spatialdependence in the data −→ spatial random effects
• ε(s) collection of iid normal random variables with mean 0 andvariance τ2 −→ accounts for non-spatial variability
• Given data y from n sites, we have
y = Xβ + w + ε
where w = (w(s1), . . . ,w(sn))′ and ε = (ε(s1), . . . ,ε(sn))′ whichimplies that:
y|w,β,τ2 ∼MVNn(Xβ + w,τ2In)
with In identity matrix of dimension n×n.
61 / 94
Process model
• In this stage we specify the model for the spatial process w(s).
• For Gaussian spatial data, w(s) is modeled to be a mean-zeroGaussian process with covariance function Cw (s,s′) that depends onparameter vector θ.
• We can use one of the stationary isotropic covariance functionsC (d) introduced earlier (exponential, Gaussian, Matern, . . .)=⇒ this introduces covariance parameters σ2,φ (possibly ν)
• We can model the covariance function Cw (s,s′) using anon-stationary covariance function.
• We can model Cw (s,s′) to be geometrically anisotropic bychoosing a parametric isotropic covariance function C ?
w (d) andintroducing a d ×d invertible matrix A withCw (s,s′) = C ?
w (‖A(s− s′)‖)
62 / 94
Process model
• Ignoring the case of geometric anisotropy, given observations atsites s1, . . . ,sn, often the process model adopted for w is:
w|θ∼MVNn(0,Σw (θ))
with Σw (θ) n×n covariance matrix induced by the covariancefunction Cw (s,s′) with parameter vector θ
• If the covariance function Cw (s,s′; θ) is isotropic then:
Cw (s,s′) = Cw (d) = Cw (0) ·ρw (d)
where d = ‖s− s′‖, and ρ(d) correlation function with parametervector θ
?.This implies that Σw (θ) = Cw (0) ·Rw (θ
?) with Rw (θ?) correlation
matrix
• Geometric anisotropy: If Cw (s,s′) is geometrically anisotropic withcovariance parameter vector θ and transformation matrix A, thenthe n×n covariance matrix is Σw (θ,A).
63 / 94
Parameter model
• In this stage, we specify prior distributions for the model parameters.
• Usually, parameters are assumed to be independent a priori andpriors on individual or group of parameters are specified.
• Usual model parameters and usual choices of priors are:
• β: In most cases there are no restrictions on these parameters.
Usual choice of priors include:
• an (improper) flat prior: p(β) ∝ 1• a multivariate normal prior: MVNp(m,V)
• For parameters representing variances (e.g. τ2, σ2):
• Inverse Gamma (α, γ) distribution• Uniform (0,γ) distribution
• For a parameter related to the range or effective range (e.g.φ):
• Gamma(α, γ) distribution• Uniform(0,γ) or Discrete uniform(S)
64 / 94
Parameter model• Placing improper priors on the variance parameters, that is, on σ2
and τ2, often leads to improper posterior distributions!• If we are modeling the covariance function Cw (s,s′) of the spatial
process w(s) as geometrically anisotropic using an isotropiccovariance function C ?
w (d) and an invertible d ×d transformationmatrix A, that is:
Cw (s,s′) = C ?w (‖A · (s− s′)‖)
then we need to specify a prior for A.
• A choice for A, then would be a Wishart distribution, Wishart(B,df )
p(A;B,df ) ∝ |A|df−d−1
2 exp(− trace(B−1A)
2)
where B is a positive-definite matrix.
• If A∼Wishart(B,df ), then E (A) = (df ) ·B. The parameter df iscalled the degrees of freedom parameter (df > d −1) and is relatedto the precision: the larger df , the larger the variance of thedistribution.
65 / 94
Fitting a Bayesian hierarchical model
• Fitting a Bayesian hierarchical model for spatial data proceedsas for any other Bayesian model.
• We are interested in deriving the posterior distributionp(w,β, θ,τ2|y) where θ represent the vector of covarianceparameters (i.e. θ = (σ2,φ), possibly, ν; for simplicity we areassuming no geometric anisotropy)
• From Bayes’ theorem
p(w,β, θ,τ2|y) ∝ f (y|w,β, θ,τ2) ·p(w|θ) ·p(β, θ,τ2)
• The joint posterior distribution is not available in closed form,therefore we will use Markov Chain Monte Carlo methods, andin particular a Gibbs sampling algorithm withMetropolis-Hastings steps for parameters that do not have fullconditionals in closed form.
66 / 94
A digression on MCMC
• In many cases in Bayesian statistics, computing the jointposterior distribution is not possible. In this cases, the jointposterior distribution will be approximted via Monte Carlomethods, by drawing samples from it.This approach replaces the numerical integration with MonteCarlo integration and inference will be then a sampling-basedinference.
• The key is to simulate from an aperiodic, irreducible Markovchain. Then, the chain admits a stationary distribution.
• A Markov chain Monte Carlo algorithm ensures that thisstationary distribution is the joint posterior distribution.
• Posterior inference is then based on the Ergodic Theorem.
67 / 94
The Gibbs sampler• Suppose Ω = (Ω1,Ω2) is the parameter in our data model f (y|Ω)
for which we have specified a prior distribution p(Ω).
• We want to compute the posterior distribution p(Ω1,Ω2|y).Suppose we know the full conditional distributions p(Ω1|Ω2,y) andp(Ω2|Ω1,y).
• The Gibbs sampler samples from the posterior distributionp(Ω1,Ω2|y) using the following sampling scheme:
1 Set starting values Ω(0) = (Ω(0)1 ,Ω
(0)2 )
2 For i = 1, . . . ,G
1 Draw Ω(i)1 ∼ p(Ω1|Ω(i−1)
2 ,y)
2 Draw Ω(i)2 ∼ p(Ω2|Ω(i)
1 ,y)• This constructs a Markov chain and, after an initial burn-in period,
the algorithm guarantees that
Ω(i)1 ,Ω
(i)2
i=G0+1,...,G
will be
samples from p(Ω1,Ω2|y) where G0 is the burn-in period.
• Additionally,
Ω(i)1
i=G0+1,...,G
are samples from the marginal
posterior distribution p(Ω1|y) and similarly for Ω2.
68 / 94
Metropols-Hastings
• In principle, the Gibbs sampler works well. In practice, theproblem is sampling from the full conditionals!
• They might not be amenable to easy sampling
• They might not be in closed form
• In this case, we can use a Metropolis-Hastings (MH)algorithm.In a Metropolis-Hastings algorithm, samples are drawn from aproposal distribution and are either rejected or accepted.
69 / 94
Marginalized model• Let’s return to the geostatistical model underlying the Bayesian
hierarchical model formulation:
Y (s) = µ(s) +w(s) + ε(s) = X′(s)β +w(s) + ε(s)
with w(s) Gaussian process with mean zero and covariance functionCw (s,s′) with covariance parameter θ, ε(s) iid random variableswith mean 0 and variance τ2 and ws) and ε(s) independent
• Then: CY (s,s′) = Cov(Y (s),Y (s′)) =Cov(w(s) + ε(s),w(s′) + ε(s′)) = Cw (s,s′; θ) + τ2δs=s′
• If we have data y = (y1, . . . ,yn)′ at n sites, then:
y = Xβ + w + ε
i.e. marginally (conditionally ONLY on the parameters)
y|β, θ,τ2 ∼MVNn(Xβ,Σw (θ) + τ2In)
with Σw (θ) covariance matrix induced by the covariance functionCw (s,s′).
70 / 94
Marginalized model
• Hence, a marginalized Bayesian hierarchical model for spatial dataY specifies:
1 Data model: f (y|β, θ,τ2) = MVNn(y;Xβ,Σw (θ) + τ2In)
2 Parameter model: p(β, θ,τ2)
• If the covariance function Cw (s,s′) is isotropic, thenΣw (θ) = Cw (0) ·Rw (θ
?) = σ2 ·Rw (θ?) with Rw (θ
?) correlationmatrix and σ2 = Cw (0).
In this case, the marginalized Bayesian hierarchical model for Y hasthe following likelihood specification:
f (y|β,θ,τ2) = MVNn(y;Xβ,σ2 ·Rw (θ?) + τ
2In)
with, at the second stage, specification of the prior distributions forthe parameters.
71 / 94
Marginalized model vs conditional model
• Conditional model
• The posterior distribution is proportional tof (y|w,β, θ,τ2) ·p(w|θ) ·p(β, θ,τ2) where θ = (σ2,φ) (possibly(σ2,φ,ν)).
• With the choice of conjugate priors, full conditionals areavailable in closed form =⇒ It is easier to program.
• Marginalized model
• The posterior distributions is proportional tof (y|β, θ,τ2) ·p(β, θ,τ2).
• Only the full conditional of β is available in closed form. Allthe other parameters don’t have closed-form full conditionals.
• The model has less parameters =⇒ It converges faster.• If the covariance function Cw (s,s′) is assumed to be isotropic,
the covariance matrix σ2 ·Rw (θ?) + τ2In of Y is more stable
than the covariance matrix σ2 ·Rw (θ?) of w in the conditional
model.
72 / 94
The spatial process in the marginalizedmodel
• In the marginalized model for the data y, the spatial process w(s) ismarginalized out.
• If we are interested in estimating the process w(s), we can recoverit at the locations s1,s2, . . . ,sn where we have observations.
• Let Ω denote the entire vector of parameters, i.e. Ω = (θ,β,τ2).Then we can recover w = (w(s1), . . . ,w(sn))′ from:
p(w|y) =∫
p(w,Ω|y)dΩ =∫
p(w|Ω,y) ·p(Ω|y)dΩ
using posterior samples of Ω.
• If Ω(1), . . . ,Ω(G ) are G samples of Ω from the posterior distribution:p(Ω|y) then the integral above can be computed via
p(w|y) =1
G
G
∑g=1
p(w|Ω(g),y)
73 / 94
Spatial prediction or Bayesian kriging
• Set up: We have observed data y = (y1, . . . ,yn)′ at sites s1,s2, . . . ,snand we want to predict Y (s) at a new location s0.
• Suppose we know the value x0 of the covariates X(s0) at s0.
• Then, in the marginalized model, we generate predictions Y (s0) ofY (s) at s0 by sampling from the posterior predictive distribution
p(Y (s0)|y,X,x0) =∫
p(Y (s0),Ω|y,X,x0)dΩ
=∫p(Y (s0)|Ω,y,X,x0) ·p(Ω|y,X,x0)dΩ
where X is the matrix X = (X′(s1) . . .X′(sn))′
• If Ω(1), . . . ,Ω(G ) are G samples of Ω from the posterior distributionp(Ω|y,X,x0), then the integral above can be computed via
p (Y (s0)|y,X,x0) =1
G
G
∑g=1
p(Y(s0)|Ω(g),y,X,x0).
74 / 94
Example: Bartlett Experimental Forest
• The Bartlett Experimental Forest is a field laboratory in the WhiteMountain National Forest in New Hampshire.
• The forest was established in 1931 and it has a size of 2,600 acres.The site was chosen because it represented conditions (soils,elevation, climate, tree species, composition) typical of manyforested areas throughout New England and Northern New York.
• In 1931-1932 the forest was gridded with 500 permanent 0.1-hasquare plots spaced 200 by 100 meter apart.
• All woody stems larger than 1-inch in diameter were measured inthe majority of the plots (441). Plants have been remeasured overtime to observe changes in tree species, sizes and regeneration.
75 / 94
Bartlett Experimental Forest
76 / 94
Bartlett Experimental Forest• Currently a disease called beech scale is decimating American
beech stands changing forest compositions across NewEngland. Scientific interest lies in mapping and understandingthe spatial distribution of current beech density.
• Additional information available is: elevation and threesatellite-derived variables, that provide information on thecanopy.
• Plot of the observed log density of beech trees in Spring 2002
1949000 1950000 1951000
2594
500
2595
500
2596
500
Log density
X
Y
−4
−3
−2
−1
77 / 94
Hierarchical Bayesian model
• Let y = (y1, . . . ,yn) observations of log basal area of beech trees atplots centered at s1, . . . ,sn.
• Let X be the n×5 design matrix with covariates
1 elevation at s1, . . . ,sn2 total canopy first satellite variable3 total canopy second satellite variable4 total canopy third satellite variable
at s1, . . . ,sn.
78 / 94
Hierarchical Bayesian model
We analyze the data using the following conditional model:
Data model: y|β = (β0, . . . ,β4),w,τ2 ind∼ MVN(Xβ + w,τ2In)
Process model: w ∼ MVNn(0,σ2Rw (φ))
Parameter model: p(β0) ∝ 1...
......
p(β4) ∝ 1p(σ2) ∼ Inverse Gamma(ασ2 ,γσ2 )p(τ2) ∼ Inverse Gamma(ατ2 ,γτ2 )p(φ) ∼ Uniform(a,b)
We specify an exponential covariance function for w(s); fit the model inR using spBayes.Inference based on the corresponding marginalized model.
79 / 94
Trace plotsTrace plots and density estimates of β0,β1,β2
0 500 1000 1500 2000 2500
-50
5
Iterations
Trace of (Intercept)
-10 -5 0 5
0.00
0.05
0.10
0.15
Density of (Intercept)
N = 2501 Bandwidth = 0.493
0 500 1000 1500 2000 2500
0.0000.0020.0040.006
Iterations
Trace of BEF.subset$ELEV
-0.002 0.000 0.002 0.004 0.0060
100
200
300
400
Density of BEF.subset$ELEV
N = 2501 Bandwidth = 0.0002148
0 500 1000 1500 2000 2500
-0.10
-0.05
0.00
0.05
Trace of BEF.subset$SPR_02_TC1
-0.10 -0.05 0.00 0.05
05
1015
Density of BEF.subset$SPR_02_TC1
80 / 94
Trace plotsTrace plots and density estimates of β3,β4
0 500 1000 1500 2000 2500
0.02
0.04
0.06
0.08
0.10
Iterations
Trace of BEF.subset$SPR_02_TC2
0.00 0.02 0.04 0.06 0.08 0.10 0.12
05
1015
2025
Density of BEF.subset$SPR_02_TC2
N = 2501 Bandwidth = 0.00368
0 500 1000 1500 2000 2500
-0.15
-0.10
-0.05
0.00
Trace of BEF.subset$SPR_02_TC3
-0.15 -0.10 -0.05 0.00
05
1015
Density of BEF.subset$SPR_02_TC3
81 / 94
Trace plotsTrace plots and density estimates of σ2,τ2,φ
0 1000 2000 3000 4000 5000
0.1
0.3
0.5
Iterations
Trace of sigma.sq
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0123456
Density of sigma.sq
N = 5000 Bandwidth = 0.01231
0 1000 2000 3000 4000 5000
0.2
0.4
Iterations
Trace of tau.sq
0.1 0.2 0.3 0.4 0.5 0.6
02
46
Density of tau.sq
N = 5000 Bandwidth = 0.01123
0 1000 2000 3000 4000 5000
0.002
0.004
0.006
Iterations
Trace of phi
0.001 0.002 0.003 0.004 0.005 0.006
0200
400
Density of phi
N = 5000 Bandwidth = 0.0001648
82 / 94
Bayesian estimationPosterior median and 95% credible intervals for the regressioncoefficients and of the covariance parameters. (Burnin: 1000iterations)
Posterior estimateModel term Parameter (95% credible interval)
Intercept β0 -0.79 (-5.1; 3.6)Elevation β1 0.002 (0.0004; 0.004)
Total canopy 1 β2 -0.03 (-0.08; 0.01)Total canopy 2 β3 0.06 (0.03; 0.10)Total canopy 3 β4 -0.07 (-0.11; -0.03)
Partial sill σ2 0.15 (0.07; 0.34)Nugget effect τ2 0.36 (0.25; 0.47)
1Range parameter φ 0.004 (0.002; 0.005)
Note that a posterior median of φ corresponds to a posteriormedian of 250(= 1
0.004 ) meters for the range parameter.83 / 94
Estimate of spatial random effects
Posterior median of w(s) (interpolated surface).
1949000 1950000 1951000
2594500
2595500
2596500
Interpolated posterior median of spatial process
X
Y
-1.4
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
84 / 94
Non-Gaussian spatial data
• The hierarchical modeling approach is very convenient particularlyfor non-Gaussian spatial data.
• Example: photon emission counts.
−6000 −4000 −2000 0
−40
00−
2000
0
Photon emission counts
X
Y
−6000 −3000 0−35
00−
2000
−50
0
Photon emission counts
XY
5000
10000
15000
20000
Data collected as part of investigations into the extent of residualcontamination from the U.S. nuclear weapons testing program inthe Marshall Islands in the South Pacific.The data is the number of photon emission counts attributable toradioactive caesium in Rongelap Island.
85 / 94
Generalized linear models• Generalized linear models extend the linear model approach to
observations y = (y1, . . . ,yn)′ of independent random variablesY1, . . . ,Yn that have a distribution that belongs to the exponentialfamily, e.g.
f (yi |γ) = h(yi ,γ) · exp[γ(yiηi −ψ(ηi ))] i = 1, . . . ,n
where γ is a dispersion parameter.
• In a generalized linear model, a linear model is fitted to a functionof the mean, that is:
E (Yi ) = µi ηi = g(µi ) = X′iβ
1 X′i is a 1×p vector of covariates relative to observation yi2 β is a p×1 vector of parameters3 g is called canonical link function
• Link functions commonly used are: identity (normal data), logit(binomial data), log (Poisson data), etc.
86 / 94
Generalized spatial linear models
• We can adapt this framework to spatial data. This is what Diggle,Tawn and Moyeed (1998) called model-based geostatistics.
• We adopt a hierarchical model formulation. Consider arealization y = (y1, . . . ,yn)′ of a spatial process Y (s) at sitess1, . . . ,sn.
• We assume that w(s) is a Gaussian process with mean zeroand stationary covariance function Cw (h) with covarianceparameter θ and no nugget effect.
• Conditional on w = (w(s1), . . . ,w(sn))′, the y1, . . . ,yn areindependent with a distribution in the exponential family (withdispersion parameter γ) and
E (Y (si )|w(si )) = µ(si ) g(µ(si )) = X(si )β +w(si )
87 / 94
Bayesian estimation
• We can fit a generalized spatial linear model using a Bayesianapproach.
• In this case, given data y = (y1, . . . ,yn)′, we have:
Data model: yi |wind∼ f (yi |wi ;β) i = 1, . . . ,n
E (Y (si )|w) = µ(si )g(µ(si )) = X(si)
′β +w(si )
Process model: w|θ ∼ MVNn(0,Σw (θ))
Parameter model: β, θ|Hyperparam ∼ p(β, θ;Hyperparam.)
• Σw (θ) is the covariance matrix of w induced by the covariancefunction Cw (s,s′) of w(s)
88 / 94
Bayesian estimation
• As in every Bayesian model, inference is carried out byderiving the posterior distribution p(β, θ,w|y).
• We use Gibbs sampling and we sample from it by drawingsamples from the full conditionals
• p(β|θ,w,y) = p(β|w,y)
• p(θ|β,w,y) = p(θ|w)
• p(w|β, θ,y)
• Regardless of the prior specification, none of these fullconditionals is available in closed form =⇒ we draw samplesfrom these full conditionals using the Metropolis-Hastingsalgorithm.
• In the spBayes package, generalized spatial linear models arefit using the spGLM function.
89 / 94
No marginalized model!
• Differently from the Gaussian case, it is not possible to marginalizeout the spatial process w(s).
• Integrating out the spatial random effect w(s1), . . . ,w(sn) yields:
f (y|β, θ) =∫
Rn
n
∏i=1
f (yi |w,β, θ)p(w|θ)dw
where p(w|θ) is the joint distribution of the vectorw = (w(s1), . . . ,w(sn))′ of spatial random effects.
• Caution in interpreting β: they report conditional associations andnot marginal association!
90 / 94
Generalized spatial linear models: anexample
Consider again the Rongelap photon emission counts Y (s).
−6000 −4000 −2000 0
−40
00−
2000
0
Photon emission counts
X
Y
−6000 −3000 0−35
00−
2000
−50
0
Photon emission counts
X
Y
5000
10000
15000
20000
Since each Y (si ) is a count, we should model each Y (si ) as a Poissonrandom variable.
91 / 94
Generalized spatial linear models: anexample
• Therefore, a model for y = (y1, . . . ,yn)′ could be:
yi |wind∼ Poisson(µ(si )) µ(si ) = t(si )exp[β0 +w(si )]
where w(s) is a stationary mean-zero Gaussian process withcovariance function Cw (h).
• t(si ) denotes the duration of the observation at si .
• Then: λ(s) = exp[β0 +w(s)] is considered the intensity of theradioactivity.
92 / 94
References (1 of 2)
• S. Banerjee, B.P. Carlin and A.E. Gelfand (2004). Hierarchicalmodeling and analysis for spatial data. CRC Press, Boca Raton(FL).
• N. Cressie (1993). Statistics for spatial data. Wiley, New York. 2ndedition.
• N. Cressie and C.K. Wikle (2011). Statistics for spatio-temporaldata. Wiley, New York.
• P.J. Diggle, J.A. Tawn and R.A. Moyeed (1998). Model-basedgeostatistics. Journal of the Royal Statistical Society, Series C, 47,299-350.
• M. Fuentes (2001). A new high frequency approach fornonstationary environmental processes. Environmetrics, 12,469-483.
• A.E. Gelfand, P.J. Diggle, M. Fuentes and P. Guttorp (2010).Handbook of spatial statistics. CRC Press, Boca Raton (FL).
93 / 94
References (2 of 2)
• D. Higdon (1998). A process-convolution approach to modelingtemperatures in the North Atlantic Ocdean. Journal ofEnvironmental and Ecological Statistics, 5, 173-190.
• D.J. Nott and W.T.M. Dunsmuir (2002). Estimation of anonstationary covariance structure. Biometrika, 89, 819-829.
• C.J. Paciorek and M.J. Shervish (2006). Spatial modeling using anew class of nonstationary covariance functions. Environmetrics,17, 483-506.
• P.D. Sampson and P. Guttorp (1992). Nonparametric estimation ofnonstationary spatial covariance structure. Journal of the AmericanStatistical Association, 87, 108-119.
• A. Schmidt and A. O’Hagan (2003). Bayesian inference fornonstationary spatial covariance structue via spatial deformation.Journal of the Royal Statistical Society Series B, 65, 743-758.
94 / 94