Lecture #5:Lecture #5: MAPS WITH GAPS--MAPS WITH GAPS-- Small geographic Small geographic area estimation, area estimation,
kriging, and kernel kriging, and kernel smoothingsmoothing
Spatial statistics in Spatial statistics in practicepractice
Center for Tropical Ecology and Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Biodiversity, Tunghai University & Fushan
Botanical GardenBotanical Garden
Topics for today’s lecture
• The E-M algorithm
• The spatial E-M algorithm
• Kriging in ArcGIS
• geographically weighted regression (GWR)
• approaches to map smoothing
THEOREM 1When missing values occur only in a response variable, Y, then the iterative solution to the EM algorithm produces the regression coefficients calculated with only the complete data.
PF: Let b denote the vector of regression coefficients that is converged upon. Then if ,bXY mm
ˆ
ooTo
1o
To
mTmo
To
1-m
Tmo
To
m
o
T
m
o
1
m
o
T
m
o
)(
)()(
bYXXX
bXXYXXXXX
bX
Y
X
X
X
X
X
Xb
THEOREM 2 When missing values occur only in a response variable, Y, then by replacing the missing values with zeroes and intro-ducing a binary 0/-1 indicator variable covariate -Im for each missing value m, such that Im is 0 for all but missing value observation m and 1 for missing value observation m, the estimated regression coefficient bm is equivalent to the point estimate for a new obser-vation, and hence furnishes EM algorithm imputations.PF:Let bm denote the vector of regression coefficients for the missing values, and partition the data matrices such that
,
and ,)(
)()(
)()(
omm
oTo
1o
Too
m
oTo
Tm
1o
Tommm
1o
Tom
Tm
1o
To
1o
To
m
o
T
mmm
omo
1
mmm
omo
T
mmm
omo
m
o
bXb
YXXXb
0
YX
XXXXIXXX
XXXXX
0
Y
IX
0X
IX
0X
IX
0X
b
b
The EM algorithm solution
M
1mmm εIyXβY
where:the missing values are replaced by 0 in Y, andIm is an indicator variable for missing value m that contains n-m 0s and a single 1
m
om
mm,
mo,
mm
oo
m
o
β
α
0
εy
I
0
X1
X1
0
Y
THEOREM 3 For imputations computed based upon Theorem 2, each standard error of the estimated regression coefficients bm is equivalent to the conventional standard deviation used to construct a prediction interval for a new observation, and as such furnishes the corresponding EM algorithm imputation standard error.PF:
2εT
m1
oTommm
1o
Tom
Tm
1o
To
1o
To2
ε
1
mmm
omo
T
mmm
omo σ)()(
)()(σ
XXXXIXXX
XXXXX
IX
0X
IX
0X
2εdiag
Tm
1o
Tommm σ])([
mXXXXIsb
What is the set of equations for the following case?
10 7
7 y4 = ?
080
1087
1087
20810
4
y
M
1mmm εIy1Y α
Some preliminary assessments
simulations
Calculations from ANCOVA regression and the EM algorithm Data Source quantity Reported value OLS/NLS estimate Schafer (1997) p. 43 )( 48.1000 48.10000
)( 59.4260 59.42600
p. 195 2,3y average (n=5) 226.2 228.0 (se = 32.86)
4,3y average (n=5) 146.8 146.2 (se = 38.37)
5,3y average (n=5) 190.8 192.5 (se = 34.11)
10,3y average (n=5) 250.2 271.7 (se = 36.20)
13,3y average (n=5) 234.2 241.3 (se = 35.18)
16,3y average (n=5) 269.2 269.9 (se = 34.53)
18,3y average (n=5) 192.4 201.9 (se = 32.91)
23,3y average (n=5) 215.6 207.4 (se = 33.09)
25,3y average (n=5) 250.0 255.7 (se = 33.39)
simulatedimputations
0.99 R ;0.987Y0.044Y 2imputedreported
EM algorithm solution for aggregated georeferenced
data: vandalized turnips plots
MTB > regress c4 8 c7-c14Regression Analysis: C4 versus C7, C8, C9, C10, C11, C12, C13, C14The regression equation isC4 = 28.9 - 6.32 C7 - 18.2 C8 - 1.10 C9 - 11.4 C10 - 10.1 C11 + 28.9 C12 + 18.8 C13 + 27.8 C14
Predictor Coef SE Coef T PConstant 28.900 2.404 12.02 0.000C7 [I1-I6] -6.317 3.254 -1.94 0.063C8 [I2-I6] -18.200 3.254 -5.59 0.000C9 [I3-I6] -1.100 3.399 -0.32 0.749C10 [I4-I6] -11.400 3.254 -3.50 0.002C11 [I5-I6] -10.100 3.399 -2.97 0.006
C12 [plot(6,5)] 28.900 5.887 4.91 0.000C13 [plot(5,6)] 18.800 5.887 3.19 0.004C14 [plot(6,6)] 27.800 5.887 4.72 0.000
Analysis of Variance for C4 Source DF SS MS F PC5 5 1289.0 257.8 8.92 0.000Error 27 779.9 28.9Total 32 2068.9
Individual 95% CIs For Mean Based on Pooled StDevLevel N Mean StDev ---+---------+---------+---------+---1 5 28.900 4.407 (-----*-----) 2 6 22.583 6.391 (----*-----) 3 6 10.700 2.585 (----*-----) 4 5 27.800 5.082 (-----*-----)5 6 17.500 6.648 (-----*-----) 6 5 18.800 5.922 (------*-----) ---+---------+---------+---------+---Pooled StDev = 5.375 8.0 16.0 24.0 32.0
Residual spatial autocorrelation
What does this mean?
SAR-based missing data estimation
where ym is a missing value (replaced by 0 in Y),
Im is an indicator variable for ym, and
is the mth column of geographic weights
matrix W
εWI
XβWIWYY
M
1m
*mm )ρ(y
)ρ(ρ
om
*omW
The Jacobian term
1
mmmo
omoo2 detJVV
VV
])ρωLN(1 )ρλLN(1[n-n
2 mn
1kk
n
1ii
m
NOTE: denominator becomes (n-nm)
What is the set of equations for the following case?
32
2231
12
eyρ)ρ(1μ0ρ10
ey)ρ(1μ2
yyρ0
eyρ)ρ(1μ0ρ7
7 Y2 = ? 10
εWI1WIWYY
M
1m
*mm )ρ(yμ)ρ(ρ
om
0
ε
X
X
1
1
YIY
Y
WW
WW
0
Y
k
1
m
o
m
o
mm
o
m
o
mmmo
omoo
m
o
β
β
ρ)α-(1
0
ρ
spatial autoregressive (AR)
)ˆ(ˆˆˆoo
1oomomm βXYΣΣβXY
kriging
estimate withsemivariogram model
fit semivariogram model with
The pure spatial autocorrelation CAR model
)β()ρ(ρβˆ0oomo
-1mm0mm 1YCCI1Y
Dispersed missing values:
NOTE: exactly the same algebraic structure as the kriging equation
)β(ρβˆ0oomo0mm 1YC1Y
ImputationImputation = the observed mean plus a weighted average of the surrounding residuals
Employing rook’s adjacency and a CAR model, what is the equation for
the following imputation?
10 3 7
6 y5 = ? 4
9 5 5
)]b(5 )b(4
)b(6)b[(3ρby
00
0005
The spatial filter EM algorithm solution
M
1mmm εβEIyXβY
kEkX
where:
the missing values are replaced by 0 in Y, and
Im is an indicator variable for missing value m that contains n-m 0s and a single 1
Field plot
Conven-tional EM estimate
Spatial SAR-EM estimate
= 0.443
Spatial filter: 3 selected
eigenvectors
(6,5) 28.9 29.99 24.31
(5,6) 18.8 17.66 13.62
(6,6) 27.8 28.26 23.93
SARρ
Imputation Imputation of turnipof turnip
productionproductionin 3 in 3
vandalizedvandalizedfield plotsfield plots
Cressie’s PA coal ash
model estimate
Cressie 10.27%
Spherical 10.62%
Gaussian 10.18%
exponential 10.12%
SAR 10.17%
spatial filter 10.71%
min mean max
7.00 9.78 17.61
Missing 1992 georeferenced density of milk
production in Puerto Rico: constrained (total = 1918)
Predicted from 1991 DMILK
Predicted from spatial filter
Predicted from both
235 70 385
1,339 1,848 1,065
344 0 468
predictionsMoran scatterplot
USDA-NASSestimation ofPennsylvania
crop production
covariate
totalconstraints
map gaps
USDA-NASS estimation of Michigan crop production
If this is2% milk, how mucham I paying for the
other 98%?
Michigan imputations
differentresponse variable
specifications
USDA-NASS estimation of Tennessee crop production
Tennesseeimputations
An EM specification when some data for both Y and the Xs are missing
Ym,y
m,y
m,y,x
m,y,o
Ym,x
m,x,y
m,x
m,x,o
Y
Y
m,y
m,x
o
m,y
m,x
o
m,y
m,x
o
Y
I
0
0
X
0
I
0
X
0
X
1
1
1
0
Y
Y
XXm,y
m,y
m,y,x
m,y,o
m,x
m,x,y
m,x
m,x,o
X
X
m,y
m,x
o
m,y
m,x
o
m,y
m,x
o
Y
I
0
0
X
0
I
0
0
Y
Y
1
1
1
X
0
X
Concatenation results:
m,x
o,x
m,y
o,y
m,xm,x
o,xm,y
m,y
o,yx
m,y
o,yxx2
m,ym,y
o,ym,x
m,x
o,xy
m,x
o,xyy1
m,x
o,x
m,y
o,y
0
0X
I
0Y
I
0
0
Yeq
YI
0X
I
0
0
Xeq
0
X
0
Y
( )
( )
( )
yie ld y
acresarea a
productionarea p
y
a
p
0
0
0
y ij j yj
n n
ij
prp
p
ara
a
yj n
n
y y yx
area xy
y ijx
area xyj
n
y y y xy y ijx
area xyj
n
w yie ld w
C
C
w w
y
m
jr
krk
n r
jr
krk
n r
y
m
xy
j
j
xy xy j
j
xy
[ ( ) ( ) ] ( ) [( )
( ) ] { ( ) [ ( )
1
1
1
1
1 1
1
1
1
1 ]
[ ( ) ( ) ]}
[ ( ) ( ) ]
y
y ij j yj
n n
ij
prp
p
ara
a
yj n
n
y ij j yj
n n
ij
prp
p
ara
a
yj n
n
y
y
y
m
jr
krk
n r
jr
krk
n r
y
m
y
m
jr
krk
n r
jr
krk
n r
y
m
w yie ld w
C
CI
w yie ld w
C
C
1
1
1
10
1
1
1
1
1
1
1
1
( ) [( )
( ) ] [( ) ]
1
0
0
0
0
1
1
1
1
1
y yx
area xy
y ijx
area xyj
npr
p
p
ara
a
y m
xy
j
j
xy
ir
krk
n r
ir
krk
n r
y
rw
C
CI
The spatialmodel
powertransformation
spatial autocorrelation
totalsconstraints
covariate
0
0
1
1
1
1
1
1 1
1
a ij
acres
area aj
n n
ij
a r
a
a
ja
j n
n
a a ax
area xa
a ij
x
area xaj
n
a a a xa a ij
x
area xaj
n
a
w w
C
area
w w
j
j
a
m
jr
krk
n r
a
m
xa
j
j
xa xaj
j
xa a
[ ( ) ( ) ] ( ) [( )
( ) ] { ( ) [ ( ) ]
a ij
acres
area aj
n n
ij
a r
a
a
ja
j n
n
a ij
acres
area aj
n n
ij
a r
a
a
ja
j n
n
a a ax
area xa
a ij
x
area
w w
C
areaI
w w
C
area
w
j
j
a
m
jr
krk
n r
a
m
j
j
a
m
jr
krk
n r
a
m
xa
j
[ ( ) ( ) ]}
[ ( ) ( ) ] ( ) [( )
(
1
1
10
1
1
1
1
11
j
xa
ir
krk
n r
a
rxaj
nar
a
a
a m
C
areaI
) ] [( ) ]1
11
0
0
0
0
0
0
1
1
1
1
1
1 1
1
p ij
p roduction
area pj
n n
ij
p r
p
p
jp
j n
n
p p px
area xp
p ij
x
area xpj
n
p p p xp p ij
x
area xpj
n
p
w w
C
area
w w
j
j
p
m
jr
krk
n r
p
m
xp
j
j
xp xpj
j
xp
[ ( ) ( ) ] ( ) [( )
( ) ] { ( ) [ ( ) ]
p
j
j
p
m
jr
krk
n r
p
m
j
j
p
m
jr
krk
n r
p
m
xp
p ij
p roduction
area pj
n n
ij
p r
p
p
jp
j n
n
p ij
p roduction
area pj
n n
ij
p r
p
p
jp
j n
n
p p px
area xp
p ij
w w
C
areaI
w w
C
area
w
[ ( ) ( ) ]}
[ ( ) ( ) ] ( ) [( )
(
1
1
10
1
1
1
1
11
x
area xpj
npr
p
p
p mj
j
xp
ir
krk
n r
p
r
C
areaI
) ] [( ) ]
1
11
y ie ld
y ie ld
acres
acres
produc tion
produc tion
residua l
residua l
residua l
,
Field plot Spatial filter: 3 selected eigenvectors
(6,5) 24.31
(5,6) 13.62
(6,6) 23.93
Imputation Imputation of turnipof turnip
productionproductionin 3 in 3
vandalizedvandalizedfield plotsfield plots
Cross-validation of spatial filter for observed turnip data
The accompanying table contains a test set of sixteen random samples (#17-32) used to evaluate three maps. The “Actual” column lists the measured values for the test locations identified by “Col, Row” coordinates. The difference between these values and those predicted by the three interpolation techniques form the residuals shown in parentheses. The “Average” column compares the whole field arithmetic mean of 23 (guess 23 everywhere) for each test location.
Kriging: best linear unbiased spatial interpolator (i.e.,
predictor)
ArcGIS: Geostatistical Wizard
anisotropycheck
density ofGerman workers
Cross-validation check of krigged values
This is one use ofthe missing spatial data
imputation methods.
Unclipped krigged surface
krigged (mean response) surface prediction error surface
exponential semivariogram modelvalues increase with darkness of brown
extrapolation
Clipped krigged surface
prediction error surface
krigged (mean response) surface
values increase with darkness of brown
Detrended population density across China
anisotropycheck
Cross-validation check of krigged values
This is one use ofthe missing spatial data
imputation methods.
Unclipped krigged surface
krigged (mean response) surface prediction error surface
exponential semivariogram modelvalues increase with darkness of brown
extrapolation
Clipped krigged surface
krigged (mean response) surface
prediction error surface
values increase with darkness of brown
THEOREM 4
The maximum likelihood estimate for missing georeferenced values described by a spatial autoregressive model specification is equivalent to the best linear unbiased predictor kriging equation of geostatistics.
Geographically weighted regression: GWR
Spatial filtering enables easier implementation of GWR, as well as proper assessment of its dfs
•Step #1: compute the eigenvectors of a geographic connectivity matrix, say C
•Step #2: compute all of the interactions terms XjEk for the P covariates times the K candidate eigenvectors (e.g., with MC > 0.25)
•Step #3: select from the total set, including the individual eigenvectors, with stepwise regression
• Step #4: the geographically varying intercept term is given by:
• Step #5: the geographically varying covariate coefficient is given by factoring Xj out of its appropriate selected interaction terms:
K
1kEki,i ki,
bEaa
j
K
1kEXki,jjji, XbEbXb
ki,j
A Puerto Rico DEM example
Mean elevation (Y) is a function of: standard deviation of elevation (X), eigenvectors E1-E18, and 18 interaction terms (XE)
Results
intercept: 1, E2, E5-E7, E9, E11-E13, E15, E18
slope: 1, E4, E6, E9, E10
R2 increases from 0.576 (with X only) to 0.911 (with geographically varying coefficients)
P(S-W) = 0.52 for the final model
GWR-spatialfilter intercept(MC = 0.692)
GWR-spatialfilter slope(MC = 0.721)
Spatial moving averages
Local smoothing of attribute values
where: wij is a spatial weights matrix yi is the attribute value for each areal unit n is the number of areal units
n...,2,1,in
1jij
n
1jiij
i
w
yw
μ
n...,2,1,in
1jij
n
1jiij
i
w
yw
μ
A summary: what have we learned during the 5 lectures?
Lecture #1• The nature of data and its information content.• What is spatial autocorrelation?• Visualizing spatial autocorrelation: Moran scatterplots,
semivariogram plots, and maps.• Defining and articulating spatial structure: topology and
distance perspectives; contagion and hierarchy concepts.
• Necessary concepts from multivariate statistics.• An example of the elusive negative spatial
autocorrelation.• Some comments about spatial sampling.• Implications about space-time data structure.
Lecture #2• Multivariate grouping, and location-allocation
modeling.• Going from the global to the local: variability and
heterogeneity.• Impacts of spatial autocorrelation on histograms.• The LISA and Getis-Ord statistics.• Cluster analysis: multivariate analysis, cluster
detection, and spider diagrams.– An overview of geographic and space-time clusters.
• Regression diagnostics and geographic clusters
Lecture #3• Autoregressive specifications and normal curve
theory (PROC NLIN).• Auto-binomial and auto-Poisson models: the
need for MCMC.• Relationships between spatial autoregressive
and geostatistical models• Spatial filtering specifications and linear and
generalized linear models (PROC GENMOD).• Autoregressive specifications and linear mixed
models (PROC MIXED).• Implications for space-time datasets (PROC
NLMIXED)
Lecture #4• Frequentist versus Bayesian perspectives.• Implementing random effects models in
GeoBUGS.• Spatially structured and unstructured random
effects: the CAR, the ICAR, and the spatial filter specifications
Lecture #5• The E-M algorithm• The spatial E-M algorithm• Kriging in ArcGIS• Approaches to map smoothing