Download - Lecture #5: MAPS WITH GAPS-- Small geographic area estimation, kriging, and kernel smoothing

Lecture #5:Lecture #5: MAPS WITH GAPS--MAPS WITH GAPS-- Small geographic Small geographic area estimation, area estimation,

kriging, and kernel kriging, and kernel smoothingsmoothing

Spatial statistics in Spatial statistics in practicepractice

Center for Tropical Ecology and Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Biodiversity, Tunghai University & Fushan

Botanical GardenBotanical Garden

Topics for today’s lecture

• The E-M algorithm

• The spatial E-M algorithm

• Kriging in ArcGIS

• geographically weighted regression (GWR)

• approaches to map smoothing

THEOREM 1When missing values occur only in a response variable, Y, then the iterative solution to the EM algorithm produces the regression coefficients calculated with only the complete data.

PF: Let b denote the vector of regression coefficients that is converged upon. Then if ,bXY mm

ˆ

ooTo

1o

To

mTmo

To

1-m

Tmo

To

m

o

T

m

o

1

m

o

T

m

o

)(

)()(

bYXXX

bXXYXXXXX

bX

Y

X

X

X

X

X

Xb

THEOREM 2 When missing values occur only in a response variable, Y, then by replacing the missing values with zeroes and intro-ducing a binary 0/-1 indicator variable covariate -Im for each missing value m, such that Im is 0 for all but missing value observation m and 1 for missing value observation m, the estimated regression coefficient bm is equivalent to the point estimate for a new obser-vation, and hence furnishes EM algorithm imputations.PF:Let bm denote the vector of regression coefficients for the missing values, and partition the data matrices such that

,

and ,)(

)()(

)()(

omm

oTo

1o

Too

m

oTo

Tm

1o

Tommm

1o

Tom

Tm

1o

To

1o

To

m

o

T

mmm

omo

1

mmm

omo

T

mmm

omo

m

o

bXb

YXXXb

0

YX

XXXXIXXX

XXXXX

0

Y

IX

0X

IX

0X

IX

0X

b

b

The EM algorithm solution

M

1mmm εIyXβY

where:the missing values are replaced by 0 in Y, andIm is an indicator variable for missing value m that contains n-m 0s and a single 1

m

om

mm,

mo,

mm

oo

m

o

β

α

0

εy

I

0

X1

X1

0

Y

THEOREM 3 For imputations computed based upon Theorem 2, each standard error of the estimated regression coefficients bm is equivalent to the conventional standard deviation used to construct a prediction interval for a new observation, and as such furnishes the corresponding EM algorithm imputation standard error.PF:

2εT

m1

oTommm

1o

Tom

Tm

1o

To

1o

To2

ε

1

mmm

omo

T

mmm

omo σ)()(

)()(σ

XXXXIXXX

XXXXX

IX

0X

IX

0X

2εdiag

Tm

1o

Tommm σ])([

mXXXXIsb

What is the set of equations for the following case?

10 7

7 y4 = ?

080

1087

1087

20810

4

y

M

1mmm εIy1Y α

Some preliminary assessments

simulations

Calculations from ANCOVA regression and the EM algorithm Data Source quantity Reported value OLS/NLS estimate Schafer (1997) p. 43 )( 48.1000 48.10000

)( 59.4260 59.42600

p. 195 2,3y average (n=5) 226.2 228.0 (se = 32.86)

4,3y average (n=5) 146.8 146.2 (se = 38.37)

5,3y average (n=5) 190.8 192.5 (se = 34.11)

10,3y average (n=5) 250.2 271.7 (se = 36.20)

13,3y average (n=5) 234.2 241.3 (se = 35.18)

16,3y average (n=5) 269.2 269.9 (se = 34.53)

18,3y average (n=5) 192.4 201.9 (se = 32.91)

23,3y average (n=5) 215.6 207.4 (se = 33.09)

25,3y average (n=5) 250.0 255.7 (se = 33.39)

simulatedimputations

0.99 R ;0.987Y0.044Y 2imputedreported

EM algorithm solution for aggregated georeferenced

data: vandalized turnips plots

MTB > regress c4 8 c7-c14Regression Analysis: C4 versus C7, C8, C9, C10, C11, C12, C13, C14The regression equation isC4 = 28.9 - 6.32 C7 - 18.2 C8 - 1.10 C9 - 11.4 C10 - 10.1 C11 + 28.9 C12 + 18.8 C13 + 27.8 C14

Predictor Coef SE Coef T PConstant 28.900 2.404 12.02 0.000C7 [I1-I6] -6.317 3.254 -1.94 0.063C8 [I2-I6] -18.200 3.254 -5.59 0.000C9 [I3-I6] -1.100 3.399 -0.32 0.749C10 [I4-I6] -11.400 3.254 -3.50 0.002C11 [I5-I6] -10.100 3.399 -2.97 0.006

C12 [plot(6,5)] 28.900 5.887 4.91 0.000C13 [plot(5,6)] 18.800 5.887 3.19 0.004C14 [plot(6,6)] 27.800 5.887 4.72 0.000

Analysis of Variance for C4 Source DF SS MS F PC5 5 1289.0 257.8 8.92 0.000Error 27 779.9 28.9Total 32 2068.9

Individual 95% CIs For Mean Based on Pooled StDevLevel N Mean StDev ---+---------+---------+---------+---1 5 28.900 4.407 (-----*-----) 2 6 22.583 6.391 (----*-----) 3 6 10.700 2.585 (----*-----) 4 5 27.800 5.082 (-----*-----)5 6 17.500 6.648 (-----*-----) 6 5 18.800 5.922 (------*-----) ---+---------+---------+---------+---Pooled StDev = 5.375 8.0 16.0 24.0 32.0

Residual spatial autocorrelation

What does this mean?

SAR-based missing data estimation

where ym is a missing value (replaced by 0 in Y),

Im is an indicator variable for ym, and

is the mth column of geographic weights

matrix W

εWI

XβWIWYY

M

1m

*mm )ρ(y

)ρ(ρ

om

*omW

The Jacobian term

1

mmmo

omoo2 detJVV

VV

])ρωLN(1 )ρλLN(1[n-n

2 mn

1kk

n

1ii

m

NOTE: denominator becomes (n-nm)

What is the set of equations for the following case?

32

2231

12

eyρ)ρ(1μ0ρ10

ey)ρ(1μ2

yyρ0

eyρ)ρ(1μ0ρ7

7 Y2 = ? 10

εWI1WIWYY

M

1m

*mm )ρ(yμ)ρ(ρ

om

0

ε

X

X

1

1

YIY

Y

WW

WW

0

Y

k

1

m

o

m

o

mm

o

m

o

mmmo

omoo

m

o

β

β

ρ)α-(1

0

ρ

spatial autoregressive (AR)

)ˆ(ˆˆˆoo

1oomomm βXYΣΣβXY

kriging

estimate withsemivariogram model

fit semivariogram model with

The pure spatial autocorrelation CAR model

)β()ρ(ρβˆ0oomo

-1mm0mm 1YCCI1Y

Dispersed missing values:

NOTE: exactly the same algebraic structure as the kriging equation

)β(ρβˆ0oomo0mm 1YC1Y

ImputationImputation = the observed mean plus a weighted average of the surrounding residuals

Employing rook’s adjacency and a CAR model, what is the equation for

the following imputation?

10 3 7

6 y5 = ? 4

9 5 5

)]b(5 )b(4

)b(6)b[(3ρby

00

0005

The spatial filter EM algorithm solution

M

1mmm εβEIyXβY

kEkX

where:

the missing values are replaced by 0 in Y, and

Im is an indicator variable for missing value m that contains n-m 0s and a single 1

Field plot

Conven-tional EM estimate

Spatial SAR-EM estimate

= 0.443

Spatial filter: 3 selected

eigenvectors

(6,5) 28.9 29.99 24.31

(5,6) 18.8 17.66 13.62

(6,6) 27.8 28.26 23.93

SARρ

Imputation Imputation of turnipof turnip

productionproductionin 3 in 3

vandalizedvandalizedfield plotsfield plots

Cressie’s PA coal ash

model estimate

Cressie 10.27%

Spherical 10.62%

Gaussian 10.18%

exponential 10.12%

SAR 10.17%

spatial filter 10.71%

min mean max

7.00 9.78 17.61

Missing 1992 georeferenced density of milk

production in Puerto Rico: constrained (total = 1918)

Predicted from 1991 DMILK

Predicted from spatial filter

Predicted from both

235 70 385

1,339 1,848 1,065

344 0 468

predictionsMoran scatterplot

USDA-NASSestimation ofPennsylvania

crop production

covariate

totalconstraints

map gaps

USDA-NASS estimation of Michigan crop production

If this is2% milk, how mucham I paying for the

other 98%?

Michigan imputations

differentresponse variable

specifications

USDA-NASS estimation of Tennessee crop production

Tennesseeimputations

An EM specification when some data for both Y and the Xs are missing

Ym,y

m,y

m,y,x

m,y,o

Ym,x

m,x,y

m,x

m,x,o

Y

Y

m,y

m,x

o

m,y

m,x

o

m,y

m,x

o

Y

I

0

0

X

0

I

0

X

0

X

1

1

1

0

Y

Y

XXm,y

m,y

m,y,x

m,y,o

m,x

m,x,y

m,x

m,x,o

X

X

m,y

m,x

o

m,y

m,x

o

m,y

m,x

o

Y

I

0

0

X

0

I

0

0

Y

Y

1

1

1

X

0

X

Concatenation results:

m,x

o,x

m,y

o,y

m,xm,x

o,xm,y

m,y

o,yx

m,y

o,yxx2

m,ym,y

o,ym,x

m,x

o,xy

m,x

o,xyy1

m,x

o,x

m,y

o,y

0

0X

I

0Y

I

0

0

Yeq

YI

0X

I

0

0

Xeq

0

X

0

Y

( )

( )

( )

yie ld y

acresarea a

productionarea p

y

a

p

0

0

0

y ij j yj

n n

ij

prp

p

ara

a

yj n

n

y y yx

area xy

y ijx

area xyj

n

y y y xy y ijx

area xyj

n

w yie ld w

C

C

w w

y

m

jr

krk

n r

jr

krk

n r

y

m

xy

j

j

xy xy j

j

xy

[ ( ) ( ) ] ( ) [( )

( ) ] { ( ) [ ( )

1

1

1

1

1 1

1

1

1

1 ]

[ ( ) ( ) ]}

[ ( ) ( ) ]

y

y ij j yj

n n

ij

prp

p

ara

a

yj n

n

y ij j yj

n n

ij

prp

p

ara

a

yj n

n

y

y

y

m

jr

krk

n r

jr

krk

n r

y

m

y

m

jr

krk

n r

jr

krk

n r

y

m

w yie ld w

C

CI

w yie ld w

C

C

1

1

1

10

1

1

1

1

1

1

1

1

( ) [( )

( ) ] [( ) ]

1

0

0

0

0

1

1

1

1

1

y yx

area xy

y ijx

area xyj

npr

p

p

ara

a

y m

xy

j

j

xy

ir

krk

n r

ir

krk

n r

y

rw

C

CI

The spatialmodel

powertransformation

spatial autocorrelation

totalsconstraints

covariate

0

0

1

1

1

1

1

1 1

1

a ij

acres

area aj

n n

ij

a r

a

a

ja

j n

n

a a ax

area xa

a ij

x

area xaj

n

a a a xa a ij

x

area xaj

n

a

w w

C

area

w w

j

j

a

m

jr

krk

n r

a

m

xa

j

j

xa xaj

j

xa a

[ ( ) ( ) ] ( ) [( )

( ) ] { ( ) [ ( ) ]

a ij

acres

area aj

n n

ij

a r

a

a

ja

j n

n

a ij

acres

area aj

n n

ij

a r

a

a

ja

j n

n

a a ax

area xa

a ij

x

area

w w

C

areaI

w w

C

area

w

j

j

a

m

jr

krk

n r

a

m

j

j

a

m

jr

krk

n r

a

m

xa

j

[ ( ) ( ) ]}

[ ( ) ( ) ] ( ) [( )

(

1

1

10

1

1

1

1

11

j

xa

ir

krk

n r

a

rxaj

nar

a

a

a m

C

areaI

) ] [( ) ]1

11

0

0

0

0

0

0

1

1

1

1

1

1 1

1

p ij

p roduction

area pj

n n

ij

p r

p

p

jp

j n

n

p p px

area xp

p ij

x

area xpj

n

p p p xp p ij

x

area xpj

n

p

w w

C

area

w w

j

j

p

m

jr

krk

n r

p

m

xp

j

j

xp xpj

j

xp

[ ( ) ( ) ] ( ) [( )

( ) ] { ( ) [ ( ) ]

p

j

j

p

m

jr

krk

n r

p

m

j

j

p

m

jr

krk

n r

p

m

xp

p ij

p roduction

area pj

n n

ij

p r

p

p

jp

j n

n

p ij

p roduction

area pj

n n

ij

p r

p

p

jp

j n

n

p p px

area xp

p ij

w w

C

areaI

w w

C

area

w

[ ( ) ( ) ]}

[ ( ) ( ) ] ( ) [( )

(

1

1

10

1

1

1

1

11

x

area xpj

npr

p

p

p mj

j

xp

ir

krk

n r

p

r

C

areaI

) ] [( ) ]

1

11

y ie ld

y ie ld

acres

acres

produc tion

produc tion

residua l

residua l

residua l

,

Field plot Spatial filter: 3 selected eigenvectors

(6,5) 24.31

(5,6) 13.62

(6,6) 23.93

Imputation Imputation of turnipof turnip

productionproductionin 3 in 3

vandalizedvandalizedfield plotsfield plots

Cross-validation of spatial filter for observed turnip data

The accompanying table contains a test set of sixteen random samples (#17-32) used to evaluate three maps. The “Actual” column lists the measured values for the test locations identified by “Col, Row” coordinates. The difference between these values and those predicted by the three interpolation techniques form the residuals shown in parentheses. The “Average” column compares the whole field arithmetic mean of 23 (guess 23 everywhere) for each test location.

Kriging: best linear unbiased spatial interpolator (i.e.,

predictor)

ArcGIS: Geostatistical Wizard

anisotropycheck

density ofGerman workers

Cross-validation check of krigged values

This is one use ofthe missing spatial data

imputation methods.

Unclipped krigged surface

krigged (mean response) surface prediction error surface

exponential semivariogram modelvalues increase with darkness of brown

extrapolation

Clipped krigged surface

prediction error surface

krigged (mean response) surface

values increase with darkness of brown

Detrended population density across China

anisotropycheck

Cross-validation check of krigged values

This is one use ofthe missing spatial data

imputation methods.

Unclipped krigged surface

krigged (mean response) surface prediction error surface

exponential semivariogram modelvalues increase with darkness of brown

extrapolation

Clipped krigged surface

krigged (mean response) surface

prediction error surface

values increase with darkness of brown

THEOREM 4

The maximum likelihood estimate for missing georeferenced values described by a spatial autoregressive model specification is equivalent to the best linear unbiased predictor kriging equation of geostatistics.

Geographically weighted regression: GWR

Spatial filtering enables easier implementation of GWR, as well as proper assessment of its dfs

•Step #1: compute the eigenvectors of a geographic connectivity matrix, say C

•Step #2: compute all of the interactions terms XjEk for the P covariates times the K candidate eigenvectors (e.g., with MC > 0.25)

•Step #3: select from the total set, including the individual eigenvectors, with stepwise regression

• Step #4: the geographically varying intercept term is given by:

• Step #5: the geographically varying covariate coefficient is given by factoring Xj out of its appropriate selected interaction terms:

K

1kEki,i ki,

bEaa

j

K

1kEXki,jjji, XbEbXb

ki,j

A Puerto Rico DEM example

Mean elevation (Y) is a function of: standard deviation of elevation (X), eigenvectors E1-E18, and 18 interaction terms (XE)

Results

intercept: 1, E2, E5-E7, E9, E11-E13, E15, E18

slope: 1, E4, E6, E9, E10

R2 increases from 0.576 (with X only) to 0.911 (with geographically varying coefficients)

P(S-W) = 0.52 for the final model

GWR-spatialfilter intercept(MC = 0.692)

GWR-spatialfilter slope(MC = 0.721)

Spatial moving averages

Local smoothing of attribute values

where: wij is a spatial weights matrix yi is the attribute value for each areal unit n is the number of areal units

n...,2,1,in

1jij

n

1jiij

i

w

yw

μ

n...,2,1,in

1jij

n

1jiij

i

w

yw

μ

A summary: what have we learned during the 5 lectures?

Lecture #1• The nature of data and its information content.• What is spatial autocorrelation?• Visualizing spatial autocorrelation: Moran scatterplots,

semivariogram plots, and maps.• Defining and articulating spatial structure: topology and

distance perspectives; contagion and hierarchy concepts.

• Necessary concepts from multivariate statistics.• An example of the elusive negative spatial

autocorrelation.• Some comments about spatial sampling.• Implications about space-time data structure.

Lecture #2• Multivariate grouping, and location-allocation

modeling.• Going from the global to the local: variability and

heterogeneity.• Impacts of spatial autocorrelation on histograms.• The LISA and Getis-Ord statistics.• Cluster analysis: multivariate analysis, cluster

detection, and spider diagrams.– An overview of geographic and space-time clusters.

• Regression diagnostics and geographic clusters

Lecture #3• Autoregressive specifications and normal curve

theory (PROC NLIN).• Auto-binomial and auto-Poisson models: the

need for MCMC.• Relationships between spatial autoregressive

and geostatistical models• Spatial filtering specifications and linear and

generalized linear models (PROC GENMOD).• Autoregressive specifications and linear mixed

models (PROC MIXED).• Implications for space-time datasets (PROC

NLMIXED)

Lecture #4• Frequentist versus Bayesian perspectives.• Implementing random effects models in

GeoBUGS.• Spatially structured and unstructured random

effects: the CAR, the ICAR, and the spatial filter specifications

Lecture #5• The E-M algorithm• The spatial E-M algorithm• Kriging in ArcGIS• Approaches to map smoothing