Geostatistics and Spatial Modeling Lecture Notes 2004

G606 Geostatistics and Spatial Modeling

Lecture Notes

Spring SemesterIdaho State University

Department of Geosciences

© 2004 John WelhanIdaho Geological Survey

Contents

Syllabus.................................................................................................................................. 1General Information................................................................................................... 1Course Content........................................................................................................... 2

Basic Statistics Review.......................................................................................................... 3Self-Evaluation Quiz.................................................................................................. 4Univariate vs. Multivariate Data and Geostatistics................................................... 5

1. Introduction, Definitions, Software, Basic Statistics Review............................................ 6S.1.1. Generalized geostatistical analysis and modeling sequence............................16

2. Lecture Notes: Parametric Tests, Nonparametric Tests.....................................................18Problem Set I. Review: Statistical summarization.....................................................23S.2.1. Summary of hypothesis testing........................................................................24S.2.2. Interpreting p-values returned by a statistical test........................................... 25S.2.3. The power of an hypothesis test...................................................................... 26S.2.4. Classification and measures of classification performance............................. 27

3. Bivariate Data, Regionalized Variables, EDA.................................................................. 35Problem Set II. Exploratory Data Analysis............................................................... 38S.3.1.Hypothesis testing of regression parameters.................................................... 39

4. Autocorrelation and Spatial Continuity............................................................................. 40Problem Set III. Spatial Correlation 1 - Experimental Variography......................... 46

5. Identifying Spatial Autocorreclation Structure.................................................................. 47Problem Set IV. Spatial Correlation 2 - Variogram Modeling.................................. 47S.5.1. Words of Wisdom on Experimental Variogram Analysis............................... 48S.5.2. Variogram Analysis Software......................................................................... 50

6. Modeling Spatial Autocorrelation Structure......................................................................51Problem Set V. Introduction to ArcGIS: EDA and Variogram Modeling................. 53S.6.1. Words of Wisdom on Variogram Modeling....................................................53

7. Kriging Concepts, Introduction to the Kriging System of Equations................................ 54Problem Set VI. Kriging Estimation: the Neighborhood Search Process..................60

8. Cross-autocorrelation and Cross-variograms..................................................................... 61Problem Set VII. Cross-variogram Analysis with ArcGIS........................................ 64

9. Practical Implementation of Kriging................................................................................. 65Problem Set VIII.. Cross-Validation and Validation of Kriging Models.................. 67

10. Assessing Kriging Uncertainty; Probability Mapping.................................................... 68Problem Set IX. Introduction to Indicator Variables and Indicator Kriging............ 71

11. Introduction to Stochastic Simulation............................................................................. 72Problem Set X. Introduction to Sequential Gaussian Simulation............................ 75S.11.1. Evaluating bivariate normality...................................................................... 76

12. Indicator Variables, Multiple Indicator Kriging and Simulation.................................... 79Problem Set XI. Introduction to Sequential Indicator Simulation.......................... 84

13. Other Simulation Techniques.......................................................................................... 85

14. Change of Support.......................................................................................................... 86

92

Geostatistics on the Internet.................................................................................................. 94

G 606 - Geostatistics and Spatial Modeling 4 Credits Spring Semester

Instructor: John Welhan, Idaho Geological Survey Textbook: Isaaks and Srivastava (1989)Meeting: Tue/Thur. 8:30-10:00 am, plus a 2-hour lab, schedule TBA

An introduction to the description, analysis, and modeling of geospatial data and of theresulting uncertainty in the models. Theory and its correct application will be integrated with theuse of various software tools (including GIS) and appropriate examples to emphasize the cross-disciplinary applicability of geostatistical analysis and modeling.

Prerequisites are an introductory applied statistics course, familiarity with the Windowsoperating environment, and basic spreadsheet-based data manipulation. Knowledge of ArcViewor other GIS software is strongly encouraged. All software will be on a Windows/Intel platform.

Readings will be assigned to stimulate in-class discussion, and all students will research,present, and lead a class discussion on published articles of their choosing that focus on conceptsand applications of geostatistical analysis and modeling.

Office hours: one hour after each class and by appointment.

Grading:40% - weekly computer lab tutorial / problem sets 40% - project, oral presentation, final report **Students must bring a suitable spatial data set to be analyzed as a term project**10% - analysis & presentation of published literature 10% - discussion of readings, class participation

Assigned Readings taken from: (* Obeler Library reserve; + instructor check out):

General Statistics References:* Till, Roger (1974) Statistical Methods for the Earth Scientist; Wiley, NY Davis, J.C. (2002) Statistics and Data Analysis in Geology (3rd ed.); Wiley & Sons, NY

Geostatistics (G), Kriging (K), Stochastic Simulation (S), and Software (W) References:* Isaaks and Srivastava (1989), Introduction to Applied Geostatistics, Oxford Univ. Press, (G,K)+ Deutsch and Journel (1997) Geostatistical Software Library and User’s Guide, Oxford (W)+ Deutsch (2003) Reservoir Modeling, Oxford (G, S)+ Goovaerts, P. (1997) Geostatistics for Natural Resources Evaluation, Oxford (G, K, S)* Houlding, S.W. (1999) Practical Geostatistics, Springer (G,K,S --although geology-oriented)* Clark, I. (1979) Practical Geostatistics, Applied Science Publishers (G,K with a mining focus)* Yarus, J.M. and Chambers, R.L. (1994) Stochastic Modeling and Geostatistics, AAPG (G,K,S)+ Pannatier, Y. (1996) VarioWin: Software for Spatial Data Analysis in 2D, Springer (W)

GIS-Remote Sensing Applications + Heuvelink (1998) Error Propogation in Environmental Modelling, Taylor & Francis, BristolPA+ Stein et al. (1999) Spatial Statistics for Remote Sensing, Kluwer Academic Publ., Boston, MA+ Johnston, K. et al. (2001) Using ArcGIS Geostatistical Analyst, GIS by ESRI, Redlands, CA

1

Course Content

Week 1. Overview, Course Topics and Case Study: Overview of applications and techniques to be covered: spatial continuity (autocorrelation)analysis; statistical modeling vs. data modeling; estimation; simulation; prediction uncertainty.

Week 1-3. Statistics Review / Exploratory Data Analysis: Statistical summarization, analysis, and modeling; representing spatial data, continuous vs.categorical data; frequency distibutions; correlation and conditional correlation; transformations(logarithmic, normal-score, indicator, rank-order); evaluating classification performance;software applications.

Week 4-6. Analysis and Quantification of Spatial Continuity: Statistical measures of autocorrelation; experimental variograms; autocorrelation functionmodels, modeling anisotropy and nested functions; indicator variograms; cross-autocorrelation(co-spatial variability of multiple variables).

Week 7-9. Best Linear Unbiased Spatial Estimation:Techniques of spatial estimation, limitations of biased estimators; kriging as a 'best', linearunbiased estimator; the kriging system of equations; use and misuse of the kriging variance;sensitivity of kriging on variogram and search strategy decisions; cross-validation and validationas measures of kriging performance; cokriging.

Week 10. Spring Break

Week 11-14. Limitations of Kriging / Stochastic Simulation / Indicator Estimation: Simulation vs. kriging, differences, philosophy, applications; adaptation of the kriging system ofequations to simulation; multiple indicator kriging, advantages; basic Gaussian and indicatorsimulation algorithms; other simulation approaches.

Week 15. Change of Support:Data measurement scale; impacts on modeling and choice of scales; regularization (changingscales, numerical techniques for addressing scaling issues (block kriging, averaging techniques). Week 16. Analysis of Uncertainty:Probability mapping; threshold exceedance; estimation vs. simulation approaches to uncertaintyanalysis; error propagation.

2

Statistics Review and Relevant Background

You should be familiar with the statistical terminology below, including how to calculate, use,and describe these most basic statistical concepts. If you have trouble with the Self-EvaluationQuiz (p.4), your stress level in this course will be high. If you are not already familiar with thenon-underlined terminology, you will need to be by the end of the review phase of this course.The review material covered in Chapters 1-3, below, includes applied statistical material thatyou will need to be familiar with and on which you will be tested in Week 3. If you do poorly onWeek 3's test, a one-on-one appointment will be arranged to evaluate your level of preparednessfor this course.

Basic statistical terminology:

Classical statistics - the analysis and description of variability ("modeling") in order to estimatethe likelihood of a future outcome ("predicting"), based on that model. For example, fitting anormal probability distribution model to a histogram creates a statistical model of variability,from which a prediction can be made of the probability that a specified value will be exceeded infuture sampling. Classical statistics is predicated on the assumption that all outcomes in a sampleare independent of one another (i.e., measurements made at one location or time have no bearingon other measurements made nearby).

Populations - parent, sample; frequency distibutions, histogram, probability distribution function, cumulative distribution function, homogeneity (unimodel / multi-modal), ergodicity, homoscedasticity, stationarity

Measures of central tendency - mode / median / mean / expected value

Measures of dispersion - variance / standard deviation / inter-quartile range / skewness / kurtosis

Parametric vs. non-parametric statistics - the Gaussian distribution / Gaussian probability tables

Bivariate correlation - regression, covariance, statistical tests of regression

Hypothesis testing - level of significance, p-values, tests of normality and population similarity, the Student's t-test, the χχ2 statistic, the Kolmogorov-Smirnov (K-S), Mann-Whitney, and other non-parametric tests

Data transformations - the standard normal deviate / the lognormal transformation

You are responsible for the prerequisite knowledge required in this course. Use the reviewchapters and suggested reference material or your own reference material to brush up on thesestatistical concepts and applications--they are vital to the subsequent development of theconcepts and application of spatial statistics!

3

Self-Evaluation Quiz: Note: if you cannot answer questions (a) through (h), this course may present difficulties!

a) What is the name commonly used for the expected value of Gaussian probability distributions?

b) What measure of dispersion is used to characterize a bell-shaped probability distribution?

c) What values of skewness and kurtosis would a normal probability distribution have?

d) What is a log-normal probability distribution? Which of the following means most closelyappoximates the mode of a log-normal distribution: arithmetic mean, geometric mean, harmonicmean?

e) What percentage of outcomes fall between the first and third quartiles of a sample population?

f) What statistical test could be applied to determine if two sets of measurements were drawnfrom Gaussian populations having similar variances but different means?

g) Consider a histogram that has two peaks; does this indicate a statistically homogeneouspopulation? could such a distribution arise in a stationary population?

h) The following frequency tables describe two histograms of surface temperature measured intwo small, adjacent lakes. Fill in the required information for the samples' descriptive statistics.

Lake A: Lake B:T,oC Frequency T,oC Frequency

12 1 n = 20 12 1 n = 2013 2 mode = ? 13 5 mode = ?14 2 mean = ? 14 5 mean = ?15 4 variance = ? 15 4 variance = ?16 5 16 217 4 17 218 1 18 119 1 19 0

i) Apply a suitable (quantitative) statistical test to evaluate the hypothesis that both sets oftemperatures are drawn from the same population.

j) In fact, both lakes overlie the same aquifer and are fed by the same ground water source (at auniform 12 oC); they are the surface expression of the same ground water table. Howis it possiblethat the temperature data portray such a different picture between the two samples?

4

Univariate vs. Multivariate Data and Geostatistics

Univariate observations - a single dependent variable measured in a sample drawn from apopulation (e.g. gold assays in drill cores) and analyzed without regard to position.

Multivariate observations - a single dependent variable measured with respect to theindependent variables of position; OR multiple dependent variables measured with or withoutaccompanying position information (e.g. gold assays in rock samples referenced to x, y, zposition; water levels in a well referenced to temporal “position”; gold, silver, and sulfur assaysin each of multiple ore samples).

e.g. all GIS (spatial) data are multivariate. An attribute measured at spatialcoordinates at different times is a multivariate variable where space and time are theindependent variables. Thus, water level, Zi, in multiple wells, i, measured at threedifferent times, tn, can be viewed as three multivariate variables with respect tospace, Zi[xi, yi] t1, Zi[xi, yi] t2, Zi[xi, yi] t3 , or as a single multivariate variable Z[x, y,t].

Similarly, time-series measurements of water level at a single location is bivariate,with “position” represented by time as the independent variable.

Geospatial data are a type of multivariate data. There may be only one variable of interest (thedependent variable) but its values are related to position (independent variables of locationand/or time.

Note that multivariate data can be analyzed as separate, univariate distibutions by consideringone variable at a time. However, the interrelationships between independent variables in spacecan often be exploited to learn more of the relevant physical process(es) and spatial statisticalstructure; such relationships can be analyzed using a variety of multivariate statistical methods(e.g. multiple linear regression, generalized analysis of variance, discriminant analysis, factoranalysis, canonical correlation, etc.) which are not within the scope of this course.

Geostatistics - the analysis and modeling of spatial (or 'geospatial') data that are distributed in acoordinate system of space and/or time; a.k.a. 'Spatial Statistics'. Geostatistics is a class ofstatistical methods that considers the interrelationship and spatial dependencies among one ormore correlated dependent variables and the independent variables of position (x, y, z, or time).

This course does not deal with other multivariate methods such as multiple linear regression,discriminant analysis, or factor analysis. See Koch and Link (1971), Volume 2, Chapter 10, foran excellent summary of the concepts in (non-spatial) multivariate data analysis and goodintroductions to these multivariate methods of analysis

5

1. Review-I - Introduction, Definitions, Software, Basic Statistics Review

1.1. What is Geostatistics?

A class of statistical tools that are used to analyze and model spatial variability, to betterunderstand physical process, and to quantify prediction uncertainty in spatial models.

- geostatistical description and prediction does not model a physical or biological process and isof little use in extrapolative prediction (prediction beyond the spatial bounds of availablemeasurements); it is best suited for interpolative prediction .

- geostatistics does not replace deterministic modeling techniques where the process model, itsinput variables, and the parameter values that describe spatial variability are sufficiently wellconstrained to construct an accurate quantitative prediction. However, even with a goodprocess model, the random or non-trend component of spatial variability is best modeled witha geostatistical approach.

- geostatistics is most valuable in the analysis of attribute values which are distributed--andphysically correlated--in space and/or time (as are most GIS data).

1.2. Five Steps to Geostatistical Modeling. - the analysis of the statistical charactersitics of spatial variability is called "structural analysis" or

"variogram modeling." From this structural analysis, predictions of attibute values atunsampled locations can be made using two broad classes of modeling techniques known as"kriging" and "stochastic simulation."

- some steps, like (1), (2), and (3) below, are mandatory and may have to be repeated iterativelybefore appropriate decisions can be made; (4) and (5) are optional depending on the goals ofthe study and the types of predictions and uncertainty analyses required: (see Section S.1.1.)

(1) EDA-I (Exploratory Data Analysis): classical descriptive statistics and analysis of stationarity and population homogeneity, population outliers, basic statistical hypothesistesting, and correlations among attributes;

(2) EDA-II: understanding the spatial nature of variability, spatial data density andsample availability, data clustering, spatial trends and discontinuities; identifying possiblepopulation regroupings and data and coordinate transformations; repeat (1) if necessary;

(3) Variography (EDA-III): spatial autocorrelation analysis, identification and treatmentof spatial outliers, structural insights into process, population regroupings, datatransformations; repeat (1) and/or (2) where necessary;

(4) Decision time: what can be done with the data? is spatial autocorrelation present? isit strongly expressed? can it be described confidently? will geostatistical prediction be ofuse? which prediction method will be used and for what reasons?

(4) Prediction: The BLUE method -- two maps based on linear, weighted averaging, one to predict

large-scale spatial variability and the other, a statistical measure of prediction uncertainty; The Stochastic method -- an unlimited number of maps based on BLUE prediction or

other methods to represent spatial variability at all scales and to assess prediction uncertaintybetter than BLUE methods can.

6

1.3. Kriging vs Simulation:- kriging is a statistical weighting procedure that produces a best linear unbiased estimate

(B.L.U.E.) and the variance of the estimate - stochastic simulation is a probabilistic depiction of local variability superimposed on the

regional spatial variability described by kriging- kriging is a smoothing interpolator that predicts large-scale spatial variability

- honors global statistics and local data- produces a smoothed representation of large-scale spatial variability- quantifies statistical uncertainty at each location where an estimate is made- used for best estimation of expected values

- simulation is a probabilistic representation of local variability- honors global and local statistics and local data- reproduces local variability- best for representing local variability and local uncertainty

- both methods can incorporate and honor different types of hard data, but only simulation canincorporate soft information and other constraints (eg: physical geometry)

1.4. Statistics Review -- Some Basic Definitions:- probability: the expectation of an outcome of a random event or measurement- stochastic: synonymous with probabilistic- dependent variable: a qualitative or quantitative measure of a physical attribute- sampling item (or event): a single outcome or measurement- (global) population: the set of all possible outcomes of a process that can be sampled- sample (population): a set of measurements or outcomes drawn from a global population. A

specimen is an event; a sample is an experiment in which multiple specimens are collected,from which we attempt to estimate the statistical characteristics of the underlying population

- regionalized variable: a variable whose value is dependent on spatial and/or temporal location

1.5. Additional Definitions:- variance: the variation of a single variable about its mean- covariance [as in bivariate regression]: the joint variation of two correlated variables about

their common mean- (auto)Covariance [as in geostatistics]: the variation of a single regionalized variable - cross-Covariance [as in geostatistics]: joint variation of two correlated regionalized variables - correlation structure: a statistical description of the Covariance of a regionalized variable

1.6. Reference Sources- Basic Statistics and Review: Till (1974) and Davis (2002) are highly recommended for their

clear presentation of concepts and methods, using numerous (albeit geological) examples. Anexcellent review of statistical techniques developed by the Biological Sciences Department atManchester Metropolitan Univ. is available on the web. You may find it helpful in scrapingoff the rust: http://asio.jde.aca.mmu.ac.uk/teaching.htm (click on M.Sc. Research Methods)

7

- Introductory Geostatistics: Isaaks and Srivastava (1989) is an excellent text and referencesource for the beginner or experienced user. It covers the fundamentals of geostatisticsthrough variography, kriging, and cokriging. More advanced treatments are found in Cressie(1993) and in Journel and Huijbregts (1978).

- Advanced Geostatistics and Simulation: Although not on library reserve, Deutsch and Journel(1992, 1998) and Goovaerts (1997) are written as companion texts and are highlyrecommended for the most comprehensive software applications and theory covering manyfacets of geostatistics analysis, estimation, and simulation (intermediate level). Anotherreference source not available on library reserve is the ESRI documentation for ArcGIS'sGeostatistical Analyst extension; it is both a review of statistical concepts and a tutorial forArcGIS's geostatistical capabilities.

1.7. Software Overview- Software used in this course is a combination of ArcGIS, third-party proprietary, and third-party

public domain software. ArcGIS now offers an excellent, user-friendly interface to a goodlysubset of exploratory statistical analysis methods, variogram and cross- variogram analysis,various kriging methods, cokriging, and cross-validation analysis as well as severaldeterministic (non-statistical) interpolation methods. However, its general statisticalcapabilities are limited, and its variogram and cross-variogram analysis capabilities (thoughrobust and powerful) are overly automated (designed for users who "can't wait to krige" butcan't be bothered with understanding what the automation is doing or not doing).Furthermore, only 2-dimensional data analysis is permitted, and stochastic methods ofmodeling and uncertainty analysis are not currently available.

- In order to introduce the various geostatistical concepts--particularly in exploratory dataanalysis (general statistical treatment of data) and variography--as well as to introducestochastic simulation concepts which ArcGIS does not currently support and 3-dimensionalgeostatistical analysis, other software packages will be introduced throughout this course.

1.8. Data Measurement Scales (Till, p.3-5): - measurements can be made on four different typologic scales, differing in their information

content from qualitative to highly quantitative- nominal scale: categorization into arbitrary classes or categories; eg: colors, species, rock types- ordinal scale: ranking into a sequence of classes whose sizes may be arbitrary or constant; e.g., the mineral hardness scale; high, medium and low-valued categories- interval scale: continuous, numerical measurements relative to an arbitrary zero (eg: oC, oF

temperature scales)- ratio scale: continuous, numerical measurements relative to a true zero (eg: oK temperature

scale; permeability; nutrient abundance)

- the ratio scale of measurement has the highest information content, but it is not always possibleor desirable to make such measurements; the nominal scale is the basis on which the naturalsciences are founded (classification and description) and are still extremely useful inquantitative analysis; ie. you choose the data type that makes most sense for yourproblem--even if it means transforming the data from a higher to a lower typology; e.g.,

8

classification of ordinal and interval data into bins or classes, or transforming to a binaryindicator variable (above and below some threshold)

- an extremely important consideration in collecting and analyzing data of any type is the'support' or size of the 1-, 2-, or 3-dimensional space over which a measurement is made; e.g.the choice of pixel size for spectral images determines how ground-based measurements willbe collected and analyzed; an ore assay on a chip sample would not provide as good anestimate of the average ore grade of a mine stope as a properly homogenized sample from a10-ton truckload from the stope; topographic contours estimated from 10-meter pixel imagerywill have a different accuracy level that 1000-meter pixel estimates

1.9. Sampling Design: (Till, p.51 - “sampling is like religion: all are for it, no one practices it”)- the purpose of sampling is to make statistical inferences about the underlying population as

efficiently and as accurately as possible - systematic vs. random sampling is dictated by practicality (availability) and the nature of what

is being sampled. For example, if a process is known to produce a patchy spatial variation ofhigh and low values, then a regular grid won't be efficient or even able to capture thestatistical properties of that patchiness

- steps in designing a sampling strategy: 1. develop a conceptual framework: the purpose of the sampling campaign, expected

populations to be encountered, types of variables (continuous, categorical), expectedsources of variability

2. form a working statistical model based on the conceptual framework (eg: a normal pop’n) 3. choose a sampling plan based on the model that will achieve the stated purpose 4. decide on the number of samples to collect to achieve the accepted levels of precision vs.

accuracy (repeatability vs. truth)

- types of sampling: regular (gridded or geometric); random; biased (historical or available)- sampling goals will differ depending on project goals: scales of variability, areas representative of the study region, number and spacing constraints of samples, sample support size, etc.

- Note: the goals of sampling are as varied as the earth is complex; different sampling campaignswithin the same project can have very different objectives, and their impacts on spatial dataanalysis can be enormous! For example, obtaining the best global estimates of environmentalcontamination may require spatially random (unbiased) sampling, whereas the most efficientsampling plan for locating zones of high-grade ore and their size may involve progressivelybiasing the collection of data towards high-valued areas. There is nothing to prevent asampling campaign from evolving from a random or gridded (unbiased) scheme to alocalized, "hot-spot" (biased) campaign

- sampling campaigns that target the high or low end of the data distribution produce spatiallyclustered data: thatis, the mean, variance and frequency distribution of the sample data arebiased by the inclusion of a disproportionate number of samples of high or low values.Declustering of such a data set is unnecessary for kriging, which automatically accounts for data clustering, but is important for simulation because the underlying population distributionmust be accurately estimated to make predictions where neighborhood data are unavailable.The technique known as cell-declustering superimposes regular grids of various cell sizes

9

over the data domain, and assigns a declustering weight to clusters of samples that fall withineach cell that are proportional to the inverse of the number of samples within a cell. Theglobal declustered mean for a given cell size is then defined as

, mdeclus = 1N S d ix i

where N is the number of samples, δi is the cell-declustering weight. Optimum declustering weights are chosen for a cell size in which the global declustered mean is a minimum (or maximum, if low-valued values are preferentially clustered)

1.10. Probability: (Till, Chapter 2, 3)- a sample of a population is the outcome of a statistical experiment (a sampling campaign)- if multiple outcomes are possible for each experiment, there is an uncertainty associated with

each outcome, i (eg: if the population consists of 50 red and black marbles in a bag and thesample size is 50, then only one possible outcome exists; there is no uncertainty in the sampleor the inferences made from the sample)

- the probability of an outcome, pi, is the proportion of outcome i to all possible outcomes: pi = ni/N; N = Σni

- for any population, there is an unknowable "true" probability of an outcome (a populationstatistic), which cannot be known, only estimated (a sample statistic)

- a process (or a sample drawn from the population created by the process) is considered randomwhen the nth outcome is independent of the (n-1)th outcome (an outcome has no "memory"of preceding outcomes)

- a Markov process is a random process but the probability of outcome i, pi, depends on aproceeding outcome j, pj (e.g., prograding fluvial vs. deltaic vs. lacustrine depositionalenvironments; the temporal vegetation succession following a forest fire)

- a Markov chain is a series of possible states, with the probability of transition fromstate i to state j defined for all possible transitions

- 1st-order Markov chain: j = i+1 ; nth-order: j = i+n- a stationary process exists where the transition probabilities are constant in time (space)- example: see cyclothem example in Till, p.11-14 for transition probabilities

- the concept of Markov processes is related to the concept of spatial correlation which is centralto geostatistics; ie. values of a spatial variable are not randomly distributed but depend ontheir spatial context

- a key concept in geostatistics is that of the random function model. This is the hypotheticalunderlying probabilistic model with which we will describe the statistical properties of avariable and all possible spatial arrangements of its values; it is a purely theoretical conceptbecause only one possible spatial arrangements is ever available to us for sampling (i.e. theEarth as it exists at the time of sampling). The random function (or R.F.) model allows for aninfinite number of possible spatial arrangements or so-called realizations of a regionalizedvariable. For example, the bag of 50 marbles constitutes a "population" in classical statistics,because it is the only one we have to sample; but if it is viewed as just one realization of arandom function, then there are many possible arrangements and proprtions of red and blackmarbles in the bag. In geostatistics we are not concerned with the other possiblearrangements, only in inferring the statistical properties of the R.F. so we can use it to make

10

better predictions of what's in the bag.

1.11. Describing a Sample Distribution: (see Cheeney, p.13; Till, p.90-91)- frequency is the number of specimens/events in a sample - a histogram is a probability frequency distribution (pfd); note that it is the area (not the height)

of the bars that is proportional to frequency; i.e., the bin sizes (class intervals) of the bars canvary. The appearance of a histogram can be greatly altered by the choice of class (bin) sizes!

- probability density function (pdf): where the number of data values is large, the number ofhistogram classes can be made arbitrarily large to more accurately reflect the shape of theunderlying population histogram; in the limit, as histogram class size approaches zero, thehistogram smooths out and approaches a continuous curve: this curve is the probabilitydensity function (pdf) and is the basis for predicting the probability that a variable lies withina specified range, or the probability that a variable lies above or below a specified threshold.The most commonly encountered pdf's are the normal (gaussian) and lognormal forms

- the cumulative frequency distribution (cfd) is the cumulative analog of the histogram; the cumulative distribution function (cdf) is the cumulative analog of the pdf- the height of the cdf at a threshold, z, is equivalent to the area, Φ, beneath its pdf to the left of z;

tabulated values, Φ(z), for the standard normal distribution are available to define theprobability (area under the pdf) that a variable's value is less than z (see Section 1.2 below)

- the q-quantile (or quantile) on a cdf is the height, q, of the cdf at a given value of the variable; aQ-Q plot therefore compares the shapes of two cdfs (or cfds)

- the p-quantile (or probability) on a cdf is the value of the variable that a proportion, p, of thedata does not exceed; thus, a P-P plot compares cumulative probabilities of two cdfs (or cfds)

- common measures of central tendency: mode (highest frequency class); median; mean(arithmetic, geometric, harmonic): ; ; marithmetic = S x i mgeometric = [P x i ]

1n mharmonic = 1

S[1/xi ]

- measures of dispersion (shape) about the central tendency: range (max, min, interquartile);

variance / std. dev.; skewness (approximately equal to 3*[mean-median]/std.dev. = -1 to +1)- common statistical "moments": 1st moment (mean) = , 2nd (variance) = , 1

n S x i1n S(x i − l)2

3rd (skewness) = , 4th (kurtosis) = 1n S( x i−l

r )3 f[S(x i − l)4 ]

- Note: the pdf and cdf exist only for continuous measurement data, but a discontinuous type of "cfd" can be plotted for any categorical variable like any histogram can (using constant or varying class intervals) for frequency analysis of categorical data. Also, the concept of a cdfis often used interchangeably with that of a "cfd", even where the cdf is unknown--so beaware of its specific usage in a particular context.

1.12. Degrees of Freedom: (Till, p.56,57) - in general, the D.F. of a statistic is equal to the number of outcomes in the sample (n) less the

number of statistical parameter estimates that are required to calculate the statistic- for example, to calculate the sample mean, we only need the n outcomes (no other statistical

information), so the D.F. for the mean is simply n. In calculating the variance, however, we

11

require an estimate of the mean, so the variance's D.F. is n - 1.

- when calculating a population statistic, by definition we have all the information about thepopulation and its statistics, so estimates are not required. That is, in calculating the varianceof a global population, the exact mean is known, so no estimate are required

e.g.: in a bag of 50 red and black marbles the population N is 50; if sample size n < 50, then calculating the variance requires an estimate of the mean; hence, the variance's D.F.= n-1

However, if the sample size is 50 (all of the population is represented in the sample), then the mean no longer needs to be estimated (it is known with certainty), and so the variance's D.F. is N.

1.13. Normal Probability Distribution Function (pdf): y = 1r 2o

exp(−(x−l)2

2r2 )

(1.1) or (see Till, p.33)y = 1r 2o

exp( −z2

2 )

where z = (x-µ)/σ is the “standard normal deviate.

- the standard normal pdf has µ = 0, σ = 1, so that (see Till, p.37 for )y = 12o

exp( −z2

2 ) l ! 3r

- the total area under the standard normal curve is 1.0 (ie. the probability that is 1)−∞ [ x, z [ ∞- the area under the curve is equal to the cumulative distribution function (cdf) andc [a [ x [ b]

is the probability that x lies between a and b :

(1.2) F(a, b) = p[a [ x [ b] = ¶a

b 1r 2o

exp(−(x−l)2

2r ) dx

- tabulated values of the standard normal curve list the inverse cdf area, 1-Φ(z)- since Φ(z) is the probability (the area of the pdf from to z) that the standardized random−∞

variable is less than z, then 1-Φ(z) is the probability that it is greater than z:

(1.3) 1 − F(z) = p[z [(x−l)

r [ ∞] = ¶z

∞ 12o

exp( −x2

2 ) dx

Area

1 − Φ

Φ

z

- tabulated values of 1-Φ(z) (Till, p. 34) range from 0.50 at z = 0 to very small values as zincreases

- note that since the normal distribution is symmetrical about a mean of zero, only positive zvalues are tabulated; for negative z values, the tabulated values of Φ(-z) 1-Φ(|z|)h

Note: the tabulated values correspond to σ; e.g. at z = 1.96, 1-Φ(z) = 0.025. That is, the sum ofthe two tails is 0.05 so that the probability that values fall between -1.96σ and +1.96σ of themean is 1-2Φ(z) = 95%

12

Example: if porosity (in percent) is normally distributed with mean, µ = 20 and σ = 2, the probability that porosity lies between 17.5 and 23 can be found as follows:-let the variable x represent porosity, so its standardized form is z:

x1 = 17.5, so z1 = (x1-µ)/σ = (17.5-20)/2 = -1.25x2 = 23, so z2 = (x2-µ)/σ = (23-20)/2 = +1.5

- the tabulated value for 1 - Φ(1.5) = 0.0668 therefore, area A2 = 1 - [1 - Φ(1.5)] = Φ(1.5)

= 0.9332 = probability that x < 23= p[(x−l)

r [ z]

z=1.5

A2

A1

p = 1 - 0.0668

p = 0.1056

- for z = -1.25, look up the value for 1 - Φ(1.25) = 0.1056; and since the curve is symmetric about zero, A1 = 0.1056 = p[

(x−l)r [ z]

- therefore the desired probability = A2 - A1 = 0.8266 ie. the probability that the porosity lies between 17.5 and 23% is 82.66%

- because of the definition of the standard normal deviate, the value of z is equivalent to the number of standard deviations, σ, away from the mean (see Till, Fig. 3.13)

1.14. The Normal-Score Transform: - any non-normal distribution (including lognormal ones) can be transformed into a standard

normal distribution by a numerical or graphical method known as a normal-score transform,which can be treated as a perfect normal distribution, then back-transformed by an inverseprocedure; see Isaaks and Srivastava, p.469-470, Hohn, p.171-172, p.175-185; note that thisprocedure is always safer than log-transformation, because back-transformation of alog-estimated values introduces a systematic bias in the estimates (Deutsch and Journel, p.93)

- normalization of the data distribution is not necessary for kriging per se but is essential forstochastic simulation of continuous variables based on gaussian-type simulation algorithms.It also circumvents the dual problems of choosing an appropriate (and usually arbitrary)transformation algorithm for an irregular frequency distribution and of appropriatelyinterpreting the back-transform of the linear estimate (as is the case for log-transforms). Itmakes correlation structure in variogram analysis easier to identify and conceptualize; and,finally, it minimizes chance of numerical instability in solving the kriging matrices

13

1.15. Other Transforms - transforms are useful for changing any distribution into one of a different form- the simplest transform is a linear shift of values (e.g., multiplying fractional values by 100 to

generate values in per cent); the normal-score transform is an example of a non-lineartransform

- the logarithmic transform (w = ln[z]) is a commonly applied transform used to convert apositively-skewed distribution into a more symmetric, normal-like distribution; although thesimple back-transform (z' = exp[w]) of any value returns the original value, the simple backtransform of a statistical estimate derived from the normally-distibuted transformed valueswill be biased, requiring a special transform (Deutsch and Journel, 1997, p.75-76)

- values of back-transformed confidence limits, mean, and std. dev'n deriveded from theirlog-transformed counterparts as:

(1.4) for the non-log value of xx = exp(ln[x]) x = exp(ln[x])

(1.5) for estimate of the (non-log) mean of xlx = exp(l ln x + 12 r ln x

2 )

(1.6) for estimate of the (non-log) std. dev'n of xrx2 = l2[exp(r ln x

2 ) − 1]

1.16. The Indicator Transform:- a very useful non-linear transform that is widely used in geostatistics; indicator transformation

is the basis for indicator kriging and indicator simulation. Ordinary kriging and Gaussiansimulation make estimates on the basis of an assumed Gaussian form of the global cdf.Where such an assumption is inapplicable, indicator geostatistics are used

- an indicator transform value, IZc, is assigned on the basis of a decision tree:

IZc = K if z < zc

IZc = NOT{K} if z zc for a threshold value, zc.m

- indicator transforms take on values of K = 0 or 1, with the sense determined by the application;where the transform is applied to estimate a probability of exceeding the zc threshold, K is setto 0 and NOT{K}, to 1 (for estimating probability of not exceeding zc, the sense of K wouldbe the opposite). However, where indicator transforms are used to estimate a cdf (see thefollowing section), K is always assigned a value of 1 and NOT{K}, a value of 0.

- for example, to estimate the probability of exceeding zc = 10, the indicator transform of valuesless than 10 would be assigned a value of 0 and values of 10 or greater would have anindicator value of 1. Ordinary kriging of the transformed variable produces estimates of theprobability that zc is exceeded.

- in general, indicator transforms are useful in the following circumstances:

1) to estimate the distribution of values above or below one or more specified threshold(s), zc(i)

(e.g., the proportions of measurements classified as high- and low-permeability);

2) to represent the cdf of a categorical variable (nominal or ordinal data), by assigning K = 1 toindicate Presence of a particular category, 0 to indicate Absence;

3) to model a continuous variable with a non-Gaussian cdf in a "non-parametric" form, byestimating the cdf with multiple indicator transforms

14

1.17. Estimating a Distribution with Indicator (non-parametric) Statistics:

- where the form of a cdf is unknown or not well defined by available data, it can be estimatedwith indicator transformation around multiple thresholds, zc(i) ; in essence, ratio- or interval- scale data are converted into a few ordinal-scale classes around two or more thresholdsknown as indicator cutoffs; from the relative proportions of outcomes falling below eachthreshold, the form of the cdf can be estimated, as in the following example:

pfd

cfd

n

0

0

C1 C2 C3

I1

I2

I3

cdf

1.0

0C1 C2 C3

I1

I2

I3

n

p (I1)

p (I3)

pdf

1.0

0C1 C2 C3

Area = p (I3)

Raw Data (noisy):

Choice of Indicator Cutoffs (pdf thresholds):

Approximate Population cdfwith Indicator Probabilities:

Reconstruct Population pdf:

- see the spreadsheet "IndicatorCDF.wk4" for an example of how the cdf of a continuousdistribution can be approximated by indicator transforms.

1.18. Introduction to Problem Sets: The Walker Lake Data Set - refer to data files and details given in class

15

S.1.1. Generalized Sequence in Geostatistical Analysis and Modeling

- this sequence of steps is intended to provide a general idea of the process of data manipulation,analysis, and evaluation in a geostatistical modeling project. It is not a "laundry list" to befollowed in strict sequence; rather, all or most of these steps need to be addressed in the orderthat is appropriate for a particular problem and data set

- software that can be used at each step is also indicated ("GsA" stands for ArcMap'sGeostatistical Analyst)

EDA-I and EDA-II1. Evaluate data distribution spreadsheet, StatMost, HISTPLT, PROBPLT, SCATPLT - global statistical character (normal/skewed, outliers, univariate / bivariate summary) - GsA has limited capabilities, although the interface is convenient2. Create plot of sample locations (a post plot) GsA - location errors, visual examination (clustering, hi/lo values, etc.)3. Data and coordinate manipulations (if necessary) spreadsheet, ROTCOORD - remove trends in raw data; indicator transformation (if necessary); rotate or

transform spatial coordinates to remove non-orthogonal spatial arrangements4. Trend analysis GsA - identify possible trends in raw data5. Compute declustering weights DECLUS - for estimating unbiased statistics; for input to simulations6. Normal-score transformation spreadsheet, NSCORE, GsA - for improving variogram analysis, kriging results, necessary for Gaussian simulation

Decision points: spatial discontinuities, segregate and/or regroup populations, apply data transformations, repeat 1-6 where necessary

Variography (EDA-III):7. Construct experimental variograms GsA, VarioWin - identify overall autocorrelation structure, optimal lag classes, anisotropy, data outliers - steps in variogram analysis (VarioWin)

- construct isotropic variogram (if any), choose optimal bin parameters- construct anisotropic variograms, identify principal orientations (if any)- look for internal consistency in alternative measures of autocorrelation

Decision points: treatment of spatial outliers, structural insights into process, population regroupings, data transformations; is spatial autocorrelation present? will geostatistical prediction be of use? which prediction method will be used and for what reasons? repeat 1-7 where necessary

16

Variogram Modeling:8. Model the variograms' autocorrelation structures GsA, VarioWin - identify and fit appropriate variogram model(s) depending on whether kriging

or simulation will be performed, if raw or normal-score data modeled, etc.

Prediction - Kriging9. Perform kriging and/or indicator kriging GsA, KT3D, IK3D10. Statistically evaluate the prediction process GsA, KT3D, IK3D - perform cross-validation to evaluate kriging errors 11. Spatially evaluate the prediction process (various software) - look for systematic spatial bias and trends in estimated values and kriging errors12. If applicable, back-transform estimated variable(s) BACKTR - reproduce original (detrended) variable's range and values

Prediction - Simulation9. Perform sequential simulation SGSIM, SISIM - estimate local and global uncertainty and the spatial character of variability10. Post-process multiple simulations POSTSIM - calculate expected values, variances, exceedance probabilities from n simulations11. If necessary, back-transform estimated variable(s) BACKTR - reproduce original (detrended) variable's range and values12. If necessary, post-process indicator simulations POSTIK - perform corrections and other final adjustments to the simulation results

Post-Prediction Analysis13. If applicable, valuate prediction performance GsA - compare estimates with a subset of the data that was held back for validation purposes - compare prediction performance of alternative variogram/search parameter choices14. If applicable, restore the trend surface (various software) - reproduce the original range of values and spatial trends in the data15. Evaluate overall performance (various software) - compare original data values with estimates, check reproduction of global cdfs, global variograms, bivariate correlations, etc.

Decision Points: has the prediction process produced satisfactory results? could performance be improved by regrouping, alternative variogram model and/or prediction search strategies? repeat 8-15 as necessary

17

2. Review-II Parametric Tests, Nonparametric Tests

2.1. Statistical Tests- one of the most fundamental applications of statistics is in deciding whether a result (be it a

confidence interval, a regression slope, two or more sample distributions, or estimates ofthose distributions' statistical characteristics) is meaningful in a probabilistic sense. Forexample, is the population mean estimated from sample #1 statistically different from thatestimated from sample #2? Do the distibutions in sample #1 and sample #2 represent anormal population? If variable y is correlated with variable x according to a calculatedregression slope, b, is the slope statistically meaningful?

- statistical tests are conducted by formulating a testable hypothesis. A null hypothesis, Ho, isformulated (e.g., 'the mean of population 1 = mean of population 2'; or 'the regression slope =zero') and tested statistically. The alternatiuve hypothesis, Ha, is the antithesis of Ho. Theresult of the statistical test of the hypothesis is to accept either the null hypothesis, thealternative hypothesis, or to conclude that both could be true, depending on the choice of thestatistical level of significance.

2.2. Parametric Tests - are applied to statistical data that are known or assumed to be derived from distributions of a

particular form (e.g., a normal or Gaussian distribution, a lognormal distribution, etc.).Parametric tests are useful for comparing the means or variances of two populations, fordetermining whether two samples were drawn from the same or different populations, and forquantifying the confidence or probability that the mean of a sample falls within or outside ofspecified thresholds. Parametric tests are applicable only to interval or ratio-scalemeasurements.

2.3. Student's t-test: (Till, p.56-61)- the t-test is a method to compare two sample means or to determine whether a population was

drawn from a normal population of the same mean, either for known or unknown variances -but always assuming a normally-distributed population! A similar comparative test ofmultiple sample distributions is performed by the Analysis of Variance (ANOVA) test (seeTill, p.106).

- given a population with mean µ and variance σ2, draw a sample of size n, whose sample meanis m and sample variance is s2

- draw all possible samples of size n and calculate t = for each sample, where can(

_m−l)

s/ n s/ n

be thought of as the standard deviation standardized by the sample size- plot the pdf of the t-statistic, which defines Student’s - t distribution (Till, p.57)- the level of significance, α, is defined as the probability of obtaining a value further from the

mean than the specified value |t| - a one-tail test is used if the test is formulated as to whether a statistic is greater than or less than

a given threshold (ie. when the probability refers to only one side (tail) of the pdf); in thatcase, look up the tabulated t-value for α level of significance

18

- a two-tailed test is used for a confidence interval (C.I.) or a test of difference or, in general, ifthe probability being tested refers to the entire pdf regardless of whether the difference is more or less than a specified value; in that case, look up the tabulated t-value for α/2 level of significance (eg: if your test is two-tailed at the 95% level, look up the t-value for α =0.025)

- Note: the t-statistic has (n-1) D.F. since we can compute m and s from the data, but we need to estimate µ

Example: a C.I. estimate based on 16 measurements (D.F. = 15) define m = 9.26, s = 2.66; from atable of Student's t critical values, for n-1=15, α/2 = 0.025, tα/2,15 = 2.131. Therefore:

(2.1) or, rearranging:−2.131 [m−l

s/ 16[ +2.131

(2.2)_m −2.131s/ 16 [ l [

_m +2.131s 16

substituting m and s: C.I.95 = ie. 95% likely that µ is within this range7.8 [ l [ 10.7

Example: To compare two sample means (Till, p.62; Cheeney, p.68), use a pooled t-statistic anda pooled variance (based on the combined sizes of both sample populations); then, if thecomputed t-statistic is less than the tabulated tα/2,15 value, the means are 1−α % likely to befrom the same population

- Types of t-tests: General t-test: samples are drawn from populations of equal variance; are the means different? Unpaired t-test: samples are drawn from populations with different variances; " " " Paired t-test: are two sets of outcomes drawn from the same population? (e.g. are the means of duplicate sets of analyses the same?)

- The power of an hypothesis test: (see Section S.2.3. ; also Till, p.63-65) The level ofsignificance, α, is the risk of rejecting Ho when it should be accepted (Type I error), whereasthe risk of accepting Ho when it is in fact false, is β (Type II error). The "power" of a test is1-β; the higher the power, the better the test; but increasing the level of significance (α)always reduces the power of a test A common point of compromise is at α = 0.05

- in general, non-parametric tests require fewer assumptions about a population but have a lowerpower and hence greater risk of Type II error than a comparable parametric test

2.4. The F-test for Comparison of Variances: (Till, p.66) - sample two normal populations for all possible sample sizes n1, n2; define F = s1

2/s22 for all

possible combinations of n1, n2 (ie. an infinite family of F distributions)- D.F. = n1-1, n2-1; the F-test is a one-tailed test

2.5. The χχ2-test: Goodness-of-Fit - used to test how well a sample distribution fits a theoretical distribution (Till, p.69); this is a goodness-of-fit test. However, it is also useful for nominal data, in a non-parametric analysis

19

of occurrence frequency in contingency tables (preferably for n > 40: Till, p.121, 124)- from a repetitive sampling of a normal distribution, calculate z = (x-µ)/σ for each member of the sample pop’n and define for all samples of size n: this is the distributionx2 = S i=1

n (z2) x2

- transform the sample data to standard normal deviates, group the z-values into r classes (eachwith at least 5 measurements), and compute the test statistic:

(2.3) X2 = S i=1r (0bservedValue[ith]Class−ExpectedValue[ith]Class)2

ExpectedValue[ith]Class = S (O−E)2

E

- D.F. is defined as r-k-1, where k is the number of parameters to be compared against the theoretical distribution (eg: if m, s are to be compared with µ, σ of the normal distribution, then k = 2), and r is the number of classes used in the comparison- Note: the test is sensitive to the number of classes used (Ho is more likely to be rejected if a

large number of classes are used; if the data are grouped into too few classes, the power ofthe test decreases, i.e., Ho may be falsely accepted, just as by binning a histogram into toofew classes may result in too simplistic a visual comparison of relative frequencies).Furthermore, there should be at least five observations within each class.

- see Till, p. 69-70 example of a χ2 parametric test of goodness-of-fit to a theoretical distribution

2.6. Non-Parametric Tests: - used for testing distributions of unknown form, for populations of unequal or unknown

variances, and for nominal or ordinal-scale measurements

2.7. 2.7. The χχ2-test: Contingency Table Analysis (Till, p.121-124)- used for comparing sample populations where the distributions are non-normal or unknown- useful for testing sample populations that are measured on a nominal scale (regardless of the

type of distribution); the data are grouped and counted in a contingency table- e.g., a number of high-, med- and low-permeability measurements (lognormally distributed) are made in two different rock types: do the two rock types have the same permeability distributions? (for a numerical example, see Table 7.4 in Till, p.121)

Number of Measurements in:Categories Gravelly Sediment Sandy SedimentTotal NumberHigh a d a+dMedium b e b+eLow c f c+f Totals a+b+c d+e+f n

- note that the data have been transformed into a nominal scale (high, medium, low classes), so that the only statistical analysis possible is counting of occurrences within/between classes - set up the table, with i = 3 rows and j = 2 columns, and with marginal row and column totals,

Ti and Tj

20

- the expected probability of finding measurements of a given permeability class in either type of sediment is the expected value, E: p i h E i = T i

n

- that is, by chance alone, we would expect that the probability of measuring a low-permeability value in either sediment type would be

(c+f)n

- the expected probability of finding values of a given permeability class in a given rock type is the joint probability:

E ij =T i$T j

n

- eg: if the two sedimentary types were hydraulically identical, a purely random distribution of permeability values should exist between, as well as within, the sediment groupings; therefore, the number of measurements in the high-permeability class in gravel alone that is expected by chance is:

E1,1 = (a+d)$(a+b+c)n

- define the null hypothesis, Ho: no significant difference exists in the distribution of permeability between gravelly and sandy sediments.- calculate the test statistic:

(2.4) X2 = S i=1r S j=1

k (Oij−Eij)2

Eij

where r, k = number of rows, columns in the contingency table- D.F. is defined as (r-1)*(k-1); for the above contingency table, D.F. = 2- from χ2 tables, look up the significance level, α, and D.F.= 2, to find the critical value of χ2 e.g.,

for α = 0.05, D.F. = 2, χ2(0.05, 2) = 5.99

- compare the test statistic, X2, with the critical χ2 value; if X2 < critical value, the null hypothesis is accepted (i.e., at a level of significance of 0.05, the variations of a, b, c vs. d, e, f in the table will occur due to chance 95 times out 100); conversely, if X2 > critical value, the null

hypothesis is rejected (that is, in only 5 out of a 100 times will the observed permeabilitydifferences between sediment types arise by chance)

- in terms of the p value: when statistical analysis software is used to analyze a contingencytable, the calculated p value would be the probability that the observed differences could ariseif the null hypothesis were true.

Note: for use of χ2 test of RxC tables in StatMost, see p.274-276 in the user's guide- use Statistics | Contingency Table | RxC Table to use StatMost's built-in chi-square contingency

table analysis; note that the contingency table data are entered without marginal totals- rather than requiring a confidence level to be specified, StatMost computes the effective p-value

corresponding to the computed X2 statistic- e.g., for Till's p.121 example, the null hypothesis is rejected for confidence levels greater than

about 0.02 (see "chi-sq RxC example.dmd" for this worked example)

2.8. The Kolmogorov-Smirnov Goodness-of-Fit Test for Normality: (Cheeney, p.62-64)- this is a non-parametric test that is used to determine if a subsequent parametric test is justified- the sample distribution is normalized to an appropriate form (e.g., if the test is whether the

distribution could be Gaussian, the sample data would be transformed according to the

21

standard normal deviate, thus "normalizing" the distribution around a mean of 0.0 and astandard deviation of 1.0)

- the D test-statistic for the normalized sample cfd is defined as the maximum class intervaldeparture from the theoretical cdf (in this example, the standard normal distribution)

- the test D-statistic is compared to a critical D-statistic - e.g., for a one-sample test, and n > 15, the critical D-statistic is (for α = 0.1, A = 1.07; αA/ n

= 0.05, A = 1.22; for α = 0.01, A = 1.51 ; α = 0.005, A = 1.63 (Cheeney, p.46; Rock, p.96)- if the test D-statistic > the critical D-statistic, the sample population is not normally distributed- Note: The K-S test is designed to be useful to test goodness-of-fit to any distribution. For this

reason, StatMost does not automatically transform a sample distribution prior toapplying the K-S normality test, so the sample data must first be normalized! TheLilliefors normality test is a modification of the K-S method that does not require the data tobe normalized (note that StatMost's implementation uses an estimation method to determinethe critical statistic so that the calculated Lilliefors p-value will be slightly different from thatcalculated in the K-S test; see Davis, p.109)

2.9. A Non-Parametric Test of Similarity: The Kolmogorov-Smirnov Test (see Cheeney, p. 45-46; Till, p.125-130 for details and example)- use the K-S test to compare any two (normal or non-normal) sample populations, or to compare

an unknown sample distribution to a normal distribution- the test compares the forms of two distributions; statistical dissimilarity between them is

identified regardless of whether it arises from differences in the mean, variance, or skewness;the test does not identify why the distributions differ. Because of this, it loses its statisticalpower at small values of n, and it always less powerful (in a statistical sense) than acomparable parametric test (such as a t-test).

- define the null hypothesis as m1 = m2 and define the alternate hypothesis as m1 m2 (two-tail) ! or m1>m2 (one-tail), where mi=sample mean (Till, p.128)

- the critical D-statistic is determined from the sample sizes of both populations and whether a one- or two-tail test is made; e.g., for different sample sizes, the critical D value for a two- tailed test is (where A=1.36 for α=0.05; A=1.63 for α=0.01) A (n1 + n2)/(n1n2)

- Note: StatMost only provides the option of a two-tailed test; it computes a test D-statistic, butdoes not report the critical D-statistic. Instead, it calculates a p-value which provides anestimate of the minimum confidence level or probability at which the null hypothesis couldbe accepted; thus, the computed probability gives more information about the test's level ofsignificance (i.e., how close the test is to a borderline rejection) than a manual comparison ofD-statistics; see the "Till p.127 K-S example.dmd" data file for worked examples of K-S test comparisons

2.10. Other Non-parametric Tests (useful for ordinal data and classification)- see Cheeney (Ch.6-7), Till (Ch.7) for examples- Mann-Whitney U-test; Kruskal-Wallis; Spearman’s rank correlation coeficient.; Kendall's-τ - commonly available tests in StatMost, SPSS, and other standard statistical packages- before applying any test, be familiar with the test procedure and assumptions, its nomenclature,

22

and perform a dummy test on a known data distribution to ensure that you know how tointerpret the results correctly

2.11. Problem Set I. Statistical Summarization: Exploratory Data Analysis-1- introduction to the use of software, using the demo data set (Walker Lake): univariate summary

statistics, box plots, frequency and cumulative frequency distributions (histograms, cfds),K-S tests of normality

Readings:

Isaaks and Srivastava (1989) Introduction to Applied Geostatistics (on library reserve) p. 4-6, Ch. 6 - the Walker Lake data set Chapter 3 - bivariate correlation, q-q plots, conditional expectation p. 40-55 - spatial description, the proportional effect, skewed data, h-scatterplots

23

S.2.1. Summary of Hypothesis Testing:

Rationale Behind Statistical Tests - from the size, the shape and dispersion of the sample data,compare the sample distribution to a known distribution or another distribution to determinesimilarities or differences at a specified level of confidence (probability of being wrong).

Parametric vs. Non-Parametric Tests - if a normal distribution is known or inferred, a parametrict-test is the most powerful; if the parent population distribution is not normal or is not known, aparametric test cannot be applied and non-parametric comparisons have to be applied (eg: K-S).

Types of Tests:

One Sample Tests - Two Sample Tests -

One-Tailed Tests -1. is the mean < or > a specified value? (t-test)

Two-Tailed Tests -2. is the sample mean equal to a 3. are two sample sets drawn form specified value? (t-test) equivalent populations? (t-test or K-S)

t-tests: general, paired, unpaired Goodness-of-Fit Tests -

4. is the sample population Gaussian? (a two-tailed test) (K-S)

Define the Null Hypothis:

Null Hypothesis -case 1. m < specified value or m > specified valuecase 2. m = specified valuecase 3. msample population 1 = msample population 2

case 4. sample's normalized z-score distribution = standard normal distribution

Specify the Confidence Level: For one-tailed tests, the confidence level of the test-statistic is thesame as the specified confidence level of the test; eg: a 0.95 confidence level is desired, so theconfidence level applied to the test-statistic is also 0.95 (and significance level, α, is 0.05).

For two-tailed tests, the test-statistic refers equally to both tails of the outcome, so at a specifiedconfidence level (eg: 0.95), the probability of rejecting the null hypothesis is equally shared byboth tails, so that the specified significance level is half (eg: 0.025).

Specify the Degrees of Freedom: case 1., 2. (t-test) D.F. = n - 1 ; case 3. (t-test) D.F. = n1 + n2 - 2

Note: for a Kolmogorov-Smirnov test, the critical D-statistics are defined ascase 3. (n > 40) Dcrit = (for α=.05); = (α=.01) 1.36 (n1 + n2 )/n1n2 1.63 (n1 + n2 )/n1n2

or case 4. (n > 15) Dcrit = (for α = .05); = (α = .01) 1.22/ n 1.51/ n

24

S.2.2. Interpreting P-Values Returned by a Statistical Test (modified from: http://www.graphpad.com/articles/interpret/principles/p_values.htm )

What is a p-value?

Observing different sample means is not enough to conclude that they represent populations withdifferent means. It is possible that the samples represent the same population and that thedifference you observed is simply a coincidence. There is no way you can ever be sure if thedifference you observed reflects a true difference or if it is just a coincidence of randomsampling. All you can do is calculate probabilities.

Statistical calculations can answer this question: If the populations really have the same mean,what is the probability of observing as large (or larger) difference between sample means in anexperiment of this size? The answer to this question is called the p value.

The p-value is a probability, with a value ranging from zero to one. If the p-value is small, thedifference between sample means is unlikely to be a coincidence.

The null hypothesis:

In general, the null hypothesis states that there is no difference between what is being compared.The p-value is the probability that the observed difference could have arisen by chance if thenull hypothesis were true. For example, consider this output from StatMost's t-test:

Sample A Sample B Sample Size 9 9 Mean 1.7778 4.7778 Difference = -3.0000 Variance 1.1944 0.9444 Ratio = 1.2647

t-Value Probability DF Critical t-Value General -6.1539 1.387 E-005 16 2.1199UnPaired -6.1539 1.387 E-005 16 2.1199

If the mean of sample A actually were the same as sample B's, then the probability is less than0.002% that two sample means would differ this much by chance. That is, in repeated samplingsof these populations, we would correctly conclude that the null hypothesis is false 99.998% ofthe time; that is, the null hypothesis could safely be rejected at the 99% confidence level. On theother hand, if the observed difference in sample means was quite small, the calculated p valuewould be large, again indicating the probability that the observed difference truly was this small.

Common misinterpretation of the p-value:

If a p-value is reported as 0.03, it would be incorrect to say that there is a 97% probability thatthe observed difference reflects the actual difference between populations and a 3% probabilitythat it does not. Rather, it means that there is a 3% chance of obtaining a difference as large asthe observed difference if the two samples were drawn from one population. That is, 97% of thetime, random samplings of the same population would produce a difference smaller than theobserved difference, and only 3% of the time could it be as large or larger.

25

S.2.3. The Power of an Hypothesis Test (or Minimizing the Risk of Falsely Accepting Ho): (see Till, p. 63-66; also http://asio.jde.aca.mmu.ac.uk/rd/power.htm)

Consider two samples with very different sample means drawn from different populations inwhich a t-test rejects the null hypothesis at the 95% confidence level. We conclude that thesamples were not drawn from the same population. But what if the difference in sample meanswas the result of a chance draw of extreme values? We would be wrong to reject Ho. The risk ofsuch an error at the 95% confidence level is 5%, or 0.05. This is the test's level of significance, α;it is the risk of commiting a Type-I error--the probability of rejecting Ho when it is in fact true.

Consider the example of temperature measurements taken from two lakes (such as the data set onthe first-day quiz). A t-test returns a p-value of 0.068, indicating there is a 6.8% chance that thedifference in sample means could arise purely by chance in the process of sampling a singleunderlying population; thus, 93.2% of the time we would expect to be correct in accepting thenull hypothesis. The risk of making a Type-I error is 6.8%.

A second type of error can also occur. This is known as a Type-II error, β, the risk of acceptingHo when it is actually false. The power of an hypothesis test is defined as 1-β. Determining thepower of an hypothesis test is an involved process; it is typically most important in the analysisof trends. See http://www.mp1-pwrc.usgs.gov/powcase/ steps.html for a discussion of theprocedure and an example of power analysis software.

Perhaps the concept can be more easily envisioned in light of the sample t-statistic and thecritical t-statistic (StatMost prints out tcritical and tsample when it reports the p-value for the t-test). Inour lake example, at the 95% confidence level and 38 D.F., tcritical is 2.02; the calculated tsample is1.88. Since |tsample| < |tcritical|, we accept Ho. If we repeat the test at the 90% confidence level, thecritical t-statistic is 1.68; since |tsample| > |tcritical|, and we would be forced to reject Ho at thisconfidence level. In this example, we know that Ho is true. So, in rejecting it at the 90%confidence level, we'd be making a Type-I error. On the other hand, by rejecting Ho we havecompletely eliminated the possibility of making a Type-II error! That is, at α = 0.1, the value of βhas dropped from some finite value (0 < β < 1) when α was specified as 0.05, to β = 0; that is,the power of the test has become 1-β, or 100%. If we always rejected Ho, we'd always be assuredof the highest possible power for the test, but it would be counter-productive (we'd be 100% sureof avoiding Type-II errors but unable to ever determine whether Ho were true). (to determine a test's power, see Till, p.64-65, and http://www.mp1-pwrc.usgs.gov/powcase/steps.html)

It should be apparent that the goal in choosing an appropriate α level is mimimizing the risk ofType-I errors (by setting α as low as possible) while maximizing the power of the test. Becausethe power of the test decreases as α increases, a common point of compromise is to choose α =0.05. Algorithms that report p values provide more information to assist in choosing an optimumα. For example, comparing only deep-water temperature measurements from both lakes (thoseleast affected by solar heating), the difference in sample means is only 0.6o, and the t-test returnsa p value of 0.22; i.e., we could reject Ho only if we specified a confidence level of 78% (α =0.22), but the power of the test would be the highest. If lowering the risk of a Type-II error wereimportant, we might be willing to accept Ho at, say, the 85% or 90% confidence level, instead.

26

S.2.4. Classification and Measures of Classification Performance:

Classification into categorical outcomes can be essentially error-free (e.g., a gold analysis of arock sample determines if that particular sample is ore grade or not in an economic sense). Moreoften, however, some degree of classification error is involved because classifications are basedon a decision threshold (e.g., classifying land cover based on a remotely sensed vegetation indexcarries uncertainty; using vegetation type to predict the presence of an animal known to inhabit aparticular vegetation type carries uncertainty; can an ore deposit be termed economic or not,based on the average gold content of 300 samples)

The simplest type of classification is a dichotomous (or binary) state: presence/absence, yes/no,high/low, uranium-bearing/not uranium-bearing, etc. Whether binary or multiple classificationcategories are used, we need a quantitative measure of how well a classification schemeperforms: that is, we need a measure of relative classification performance. For example, is landuse type predicted significantly better with a combination of derived remotely sensed measuresthan by a single, accurate vegetation index measure? Is one classification scheme more accurate /less inaccurate than another? To minimize the error of classifying a rock as uranium-bearingwhen it is not, is a decision threshold of 20 cps better than 10 cps ?

Methods of evaluating classification performance can be grouped into two types: threshold-dependent and threshold-independent. A threshold-dependent statistical measure summarizes theproportions of correct and incorrect classifications and presents the result as a summary statistic.That is, once a threshold has been decided and a classification produced (e.g., uranium-bearing if>20 cps, not uranium-bearing if <20 cps), the classification results can be grouped into acontingency table to summarize classification performance. The simplest summary table is for abinary outcome, known as a 2x2 confusion or error matrix:

Actually Present Actually AbsentPredicted Present a bPredicted Absent c d

where a, b, c, d represent frequencies of occurrence of possible outcomes from N (=a + b + c + d)total outcomes. Those outcomes (a, d) which are predicted correctly are known as True Positiveand True Negative outcomes, respectively; the proportions of misclassified outcomes are knownas False Positives (b) and False Negatives (c). Note that classification performance can beevaluated using this formalism for any number of categories, not just the binary case.

A variety of different measures of classification performance can be defined from theinformation presented in an error matrix. A few of these are:

Sensitivity a/(a + c)Specificity d/(b + d)Positive Predictive Power a/(a + b)Classification Rate (a + d)/NMisclassification Rate (b + c)/N

27

All of these measures have different characteristics, and some are overly sensitive to samplingbias, as reflected in the prevalence ratio ( [a + c]/N = the proportion of positive cases in thesample data set). For a comparison of the effect of prevalence on predictive power of variousmeasures of classification performance, see http://asio.jde.aca.mmu.ac.uk/resdesgn/presabs.htm.

One of the most useful statistical measures of classification performance is the κ statistic. For a2x2 error matrix, it can be defined as:

κ = (a + d) - (((a + c)(a + b) + (b + d)(c + d)))/N N - (((a + c)(a + b) + (b + d)(c + d))/N)

The κ statistic represents the proportion of specific agreement among correct and incorrectclassifications. Unlike other measures of classification performance, κ makes use of all of theinformation in the error matrix.

The Kappa Statistic

As stated above, κ is a measure of agreement. Although the χ2 test is also a measure ofagreement, it provides no direct information about how good or poor the agreement is, only if it

is statistically significant or not. In contrast, the κ statistic quantifies more of the information inthe error matrix so that it can be used to compare relative classification performance.

Both the χ2 and the κ statistics are calculated from RxC contingency tables that summarize

frequencies of responses (simple counts). In contrast to the χ2 statistic, however, κ is onlydefined for a square contingency table (R = C). For example, the relative performance of aclassifier that produces three categorical states can be compared between two classificationoutcomes. For example, a 3x3 contingency table would look like :

Outcome-2Outcome-1 Category1 Category2 Category3 Totals Category1 a = 88 b = 10 c = 2 a+b+c = 100 Category2 d = 14 e = 40 f = 6 d+e+f = 60 Category3 g = 18 h = 10 i = 12 g+h+i = 40 Totals a+d+g b+e+h c+f+i a+b+... +h+i

= 120 = 60 = 20 = 200

where a - i designate occurrence frequencies.

The κ statistic can be used as an index of agreement between two outcomes, expressing thepercentage of times that the outcomes agree in each category. In the above example, agreementsare shown in the diagonal cells (cells with counts of a, e, and i). So, 88+40+12 = 140 out of 200

28

total comparisons agree; this is the observed rate of agreement, or probability of agreement, Po =0.7. We don't know if this is good or not, because we don't know what the level of agreementwould be by pure chance, alone (that is, the expected probability, Pe). The chance level ofagreement is given by the expected counts for the same three cells. The expected counts arefound in the same manner that we found expected frequencies for χ2 test; the expectedprobabilities are expressed as Pe = (row total * column total)/N2.

Thus, the expected probabilities in the above example are:Outcome-2

Outcome-1 Category1 Category2 Category3 Totals Category1 60/200 30/200 10/200 100/200 Category2 36/200 18/200 6/200 60/200 Category3 24/200 12/200 4/200 40/200 Totals 120/200 60/200 20/200 200/200

The sum of the expected counts in the diagonal cells (cells a, c, i) gives us the expectedfrequency of agreement (60+18+4 = 82) for an expected probability of agreement of 82/200, orPe = 0.41.

Kappa compares the observed levels of agreement with the levels of agreement expected in apurely random classification outcome. The kappa statistic is defined for a generalized n x n errormatrix as:

κ = (Po - Pe)/(1 - Pe) .

In the above example, then, κ = (0.7 - 0.41)/(1 - 0.41) = 0.49, which represents the proportion ofagreements after chance agreement has been excluded. Its upper limit is +1.00 (total agreement).

If two outcomes agree purely at a chance level, κ = 0.0.

The value of κ can be used in a quantitative sense to compare classification performance amongtwo or more different classification outcomes. A rule of thumb for interpreting the kappa statisticis:

κ = 1.0 perfect agreement (in a non-spatial statistical sense)κ > 0.75 excellent agreementκ > 0.4 good agreementκ < 0.4 poor or marginal agreementκ = 0.0 indistinguishable from random agreement

29

Example 1: Evaluate whether an outcome is significantly better than a chance outcome (categorycounts are randomly distributed across all categories). A simple example:

10 events classified into 2 categoriesOutcome 1 = result of a supervised classification schemeOutcome 2 = random assignment (equal category counts, random location)

Outcome Results: 1 2 Outcome 1 compared to Outcome 2:A B Number of correspondences in Category A = 3A A Number of correspondences in Category B = 2B A Number of False Positives = 3B B Number of False Negatives = 2B A Total Comparisons = 10B BA A RxC table: Outcome 2:

A A A BA B Outcome 1: A 3 3A B B 2 2

The result, κ = 0.0, indicates that Outcome 1 is not at all similar to Outcome 2 (a randomclassification outcome) so it is an improvement over a random classification.

Example 2: Compare two different classification schemes to evaluate their relative

performance. A κ value of 1.0 would indicate two classification schemes performequally well (though not necessarily identically, particulalry in a spatial sense),

whereas a κ value of 0.0 would indicate that the correspondence between the twooutcomes is purely random:10 events classified into 2 categoriesOutcome 1 = result of a supervised classification schemeOutcome 2 = result of a different classification scheme

Outcome Results: 1 2 Outcome 1 compared to Outcome 2:A B Number of correspondences in Category A = 1A B Number of correspondences in Category B = 0A B Number of False Positives = 5A B Number of False Negatives = 4A B Total Comparisons = 10A AB A RxC table: Outcome 2:

B A A BB A Outcome 1: A 1 5B A B 4 0

and the result, κ = - 0.80, indicates that Outcome 1 is almost the exact antithesis of Outcome 2.

30

This example demonstrates that κ can measure antithetical as well as direct correspondence; i.e.,

a κ value of -1.0 indicates that two classification outcomes are mirror images of one another (butonly in a non-spatial statistical sense; the two classification outcomes may still have verydifferent spatial patterns).

Note: StatMost reports all κ values as | κ |; therefore, StatMost's reported Po and Pe values

must be examined in order to determine whether κ < 0.

Threshold-Independent Measures: the Receiver Operating Characteristic Curve:

In classifying results into categories on the basis of a decision threshold, the classification resultswill differ according to the value of the threshold. Consider a classification threshold that is usedto segregate two populations of measurements or probabilities (e.g., those indicating Presenceand those indicating Absence) into two classes: Predicted Present and Predicted Absent.

The proportions of misclassified and correctly classified measurements are indicated as True andFalse Negative and Positive outcomes (TN, FP, etc.). In a binary error matrix, overallclassification performance is represented as:

Actually Present Actually AbsentPredicted Present a (TP) b (FP)Predicted Absent c (FN) d (TN)

If a different threshold is applied, for example, to minimize False Negative classifications, thenother classification rates will be affected:

31

That is, by lowering the decision threshold, the False Negative misclassification rate is muchlower but at the expense of higher False Positive misclassification as well as higher True Positiveand lower True Negative classification rates.

A threshold-independent measure of classification performance summarizes classificationperformance over all possible decision thresholds. It is therefore a more powerful measure ofperformance and can also be used to guide the optimal choice of a threshold to meet specifiedclassification performance criteria.

The Receiver Operating Characteristic curve or ROC curve is one such threshold-independentmeasure of classification performance for dichotomous classification outcomes. The ROC curveis defined by calculating the rates of True Positive (TPr) and False Positive (FPr) classificationrates at all possible decision thresholds, z, that span the measurement (or probability) range usedto make the dichotomous classification:

TPr(z) = TP(z) TP(z) + FN(z)

FPr(z) = FP(z) FP(z) + TN(z)

where TP, FP, etc. refer to the proportions of True Positive, False Positive, etc. classificationsthat result from classifying the measurement (or probability) into one of two possible outcomesfor a particular decision threshold (z). A plot of TPr vs. FPr defines the ROC curve:

The area under the curve can vary between 0.0 and 1.0; an area of 0.5 (or a curve representing thediagonal line) indicates classification performance no better than by chance. Analogous to a negative kappa value, a curve that lies below the diagonal represents some degree of antitheticalcorrelation. The ROC curve represents the ability of a measurement or probability variable tocorrectly classify an outcome. Thus, different classification schemes can be compared and rankedon the basis of the areas under their ROC curves.

Note that, like the kappa statistic, this type of performance measure is strictly applicable only in anon-spatial statistical sense, meaning that it should not be used as the sole determiner ofclassification performance for spatial data. To illustrate this, consider a situation where aparticular condition is present only at several known sites (o) and nowhere else:

32

- - - -- - - - ("-" represents locations that were not sampled- - - o as well as where the condition truly does not occur)o - - o.

Two classification schemes based on the same data and probability information produce twodifferent predictions of likely occurrences (p) at unsampled locations:

x p x x x x x xp p x o x x p ox x x x x p p x (x = predicted non-occurence)o x x o o x x o

These two classifications would be indistinguishable in a non-spatial statistical sense (withidentical ROC curves or kappa statistics), but the prediction on the right is obviously moreaccurate in a spatial sense.

ROC analysis is available in SPSS and other statistical packages. "ROC_calculation.xls" is aspreadsheet that demonstrates a simple method for calculating ROC curves for any data set; itcan be used as is or as a template for designing a custom application. Rather than calculatingclassification performance rates over a continuous range of decision thresholds, the spreadsheetcalculates classification performance at ten decision thresholds corrresponding to histogramclasses defined by the user. The ROC curve in the figure example above was calculated with thisspreadsheet, using the following set of training data:

If specific classification performance criteria can be defined (for example, on the basis of relativecost or relative risk), the ROC curve can also be used to assist in choosing a decision thresholdthat best meets the specified criteria. See http://asio.jde.aca.mmu.ac.uk/resdesgn/roc.htm and thespreadsheet "ROC_calculation.xls" for more information.

33

Additional Information on the Kappa statistic:

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and PsychologicalMeasurement, 20, 37-46.

Kraemer, H. C. (1982). Kappa coefficient. In S. Kotz and N. L. Johnson (Eds.), Encyclopedia ofStatistical Sciences. New York: John Wiley & Sons

Fielding, A. H. and Bell, J. F. (1997). A review of methods for the assessment of prediction errorin conservation presence/absence models. Environmental Conservation, 24: 38-49.

Manchester Metropolitan University Dept. of Biological Sciences http://asio.jde.aca.mmu.ac.uk/resdesgn/presabs.htm

Additional Information on ROC Analysis:

Beck, J.R. and Schultz, E.K., (1986). The use of relative operating characteristic (ROC) curves intest performance evaluation: Archives of Pathological Laboratory Medicine, 110, p.13-20.

Zweig, M.H., & Campbell, G. (1993). Receiver-Operating Characteristic (ROC) Plots: AFundamental Evaluation Tool in Clinical Medicine. Clin. Chem., 39 (4), pp. 561-577.

Simplified primer:http://www.mieur.nl/mihandbook/r_3_3/booktext/booktext_15_04_01_02o.htm

Excellent, clear, and straightforward description of how ROC curves are calculated:http://www.anaesthetist.com/mnm/stats/roc/#make

34

3. Correlation, Regionalized Variables, Exploratory Data Analysis

3.1. Definitions:Geostatistics (spatial statistics): A branch of applied statistics focusing on the characterization of the geospatial dependence ofone or more attributes whose values vary over space (in 1-D, 2-D, or 3-D); and the use of thatspatial dependence to predict (model) values at unsampled locations.

(time-series analysis--hydrographs, the stock market--is a close, 1-D relative of spatial statistics)

Prediction (estimation, interpolation, modeling) methods: Any of a number of methods to produce estimates of a variable at unsampled locations based onvalues at discrete points. Examples include: tesselation (Theissen polygons, triangular irregularnetwork, Delauney triangulation, etc.), moving average, inverse distance weighting, splinefunctions, trend surfaces. The geostatistical equivalent is kriging, a statistically- unbiased linearestimator.

Spatial dependence (autocorrelation): Most physical processes generate spatial variability such that two data values sampled closetogether tend to be more similar than two values sampled far apart. Where a strong spatialdependence exists, spatial statistical tools can be used to predict (model) values at unsampledlocations better than other interpolation procedures.

Bivariate and multivariate dependence (crosscorrelation): A physical process produces correlated variability in the values of two or more attributes, whosecorrelation can be used to understand the process and/or to make predictions.

3.2. Bivariate Correlation: - an analysis of variation of two variables za, zb drawn from different but related populations- analysis of bivariate regression is potentially important in spatial data analysis: if an extensively

sampled secondary variable (eg: topographic elevation) is correlated to the primary variablewe wish to estimate (eg: water table elevation), the spatial cross-correlation between the twovariables can greatly help in estimating the primary variable by exploiting the spatialcorrelation information in the correlated secondary variable if its values are known at otherlocations where the primary variable is unsampled

- bivariate regression analysis is predicated on the assumption that the distributions of both za andzb are Gaussian; in some cases, this condition can be relaxed e.g. the independent variable cantake on discrete values

- theoretically, linear regression analysis is strictly valid only under the following conditions:- both variables are measured with no (or negligible) error - both variables are normally distributed- the variables are linearly correlated- the X values are independent- prediction error is homoscedastic (constant variance) and Gaussian- prediction errors are independent (not autocorrelated)

35

- So, obviously, just about any real set of data shouldn't be regressed! In practice, though, any ofthe above requirements can be relaxed and/or ignored. In fact, all the above assumptions canbe thrown out IF the sole purpose of linear regression is to predict U from V, and some arguethat regression coefficients should only be analyzed with a related technique (linear functionanalysis) because of the measurement errors present in both variables (Use and Abuse ofStatistical Methods in the Earth Sciences, W.B. Size, ed., 1987, Oxford Univ Press; p. 78).

- in a bivariate relationship, the overall variance of the two variables can be thought of ascomposed of three variance components: 1) variance of variable A, 2) variance of variable Band 3) the variance arising from the correlation of A vs. B

- this latter variance is known as the bivariate covariance and is defined by:

(3.1) Cov(za, zb) = 1n−1 S i=1

n (za, i − ma)(zb, i − mb)

- and the correlation coefficient is:

(3.2) r = Cov(za, zb)/sza szb

- the value of the square of the correlation coefficient (r2) represents the fraction of the totalvariance of za and zb that is due to their linear correlation

- where the population is not bivariate normal, use a non-parametric correlation method on theranked (ordinal) form of the data, such as Kendall's-τ or Spearman’s rank correlationcoefficient (see Till, pp. 131-134; Cheeney, Ch. 6), which are based on the difference in rankposition of all observations between two samples or between x and y values

3.3. Hypothesis Test of the Significance of Correlation- from the definition of the correlation coefficient in equations (3.1) and (3.2), it is apparent that

if values of the dependent variable, zb, are close to their mean (i.e. there is little variation) andthe standard deviations of one or both variables, sza or szb, is relatively large, then the value ofr can be small; in other words, by itself the value of r may be a poor estimator of the degreeof correlation

- r is only a relative measure; cannot compare different data sets on the basis of r values- What constitutes a significant correlation? This is entirely subjective, depends on the data, the

variables examined, purposes of the analysis- r does not indicate anything of the statistical significance of a correlation- furthermore, if the assumption of bivariate normality is violated (as in a conditional variation,

e.g., X is classified into categories such as hi, med, lo), then the r statistic is meaningless- in other words, a more robust test is required to determine statistical significance of a bivariate

correlation- if the two variables, za, zb are normally distributed, then various types of t-tests can be applied

to determine if the bivariate relationship between them is statistically significant (Till, p.86)- for example, to test whether the value of the correlation coefficient represents a statistically

significant bivariate correlation, a null hypothesis is set up to represent the situation where

36

the population correlation coefficient, ρ, is zero (ie. the variables are not correlated); ie. Ho(ρ= 0) and the alternative hypothesis is Ha(r 0); this is an example of a two-tailed test, and!the t-statistic is defined as:

(3.3) with D.F. = (n-2) t = r(n−2)(1−r2)

where n is the number of sample data plotted in the regression

- therefore, if the value of |t| calculated from eq'n (3.3) is larger than t(n-2) α/2, we can reject Ho andconclude that a significant bivariate correlation (ρ 0) exists at a confidence level of!100(1−α)%

- for more on hypothesis testing of regression hypotheses, see Section S.3.1.

3.4. Conditional Expectation- where a non-linear correlation is apparent, an alternative regression technique is to specify the

means of U that correspond to different classes of V. This produces a conditionalexpectation: an expected value of U is defined within each V class; the expected values of Uare conditional because they depend on the V class that is specified.

3.5. Regionalized Variables - unlike a random variable which results from a purely random process (eg: the results of throwing dice), a regionalized variable (or r.v.) is distributed in space (and/or time) with location information attached to each measurement; in other words, the measurement variable (z) no longer represents a statistically independent univariate population, but is part of a multivariate population (x, y, zxy) in which the values of zxy may no longer be strictly independent in a statistical sense; that is, zxy may be correlated to x and y because of the physical process which generated it - in general, any measurement which is associated with spatial or temporal coordinates is a r.v.- denote this variable as z(x), where x designates spatial coordinates (in 1-D, x is x; in 2-D, x is x, y; etc.); e.g. if the regionalized variable is rainfall amount, each data point is denoted with coordinates x = (x, y), and z(x) is rainfall amount- key concept: the variability of any regionalized measurement over/in the earth can be viewed as but one possible realization or outcome of a hypothetical random process (a God who throws dice?) which has distributed values of z(x) in just one of an infinite number of possible ways- key concept: in geostatistics, the r.v. is assumed to be the outcome of a physical process (or

multiple processes) whose spatial form represents a combination of a structured aspect (eg:lead concentration in contaminated soil due to the contamination process/history) and arandom, unstructured aspect (eg: the natural lead content in, and proportions of, feldspar andlimestone detritus in the soil); local 'trends' such as contaminant hot spots can be handledwithin the geostatistical modeling process, but significant regional trends are removed priorto analysis and modeling to ensure that z(x) represents a stationary r.v.; ie. the analysis,estimation and simulation of variation is done with the trend subtracted from the raw data,

37

then the trend is added back in the final estimation (use Surfer or other software to model thetrend, then remove it from the raw data); in this course, we will not focus on trend removal(see Koch and Link, 1971, chapt. 9); what constitutes a 'significant' trend is usually a matterof judgement and can be identified in exploratory data analysis or during variogram analysis

3.6. The Walker Lake Data set- see Isaaks and Srivastava, 1989- be able to outline the phases / steps in a geostatistial analysis- the purpose of exploratory analysis: get to know your data, identify sampling patterns, sampling

history and possible sampling biases, patterns of variability, bivariate correlations amongmultiple r.v.'s, etc.

- examine the exhaustive Walker Lake data set, its main features of spatial variability, evidenceof heteroscedasticity, correlations of V, U, T

- compare the features of the sample data set; plot the sample data, get to know it, identifypatterns, sampling bias, etc. (note that in a real analysis, you will not have access to theexhaustive data set; only God knows what the real situation looks like and you are trying toreconstruct that situation from a few paltry measurements)

- key concept: different populations of the r.v.'s may be present (e.g. the T variable may representrock type); the ability to segregate V, U values into two possible classes (T = 0 or 1) raisesthe question of when population splitting should occur. There are no hard rules, but someguidelines are:

1) Is the distinction physically meaningful? One should have good reasons for splitting, even ifthe segregation is done subjectively;

2) After splitting into subpopulations, do sufficent data remain within all the subpopulations tojustify statistical inference based on the numbers of data points? If some subpopulations havetoo few data to justify meaningful statistical measures, their segregation may not be useful;and

3) What is the goal of the study? Does population splitting contribute to the goal? For example,estimating the spatial distribution of species proportions from fifty calibration locales in anarea with two very different types of land cover: if the distribution of species counts in thecalibration sites is statistically indistinguishable between the two land cover types, splittinginto subpopulations is unnecessary; if statistically distinct, then splitting must be performedprior to geostatistical analysis.

3.7. Problem Set II. Exploratory Data Analysis- preliminary spatial analysis (the Walker Lake data file, sample locations, clustering, spatial

sampling bias, contour maps, spatial trends, coordinate outliers, attribute value outliers,interval [hi/lo] maps, indicator maps)

- clean up the raw Walker Lake data file, calculate declustering weights, and calculate thenormal-score transform of the V data using NSCORE

38

S.3.1. Hypothesis Testing of Regression Parameters:

A hypothesis test is often used to evaluate the significance of a linear regression fitted to a scatterplot of x, y data. To make a parametric test, the x an y variables are assumed (or known) to benormally distributed.

An alternative t-test to the one discussed in Equation (3.3) for the corelation coefficient involvesa test of the significance of the calculated regression slope. The Null Hypothesis is defined as b =0, where b is the calculated slope of the regression line; in other words, Ho posits that y variesindependently of x and that the x and y values are not correlated. This is a two-tailed test becausethe Alternate Hypothesis is that b is not zero - ie. b could be either greater than or less than zero.If Ho were rejected, the Alternate Hypothesis, Ha, would be accepted: that x and y are correlatedat the level of significance of the hypothesis test.

In this test, the t-statistic is defined by the formula:

(S.3.1) t = (b − bo ) sxse n

where bo is the specified slope (zero in this case), sx is the standard deviation of the independentvariable, and

(S.3.2) is the standard error of the correlated data se =n2sy

2(1−r2)n(n−2)

The significance level defining the critical t-statistic is α/2 (two-tailed test) and the test has n-2degrees of freedom. The Null Hypothesis is rejected if the absolute value of the t-statisticcalculated in Equation (3.4) exceeds the tabulated critical t-statistic, t(n-2), α/2, and the regressionslope is said to be significant at the 100(1-α) % level.

Once the regression has been deemed significant, a confidence interval about the least-squaresvalue of the slope can also be determined with the calculated t-statistic. In this case, we wish todetermine the values of bo in Equation 3.4 that produce a calculated value of that is equal tot

t(n-2), α/2. In other words:

(S.3.3) at the 100(1-α) % level.C.I. for slope = b ! t(n−2), a/2 $ sesx

1n

where b is the value of the slope against which the regression slope is to be tested.

39

4. Autocorrelation and Spatial Continuity

4.1. One-dimensional Autocorrelation: - the correlation function defined in equation (3.2) for bivariate data measures the degree of

correlation between two related variables (not necessarily regionalized variables)- within a single 1-D series of spatial measurements, an autocorrelation function can be defined that is a measure of the internal correlation between successive measurements- for a 1-D series of measurements, the concept of autocorrelation is analogous to the covariance

of bivariate data, where the variable zbi becomes za

i+L where L is the offset from position i inthe data series; thus, the autocorrelation function is defined as:

(4.1) rL = cov(z, z + L)/sz2 = 1

n−1 S i=1n (z i − m i)(z i+L − m i+L) /sz

2

where m represents the mean of the values defined for zero offset and for an offset of L (note that the definition of covariance contained in the numerator of equation 4.1 equals thepopulation variance of z when L = 0)

- this function is calculated for various offsets or separations, L, called "lags" and the value of rL is plotted at each value of L to form the correlogram (equivalent to the standardized spatial covariance function defined below)- conceptually, equation 4.1 compares the degree of correlation or similarity between the time-

series and its copy, where the copy is shifted by L units and a standardized covariance for theregion of overlap is computed; note that at L = 0, the covariance term equals the samplevariance and rL equals 1.0; as L increases, the amount of overlap decreases until the length ofrecord compared is too small to produce reliable estimates of rL; n is therefore the number ofcommon data values in the overlapped portion of the data series and its shifted copy

- since the degree of correlation is symmetric for positive and negative lag shifts, only theabsolute value of L is plotted

- see Davis, p.235 for examples of different autocorrelative behavior- a cross-correlation function can be defined for the comparison of two different 1-D time-series

sing the cross-covariance; the equation is identical, with the appropriate superscripts (za andzb denoting the two different variables) added to the zi and z i+L terms in equation (4.1) (seeDavis, p.240-243)

- Note: for nominal data (e.g. sediment types in stratigraphic sequences), use cross-association(eg: correlation between two sequences of rock types) and a non-paramteric test (such as aχ2-test) for significance of match (see Davis, p. 247-250)

4.2. Regionalized Variable vs. Random Function- aside from random sampling and analytical errors, a geospatial measurement (the regionalized

variable, e.g. copper content) is considered to be essentially deterministic (non-random), ie:there exists a single value of porosity, or one possible copper concentration at a point, or aunique water level at any given time in a given water well

40

- key concept: in order to develop spatial correlation statistics from such a variable from which tomake geostatistical estimates at unsampled locations, the r.v. is assumed to represent onestatistical sample drawn from an infinite number of possible samples all having identicalstatistical characteristics; for example, a hydrograph of water level vs. time represents onlyone possible statistical sample of the distribution of water levels vs. time drawn from aninfinite number of possible distributions with the same statistical characteristics

- the fictitious domain of all such possible distributions is known as a Random Function (R.F.)and a single sample of the regionalized distribution of possible porosities or copper contentsdrawn from it is called a realization of the R.F.

- a R.F. is a function from which values can be drawn which have a variance about a mean,together with skewness, kurtosis, etc.; each of these statistical measures also depends onspatial position; therefore, to fully specify a R.F.'s statistical properties is a theoreticalnightmare and so in practice many simplifications are used to represent the R.F.

- our task in geostatistics is to infer the nature of the R.F. controlling the spatial distribution ofthe regionalized variable (eg: the available water level or porosity measurements) so that thisfunction can be used to estimate values of the variable of interest at unsampled locations ortimes

- this is analogous to the task of estimating a univariate population distribution (analogous to theR.F.) from a number of statistical samples (analogous to the realizations of the R.F.) drawnfrom the population; the key difference in geostatistical analysis is that we do not havemultiple samples of the R.F. from which to infer its characteristics, only a single sample (thegeospatial data representing the regionalized variable) consisting of a limited number ofpoints

- for Star Trek fans: to use a crude analogy, if we could sample water level vs. time at a particularpoint in a river in a number of parallel universes, we'd be better able to estimate theunderlying R.F. that describes water level fluctuations (in a manner analogous to thestatistician who can draw red and black marbles from a bag many times to estimate the trueproportion of marbles in the bag)

- key concept: the mean state of all possible samples of a r.v. would be equal to the average valueof the r.v.within a single sample; this would be an example of ergodicity (from the Greek for“wandering”), in other words, a single sample would reflect the statistical character of theR.F.

- the assumption of ergodicity is therefore a crucial one; it is required to infer the properties ofthe R.F. from a single realization (the measured copper values or porosities); however,non-ergodic behavior is common in the real world, but the criteria for recognizing it aresubjective (see discussion under kriging section, Week 7)

- note that the requirement of ergodicity is not unique to geostatistics: it is an implicit assumptionin all inferential statistics: from estimating the proportion of red and black marbles in a bag,to inferring a population distribution from a histogram, to estimating a population mean fromthe sample mean

41

- key concept: because we only have one realization to work with in geostatistics, one more keyconcept must be introduced if we hope to use statistics to make estimates at unsampledlocations: if homogeneous physical process(es) produced the variability in an r.v. over somearea of interest, then the r.v. will demonstrate the same kind of variability over the entire areaas it does within smaller subareas; in other words, the R.F. from which the r.v. is drawn isstationary, and statistical homogeneity (stationarity) can be assumed over the entire area; thisis equivalent to saying that if we divide the study area into smaller parts, each part could beconsidered a different realization of the same R.F.; if so, then we can generate statisticalestimates from each of the smaller parts as a kind of surrogate for drawing multiple samplesfrom the R.F. (see Pannatier, p.77 and Fig.A.2)

- note that ergodicity requires stationarity, but that stationarity does not imply ergodicity

- there are various degrees of stationarity of the underlying process responsible for generating theregionalized variable; the type of stationarity assumed determines the kind of statisticalinference that is permitted:

- strict stationarity, in which all the random function’s parameters are invariant from point to point, is rarely assumed because of the formidable challenge in describing all its parameters- second-order stationarity exists if the R.F.'s mean and variance are independent of location and the covariance depends only on separation or lag between measured values of the regionalized variable- intrinsic stationarity (or the intrinsic hypothesis) is the weakest assumption; certain physical processes (eg: Brownian motion) do not have a definable variance or covariance, but the variance of their increments does exist (see definition of the semivariogram below), in which case the semivariogram can be defined but other measures of spatial correlation (e.g. covariance) cannot

4.3. Spatial Statistical Moments (not to be confused with special romantic moments)- the expected value of a Random Function Z(x) at any location x is equal to its mean:

m(x) = E{Z(x)} , assumed constant for a stationary R.F. (in other words, the R.F. is assumed to have Gaussian pdf characteristics)

- in linear (two-point) geostatistics, there exist three second-order moments:

variance: Var{Z(x)} = E{[Z(x) - m(x)]2} [Z(x) - m(x)]2 = 1

n S

covariance: C(x, x+L) = E{[Z(x)- m(x)][Z(x+h) - m(x+h)]} = E{Z(x)Z(x+h)} - m(x)m(x+h) [Z(x) . Z(x+h)] - m(-h) . m(+h) = 1

n(L) S

semivariogram: γγ(x, x+L) =1/2 E{[Z(x) - Z (x+h)]2} (valid only if no trend exists) [Z(x) - Z (x+h)]2 = 1

n(L) S

42

- Note: these second-order moments are not a function of location but are only dependant on lagseparation, h

4.4. Practical Definition of Spatial Correlation Structure for a Regionalized Variable:- based on second-order moments- the term “variogram” is used here as a generic term for a spatial correlation estimator statistic- specific second-order statistics are defined differently, and are used to better summarize

non-normally distributed data or data with extreme-valued outliers- the experimental measure of spatial covariance is defined by:

(4.2) cov(zi, zi+h) = C(h) = 1nh Si=1

nh (zi − z i)(zi+h − z i+h)

= 1nh S i=1

nh (z i $ z i+h) − z i $ z i+h

= 1nh S i=1

nh (z i $ z i+h) − z − $ z +

where zi = ith data value at location (xi, yi), h = lag, nh is the number of data pairs separated bylag h, and the overbars represent the means of the two endpoints of the lag pairs (also often expressed as - and +)

- the covariance is equal to the sample variance when h = 0 ie. at zero lag offset, the values of zi

and zi+h are equal and equation (4.2) equals the definition of variance, thus C(0) = σ2; at largevalues of h, the values of zi and zi+h are poorly correlated and C(h) -> 0

- graphically, the theoretical covariance function looks like this:

C(h)

σ 2

0lag, h

- typically, if the experimental values of cov(h) level off at large lags, the underlying randomfunction is assumed to be second-order stationary (ie. its mean and variance are independentof location, and covariance depends only on lag separation, h)

- the spatial covariance can be expressed as the inverted covariance (sometimes referred to as thenon-ergodic covariance, because it does not assume that in equation 4.2):z i = z i+h

(4.3) C'(h) = C(0) - C(h) , or C'(h) = σ2 - C(h)

- note that in the presence of a trend, the covariance does not level off and can take on negativevalues; in that case, second-order stationarity does not exist

43

- the semivariogram definition is graphically derived from an h-scatterplot (Hohn, p.91-92) - plot all values of zi+h vs. zh that are separated by a given value of h (in practical terms, a range of h values) - the moment of inertia, Im, of the cloud of points about the 45o line is defined as:

Im = 1nh S i=1

nh d i2

z(x+h)

z(x)

1:1 line

z(x) - z(x+h)d

- since the moment of inertia is defined about a 1:1 relationship, a right triangle defines the relationship between d and [z(x) - z(x+h)]

- therefore, 2di2 = (zi - zi+h)2, and the semivariogram is defined as

(4.4) c(h) = Im = 12nh S i=1

nh (z i − z i+h)2

- the standardized semivariogram is defined asγs(h) = γ(h)/C(0) where C(0) is the sample variance

- similarly, the correlogram is the spatial covariance standardized (divided) by the sample variance, and is exactly analagous to the autocorrelation function for second-order stationarity:

ρ(h) = C(h)/C(0) - it takes on a similar form to the other variogram measures when it is expressed as the inverted correlogram:

ρ'(h) = 1 - C(h)/C(0)

- the madogram is defined as the mean (sometimes median) absolute difference: mad(h) = 1

nh Si=1nh z i − z i+h

(for computational definitions of these various autocorrelation statistics, see Deutsch andJournel, p. 45 and Isaaks and Srivastava, p.59)

- these various estimators of spatial correlation are all defined differently but all can be used to inform the parameters of spatial correlation structure during modeling- except for the madogram, these various estimators are exactly equivalent representations of the underlying R.F. if second-order stationarity exists:

44

(4.5) ρ'(h) = C'(h)/C(0) = 1 - C(h)/C(0) = γ(h)/C(0) = γs(h)

- i.e. if the various variogram estimators level off beyond a certain lag then second-orderstationarity can be assumed and the inverted covariance, the semivariogram, the standardizedsemivariogram, and correlogram are all equivalent estimators of spatial continuity

- if the semivariogram does not level off but does not rise faster than the square of h, the randomfunction is not second-order stationary and is said to obey the “intrinsic hypothesis”; in thatcase, only the semivariogram is a valid estimator of spatial continuity and the otherestimators cannot be fitted with a model random function

- if experimental semivariogram values increase as fast as or faster than h2, then the intrinsichypothesis is invalid and the presence of a regional trend is indicated; in order to proceedwith variogram analysis, the trend would have to be removed and variogram analysis andmodeling performed on the residuals

- become familiar with the following nomenclature: nugget; sill; transition region; range ofinfluence; types of variogram shapes: linear, parabolic, spherical, exponential, gaussian

- Note: in order to analyze variogram structure and to utilize the resulting correlation structure insubsequent kriging or simulation of a geometrically-deformed body (such as a stratified andfolded reservoir or ore body, or a stratified aquifer of variable thickness, or a dippinggeologic formation), or to avoid numerical instability problems associated with matrices builtfrom coordinate data of different number size (such as x,y in millions of meters vs z in tensof meters), coordinate transformations are applied

- one common type of transformation, utilized for folded or variable thickness geologic bodies,represents the original x,y,z coordinates in stratigraphic coordinates of an equivalent, simplertabular body:

(4.6) z ∏(x, y) =top(x,y)−z(x,y)thickness(x,y)

where z and z' are the original and stratigraphic coordinates, respectively, at location (x,y);top(x,y) is the elevation of the top of the original stratified body; and thickness(x,y) is thethickness of the geologic body at location (x,y). This transformation "straightens out" acontorted or variable thickness geologic body and represents it for purposes of correlationstructure analysis and modeling as an equivalent tabular body.

- another common transformation is to change the number size of x,y coordinates to match thenumber size of z coordinates; for example, if the range of z is from 2500 to 3000 ft above sealevel, but x,y are in state plane feet with values of the order of 500,000, transform the x,yvalues as:

(4.7) x ∏ = x − xmin y ∏ = y − ymin

where x,y and x',y' are the original and transformed coordinates, respectively, and xmin,ymin arethe minimum values of the x,y ranges; for 2D data, transformation (4.7) may be necessary forprograms such as VarioWin which cannot handle x,y values larger than 5 digits

45

4.5. Computation of Experimental Variograms- see reading handout (VarioWin’s Chapter 2 short tutorial)- concepts: lag bins, mean lags, overlapping bins, variable bins, directional search parameters- rules of thumb: minimum data pairs per lag bin ca. 20-30; max. lag ca. 1/2 of max. separation- see hand-out of variography process

4.6. Software Comparison- various variogram programs differ slightly in the manner in which lags are represented for each

lag interval, and the flexibility with which lag intervals can be specified; programs which plotlags as their weighted means are preferrable because the effects of clustered data onvariogram shape can be visually identified

- GeoEas plots variogram statistics at the weighted mean lags and allows specification of unequallag intervals

- VarioWin plots weighted mean lags but allows only for specification of equally-spaced lagintervals

- GSLIB's GamV3 also plots weighted mean lags and equal lag intervals - commercial software such as GeoPack plots centered lags, in which the contribution of

clustered data cannot be discerned from the correlation function plots- variogram plots in ArcGIS's Geostatistical Analyst are a cross between a conventional

variogram and a partial variogram cloud, showing the spread of variogram statistics in thecardinal directions within each lag bin

4.7. Problem Set III. Spatial Correlation 1 - Experimental Variography

46

Date post:	30-Nov-2015
Category:	Documents
Upload:	coldrain1
View:	237 times
Download:	7 times

Geostatistics and Spatial Modeling Lecture Notes 2004

Documents