11 Lecture 18 Data Quality Issues Ch. 14. 2 Introduction Spatial data and analysis standards are...

Post on 17-Jan-2016

218 views 0 download

Tags:

transcript

11

Lecture 18Data Quality Issues

Ch. 14

2

Introduction

• Spatial data and analysis standards are important because of the range of organizations producing and using spatial data, and the amount of data transferred between these organizations.

• There are several types of standards:– Data standards– Interoperability standards– Analysis standards – Professional and certification standards

3

Introduction (continued)

• National and international standards organizations are important in defining and maintaining geospatial standards:– Federal Geographic Data Committee (FGDC) which

focuses on the national spatial data infrastructure (www.fgdc.gov)

– International Spatial Data Standards Commission which is a clearing house and gateway for international standards

– Open Geospatial Consortium (OGC) which is developing interoperability standards. Web Mapping Service (WMS) standards are an example.

GIS Certification

• What kind of certification is available?

• Two primary options:– Geographic Information Systems Professional

(GISP) is based on your work and volunteering experience.

– ESRI Technical Certifications are test based.

• The third option is a university based certification.

4

5

The Geospatial Competency Model

6

77

GIS Professional Certification URISA is the founding member of the

GIS Certification Institute, the organization that administers professional certification for the field

and is dedicated to advancing the industry.

Education: 30 Points

Experience: 60 Points

Contributions: 8 Points

The additional 52 points can be counted from any of the three categories. 

The minimum number of points needed to become a certified GIS Professional as detailed in the three point schedules given below is 150 points.  Thus, all applicants are expected to document achievements valued at a minimum of 150 points. To ensure that applicants have a broad foundation, specific minimums in each of the three achievement categories must be met or exceeded.  These minimums are as follows:

8

9

A Sample of University Certificates

• UMM – undergraduate

• USM undergrad/grad

• UM – graduate

• Penn State

• University of Denver

• University of Southern California

• George Mason University

1010

Spatial Data Standards

• Data – measurements and observations

• Data quality – a measure of the fitness for use of data for a particular task (Chrisman, 1994).

• It is the responsibility of the user to insure that the data is fit for the task.

• Metadata – data about the data

1111

Spatial Data Standards

• Spatial Data Standards – methods for structuring, describing and delivering spatially-referenced data.

• Media Standards – the physical form of the data (CD/download etc).

• Format Standards – specify data file components and structures. These standards aid in data transfer.

• Spatial Data Accuracy Standards –document the quality of the positional and attribute accuracy.

• Document Standards – define how we describe spatial data.

1212

GIS Is Not PerfectA GIS cannot perfectly represent the world for many

reasons, including: • The world is too complex and detailed. • The data structures or models (raster, vector, or

TIN) used by a GIS to represent the world are not discriminating or flexible enough.

• We make decisions (how to categorize data, how to define zones) that are not always fully informed or justified, and are always biased.

• It is impossible to make a perfect representation of the world, so uncertainty is inevitable

• Uncertainty degrades the quality of a spatial representation

1313

Concepts Related to Data Quality

• Related to individual data sets:– Errors – flaws in data– Accuracy – the extent to which an estimated

value approaches the true value.– Precision – the recorded level of detail of your

data.– Bias – the systematic variation of the data

from reality.• Personal bias• Instrument bias

1414

1515

Concepts Related to Data Quality

• Related to source data:– Resolution – the smallest feature in the data

set that can be displayed.– Generalization- simplification of objects in the

real world to produce scale models and maps.

1616

Resolution and generalization of raster datasets

1717

Figure 10.3 Scale-related generalization

1818

Data Sets Used for Analysis

Must be:– Complete – spatially and temporally– Compatible – same scale, units of measure,

measurement level– Consistent – both within and between data

sets. – And Applicable for the analysis being

performed.

1919

Sources of Error (Uncertainty) in GIS

2020

A Conceptual View of Uncertainty

Real World

Conception

Data conversion and Analysis

Source Data, Measurements &Representation

Result

error propagation

2121

Uncertainty in The Conception of Geographic Phenomena

Many spatial objects are not well defined or their definition is to some extent arbitrary, so that people can reasonably disagree about whether a particular object is x or not. There are at least four types of conceptual uncertainty

• Spatial uncertainty• Vagueness• Ambiguity• Regionalization problems

2222

• Spatial uncertainty occurs when objects do not have a discrete, well defined extent.

• They may have indistinct boundaries.

• They may have impacts that extend beyond their boundaries.

• They may simply be statistical entities.

• The attributes ascribed to spatial objects may also be subjective.

Spatial uncertainty

2323

• Vagueness occurs when the criteria that define an object as x are not explicit or rigorous.

• For example:– In a land cover analysis, how many oaks (or

what proportion of oaks) must be found in a tract of land to qualify it as oak woodland?

– What incidence of crime (or resident criminals) defines a high crime neighborhood?

Vagueness (obscureness)

2424

Ambiguity

Ambiguity occurs when y is used as a substitute, or indicator, for x because x is not available.

• The link between direct indicators and the phenomena for which they substitute is straightforward and fairly unambiguous.

• Indirect indicators tend to be more ambiguous and opaque.

• Of course, indicators are not simply direct or indirect; they occupy a continuum. The more indirect they are, the greater the ambiguity.

2525

• Regional geography is largely founded on the creation of a mosaic of zones that make it easy to portray spatial data distributions.

• A uniform zone is defined by the extent of a common characteristic, such as climate, landform, or soil type.

• Functional zones are areas that delimit the extent of influence of a facility or feature—for example, how far people travel to a shopping center or the geographic extent of support for a football team.

• Regionalization problems occur because zones are artificial.

Regionalization problems

2626

Uncertainty in the measurement of geographic phenomena

Error occurs in physical measurement of objects. This error creates further uncertainty about the true nature of spatial objects.

• Physical measurement error• Digitizing error• Error caused by combining data sets with

different lineages

2727

Physical measurement error

Instruments and procedures used to make physical measurements are not perfectly accurate. For example, a survey of Mount Everest might find its height to be 8,850 meters, with an accuracy of plus or minus 5 meters.

• In addition, the earth is not a perfectly stable platform from which to make measurements. Seismic motion, continental drift, and the wobbling of the earth's axis cause physical measurements to be inexact. (GPSing error, GPSing error, remote sensing errorremote sensing error)

2828

Digitizing error

• A great deal of spatial data has been digitized from paper maps.

• Digitizing, or the electronic tracing of paper maps, is prone to human error. – Lines may be drawn too far, not far enough, or missed

entirely. Errors caused by digitizing mistakes can be partially, but not completely, fixed by software.

– Additional error occurs because adjacent data digitized from different maps may not align correctly. This problem can also be partially corrected through a software technique called rubbersheeting.

2929

Digitizing ErrorAny digitized map requires:

Considerable post-processing Check for missing features

Connect lines Remove spurious polygons Some of these steps can be

automated

3030

Error caused by combining data sets with different lineages

• Data sets produced by different agencies or vendors may not match because different processes were used to capture or automate the data. – For example, buildings in one data set may appear on the

opposite side of the street in another data set. – Error may also be caused by combining sample and

population data or by using sample estimates that are not robust at fine scales.

– "Lifestyle" data are derived from shopping surveys and provide business and service planners with up-to-date socioeconomic data not found in traditional data sources like the census. Yet the methods by which lifestyle data are gathered and aggregated to zones or are compared to census data may not be scientifically rigorous

3131

Uncertainty in the representation of geographic phenomena

• Representation is closely related to measurement. • Representation is not just an input to analysis, but

sometimes also the outcome of it. For this reason, we consider representation separately from measurement.– The world is infinitely complex, but computer system are finite. – Representation is all about the choices that are made in capturing

knowledge about the world

• Uncertainty in earth model: ellipsoid models, datum, projection types

• Uncertainty in the raster data model (structure)• Uncertainty in the vector data model (structure)

3232

• The raster structure partitions space into square cells of equal size (also called pixels).

• Spatial objects x, y, and z emerge from cell classification, in which Cell A1 is classified as x, Cell A2 as y, Cell A3 as z, and so on, until all cells are evaluated.

• A spatial object x can be defined as a set of contiguous cells classified as x.

• Commonly, a cell is not purely one thing or another, but might contain some x, some y, and maybe a bit of z within its area.

• These impure cells are termed mixed pixels or "mixels." • Because a cell can hold only one value, a mixel must be

classified as if it were all one thing or another. Therefore, the raster structure may distort the shape of spatial objects.

Uncertainty in the raster data structure

3333

Error in raster

• raster- because of the distortions due to flattening, cells in a raster can never be perfectly equal in size on the Earth’s surface. - when information is represented in raster form all detail about variation within cells is lost, and instead the cell is given a single value. largest sharelargest share, central central pointpoint (f.g. USGS DEM), and mean valuemean value (f.g. remote sensing imagery)

Largest share

Central point

8 6 7.5

Mean value

6.33

66.29

8

8

8 6

6

66

6

8x(1/6)+6x(5/6)=6.338x(3/4)+6x(1/4)=7.58x(1/7)+6x(6/7)=6.29

3434

Figure 10.8 Problems with remotely sensed imagery: (left) example of a satellite image with cloud cover (A), shadows from topography (B), and shadows from cloud cover

(C); (right) an urban area showing a building leaning away from the cameraSource: Ian Bishop (left) and Google UK (right)

3535

• Socioeconomic data—facts about people, houses, and households—are often best represented as points.

• For various reasons (to protect privacy, to limit data volume), data are usually aggregated and reported at a zonal level, such as census tracts or ZIP Codes.

• This distorts the data in two ways: – First, it gives them a spatially inappropriate representation

(polygons instead of points); – Second, it forces the data into zones whose boundaries

may not respect natural distribution patterns.

Uncertainty in the vector data structure

3636

Map representation error

Map scale Ground distance, accuracy, or resolution (corresponding to 0.5 mm map distance)

1:1,250 0.625 m

1:2,500 1.25 m

1:5,000 2.5 m

1:10,000 5 m

1:24,000 12 m

1:50,000 25 m

1:100,000 50 m

1:250,000 125 m

1:1,000,000 500 m

1:10,000,000 5 km

3737

Uncertainty in the data conversion and analysis of geographic phenomena

Uncertainties in data lead to uncertainties in the results of analysis; Data conversion and spatial analysis methods can create further uncertainty

• Data conversion error• Georeferencing and resampling• Projection and datum conversions• The ecological fallacy• The modifiable areal unit problem (MAUP)• Classification errors

3838

• The ecological fallacyThe ecological fallacy is the mistake of assuming that an overall characteristic of a zone is also a characteristic of any location or individual within the zone.

• The Modifiable Areal Unit Problem (MAUP)The results of data analysis are influenced by the number and sizes of the zones used to organize the data. The Modifiable Area Unit Problem has at least three aspects:

1. The number, sizes, and shapes of zones affect the results of analysis.

2. The number of ways in which fine-scale zones can be aggregated into larger units is often great.

3. There are usually no objective criteria for choosing one zoning scheme over another.

3939http://www.gistutor.com/concepts/24-intermediate-concept-tutorials/57-

ecological-fallacy-in-gis.html

Ecological Fallacy Example

4040

http://www.google.com/imgres?um=1&hl=en&client=firefox-a&sa=N&rls=org.mozilla:en-US:official&biw=1257&bih=845&tbm=isch&tbnid=ghU6S5VuksC-8M:&imgrefurl=http://www.indiana.edu/~gisci/courses/g438/lectures/gis_census.html&docid=VCO84JSYMIBN2M&imgurl=http://w

MAUP Example

4141

Classification error and quality check

4242

SelectingSelectingROIsROIs

Alfalfa

Cotton

Grass

Fallow

4343

Background:Background: ETM+, 7/15/01

Top image:Top image:IKONOS, Oct, 2000

Classification ResultClassification Result

4444

Confusion Matrix

1686

Grass Alfalfa Cotton Chili Fallow (corn)

total User accuracy (%)

Grass 110 22 0 0 0 132 83.3

Alfalfa 5 105 0 0 0 110 79.5

Cotton 0 0 945 5 0 950 99.5

Chili 0 0 50 42 0 92 45.7

Fallow 0 0 0 0 484 484 100

total 115 127 995 47 484 1768

Producer accuracy (%)

95.6 82.7 95.0 89.4 100

Classification resultsClassification resultsGGrroouunndd ttrruutthh

%4.951768

1686_ AccuracyOverlay

%3.891768/)4844844792995950127110115132(1768

1768/)4844844792995950127110115132(1686_

xxxxx

xxxxxIndexKappa

4545

• Producer accuracy is a measure indicating the probability that the classifier has labeled an image pixel into Class A given that the ground truth is Class A.

• User accuracy is a measure indicating the probability that a pixel is Class A given that the classifier has labeled the pixel into Class A

• Overall accuracy is total classification accuracy.• Kappa index (another parameter for overall accuracy) is a

more useful index for evaluating accuracy.– Errors of commission represent pixels that belong to another class

but are labeled as belonging to the class.– Errors of omission represent pixels that belong to the ground truth

class but that the classification technique has failed to classify them into the proper class.

Bases of Confusion Matrix

4646

Error Propagation

Real World

Conception

Data conversion and Analysis

Measurement &Representation

Result

error propagation

• the errors in the input will propagate to the output of the operation

• error propagation measures the impacts of error (uncertainty) in data on the results of GIS operations

4747

Finding and Modeling Errors

• Checking for errors– Visual inspection during data editing and

cleaning.– Attributes can be checked by using

annotation, line colors and patterns.– Double digitizing– Statistical analysis may identify extreme

values of attributes.

4848

Finding and Modeling Errors

• Error modeling– 1. Epsilon modeling

• Based on a method of line generalization, and adapted by Blakemore.

• It places an error band around a digitized line, describing the probable distribution of error.

• Error distribution is subject to debate:– Normal curve– Piecewise quartile distribution– Bimodal

• The epsilon band can be used in analyses to improve the confidence of the user in the result.

4949

Figure 10.17 Point-in-polygon categories of containmentSource: Blakemore (1984)

5050

Finding and Modeling Errors• Error modeling

– 2. Monte Carlo simulation – used in overlays.• Simulates input data error by adding random noise to the

line coordinates of the map data.

• Each input is assumed to be characterized by an estimate of positional error.

• This changes the shape of the line.

• The process is repeated multiple times and the randomized data put through the GIS analyses.

• Output:– A number

– A map

5151

Figure 10.18 Simulating effects of DEM error and algorithm uncertainty on derived stream networks

5252

Managing GIS Error

• To manage errors we must track and document them.

• The concepts introduced earlier:– Accuracy, Precision, Resolution,

Generalization, Bias, Compatibility, Completeness and Consistency

provide a checklist of quality indicators:

• These should be documented for each data layer.

5353

Managing GIS Error

• Data quality information can be used to create a data lineage.

• A record of the data history that presents essential information about the development of the data.

• This becomes the metadata.

5454

Living with uncertainty

• uncertainty is inevitable and easier to find,• use metadata to document the uncertainty• sensitivity analysis to find the impacts of input

uncertainty on output, • rely on multiple sources of data, • be honest and informative in reporting the results of GIS

analysis.• US Federal Geographic Data Committee lists five

components of data quality: attribute accuracy, positional accuracy, logical consistency, completeness, and lineage (details see www.fgdc.gov)

5555

Basics of FGDC

• Federal Geographic Data Committee (FGDC) metadata answers the who, what, where, when, how and why questions of geospatial data.

• The data structure and elements defined for FGDC metadata are described fully in the “Content Standard for Digital Geospatial Metadata” (CSDGM).

5656

SEVEN SECTIONS OF FGDC

The Federal Geographic Data Committee (FGDC), Content Standard for Digital Geospatial Metadata (CSDGM) organizes a metadata record into seven main sections: – Identification Information– Data Quality Information– Spatial Data Organization Information– Spatial Reference Information– Entity and Attribute Information– Distribution Information– Metadata Reference Information

5757http://www.maine.gov/megis/policies/megisfgdc.rtf

Identification Information

• What is the name of the dataset?• What is the subject or theme of the information included?• What is the scale of the dataset?• What are the attributes of the dataset?• Where is the geographic location of the dataset?• Who developed the dataset?• Who provided the source material for the dataset?• Who will publish the dataset?• When were the features of the dataset identified?• How are the features of the dataset depicted?• Why was the data set created?• Are there restrictions on accessing or using the data?• Are external files available that are related to the dataset?

5858http://www.maine.gov/megis/policies/megisfgdc.rtf

Data Quality Information

• How reliable are the data?• What are its limitations or inconsistencies? • What is the positional and attribute accuracy? • Is the dataset complete? • Were the consistency and content of the data

verified? • Where can the sources of the data be located?• What processes were applied to these sources

and by whom?

5959http://www.maine.gov/megis/policies/megisfgdc.rtf

Spatial Data Organization

• What spatial data model was used to encode the spatial data?

• How many and what kind of spatial objects are included in the dataset?

• Are methods other than coordinates, such as street addresses used to encode locations?

6060http://www.maine.gov/megis/policies/megisfgdc.rtf

Spatial Reference

• Are coordinate locations encoded using longitude and latitude?

• What map projections is used?

• What horizontal datum and/or vertical datum are used?

• What parameters should be used to convert the data to another coordinate system?

6161http://www.maine.gov/megis/policies/megisfgdc.rtf

Entity and Attribute Information

• What geographic information (roads, houses, elevation, temperature, etc.) is described?

• How is this information coded?

• What do the codes mean?

• What source was used for defining the attributes or codes, i.e. Cowardin classification?

6262http://www.maine.gov/megis/policies/megisfgdc.rtf

Distribution

• From whom can the data be obtained?

• What formats are available?

• What media are available?

• Are the data available online?

• What is the price of the data?

6363http://www.maine.gov/megis/policies/megisfgdc.rtf

Metadata Reference

• When were the metadata compiled, and by whom?

• When was the metadata record created?

• Who is the responsible party?

• When were the metadata last updated?