+ All Categories
Home > Documents > Data Quality Management

Data Quality Management

Date post: 05-Jan-2016
Category:
Upload: cale
View: 17 times
Download: 0 times
Share this document with a friend
Description:
Data Quality Management. Geospatial errors can cause real-life problems!. http://www.brownsmarina.com/fun.html. One management strategy …. Murphy’s Law. Ignoring data quality issues usually doesn’t work very well. Some geospatial goofs. This one’s worse…. - PowerPoint PPT Presentation
Popular Tags:
40
CS 128/ES 228 - Lecture 1 4a 1 Data Quality Management http://www.brownsmarina.com/fun.html Geospatial errors can cause real- life problems!
Transcript
Page 1: Data Quality Management

CS 128/ES 228 - Lecture 14a 1

Data Quality Management

htt

p:/

/ww

w.b

row

nsm

ari

na.c

om

/fu

n.h

tml

Geospatial errors can

cause real-life problems!

Page 2: Data Quality Management

CS 128/ES 228 - Lecture 14a 2

One management strategy …

Page 3: Data Quality Management

CS 128/ES 228 - Lecture 14a 3

Murphy’s Law

Ignoring data quality issues usually doesn’t work very well

Page 4: Data Quality Management

CS 128/ES 228 - Lecture 14a 4

Some geospatial goofs

Page 5: Data Quality Management

CS 128/ES 228 - Lecture 14a 5

This one’s worse…

Mars Climate Orbiter (MCO) was lost on 23 Sep 1999 when it failed to enter an orbit around Mars, instead crashing into the planet, destroying the $125 million craft, part of a $328 million mission

The root cause of the failure was a computer program that was supposed to provide its output in newton seconds (N·s) but instead provided pound-force seconds (lbf·s).

http://lamar.colostate.edu/~hillger/unit-mixups.html#mco

http://www.boeing.com/companyoffices/gallery/images/space/d2_mars_climate_orbiter_01.htm

Page 6: Data Quality Management

CS 128/ES 228 - Lecture 14a 6

And these are really bad!Just a 'map error'? The China Daily website carries a cartoon of the damaged US plane at Hainan Island's airbase and asks sarcastically if Sunday's collision "might be due to another map error“ - a reference to the US bombing of the Chinese embassy in Belgrade in 1999. "Last time it's due to a map error, and this time another map error? What about the next?” http://news.bbc.co.uk/1/hi/world/monitoring/media_reports/1260185.stm

    

It might be due to another map error      

China Daily

Page 7: Data Quality Management

CS 128/ES 228 - Lecture 14a 7

What is error?

“Error is the physical difference between the real world and the GIS facsimile”

-Heywood, Cornelius, & Carver, p. 178

Errors are impossible to avoid, but can be managed

Page 8: Data Quality Management

CS 128/ES 228 - Lecture 14a 8

A Data Management Model

Data acquisition

Data representation

& analysis

Data outputs

Page 9: Data Quality Management

CS 128/ES 228 - Lecture 14a 9

Data acquisition errors

Scientists use the term “error” for two very different concepts:

natural variability actual mistakes

Page 10: Data Quality Management

CS 128/ES 228 - Lecture 14a 10

Take a sidewalk …

What’s its width? 1.77, 1.82, 1.69 … meters

a. “Error” (natural variability):mean width = 1.76 m, range 1.69 - 1.82

b. “Error” (actual mistake): mean = 1.67 ft

Page 11: Data Quality Management

CS 128/ES 228 - Lecture 14a 11

Accuracy vs. Precision

Figure 10.1, An Introduction to Geographic Information Systems by

Heywood, Cornelius, and Carver

Page 12: Data Quality Management

CS 128/ES 228 - Lecture 14a 12

Random error vs. Bias

Page 13: Data Quality Management

CS 128/ES 228 - Lecture 14a 13

Where does lack of precision come from?

Natural variability

Poor input assumptions

Imprecise equipment

Sloppy measurement

Accumulated error

Page 14: Data Quality Management

CS 128/ES 228 - Lecture 14a 14

Random error is often “normal”

mean

Standard deviation

Page 15: Data Quality Management

CS 128/ES 228 - Lecture 14a 15

95% of observations ±2 s.d.

mean

Mean + 2 s.d. Mean + 2 s.d.

Page 16: Data Quality Management

CS 128/ES 228 - Lecture 14a 16

Means have smaller variability than single measurements

S. E. (mean) = standard deviation √n

If n = 4 √n = ?

Page 17: Data Quality Management

CS 128/ES 228 - Lecture 14a 17

Where does lack of accuracy come from?

Dubious source data Incompatible source data

Data collected at different times through different methods, possibly in different formats

Bias

Page 18: Data Quality Management

CS 128/ES 228 - Lecture 14a 18

How can we fix it? Benchmarks

ex. National Geodetic Survey maintains a database of survey “monuments” at

http://www.ngs.noaa.gov/ cgi-bin/datasheet.prl

Otherwise – just measure variability

http://upload.wikimedia.org/wikipedia/commons/thumb/6/66/USCGS-E134.jpg/617px-USCGS-

E134.jpg

Page 19: Data Quality Management

CS 128/ES 228 - Lecture 14a 19

Data representation errors Transference error

Data storage errors

Analysis errors

Page 20: Data Quality Management

CS 128/ES 228 - Lecture 14a 20

Where does transference error come from?

Typos, etc. Less likely with automated data

collection and transformation Can be prevented through diligence

and software “sanity” checks

Format conversion Many inter-format conversions cause

loss/corruption of data/information

Page 21: Data Quality Management

CS 128/ES 228 - Lecture 14a 21

Something got lost in the translation

“geographic information systems is an interesting course”

“ 지리적인 정보 시스템은 재미있는 과정 이다 ”

“The geography information system is the process which is fun”

Thanks to http://babelfish.altavista.com/babelfish/tr

Page 22: Data Quality Management

CS 128/ES 228 - Lecture 14a 22

Raster Vector conversions

Aliasing is an intrinsic problem of GIS’s

Page 23: Data Quality Management

CS 128/ES 228 - Lecture 14a 23

Digitization errors

Page 24: Data Quality Management

CS 128/ES 228 - Lecture 14a 24

Topology errors

Figure 10.5, An Introduction to Geographic Information Systems by

Heywood, Cornelius, and Carver

Page 25: Data Quality Management

CS 128/ES 228 - Lecture 14a 25

Data storage/retrieval errors

Hardware failure

Hardware Limitations

Page 26: Data Quality Management

CS 128/ES 228 - Lecture 14a 26

What is a hardware limitation? Numbers in a

computer are stored in a finite number of bits.

Using too few bits can cause round-off error.

Box 9.2, Principles of Geographic Information Systems by Burrough and McDonnell

Page 27: Data Quality Management

CS 128/ES 228 - Lecture 14a 27

Where do errors of data rot come from?

Link rot Not FoundThe requested URL /cs/dlevine/ was not found on this

server.Apache/1.3.27 Server at www.xxx.edu Port 80

Poor “style” E.g. “Employees may appeal to Sr. Carney” as

opposed to “Employees may appeal to the President of the University”

Page 28: Data Quality Management

CS 128/ES 228 - Lecture 14a 28

Where do errors of analysis come from?

How long do you have? …

Mistaken queries

Analyzing layers with different datums or coordinate systems

Comparing attributes with incompatible units

Page 29: Data Quality Management

CS 128/ES 228 - Lecture 14a 29

More errors of analysis … Inappropriate resolution

Combining rasters/vectors with different resolutions

Using exact/abrupt surface fits when approx./gradual is appropriate (or vice versa)

Page 30: Data Quality Management

CS 128/ES 228 - Lecture 14a 30

Data output errors Maps

Reports

Page 31: Data Quality Management

CS 128/ES 228 - Lecture 14a 31

Junket at taxpayers’ expense?Did a politician misuse federal funds to visit Alaska on the way to official business in Japan?

Muekrcke. Map Use, 2nd ed. p. 395

Page 32: Data Quality Management

CS 128/ES 228 - Lecture 14a 32

No - Intentional map error*

*More like lying with maps!

Muekrcke. Map Use, 2nd ed. p. 395

Page 33: Data Quality Management

CS 128/ES 228 - Lecture 14a 33

Should maps be as accurate as possible?

Map simplification Features are omitted Area features become

lines or points

Exaggeration Features’ apparent

size is “increased” (e.g. hydrants)

Features’ separation is increased on the map for visibility

Must Mapquest be accurate?

Page 34: Data Quality Management

CS 128/ES 228 - Lecture 14a 34

Reporting significance of findings Hypothesis testing

What does the term “significant” mean to scientists?

Page 35: Data Quality Management

CS 128/ES 228 - Lecture 14a 35

Are two means really different?These two normal distributions have a very large overlap. The

means of the two populations are not significantly different, because the overlap is > 5% of the area under the curves. t would be very small.

htt

p:/

/ww

w.s

teve.g

b.c

om

/sci

en

ce/s

tati

stic

s.h

tml#

t

Page 36: Data Quality Management

CS 128/ES 228 - Lecture 14a 36

What about these two means?

htt

p:/

/ww

w.s

teve.g

b.c

om

/sci

en

ce/s

tati

stic

s.h

tml#

t

Page 37: Data Quality Management

CS 128/ES 228 - Lecture 14a 37

These means are also significantly different - why?

htt

p:/

/ww

w.s

teve.g

b.c

om

/sci

en

ce/s

tati

stic

s.h

tml#

t

Page 38: Data Quality Management

CS 128/ES 228 - Lecture 14a 38

How do we actually test for statistical differences?

Student’s t-test

t = difference in means measure of variability

Page 39: Data Quality Management

CS 128/ES 228 - Lecture 14a 39

Three Commandments of Data Reporting

Thou Shalt Not …I. Report insignificant digits

(or omit significant trailing zeros)

II. Report means without also reporting sample sizes and variability

III. Report results as “significant” (or even worth talking about) without doing the appropriate statistical tests.

Page 40: Data Quality Management

CS 128/ES 228 - Lecture 14a 40

How do we minimize (NOT avoid) error?

-- “Mad Eye” Moody Defense Against The Dark Arts Instructor Hogwarts School of Witchcraft and Wizardry

“CONSTANT VIGILANCE”

htt

p:/

/new

s.b

bc.

co.u

k/1

/sh

are

d/s

pl/h

i/p

op

_up

s/0

5/e

nte

rtain

men

t_g

ob

let_

of_

fire

/htm

l/3

.stm


Recommended