+ All Categories
Home > Science > Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Date post: 19-Feb-2017
Category:
Upload: blake-regalia
View: 94 times
Download: 2 times
Share this document with a friend
31
Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data Krzysztof Janowicz 1 , Yingjie Hu 1 , Grant McKenzie 2 , Song Gao 1 , Blake Regalia 1 , Gengchen Mai 1 , Rui Zhu 1 , Benjamin Adams 3 , and Kerry Taylor 4 2016/10/01 1 STKO Lab, University of California, Santa Barbara, USA 2 Department of Geographical Sciences, University of Maryland, USA 3 Centre for eResearch, The University of Auckland, New Zealand 4 Australian National University, Australia [email protected]
Transcript
Page 1: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Moon Landing or Safari?

A Study of Systematic Errors and their Causes in Geographic Linked Data

Krzysztof Janowicz1, Yingjie Hu1, Grant McKenzie2, Song Gao1, Blake Regalia1,

Gengchen Mai1, Rui Zhu1, Benjamin Adams3, and Kerry Taylor4

2016/10/01

1STKO Lab, University of California, Santa Barbara, USA

2Department of Geographical Sciences, University of Maryland, USA

3Centre for eResearch, The University of Auckland, New Zealand

4Australian National University, Australia

[email protected]

Page 2: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Linked Data

Linked Data: representing data as collections of intra & inter-linking graphs.

The nodes and edges of the graphs are Internationalized Resource Identifiers

(IRIs). It is built upon the Resource Description Framework (RDF); enabling

Web docs & services to share structured data about anything.

[email protected]

Page 3: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Linked Data Significance

Linked Data is already in very wide use; it powers many ‘smart’ query services.

It is revolutionizing data publishing and retrieval.

[email protected]

Page 4: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Linked Data Significance

The Linked Data cloud grows every year, but it suffers from: data quality

issues, limited availability, and lack of data persistence. Data quality and

maintenance are known to be the most difficult issues facing data publishers.

[email protected]

Page 5: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Geographic Linked Data

Geographic data is one of the primary nexuses for structured data on the

world-wide web.

[email protected]

Page 6: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Data Scientists

As Geographic Information Scientists, it is our responsibility to:

• assess the quality of structured geo-data on the web

• discover systematic errors

• identify their root causes

• and publish our recommendations for best practices

Our motivation is only to improve data quality, not to criticize others for falling

victim to these errors.

Most of these errors are common. They tend to arise from easily overlooked

qualities of geographic information.

[email protected]

Page 7: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Errors

We have broken down systematic errors into the following categories:

1. Triplification and Extraction

2. Improper use of ontologies / Limited understanding of domain

3. Designing new ontologies / Oversimplified conceptual models

4. Data accuracy / Lack of ‘uncertainty’ framework

[email protected]

Page 8: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Triplification

(1) “Triplification” typically refers to the transformation of flat data into RDF.

[email protected]

Page 9: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Natural Language Processing

(2) Extraction of semantically-rich semi-structured or unstructured data using

natural language processing and machine learning; e.g., DBpedia, FRED1.

Anakin Skywalker was a male human born on Tatooine who became a Jedi

Knight, and later served the Galactic Empire as Darth Vader.

1http://wit.istc.cnr.it/stlab-tools/fred/demo [email protected]

Page 10: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Natural Language Processing

(2) Extraction of semantically-rich semi-structured or unstructured data using

natural language processing and machine learning; e.g., DBpedia, FRED1.

Anakin Skywalker was a male human born on Tatooine who became a Jedi

Knight, and later served the Galactic Empire as Darth Vader.

1http://wit.istc.cnr.it/stlab-tools/fred/demo [email protected]

Page 11: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Natural Language Processing

(2) Extraction of semantically-rich semi-structured or unstructured data using

natural language processing and machine learning; e.g., DBpedia, FRED1.

Anakin Skywalker was a male human born on Tatooine who became a Jedi

Knight, and later served the Galactic Empire as Darth Vader.

1http://wit.istc.cnr.it/stlab-tools/fred/[email protected]

Page 12: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Triplification Errors

(1) and (2) are both liable to the same types of errors that can occur during

the extraction & conversion of the source data from its original format.

Time to investigate for errors! How does one begin searching for systematic

errors in world-wide geographic data? By using a map!

[email protected]

Page 13: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data
Page 14: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

World Map Image

(In regards to the previous slide):

No base-map on this image; yet you can clearly recognize this is a map of the

world. The Linked Data cloud has a high spatial coverage!

The large “X” in the center of the map can be blamed on a parsing error. This

can happen when one of a coordinate’s decimal values is reused for the latitude

or longitude; this has the effect of locating a point at (X, X) or (Y, Y).

[email protected]

Page 15: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data
Page 16: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

World Map Image

(In regards to the previous slide):

Notice the grid-like structure in Russia; we see a reguarly-spaced snapping of

points. Those are the results of decimal truncation; a process that forces a

floating-point value into an integer.

Lastly, see those ghostly images of land masses where there shouldn’t be land

masses? These are reflections of New Zealand and Australia mirrored about the

Equator; we also found evidence of horizontal mirroring as well. Two

explanations for this: (1) Negative signs (or a lack thereof); (2) Improper

parsing of Quadrant identifiers; e.g., Oeste starts with an ‘O’ (Spanish word for

West) but parsing throws this out and longitude gets flipped onto the other

side of the globe.

[email protected]

Page 17: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Problem Essentials

Lessons Learned:

If triplification software does not account for full range of variations,

unexpected geometries may occur.

Coordinate discrepancy rectangularization2

2http://dbpedia.org/page/Solar_Star [email protected]

Page 18: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Ontology Use & Domain Errors

Page 19: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Ontology Fertility

Apparently, the location of the Moon Landing event took place in Algeria. So

what’s the deal? Was it a Moon Landing or a Safari?

dbr:Tranquility Base geo:lat 0.713889; geo:long 23.7078 .

W3C Basic Geo spec declares WGS84 as the coordinate reference system - but this is

not enforced through axiomatization, so there is no consideration for preventing

geo:lat and geo:long fromm being used to represent locations on any celestial body,

not just Earth. The Moon, Mars, Tatooine, etc.

The oversimplification of vocabularies or schemas (for making publication easier)

can lead to the incorrect usage of an ontology.

[email protected]

Page 20: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Domain Error

Let’s perform a simple, typical, spatial query using Linked Data:

How many people live around the Gulf of Guinea?

Population = 7.6 billion

According to our query results, the Gulf of Guinea has the highest population density

in the world... How can this be? Well, because we didn’t expect planet Earth to be

located in it’s own reference system! Earth has a population value, so it gets counted

in our results as if it were just another populated place.

[email protected]

Page 21: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Data Quality via Ontology Tradeoff

Lessons Learned:

It is critical for data publishers to fully understand an ontology’s intended uses

when selecting one to construct their Linked Data.

Lifting data is not trivial; it needs to involve both domain experts and

experienced Linked Data developers.

All spatial data should have a CRS, but this imposes another hurdle-to-entry

for data publishers. Too little restriction threatens data quality; too much

deters data publishers.

Discrepancies among data sources and a lack of provenance information is toxic

to researchers who cannot ascertain its reliability.

[email protected]

Page 22: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Modeling Errors

Page 23: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Modeling Errors

DBpedia shows 1.8k 0-degree persons, 371k 1-degree persons, and 31k

2-degree persons. Higher-degree persons may be from lack of information

about their birth / death place, or may be a fictitious character identified as

type Person. 0-degree persons indicate modeling [email protected]

Page 24: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Terry Fox

“Terry Fox” is one of those 0-degree persons, his resource includes spatial

coordinates. But it looks like the person Terry Fox was accidentally matched to

the statue of Terry Fox.

[email protected]

Page 25: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Terry Fox

Plotting the coordinates on a map reveals a place called “Mt. Terry Fox

Provincial Park”. This clearly demonstrates the consequences of a modeling

error.

[email protected]

Page 26: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Data Accuracy and an Uncertainty

Framework

Page 27: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Accuracy

There are 136,964 combinations of geometries3 among places with cardinal

direction relations on DBpedia. According to our analysis, by using 8 equal

divisions of the compass rose, nearly 13

of these relations are inaccurate.

Using 8 equal divisions (π4) of the compass Nearly 1

3of all relations are innaccurate

3Formatted in Well-Known Text: Geographic coordinates [email protected]

Page 28: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Accuracy

Part of the blame for innacurate cardinal direction relations can be placed on

using point geometries for regions, making the relation true in only a portion of

the cases.

[email protected]

Page 29: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Uncertainty

Decimal and coordinate values can be misleading; their precision implies

accuracy to the degree of the least significant digit; e.g., the centroid of Santa

Barbara is accurate to 1.1 microns:

POINT(-119.71416473389 34.425834655762)

Also, it has an area of 108.69662101458125 km2, which is accurate to a few

hundred femtometers (10e−13).

Clearly, there is a need for an uncertainty framework when it comes to providing

measurement data.

[email protected]

Page 30: Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Conclusion

Conclusions:

Geographic Information plays a key role in interlinking structured data on the

Web. Improving geo-data quality is pivotal to improving the functionality and

reliability of Linked Data for science, research, applications, etc.

We identified systematic errors in geographic Linked Data, discussed their

causes, and suggested ways to improve its quality and reliability.

Striking the balance between (a) keeping models simple and easy to use so that

they enable streamlined data publishing processes and (b) hazardous

oversimplifications, remains a major challenge to be addressed in future works.

[email protected]


Recommended