Metadata challenges: providing stronger assessments of data quality

PUBLIC PROTECTION AND ETHICAL GEOSPATIAL DATA DISSEMINATIONAN INITIATIVE OF GEOIDE (PROJECT IV-23)

Metadata challenges: providing stronger assessments of data

qualityDr Lex Comber

[email protected]

Acknowledgements

• The ideas in this presentation are the result of an ongoing collaboration – Mark Gahegan

• This is a work in progress...

Aims

• To expand on current notions of metadata for spatial data• To explore metadata objectives, content and roles • To consider possible metadata developments• To propose an agenda for evolving metadata

Statement“Data quality can only be determined in light of its intended use:

quality is not absolute is relative to its use”

• Data is frequently (mostly?) used purposes other than its original use • 3rd party data; more users; greater access eg SDIs, INSPIRE, GRID etc

• Users need to understand the uncertainties when they use the data • A dataset will have different ‘quality’ for different users (and uses)

Outline

• Introduction– Spatial data variability: semantics, measurement &

abstraction– Examples

• Context– Users, prototypes & semiotic triangles– Standards

• A research agenda for more nuanced metadata

Introduction: spatial data variability

• Many different ways of conceptualising the world– Grounded in semantics and meaning– Different meanings and understandings– Sometimes called an ‘ontology’

• Geographic representation– Real world infinitely complex– Representation involves

• Abstraction, Aggregation, Simplification etc

• Examples

Grainger, A (2007). The influence of end-users on the temporal consistency of an international statistical process: the case of tropical Forest Statistics. Journal of Official

Statistics, 23(4): 553-592

Example: UN FRA

Spatial characterization

can change

Example: sea level

Fact: A bridge collapsed !Where: Laufenburg on the river RhineWhy: The already completed bridge

on the Swiss side has a difference in altitude (level) of 0,54 meters compared to the German counterpart

How: The two neighbouring countries use varying (different) measuring methods

Source: http://www.laufenburg.ch

Differences in sea level (cm)

Example: what is a forest?

Example: what is a forest?From Comber, A.J., Fisher, P.F., Wadsworth, R.A., (2005). What is land cover? Environment and Planning B: Planning and Design, 32:199-209

Does not include species, area, strip width

Portugal

Turkey

Kyrgyzstan

Switzerland

Estonia

Denmark

Kenya

UNESCO

Sudan

Tanzania

Ethiopia

South Africa

Jamaica

Zimbabwe

Gambia Mexico

Israel United States

BelgiumLuxembourg

Malaysia

PNG United Nations -FRA 2000

Namibia

Somalia

Netherlands New Zealand

Mozambique

Australia Cambodia

Japan

Morocco

SADC

0

2

4

6

8

10

12

14

16

0 10 20 30 40 50 60 70 80 90

Canopy Cover (%)

Tree

Hei

ght (

m)

Data source: http://home.comcast.net/~gyde/DEFpaper.htm

Introduction: spatial data variability

• Much variation representation of the world• Choices about representation vary depending on– Commissioning, scientific & policy context (who paid for it?) – Observer (what did you see?)– Institution (why you see it that way?) – Measurement (how did you record it?)

• So… almost everything in Geography is a matter of interpretation– The same processes may be recorded (represented) in different ways → Variation in representation & concepts

Context

• Now: many more users of spatial data• Obtaining data is easy and quick– Web, INSPIRE, SDIs (click through download) – No gatekeeper, no negotiation

• Users may assume that data about ‘forest’ or ‘height above sea level’ etc matches their concept, their understanding – Prototypes in cognitive science

Context

• Semiotic triangle– Real world– GI conceptualisations– User prototypes

• GI is interpreted from personal & group conceptualizations of the world

‘real world’

GIUser

measurem

ent

• Geographical data are mapped into those conceptualizations

• Then provided to users

Context

• How does the user – Understand the data-to-real world

link?– Avoid mis-matches with their

Prototype, Conceptual model, Analytical objectives, or Existing data?

– Determine data quality?– Ensure robust analysis?

‘real world’

GIUsermetadata

analy

sis

measurem

ent

• Users might expect metadata to support their activity…

• The meta-descriptions of metadata in standards support that view…

Context

• Geo-spatial data quality and metadata standards: – Positional Accuracy, Attribute Accuracy, Lineage, Logical consistency,

Completeness– In many early standards: DCDSTF, 1988; FGDC, 1998; ANZLIC, 2001; ISO,

2003, OGC, INSPIRE– Distilled into the Dublin Core

• Dublin Core Metadata Elements Set identifies 15 components– Contributor, Coverage, Creator, Date, Description, Format, Identifier,

Language, Publisher, Relation, Rights, Source, Subject, Title, Type• Relate to mainstream information sources

– Books, web pages– Based on IP, cataloguing, retrieval & discovery– How to document information

Context

• Metadata objectives:– “Data about data or a service. Metadata is the documentation of

data. In human-readable form, it has primarily been used as information to enable the manager or user to understand, compare and interchange the content of the described data set” (ISO, 2003a)

• BUT standards reflect the process of data production– Lineage from methods and data sources– Accuracy, Consistency and Completeness from assessment of

results• Little focus on use • Little focus on assessments of data quality

Context

• Currently metadata does NOT close the semiotic loop

• In part this is the nature of standards...... in theory provide a common language ... But their specification (content) is always a

compromise and lags behind research & practice – E.g. a recent book on spatial data standards took

10 years from inception to being published.

Research Agenda

• Can users make sense of the metadata provided? – Does it meet their needs?

• Are the various MD fields relevant in this new context? – Are there important omissions?

• Are there opportunities for further richness provided by recent innovations in information science?

• Will data producers will be able to keep up with metadata production at ever-increasing data rates?

• In short: does metadata need to be re-envisioned for these new technologies and use-cases?

Research Agenda to support user evaluations of data quality

1. Metadata for what purpose, what roles?– Currently based on Archive, Discovery, Citations and Browsing. – Is this complete? – What about data quality assessments? Semantics ?

2. Metadata for what kinds of resources (not just data)?– Just datasets? Too shortsighted? What about:

• Methods? Workflows? Research Questions? Researchers?– There are syntactic and semantic issues for each of the above: e.g. Methods

can be described by syntactic signatures but that does not describe what they do to the user…

3. Actionable metadata?– Today’s information systems are poor consumers of metadata…

• Do the tools we use make effective use of metadata?• Eg the GIS community have spent much time and effort on uncertainty metadata,

even though the systems cannot analyze and propagate uncertainty during analysis


4. Does the role of Standards need to change?– Many metadata standards, and for a variety of purposes.– re-invented by different disciplines/groups– Who gets to make the standards?– Should standards to cover all metadata needs for science communities?

5. Cost and time for creating metadata standards?– How long does it take (examples from EU and ISO)?– What is the typical cost?– What does the metadata standards development process look like?– Do communities always accept them?

6. The burden of metadata production?– Often an ‘unfunded mandate’– Documenting standards ignored to various degrees– Are metadata standards failing? (e.g. NSDI)– Are we sure we are collecting information that is useful?


7. Conveying understanding: Capturing and representing domain semantics? – There are many realities…each user of some resource brings a different

understanding and potentially different metadata needs– Representing data semantics: (i) for users, (ii) for foreign systems

• Using meta-models, where some domain semantics are first defined, then used to construct information schemas (e.g. NADM: North American Data Model for Geological Mapping)

• Using ontologies for knowledge domains and tasks (e.g. NASA’s SWEET ontology of Earth processes and regions)

8. Mining situational metadata from use-cases (provenance)?– User ranking and feedback:

• What works? What is missing? What is known? What is unknown?

– Use-case logging: monitor use via a web portal / library, warehouse…• Use counts by web domains: differentiate user communities

– Use-case mining and analysis• Discover significant usage patterns, use these to infer relevance, e.g. recommender systems,

– Genesis, derivation, workflows• By exposing, analyzing and documenting the means by which the dataset was produced


9. Mining semantic metadata from resources and schemas?– Ontology mining

• inferred from schema (metadata) - mappings built from exposed data schema • inferred from data in some cases - schema and data to construct ontology

10. Evolving metadata?– the way we describe the world keeps changing

• …and we learn more about how things are used

– The way we think about metadata now has evolved considerably over the last 20 years• we should expect that to continue.• Metadata schemas need to be designed for expansion and replacement as science

evolves. – Meta-models help a lot, but are they flexible enough? Will emergent

use patterns lead to new insights?

Final remarks

• Assertion 1: Current attempts to gather and utilize metadata for data quality assessments are failing...

• Assertion 2: The burden of tagging existing and future data with user-relevant metadata to do this is overwhelming– We cannot realistically expect data producers to carry this burden alone

• Many different approaches to metadata creation are open to us– Some are new, facilitated by ‘grid’ and web service ‘brokered’ access to e-resources– We need to try some of these on a large scale.

• These research ideas are intended to augment the ongoing work of INSPIRE / ESDIN, etc (not a critique)

• The stakes are high: our success in sharing data – of which data quality assessments are a key part – will have big repercussions for research and policy for years to come

Date post:	25-Feb-2016
Category:	Documents
Upload:	cheche
View:	43 times
Download:	0 times

Metadata challenges: providing stronger assessments of data quality

Documents