PUBLIC PROTECTION AND ETHICAL GEOSPATIAL DATA DISSEMINATIONAN INITIATIVE OF GEOIDE (PROJECT IV-23)
Metadata challenges: providing stronger assessments of data
qualityDr Lex Comber
Acknowledgements
• The ideas in this presentation are the result of an ongoing collaboration – Mark Gahegan
• This is a work in progress...
Aims
• To expand on current notions of metadata for spatial data• To explore metadata objectives, content and roles • To consider possible metadata developments• To propose an agenda for evolving metadata
Statement“Data quality can only be determined in light of its intended use:
quality is not absolute is relative to its use”
• Data is frequently (mostly?) used purposes other than its original use • 3rd party data; more users; greater access eg SDIs, INSPIRE, GRID etc
• Users need to understand the uncertainties when they use the data • A dataset will have different ‘quality’ for different users (and uses)
Outline
• Introduction– Spatial data variability: semantics, measurement &
abstraction– Examples
• Context– Users, prototypes & semiotic triangles– Standards
• A research agenda for more nuanced metadata
Introduction: spatial data variability
• Many different ways of conceptualising the world– Grounded in semantics and meaning– Different meanings and understandings– Sometimes called an ‘ontology’
• Geographic representation– Real world infinitely complex– Representation involves
• Abstraction, Aggregation, Simplification etc
• Examples
Grainger, A (2007). The influence of end-users on the temporal consistency of an international statistical process: the case of tropical Forest Statistics. Journal of Official
Statistics, 23(4): 553-592
Example: UN FRA
Spatial characterization
can change
Example: sea level
Fact: A bridge collapsed !Where: Laufenburg on the river RhineWhy: The already completed bridge
on the Swiss side has a difference in altitude (level) of 0,54 meters compared to the German counterpart
How: The two neighbouring countries use varying (different) measuring methods
Source: http://www.laufenburg.ch
Differences in sea level (cm)
Example: what is a forest?
Example: what is a forest?From Comber, A.J., Fisher, P.F., Wadsworth, R.A., (2005). What is land cover? Environment and Planning B: Planning and Design, 32:199-209
Does not include species, area, strip width
Portugal
Turkey
Kyrgyzstan
Switzerland
Estonia
Denmark
Kenya
UNESCO
Sudan
Tanzania
Ethiopia
South Africa
Jamaica
Zimbabwe
Gambia Mexico
Israel United States
BelgiumLuxembourg
Malaysia
PNG United Nations -FRA 2000
Namibia
Somalia
Netherlands New Zealand
Mozambique
Australia Cambodia
Japan
Morocco
SADC
0
2
4
6
8
10
12
14
16
0 10 20 30 40 50 60 70 80 90
Canopy Cover (%)
Tree
Hei
ght (
m)
Data source: http://home.comcast.net/~gyde/DEFpaper.htm
Introduction: spatial data variability
• Much variation representation of the world• Choices about representation vary depending on– Commissioning, scientific & policy context (who paid for it?) – Observer (what did you see?)– Institution (why you see it that way?) – Measurement (how did you record it?)
• So… almost everything in Geography is a matter of interpretation– The same processes may be recorded (represented) in different ways → Variation in representation & concepts
Context
• Now: many more users of spatial data• Obtaining data is easy and quick– Web, INSPIRE, SDIs (click through download) – No gatekeeper, no negotiation
• Users may assume that data about ‘forest’ or ‘height above sea level’ etc matches their concept, their understanding – Prototypes in cognitive science
Context
• Semiotic triangle– Real world– GI conceptualisations– User prototypes
• GI is interpreted from personal & group conceptualizations of the world
‘real world’
GIUser
measurem
ent
• Geographical data are mapped into those conceptualizations
• Then provided to users
Context
• How does the user – Understand the data-to-real world
link?– Avoid mis-matches with their
Prototype, Conceptual model, Analytical objectives, or Existing data?
– Determine data quality?– Ensure robust analysis?
‘real world’
GIUsermetadata
analy
sis
measurem
ent
• Users might expect metadata to support their activity…
• The meta-descriptions of metadata in standards support that view…
Context
• Geo-spatial data quality and metadata standards: – Positional Accuracy, Attribute Accuracy, Lineage, Logical consistency,
Completeness– In many early standards: DCDSTF, 1988; FGDC, 1998; ANZLIC, 2001; ISO,
2003, OGC, INSPIRE– Distilled into the Dublin Core
• Dublin Core Metadata Elements Set identifies 15 components– Contributor, Coverage, Creator, Date, Description, Format, Identifier,
Language, Publisher, Relation, Rights, Source, Subject, Title, Type• Relate to mainstream information sources
– Books, web pages– Based on IP, cataloguing, retrieval & discovery– How to document information
Context
• Metadata objectives:– “Data about data or a service. Metadata is the documentation of
data. In human-readable form, it has primarily been used as information to enable the manager or user to understand, compare and interchange the content of the described data set” (ISO, 2003a)
• BUT standards reflect the process of data production– Lineage from methods and data sources– Accuracy, Consistency and Completeness from assessment of
results• Little focus on use • Little focus on assessments of data quality
Context
• Currently metadata does NOT close the semiotic loop
• In part this is the nature of standards...... in theory provide a common language ... But their specification (content) is always a
compromise and lags behind research & practice – E.g. a recent book on spatial data standards took
10 years from inception to being published.
Research Agenda
• Can users make sense of the metadata provided? – Does it meet their needs?
• Are the various MD fields relevant in this new context? – Are there important omissions?
• Are there opportunities for further richness provided by recent innovations in information science?
• Will data producers will be able to keep up with metadata production at ever-increasing data rates?
• In short: does metadata need to be re-envisioned for these new technologies and use-cases?
Research Agenda to support user evaluations of data quality
1. Metadata for what purpose, what roles?– Currently based on Archive, Discovery, Citations and Browsing. – Is this complete? – What about data quality assessments? Semantics ?
2. Metadata for what kinds of resources (not just data)?– Just datasets? Too shortsighted? What about:
• Methods? Workflows? Research Questions? Researchers?– There are syntactic and semantic issues for each of the above: e.g. Methods
can be described by syntactic signatures but that does not describe what they do to the user…
3. Actionable metadata?– Today’s information systems are poor consumers of metadata…
• Do the tools we use make effective use of metadata?• Eg the GIS community have spent much time and effort on uncertainty metadata,
even though the systems cannot analyze and propagate uncertainty during analysis
Research Agenda to support user evaluations of data quality
4. Does the role of Standards need to change?– Many metadata standards, and for a variety of purposes.– re-invented by different disciplines/groups– Who gets to make the standards?– Should standards to cover all metadata needs for science communities?
5. Cost and time for creating metadata standards?– How long does it take (examples from EU and ISO)?– What is the typical cost?– What does the metadata standards development process look like?– Do communities always accept them?
6. The burden of metadata production?– Often an ‘unfunded mandate’– Documenting standards ignored to various degrees– Are metadata standards failing? (e.g. NSDI)– Are we sure we are collecting information that is useful?
Research Agenda to support user evaluations of data quality
7. Conveying understanding: Capturing and representing domain semantics? – There are many realities…each user of some resource brings a different
understanding and potentially different metadata needs– Representing data semantics: (i) for users, (ii) for foreign systems
• Using meta-models, where some domain semantics are first defined, then used to construct information schemas (e.g. NADM: North American Data Model for Geological Mapping)
• Using ontologies for knowledge domains and tasks (e.g. NASA’s SWEET ontology of Earth processes and regions)
8. Mining situational metadata from use-cases (provenance)?– User ranking and feedback:
• What works? What is missing? What is known? What is unknown?
– Use-case logging: monitor use via a web portal / library, warehouse…• Use counts by web domains: differentiate user communities
– Use-case mining and analysis• Discover significant usage patterns, use these to infer relevance, e.g. recommender systems,
– Genesis, derivation, workflows• By exposing, analyzing and documenting the means by which the dataset was produced
Research Agenda to support user evaluations of data quality
9. Mining semantic metadata from resources and schemas?– Ontology mining
• inferred from schema (metadata) - mappings built from exposed data schema • inferred from data in some cases - schema and data to construct ontology
10. Evolving metadata?– the way we describe the world keeps changing
• …and we learn more about how things are used
– The way we think about metadata now has evolved considerably over the last 20 years• we should expect that to continue.• Metadata schemas need to be designed for expansion and replacement as science
evolves. – Meta-models help a lot, but are they flexible enough? Will emergent
use patterns lead to new insights?
Final remarks
• Assertion 1: Current attempts to gather and utilize metadata for data quality assessments are failing...
• Assertion 2: The burden of tagging existing and future data with user-relevant metadata to do this is overwhelming– We cannot realistically expect data producers to carry this burden alone
• Many different approaches to metadata creation are open to us– Some are new, facilitated by ‘grid’ and web service ‘brokered’ access to e-resources– We need to try some of these on a large scale.
• These research ideas are intended to augment the ongoing work of INSPIRE / ESDIN, etc (not a critique)
• The stakes are high: our success in sharing data – of which data quality assessments are a key part – will have big repercussions for research and policy for years to come