Date post: | 22-Feb-2023 |
Category: |
Documents |
Upload: | khangminh22 |
View: | 0 times |
Download: | 0 times |
1
Trustworthiness of open government data
an analysis on the requirements for open government data
from the perspective of authoritativeness and sustainable accessibility
10th of July 2020
Jan Koers
Student id 11933984
2
Table of contents
1. Introduction ..........................................................................................................................................4
2. Trust as a basic requirement for our information society ...................................................................5
2.1 Introduction ............................................................................................................................. 5
2.2 What defines trust and how is authoritativeness of information related .............................. 5
2.3 The difference between data and records from the perspective of trust .............................. 6
2.4 Recordkeeping principles that support trust .......................................................................... 8
2.5 Conclusions .............................................................................................................................. 9
3. Assessing current open government data on trustworthiness ......................................................... 10
3.1 Why open data ...................................................................................................................... 10
3.2 Research questions and expected outcome ......................................................................... 10
4. Recordkeeping principles supporting trustworthiness of open data ............................................... 13
4.1 Key components of trust and supporting recordkeeping principles ..................................... 13
4.2 Focus group about valorisation of key components of trust ................................................ 16
4.3 Form to assess the open data ............................................................................................... 20
5. Incorporation of recordkeeping principles in Dutch open data ....................................................... 22
5.1 Introduction ........................................................................................................................... 22
5.2 The Dutch base registry of buildings and addresses (BAG) ................................................... 24
5.3 Air quality measurements by Dutch institutions................................................................... 31
5.4 Crowd sourced air quality data ............................................................................................. 41
5.5 Analysis of the research data ................................................................................................ 48
5.6 Assessment of the assessment framework ........................................................................... 52
6. Recommendations to promote trust of open data environments.................................................... 53
6.1 Translation of the research findings to recommendations ................................................... 53
6.2 Standardized and inseparable minimum metadata set for individual information objects . 54
6.2 Source information system as archiving and publication platform ...................................... 55
7. Conclusions ................................................................................................................................. 57
Bibliography ........................................................................................................................................... 59
3
List of figures
Figure 5.1 Formalisation continuum of data creation environments ................................................... 22
Figure 5.2 Visualisation of the building and address information using the BAG viewer ..................... 26
Figure 5.3 Infographic on the re-use of BAG data ................................................................................. 31
Figure 5.4 Visualization of actual measurements of fijnstof (PM10) at June 6th 2020, 14:55 hours ... 36
Figure 5.5 Report of historical measurements of fine dust (PM10) on August 6th 2018 ..................... 36
Figure 5.6 Data acquisition, processing and publication of sniffer bike data ...................................... 43
Figure 5.7 Visualisation of sniffer bike open data with specific web application ................................. 43
Figure 6.1 Possible information architecture with integral recordkeeping functionality ..................... 56
List of tables
Table 3.1 Research questions and proposed research methodologies ................................................ 12
Table 4.1 Relation of trust components (3 perspectives), trust type and recordkeeping principles .... 15
Table 4.2 Assessment form for assessing recordkeeping principles in open data sets ........................ 20
Table 5.1 Assessment form for BAG data .............................................................................................. 27
Table 5.2 Assessment form for air quality measurements by Dutch institutions ................................. 37
Table 5.3 Assessment form for crowd sourced air quality data ............................................................ 45
Table 5.4 Classification of presence of recordkeeping principles in 3 assessed open data sets........... 49
4
1. Introduction
For thousands of years information has been guarded by passing on stories and writing them
down on physical media like paper. In the last decades digitization started with the conversion
of written or typed documents into digital representations by scanning, with no major changes
in the essence of use of the information.
But with the technologies developing in a fast way, now a real digital transformation is taking
place. Formal and informal communication and interaction becomes more and more digital, as
can be concluded on the wide range of official digital government programs and the
incredible use of social media platforms.
These technology changes induce societal changes: from changing democracy by platforms
for e-democracy on one side to informational isolation of individuals and groups because of
information bubbles. There is ongoing debate on how to manage these technology and
societal chances: promoting transparency and openness of government (like the future Dutch
legislation “Wet Open Overheid”), the protection against abuse of personal information (like
the European General Data Protection Regulation) and the growing use of data and black box
algorithms in automated decision making processes without knowing their exact quality and
functioning.
All these technology and societal developments imply challenges for information
management: how can we deal with the fluidness, temporality and context change of
information? How can we support the transparency and a sound interpretation of data and
algorithms. How can we support the trustworthiness of digital data?
This research investigates on the aspects of trust of data as a basic requirement for the current
trend of datafication in our society. It elaborates on the difference between data and records
and the recordkeeping principles that can support trust of data. Consequently several Dutch
open data sets are being assessed on their level of trustworthiness. Finally recommendations
are given to integrate recordkeeping principles in data governance.
5
2. Trust as a basic requirement for our information society
2.1 Introduction
Current developments in our information society stress the importance of the concept of trust
of information. In recent research our time has been referred to as an “era of post-truth and
disinformation” (Duranti, 2018). A complete research program in archival science is focussing
on this aspect (the Interpares Trust program1). Upward et al. (2018) elaborate on
“authoritative information resource management” as crucial for our networked information
society.
That raises the question what actually defines trust and how authoritative information
resource management is related to trust. Is there is a difference between the concept of data
and the concept of records from the perspective of trust? What (recordkeeping) principles are
key to support trustworthiness of data?
2.2 What defines trust and how is authoritativeness of information related
In Webster’s dictionary trust is being defined as “Assured resting of the mind on the integrity,
veracity, justice, friendship, or other sound principle, of another person”. Trust therefore is
related to the perception of persons on the one hand and sound principles on the object to be
trusted. In this case the object is information for which integrity and veracity would be the
applicable principles.
In archival science Donaldson & Conway (2015) studied the user conceptions of
trustworthiness for digital archival documents. They concluded that the formulation of
trustworthiness as a concept in Kelton, Fleischmann, and Wallace’s (2008) Integrated Model
of Trust in Information can be questioned. Besides the aspects of accuracy, believability,
coverage, currency, objectivity, stability and validity in the model, Donaldson & Conway
revealed that aspects as authenticity, inaccurate/trustworthy, first-hand/primary,
legibility/readability and form are perceived relevant as well.
Wang et al. (2011) in their study on trust of machine created data concluded that trust is
related to the reputation of the creator and/or publisher of data. The reputation of the creator
and/or publisher is related to the supposed quality of the data created and/or published. The
quality indication is based on the assessment by experts, reputation rules of computational
systems or the re-users of the information. Ceolin et al. (2016) combine user reputation and
provenance analysis, looking at provenance for the estimation of trust in an automated way.
They designed a series of algorithms to extract relevant provenance features, generate
stereotypes of user behaviour from the provenance features and estimate the reputation of
both stereotypes and users. Based on that they estimate the trustworthiness of “artefacts”.
Upward et al. (2018) use, in their publication “Recordkeeping informatics for a Networked
age”, the term authoritative information resource management and relate this to Giddens
social theory. They elaborate that our societal values determine whether authoritative
1 https://www.interparestrust.org
6
information is important for the functioning of society and what information will be
considered as authoritative. Giddens theory stresses that social structures are based on
relations and shows how social structures survive in space and time. As a parallel the authors
describe that records are structured reflecting the relations in society and are therefore based
on relations as well. Consequently, the survival of information in space and time is related to
how social structures survive. They conclude that in current information resource
management the authoritative approach has gotten out of sight, caused by cultures that merely
value the here and now.
Gillian (2018) concludes that there are many socio-cognitive factors in play and a reliable
source alone does not consistently determine trustworthiness for users. Yoon (2014) concludes
on the trust of data in repositories that “trust in data itself plays a distinctive and important
role for users to reuse data, which may or may not be related to the trust in repositories”. He
concludes that users’ trust in data is another important area to be investigated further, creating
a bridge to the data governance domain.
Based on the literature related to trust of information, these two types trustworthiness will be
taken into account in this research, the trustworthiness of the source of information, which
will be referred to as social based trust, and the trustworthiness of the information itself which
will be referred to as content based trust.
2.3 The difference between data and records from the perspective of trust
In the previous paragraphs, three different terms have been used without defining them yet:
data, record and information. This paragraph introduces definitions of these terms within the
context of this research and looks at the difference between data and records from the
perspective of trust.
The distinction between the terms data and record is blurred. Classical records are supposed to
reflect subjective, context dependent and human induced information which is subject to
interpretation and is associated with authority and actions. Data (and the devices and
algorithms that create and/or process them) are considered objective, accurate, true and
therefore neutral, which can objectively be analysed to derive patterns and/or knowledge and
be able to project future behaviour or developments.
But data are not neutral at all: the design of sensors, the modelling of data models, the
classification of information, the design of algorithms, the results of analysis, all are human
activities that part from an interpreted and/or subjective theory or design that influences the
way the data can be used and interpreted. Additionally, current technologies enable us to
convert classical records to data by extracting data from for instance textual or audiovisual
information. The other way around classical records can be generated from data.
7
Borglund et al. (2014) state that the terms data, records and information can be used equally
but that their use might have different connotations. They argue that the term record is often
used when the legal context or meaning is important which coincides with the ISO 15489
definition of records as “Information created, received, and maintained as evidence and
information by an organization or person, in pursuance of legal obligations or in transaction of
business.” The term information is being used if there is a more customer orientation with
usage, accessibility and benefits as its core focus. The term data mostly refers to easily
processable (tabular or structured) data, sometimes referred to as “raw material”. The
growing generation and use of this type of data explains the rising use of the term data.
Borglund et al. conclude that the different stakeholders have different perspectives and stress
on different aspects, which is why different terminology is being used.
This research is related to the creation and use of government information. This information is
almost every time directly or indirectly part of a process that has legal or transactional
characteristics. Still different terms are being used, like data, records and information. This
research is focussing on the role of recordkeeping principles, identified in archival science to
support trustworthiness of information, to analyse trust aspects of open government data.
Therefore the basic terminology has been based on the descriptions of terminology used in the
reference project Preservation as a Service for Trust (PaaST) from the Interpares Trust
Program (2018)2. This reference document has at its core the definition of an “intellectual
entity”:
• Intellectual entity: artefacts that are intended to communicate information; this
encompasses human readable entities like texts and photographs and machine readable
entities like databases and software
• Data: objects that are the ingredients of intellectual entities; this includes objects that
are directly or indirectly created by humans, such as data that capture details of human
interactions with social media or online systems, data generated by environmental
sensors and the outputs of artificial intelligence systems.
• Record: A type of intellectual entity that was made or received in the course of a
practical activity as an instrument or a by-product of such activity, and set aside for
action or reference
• Information: the communication result of intellectual artefacts: this encompasses for
instance the visualisation of environmental data on a map to understand air pollution
or a building permit used for starting a construction process.
Because all data are made or received in the course of a practical activity, the key
characteristic of a record therefore is that is it “set aside” for action or reference, and therefore
can serve as evidence.
2 https://interparestrust.org/terminology/term/Preservation as a Service for Trust (PaaST)
8
Within this research, “set aside” encompasses that an intellectual entity has been defined
beforehand and is created for reuse (within new actions or as a reference in new actions) and
therefore should have the characteristics that make it possible to do so: it should trustworthy
and it should be accessible, readable and usable.
The characteristics that underpin trustworthiness, can be achieved and/or supported by the use
of recordkeeping principles in the design, creation and publishing of records. What
recordkeeping principles can support this trust is elaborated in the next paragraph.
2.4 Recordkeeping principles that support trust
As mentioned in paragraph 2.2, Upward et al. (2018) describe the challenges we have
nowadays in the field of what they call authoritative information resource management. The
authors, based on the social theory of Giddens, stress that at basis the (information) culture of
a social structure will define the importance of recordkeeping and recordkeeping principles.
Our societal values determine whether authoritative information is important for the
functioning of society and what information will be trusted or considered as authoritative.
The framework of the authors is directly related to the actor network theory. This theory
serves as a basis to define and implement the relations between agents (users / roles by means
of applications), records (intellectual entities set aside for action or reference) and actions
(business processes). Within current technological and networked information environments,
mostly based on data processing platforms, direct recording and archiving of information is
already, or will be, the standard. The challenge is to guarantee the authoritativeness of
information.
The authors distinguish 2 recordkeeping building blocks to support authoritative information
resource management in the current digital world:
• the record continuum model: in the digital age, technology enables and supports the
(re)use of records; this implies a focus on the creation, capture, and maintenance of
records in a way that sustainable accessibility for reuse is guaranteed; this can be
translated to the availability of repositories that guarantee access for (re)use, guarantee
integrity of the information and maintain readability and usability of information for
future use by means of preservation technologies
• recordkeeping metadata: recordkeeping metadata can support the accountability of an
organization and support the traceability, authenticity and sustainability of records: it
can establish what can be done with records (taking into account the objective for use,
and aspects of privacy and secretiveness of records) and how long to retain records;
this can translated to the practice to create, update and use relevant metadata in every
process that manages records; in this way metadata serve “to tell the story of
information” and is the basis of the (re)construction of (trans)actions; with the relevant
metadata, records can be used for evidence
These building blocks will be the basis for this research on trustworthiness of open
government data.
9
2.5 Conclusions
For the functioning of a democratic society it is essential that there is trust between the
members, be it government bodies, citizens, firms or educational institutions. Trust is
supported by the creation, distribution and use of information that is considered trustworthy.
Trustworthy information is supported by authoritative information resource management.
Aspects that support trust are the trustworthiness of the provenance of the information (social
based trust) and the trustworthiness of the information itself (content based trust). The
trustworthiness of the information is determined by its users and can be supported by the use
of metadata that describe who, where and how the data has been created, in what context and
what actions have been performed on it. When data and relevant metadata are accessible, it
can be verified and assessed on accurateness, reliability and authenticity. This supports the
content based trust and consequently the social based trust.
These concepts of trust of information will be further investigated in this research. Do the
identified building blocks support the trustworthiness of information and if so, how do they
support it. The next chapter details why open government data has been chosen as a context to
research this, which research questions will be investigated and based on what
methodologies.
10
3. Assessing current open government data on trustworthiness
3.1 Why open data
Open data are a good representation of current developments in the information society.
Governments want to be transparent and publish their data as open data if this doesn’t infringe
privacy or copyright legislation. Innovative information architectures like the municipal
Common Ground movement rely on (open) data repositories for business transactions making
use of technologies like linked data, services and application programming protocols (API’s).
More and more open data is published to be reused by others than the creators. These open
data can be considered records, because they are “intellectual entities, set aside for action or
reference” and therefore should be trustworthy. Nevertheless, open data face challenges
related to trust, specifically related to data quality and the difficulty to assess data quality
(Jaakkola et al. (2014)). In the Interpares Trust programme a study specifically focussed on
the “Implications of Open Government, Open Data, and Big Data on the Management of
Digital Records in an Online Environment” (Suderman J. and Timms, K. (2017)). The study
finds trust issues in open data and open government initiatives. These trust issues are related
to gaps in the “recordkeeping infrastructure” and a focus on “participatory or collaborative
governance” to increase trust. The overall objective of the study was to identify records-
related issues in order to support the establishment of appropriate InterPARES Trust research
projects to address the issues.
The open data landscape is changing rapidly and so is legislation. New Dutch legislation is in
preparation (called the “Wet Open Overheid” and related to the new European directive for
open data 2019/10243). With the directive there are also general quality requirements related
to the use of metadata, standardization and accessibility.
This research is related to trust issues of open data and the relation to recordkeeping
principles. Its objective is to find the relevant recordkeeping infrastructure components
needed for Dutch data governance and open data landscape to support the trustworthiness of
open data. The next paragraph poses the research questions and presents the research
methodology.
3.2 Research questions and expected outcome
In current data governance environments there is a lack of attention for the capturing,
processing, recording, reuse and publication of information so that this information can be
trusted and can be used in an authoritative way. Current information creation is that fast and
that complex that it is difficult to assess whether information is authoritative and whether it is
true or that is has been changed. Casellas et al. (2012) indicate that from a record management
perspective data quality is one of the most relevant problems in (open) data projects. Lemieux
et al. (2014) showed in a study about the use of big data for visual analytics that the lack of
recordkeeping standards, related to link data to the context of their creation for proper
3 https://eur-lex.europa.eu/legal-content/NL/TXT/PDF/?uri=CELEX:32019L1024&from=EN
11
interpretation, caused knowledge gaps. More and more amateur based content building,
combined with direct archiving, raises the question whether we still are able to trust (sources
of) information. Duranti (2018) states that we live in a post truth era and that “objective facts
are less influential in shaping public opinion than appeals to emotion and personal belief”.
From the Interpares research program the implications of open government, open data and big
data on the management of digital records have been studied by Suderman & Timms (2017).
They conclude that open government and open data initiatives and structures do not have only
accountability as a guiding principle, but as well citizen participation, technology and
innovation, and transparency. They identify trust issues with respect to open government and
open data initiatives, underpinned by gaps observed in recordkeeping infrastructure and
operations.
The objective of this study is to find the relevant recordkeeping infrastructure components
needed for Dutch data governance and open data landscape to support the trustworthiness of
open data.
The expected outcome of this research is that a different view (still) exists between trust from
an archival perspective (more focussed on the trustworthiness of the provenance and the
repository itself, social based trust) and trust from a data governance & -science perspective
(more focussed on the trustworthiness of the content, the data quality, content based trust). It
is expected as well that the incorporation of recordkeeping principles (being part of
authoritative information resource management) in data governance can lead to more trust in
open data.
To reach the objective of this research, the following research questions have been formulated
and will be answered using the proposed research methodologies:
12
Table 3.1 Research questions and proposed research methodologies
Research question Proposed methodology
1. What are the key components of trust of open data and what recordkeeping principles support them
Literature study of both the archival science domain as data governance & -science domain to extract the key components and supporting recordkeeping principles
2. What are the differences in valorisation of the key components for trust from the archival perspective and the data governance & - science perspective
Focus group with archival and data governance experts to rate the relevance of the found key components in research question 1 and add new ones by the experts, if applicable. Different experts who are currently working on the (municipal) information architecture for data and records, including the open data arena, will be approached to review the key components for trust.
3. How do current Dutch open data sources promote trust by incorporating recordkeeping principles that support trust?
Investigation of 3 open data sources:
• one based on legislation, in this case the official Dutch base registry for buildings and addresses (BAG)
• one not based on legislation but created by a government institution with specific directives for the registry, in this case official air quality data created by RIVM
• one based on crowd sourced registry which is re-used by government, in this case air quality data acquired by volunteers using sensors on bikes
The investigation consists of determining whether the key components of trust are supported in the open data sources by the use of recordkeeping principles; the investigation will be based on available information for public use; to identify the relevance of the transparency and presence of all relevant information for public use, additional information may be solicited from the owner and/or publisher of the information when not available publicly.
4. What recommendation can be made for Dutch data governance and open data landscape to promote trust of open data (environments)
Analysis of the results of research questions 1,2 and 3 and elaborations of recommendations, if applicable related to current developments in the field
13
4. Recordkeeping principles supporting trustworthiness of open data
4.1 Key components of trust and supporting recordkeeping principles
The first research question has the objective to identify the key components in authoritative
information resource management that support the level of trust of open data. Literature has
been studied of both the archival science domain as the data governance & -science domain.
The concepts of both social based trust as content based trust have been taken into account.
From an archival perspective Donaldson & Conway (2015) studied the user conceptions of
trustworthiness for digital archival documents. They conclude that that the way of creating
and guaranteeing trust is to be able to preserve the identity and integrity of digital records.
Donaldson & Conway did a qualitative study on information in the form of images to
determine the components of trustworthiness and their relative importance. The following
components have been identified and confirmed to be relevant for trust from the archival
perspective related to the Integrated Model of Trust in Information from Kelton et al. (2008):
• Accuracy: believed to be free of error
• Believability: the extent to which the information appears to be plausible
• Coverage: completeness of the information
• Currency: the degree to which the information is up-to-date
• Objectivity: balance of content
• Stability: the persistence of information, both its presence and contents
• Validity: the use of responsible and accepted practices such as the soundness of the
methods used, the inclusion of verifiable data, and the appropriate citation of sources
Donaldson & Conway also uncovered the following emergent themes that were not identified
within the Integrated Model of Trust in Information:
• Perceived authenticity: Is it fake?
• Inaccurate information: conceptualizing documents as being trustworthy despite
containing inaccurate information
• Primary or first-hand evidence: the extent to which the document is primary or first-
hand
• Document’s legibility or readability
• Document’s perceived proper form (this relates to “coverage”, “stability, “validity”
and “readability”)
From a data governance & -science perspective Wang et al (2011), in their study on trust of
machine created data, conclude that trust is related to the reputation of the creator and/or
publisher of data. The reputation of the creator and/or publisher is related to the supposed
quality of the data created and/or published. The quality indication is based on the assessment
by experts, reputation rules of computational systems or the re-users of the information. The
reputation is based on transparency and consistency of the execution of business rules, which
determine a part of the quality of data. Data quality assurance is therefore crucial and directly
related to authoritative information resource management. Strong et al. (1997) and Mazon et
al. (2012) recognize “fitness for use” as being the key factor of data quality (based on the
14
components topicality, completeness, correctness and precision) and an important criterion for
data analytics and business. The two key components that are identified for trustworthiness of
data are the reputation of the creator and/or publisher and the quality of the data / fitness for
use.
From an open data perspective, the Sunlight Foundation defined and published Open Data
Principles (Sunlight Foundation, 2010). These principles are a combination of good-practices
and requirements to promote and support the (re)use of open data. These requirements are
supposed to support trust and to lower the barriers for (re)using open data. The principles are:
1) Completeness, including release of descriptive metadata, with the highest possible
level of granularity, which will not lead to personally identifiable information;
2) Primacy, collected at the source, including information on how and where data were
collected to allow verification by users;
3) Timeliness, data should be released as quickly as possible;
4) Ease of physical and electronic access;
5) Machine-readable, in formats that allow machine-processing;
6) Non-discrimination, available to anyone with no requirement of identification or
justification;
7) Use of commonly-owned or open formats;
8) Licensing, no imposition of attribution requirements and preferably labelled as part of
the public domain;
9) Permanence, data should remain online with appropriate version-tracking and
archiving;
10) Usage costs, data available preferably free of charge.
The key components of the open data foundation that are related to the trustworthiness of the
data itself are completeness, primacy and timeliness. The components from 4 to 10 are more
related to sustainable accessibility of open data. For his research the accessibility is mainly
taken for granted because the topic of research are open (accessible) data. However, two
aspects are taken into account that relate to the sustainability of the accessibility: machine-
readable (5) and permanence (9).
In paragraph 2.4 the recordkeeping principles that support trust have been identified:
descriptive metadata and repositories that guarantee integrity and preservation. In order to
verify whether the identified recordkeeping principles support all the identified trust
components an integration matrix has been made. Parting from the trust components from the
archival perspective that are the most granulated, corresponding trust components identified
by the data governance & -science domain and the open data foundation are connected (are
placed in the same row) based on their definitions. Each row is then identified as a component
that relates to social based trust and / or the content based trust. Finally the identified
recordkeeping principles are related to each combination of trust components.
15
Table 4.1 Relation of trust components (3 perspectives), trust type and recordkeeping principles
Archival user perspective
Data science perspective
Open Data Perspective
Related to social based trust and/or content based trust
Supporting Recordkeeping principle
Accuracy Fitness for use4 Completeness Content based trust Descriptive metadata5
Believability Reputation of creator and/or publisher Fitness for use
Primacy Completeness
Social based trust Content based trust
Descriptive metadata/Repository that guarantees integrity and preservation
Coverage Fitness for use Completeness Content based trust Descriptive metadata
Currency Fitness for use Timeliness Permanence
Content based trust Descriptive metadata
Objectivity Reputation of creator and/or publisher
Primacy Social based trust Content based trust
Descriptive metadata
Stability Reputation of creator and/or publisher
Permanence Social based trust Content based trust
Repository that guarantees integrity and preservation
Validity Reputation of creator and/or publisher
Primacy Social based trust Content based trust
Descriptive metadata
Perceived authenticity6
Reputation of creator and/or publisher
Primacy Social based trust Content based trust
Repository that guarantees integrity and preservation
Inaccurate information
Fitness for use Primacy Social based trust Content based trust
Descriptive metadata
Primary or first-hand evidence
Reputation of creator and/or publisher
Primacy Social based trust
Repository that guarantees integrity and preservation
Document legibility or readability
Fitness for use Machine-readable Commonly-owned or open formats
Social based trust Content based trust
Repository that guarantees preservation
Document’s perceived proper form
Fitness for use Primacy Social based trust Content based trust
Descriptive metadata
4 Fitness for use is the main component of data quality. It is the extent to which data is suitable for the purpose that it has been created and used; aspects related to data quality are topicality, completeness, correctness and precision; a good description also helps to assess whether the data will be suitable for re use for another purpose
5 Descriptive metadata refers to all recordkeeping metadata that allow unique identification of the record, allow the verification of the provenance (including the juridical and administrative context, i.e. actors and the procedural context, i.e. processes, use of standards etc) and describe the context of the record that support the reliability and usability (including accuracy, coverage, currency)
6 In this research authenticity is defined as the combination of verifiable identity and verifiable unchanged original content
16
Resuming the integration matrix it can be concluded that:
• although there are differences in terminology and granularity, all identified trust
components from the three perspectives are present, can be related and can be
supported by the use of recordkeeping principles
• most combinations of trust components are related to both content based and social
based trust; this confirms the interrelation between both forms of trust: social based
trust induces content based trust and the other way around
• from an open data perspective most components are not incorporated in the integration
matrix because they are related to (sustainable) access (the components ease of
physical and electronic access, non-discrimination, licensing, usage costs); the
components related to access and legibility are incorporated (machine-readable and
use of commonly-owned or open formats)
In the above approach it is assumed that information is available and accessible. From the
perspective of this research this assumption is logical because open data is assessed. However,
the availability and accessibility of information in general is not obvious. Public bodies are
encouraged or in some cases obliged to publish their information with the related minimal
requirements of the use of (standard) metadata and open standards for accessibility (like
linked data, use of standard API’s, use of standard services). Because of the importance of this
aspect of accessibility related to legibility (machine-readable and use of commonly-owned or
open formats) this will be assessed as well in this research.
In this paragraph the key components in authoritative information resource management that
support the level of trust of open data have been identified, including the recordkeeping
principles that support them. This identification is based on literature and has resulted in the
elaborated integration matrix. It presents the corresponding trust components from 3
perspectives, their relation to social and content based trust and the related recordkeeping
principles that support the trust components.
In order to verify and enrich this result, the integration matrix has been analysed by a small
focus group of experts that are working on new architectures for sustainable access of digital
information, including data. The results of this process are presented in the next paragraph.
4.2 Focus group about valorisation of key components of trust
Literature has shown a broad approach to the trustworthiness of information. The integration
matrix is the result of a literature review with a focus on open government data and presents
the key trust components from both the archival perspective and the data governance & -
science perspective. In order to verify and enrich this result several experts had been invited
to participate in a focus group. Only 2 experts have been able to do that, Paul Groth of the
University of Amsterdam and Erik Saaman of the National Archives of the Netherlands.
Looking at the integration matrix with the identified key components for trust, the experts
indicated that the matrix could need a better justification and/or explanation concerning the
17
corresponding terminology between the archival-, the open data, and the data governance & -
science perspective. The objective of this research however is to find the recordkeeping
principles supporting the trust components and assess current open data based on that.
Therefore in-depth comparison is beyond the scope of the research. To tackle this feedback
the integration matrix has been converted in a relational diagram illustrating the relation of the
recordkeeping principles to each of the trust components of each particular domain
perspective (see figure 4.1).
The archival perspective focusses on information that has to be trustworthy over time, the data
perspective focusses on information that has to be trustworthy in content and the open data
perspective focusses on the accessibility of information. Recordkeeping principles support the
trustworthiness of information regardless of their form and use. The figure shows that all trust
components are covered by three recordkeeping principles.
Secondly, the difference between trustworthiness based on provenance or trustworthiness
based on the information itself was discussed.
As elaborated earlier, trust can be based on social structures. It reflects the trust in a social
system with known identities of the actors. Provenance describes and confirms where
information comes from and how it was created and/or obtained. This social trust is based on
trust in the creator and/or provider of information. It is often based on clear and verifiable
procedures for the creation and maintenance of the data, which support the assessment of the
quality of the data. If trust in the social system is not present or declining, alternatives arise
like the creation of systems based on blockchain technology. These initiatives pretend to
support trust based on complete transparency and the consensus of a community to support
the audibility of the provenance of information. Key aspects are verifiable identity and
verifiable transparency.
Figure 4.1 Relations of trust components of 3 perspectives with recordkeeping principles Figure 4.1 Relations of the trust components of each perspective with the recordkeeping
principles
18
Trust can be based on the (quality of the) information itself as well. This trust can be obtained
by assessing the content of the information based on its characteristics. This can be done
independent from the social trust but it can also be one of the backbones of the social trust
(“this creator of data always produces good data”). To assess the content of the data however,
supporting structures with contextual information are needed to understand the data (Borgman
2015). This process of sensemaking of data has been object of study. Faniel et al. (2019) have
created a typology of the contextual information needed to support data evaluation across
three disciplinary domains, finding that information about data production, data repositories,
and data usage are key in making decisions about reusing data.
It was concluded that it is important to distinguish between these two different perspectives of
trust, but the terminology used to distinguish between trustworthiness based on provenance or
trustworthiness based on the information was not that clear. Therefore clearer terminology
was suggested: social based trust and content based trust. This terminology has been
incorporated retroactively in this research.
Thirdly, a focus was put on the metadata that is most relevant for trust in data. In “Talking
datasets – understanding data sensemaking behaviours”, Koesten et al. (2020) elaborate on a
study focussing on the user perception of datasets and identified necessary data attributes that
support trust because they support sensemaking activities like inspecting data, engaging with
content, and placing data within broader contexts. The research concludes on documentation
practices which can be used to facilitate sensemaking and subsequent data reuse and therefore
support trust and long-term preservation of meaning.
The recommended attributes for data are attributes that describe the following basic
characteristics: format, provenance (including research field and methods used), purpose,
topic, location, quality and uncertainty, time. Nevertheless the authors plea for a more
expanded and robust way then current standardized metadata and reporting conventions to
structure and to support sensemaking of data.
Above mentioned attributes have been compared with the metadata standard that is being
proposed by the National Archive of the Netherland in the design for MDTO (Metagegevens
voor Duurzaam Toegankelijke Overheidsinformatie7). This minimal set of descriptive
metadata is being defined by the National Archive, in cooperation with de Dutch Association
of Municipalities and other governmental bodies. It is supposed to cover the trust components,
although its objective is to support the process of “overbrenging” (i.e. transfer) of digital
information to archival repositories. Nevertheless it should be applicable as well to data that
will be available as records from their source information system or in specific open data
repositories.
The requirement of a minimal set of descriptive metadata will be used as one of the key
recordkeeping principles to assess the open data sets on in this research. The assessment form
that is elaborated in the next chapter has been detailed with those metadata elements found
7 https://www.nationaalarchief.nl/sites/default/files/field-file/OD08_TMLO.pdf
19
important for trust that coincide with equal or similar metadata elements within the MDTO.
An additional focus will be on candidates for expansion of current standardized metadata if
applicable to the data to be researched, like certain types of information structures attached or
linked to the dataset, records and/or columns that provide context and facilitate particular
sensemaking patterns over others.
Finally a reference and comparison was made with the FAIR principles8. The acronym stands
for findable, accessible, interoperable and reusable data. The FAIR guiding principles
promote to optimize data sets for reuse by both humans and machines. This implies sufficient
human readable and/or machine readable metadata to describe the datasets and its
components. Of these principles the following two are most relevant to this research:
“(Meta)data are associated with detailed provenance” and “(Meta)data meet domain-relevant
community standards “.
While these fair principles refer to (meta) data, a related initiative appeared called TRUST for
FAIR.9 These TRUST principles describe the requirements for data repositories for managing
and disseminating data over a long period of time. The most relevant components for this
research are the requirements related to reliable and secure operations (related to the
integration matrix recordkeeping principle of a repository that guarantees the integrity of the
information) and the support of long term data and knowledge preservation (related to the
integration matrix recordkeeping principle of a repository that guarantees preservation).
The results of the focus group have shown that there are no fundamental differences in
valorisation of the key components for trust from the archival perspective and the data
governance & - science perspective. Both perspectives recognize the importance of the key
components. The feedback of the focus group has been incorporated in the research. The
integration matrix was revised by the two experts, each from its own perspective (the archival
perspective and the data governance & - science perspective). The investigation of this
research consists of determining whether the key components of trust are supported in the
open data sources by the use of recordkeeping principles. Therefore an assessment form has
been elaborated to assess the open data sets on this. This form is based on the literature
review, the integration matrix and the feedback of the experts. This assessment form is
presented in the next paragraph.
8 https://en.wikipedia.org/wiki/FAIR_data
9 https://www.slideshare.net/daweilin/trust-principles-2019rda
20
4.3 Form to assess the open data
The integration matrix and the related relational diagram (figure 4.1) have been transformed
to an assessment form (Table 4.2).
Table 4.2 Assessment form for assessing recordkeeping principles in open data sets
Descriptive metadata
Minimal available descriptive metadata: can juridical and administrative context, i.e. actors and the
procedural context, i.e. processes, use of standards be derived (is this descriptive metadata available at
dataset level, record level or attribute level
Is the format of the data described?
Provenance General: can the identity of the creator be identified and verified?
Provenance Specific: Can the process / methods be identified that have been used create or derive the data?
Are the data primary (no selection and no processing (except processes of dissociation of personal data) and no
copying (accessing the source)).?
Topic: what does the data represent; does the data have a defined taxonomy or ontology ?
Purpose: for what purpose has it been created?
Location: to what location is the data related?
Time: To what time frame is the data related? What’s the actuality of the data and metadata; does historical
data remain online with appropriate version-tracking and archiving; is continuous availability guaranteed by a
legal or policy commitment?
Quality and uncertainty: is a description of correctness, completeness, and precision/granularity present?
What other non-domain specific metadata are available that support trust components?
Are the metadata comprehensive; do the metadata comply with metadata standards (international standards,
domain standards)
Repository
Repository Accessibility
Are the data findable and accessible : How can the data be found (portal, search engine), how can the data be
accessed (API, linked data/URI, (download) portal), which commonly owned or open formats (including
machine readable) are used
Is the data is accessible from its source information system or a recognized repository for publishing
Repository Integrity
Are technologies being used that guarantee the integrity of the source information system or repository:
securing that the information at the right moment has been “frozen” and therefore cannot be changed
anymore?
Repository Preservation
Does the information system or repository incorporate technology of preservation to guarantee readability and
usability in the future? Are the records (metadata and data) available in a sustainable format (5 star model of
Tim Berners Lee (http://5stardata.info/en/) : Open licence, not structured or open format/ machine-readable
structured data / non-proprietary format/ open standards from W3C / linked open data)
21
The form is the principal component of the analytic framework for the assessment of the open
data sets. It enables to assess how current Dutch open data sources support the key
components of trust by incorporating recordkeeping principles.
The form consists of questions regarding the presence and/or incorporation of identified
recordkeeping principles. Each recordkeeping principle has been translated to relevant
questions that indicate whether the recordkeeping principles are present or whether the
objectives of the recordkeeping principles are met. These detailed questions are based on the
literature study and the feedback from the experts of the focus group.
As indicated earlier, the essence of open data is that it is available and accessible. However it
is not obvious that this is done using (open) standards for accessibility (like linked data, use of
standard API’s, use of standard services). Therefore these accessibility-related components
are being assessed as well within the assessment form.
The presented form is used to assess each of three open datasets within the Dutch open data
landscape on the level of incorporation of recordkeeping principles. The results of the
investigation are presented in the next chapter.
22
5. Incorporation of recordkeeping principles in Dutch open data
5.1 Introduction
The third research question contemplates how current Dutch open data sources promote trust
by incorporating recordkeeping principles. To assess the level of the incorporation of
recordkeeping principles in Dutch open data landscape, three different datasets have been
analysed using the assessment form presented in the previous chapter.
The investigation has been primarily been based on the available information for public use;
to identify the relevance of the transparency and presence of all relevant information for
public use , additional information has been solicited from the owner and/or publisher of the
information when not available publicly.
The selection of the three datasets has been based on the level of supposed formalisation of
the creation environment of the data. This has been done based on an analogy of the land
rights continuum. Within the land administration domain the concept of a continuum of land
rights exists10. The continuum of land rights is a concept or metaphor for understanding the
diversity in tenure rights, varying from informal land rights like customary rights to formal
land rights like registered freehold. An analogy can be made with the level of formalisation of
data creation. Data can be created based on formal processes defined by legislation and
executed by government bodies or contractors. Data can also be created on a much less formal
way, for instance based on initiatives of citizens to provide the governments with data they
think might help decision making, policy making or to challenge government decisions. These
are two extremes of a continuum of data creation environments which vary in formalisation,
see figure below.
Figure 5.1 Formalisation continuum of data creation environments
The selection of datasets based on supposed differences in the level of formalisation allows
for a comparison of support of trust components by incorporating recordkeeping principles
between data created in creation environments with supposed differences in formalisation
levels.
The first dataset is supposed to to be well formalised. It is an official governmental dataset,
regulated by law, the official Dutch base registry for buildings and addresses (BAG).
10 https://unhabitat.org/secure-land-rights-for-all
23
The second dataset is supposed to be somewhere in the middle of the formalisation
continuum. It is a dataset created by a several governmental bodies and comprises air quality
measurements. This registration is not regulated by legislation but has certain formalisation
components in it, like strict domain requirements about how air quality should be measured.
The third dataset is supposed to be an informal dataset. It comprises as well air quality
measurements but has been initiated based on crowd initiatives and is supported by official
institutions like RIVM. The information is being published by a private platform, the City
Innovation Platform, which is being used by government bodies as well.
The data itself, the corresponding metadata and the references to other relevant information
related to the data have been assessed based on the information available as open data on the
internet. For each dataset the following aspects are described:
• how and where the data and additional information has been accessed for this research
• description of the objective and content of the data: what are the source data, what is
the published part of the source data, a visualization of this data, who created the data,
for what objective the data has been created
• the assessment of the incorporation and/or presence of the recordkeeping principles
identified in the previous chapter, based on the form which has been elaborated in the
same chapter
• a reflection on the use of the data and the relevance of recordkeeping principles to
support trust, as far as this can be identified
Because not every aspect could be assessed based on the publicly available information, a
request has been sent to the three different providers of the open data to answer the following
questions:
1. Has the metadata that is available for the dataset been provided for by the owner of
the data or is this metadata being generated at the moment that the data is being
incorporated in the publication platform?
2. Is the metadata valid for the dataset as a whole or as well for every individual
information object?
3. Is additional information available about the processing (selection, conversion,
calculations etc) that has been done in transferring data from the source to the
publication platform?
4. Is history of information (and related processing) available?
5. Is the information secured in a way that it cannot be changed any more, not by users
accessing the information from the open data platform or services nor by the
providers of the open data (i.e. has the information been "frozen")?
6. How do you guarantee that the information will be legible and usable in the future: is
this based on for instance conversions of the data or is this based on other methods /
techniques and/or agreements?
24
The answers to these questions by the providers of the information help to identify the
relevance of transparency about the way data has been created, processed and published. The
answers have been incorporated in italic, both in the paragraphs that describes each dataset
and in the assessment form. The next 3 paragraphs describe the aspects of each of the three
datasets that are part of this investigation. Subsequently the three data are compared and
analysed with respect to the presence and/or incorporation of trust components. Finally the
usefulness of this assessment form has been assessed as well.
5.2 The Dutch base registry of buildings and addresses (BAG)
How the data has been accessed for this research and based on what information
To start the investigation, first of all the platform data.overheid.nl has been accessed. This
platform is a registry of available open data that is generated by Dutch governmental
institutions. Registry of open data set is not mandatory so the registry does not contain all
open government data available.
The platform provides the basic information about the base registry BAG. On the platform a
general description is given of the data. References are made to other websites for more in-
depth information. The creation, maintenance and use of this information is defined by law11.
The minimal qualitative requirements are described and are being audited on a yearly basis.12
There is a complete catalogue description of the BAG data available13. This catalogue
describes the requirements for data quality, details about the information objects, its attributes
and its relations.
The actual data that are available as open data can be accessed in several ways:
• visualization with the specific app bagviewer14
• download from the website of the Dutch Cadastre15: this is a paid service for private
companies and private persons. Two versions can be downloaded: a version that
contains the data that were current on a given date, or a version that contains all data
including historical data of the life cycle of the information objects
• use of specific services like WFS16
• use of a REST API’s17 , including a REST API for the use of linked BAG data
• using SparQL18
11 https://www.geobasisregistraties.nl/basisregistraties/adressen-en-gebouwen/bag-wet-en-regelgeving
12 https://www.geobasisregistraties.nl/basisregistraties/documenten/publicatie/2019/06/21/kwaliteitskader-bag-2019
13 https://www.geobasisregistraties.nl/binaries/basisregistraties-ienm/documenten/publicatie/2018/03/12/catalogus-2018/Catalogus-BAG-2018.pdf
14 https://bagviewer.kadaster.nl/lvbag/bag-viewer/index.html# geometry.x=160000&geometry.y=455000&zoomlevel=0
15 https://zakelijk.kadaster.nl/-/bag-extract
16 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-4d44-a9e3-54355556f5e7
17 https://data.pdok.nl/bag/api/v1/
18 https://data.pdok.nl/sparql#
25
Description of the objective and content of the data
The Dutch base registry of buildings and addressed is part of a framework of 13 base
registries in the Netherlands19. These base registries form the core of the Dutch governmental
information landscape and are fundamental for both governmental bodies and other parties
that use basic information about buildings and addresses. Within Dutch government, the use
of the information of this registration is obligatory.
The BAG registry incorporates both actual information about addresses and buildings as well
as historical information (addresses that do not exist anymore, buildings of which the
geographical description has been changed over time and of which the old descriptions are
still available).
Dutch municipalities are responsible to maintain this information. The information can be
published as open data by every municipality itself (for instance, the Municipality of Haarlem
publishes this information on their open data platform20). However, the official distribution
and publication of this information is being done by the “Dienst voor het kadaster en de
openbare registers”, the The Netherlands’ Cadastre, Land Registry and Mapping Agency.
They receive daily a copy of all changes made by the Dutch municipalities. These changes are
processed to be held in a national information system (called LVBAG, Landelijke
Voorziening BAG) which is being used for “private” distribution for reuse of this complete
information by institutional users like government bodies.
For the publication of this information as open data a subset is being selected and published
on the platform PDOK (“Publieke Dienstverlening Op de Kaart”). Based on the questions sent
to Kadaster for this research the following was indicated:
the national information system for BAG (LVBAG) is responsible for the metadata of
the dataset as a whole. From the LVBAG a set of mutations is generated on a daily
basis which is being processed within the publication platform PDOK. This is an
automated process. History is not being maintained in the PDOK platform but only in
the LVBAG. Only the municipalities can make changes within their source data based
on the established mechanisms. Applied changes result in new “versions” of objects
maintaining the history (earlier “versions” of objects). Changes are made based on
authorisation of the users and certificates to infringe information security protocols.
Changes in the platforms (LVBAG and /or PDOK) are being discussed, prepared and
executed together with all stakeholders to guarantee sustainable accessibility and
usability of the data.
The following image gives an impression of the data available via the PDOK platform. This
data has been visualized using the application BAG viewer provided by the Dutch Cadastre.
19 https://www.digitaleoverheid.nl/overzicht-van-alle-onderwerpen/basisregistraties-en-
stelselafspraken/inhoud-basisregistraties/
20 https://www.haarlem.nl/opendata/
26
Figure 5.2 Visualisation of the building and address information using the BAG viewer
Assessment of the presence of recordkeeping principles
Based on the publicly available documentation as mentioned in the description of the dataset
and an analysis of the available open data itself, the presence and/or incorporation of
recordkeeping principles has been assessed using the assessment form elaborated for this
research.
27
Table 5.1 Assessment form for BAG data
Minimal available descriptive metadata: can juridical
and administrative context, i.e. actors and the
procedural context, i.e. processes, use of standards
be derived
Is the format of the data described? The format of the data is described partially: the
format of a download is not available online, the
format the webservice that delivers the data, the WFS
(web feature service)21, is described
Provenance General: can the identity of the creator
be identified and verified?
The identity of the formal creator is defined as the
attribute “Bronhouder” of each object. It can also be
derived from the geographical position of the object
within a municipal jurisdiction combined with the
legal responsibility that every municipality has to
create and maintain this data.
Provenance Specific: can the process / methods be
identified that have been used create or derive the
data?
The process / methods can partially be identified:
every municipality has its own specific process for
creating and updating of BAG objects, but the minimal
national requirements create a common base that
defines the fundamental subprocesses required for
the process (some of the information of the
subprocesses is incorporated as source documents
and comprise of the results of the minimal required
procedures/subprocesses like a building permission,
an address allocation or a surveying process)
Are the data primary (no selection and no processing
(except processes of dissociation of personal data)
and no copying (accessing the source)).?
The data are not primary: the data are a copied subset
of the data which are delivered by each municipality
to the LVBAG; the LVBAG has additional data and each
municipality can have more additional data available
in its own source data (for instance the municipality of
Haarlem22).
Topic: what does the data represent; do the data have
a defined taxonomy or ontology ?
The data do have a defined taxonomy. The data
model description with the significance of each
component is available of all available classes and
each of its characteristics23
21 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-
4d44-a9e3-54355556f5e7
22 opendata.haarlem.nl
23 https://bag.basisregistraties.overheid.nl/datamodel
28
Purpose: for what purpose has it been created? The purpose of this data can be found in the
legislation related to the base registry. The purpose in
this case is to serve as a unique base information for
several kinds of processes that depend on address
and/or building information (for instance supporting
the processes of registration of change of residential
address of citizens, the process of emission of building
permits or the taxation of real estate).
Location: to what location is the data related? The BAG data are geodata. This means that location is
incorporated in each information object. The relevant
classes within the data set (building, address, public
space) have a geographic description incorporated as
an attribute / characteristic of each individual
instance of the class.
Time: To what time frame is the data related? What’s
the actuality of the data and metadata; does historical
data remain online with appropriate version-tracking
and archiving; is continuous availability guaranteed by
a legal or policy commitment?
The open data source only contains data that is
current. Data that is historical is available in the
LVBAG and in the municipal source registries. The
relevant classes (entity and graph) have an indication
of BeginGeldigheid (start-date of validity) and
EindGeldigheid (end-date of validity). Version tracking
is incorporated in the mechanism of the use of validity
dates (each change of an information object, causes
the old object to be archived and a new object to be
created. Because all these data has to be available
over time, the Dutch Cadastre guarantees the
“archiving“ of the BAG data by continuous conversion
of the LVBAG information system without loss of
information
Quality and uncertainty: is a description of
correctness, completeness, and precision/granularity
present?
There is no direct description of the correctness,
completeness or precision/granularity within the
metadata. The data is supposed to comply with the
regulation regarding correctness, completeness and
precision that are defined by law and which is being
audited once a year within each individual
municipality based on the official quality
requirements. This regulation in combination with
auditing reports24 can be considered as metadata
related to quality and uncertainty.
What other non-domain specific metadata are
available that support trust components?
There are no other specific metadata identified that
support trust components
24 The auditing reports are not published as open data but can be consulted at the Ministry of Interior
29
Are the metadata comprehensive; do the metadata
comply with metadata standards (international
standards, domain standards)
The metadata to describe the dataset depend on the
way you access the data; in this research the WFS has
been used; the metadata description of the BAG
service25 is compliant with the European Inspire
standards/ ISO 19119 and this is validated based on
the ISO standard 1913926; the metadata within the
data itself are domain specific for this base registry
and can be derived from the BAG data model
Repository: accessibility, integrity and preservation
Are the data findable and accessible : How can the
data be found (portal, search engine), how can the
data be accessed (API, linked data/URI, (download)
portal), which commonly owned or open formats
(including machine readable) are used
The data can be found using data.overheid.nl,
georegister.nl and pdok.nl (and as well in open data
portals of individual municipalities that may
incorporate richer information than the published
data that at national level). The data can be accessed
using standardized services which are machine
readable, API’s and can be downloaded as well based
on open formats like XML.
Is the data is accessible from its source information
system or a recognized repository for publishing
The PDOK environment is a recognized repository for
publishing. It contains copies of data that reside in
their own, closed source information systems. The
copies are a subset of the data in the source
information systems.
Are technologies being used that guarantee the
integrity of the source information system or
repository: securing that the information at the right
moment has been “frozen” and therefore cannot be
changed anymore?
Changes to the repository are executed by automated
mechanisms that process the changes identified in the
centralized source information system LVBAG.
Changes in the LVBAG can only be done by
municipalities, using their specific software and
protocols that update the LVBAG repository based on
the changes of the information in their local
information system. Changes to information always
lead to the archiving of the old object and the
creation of a new object (in the municipal information
system and the LVBAG), in this way changes to the
information are being tracked and stored, but these
are not available in the open source environment.
25 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-
4d44-a9e3-54355556f5e7?tab=inspire
26 ISO 19139 provides the XML implementation schema for ISO 19115 specifying the metadata record format
and may be used to describe, validate, and exchange geospatial metadata prepared in XML
30
Does the information system or repository
incorporate technology of preservation to guarantee
readability and usability in the future? Are the records
(metadata and data) available in a sustainable format
(5 star model of Tim Berners Lee
(http://5stardata.info/en/) : Open licence, not
structured or open format/ machine-readable
structured data / non-proprietary format/ open
standards from W3C / linked open data)
The repository guarantees readability and usability by
using conversion of the data when changes are made
to the technical platform guaranteeing that no data is
lost.
Data remains readable and legible because of the use
of services that comply to international standards like
WFS and recently the incorporation of Linked Data.
Because of the use of WFS and linked data the data is
machine readable.
The license is open: CC-0 (1.0)
A reflection on the use of the data and the relevance of recordkeeping principles for trust.
The first reflection is related to the principal question whether the open data has a level of
trustworthiness to be able to serve in legal contexts. This is the case for the BAG data.
Because of its legal foundation and an ample availability of metadata this information is used
as such (although for instance the actual creation and processing of the information of the
source is not verifiable because the actual processes within the municipalities are not
described).
The (re)use of BAG data is obligatory for Dutch government bodies. The use of the official
distribution mechanism LVBAG and the use of the PDOK requests for information is being
monitored and also being evaluated. Figure 5.3 shows an infographic that presents the
evaluation of the use of BAG information in 2018 by the different types of users of BAG
information. Of the respondents 61% indicate that their own products become more reliable
with the (re)use of BAG information27. This is an indication of the trust users have in this
data.
The report and infographics show details about content based trust, reporting how users rate
the quality of the provided data. There are mechanisms in place to report supposed incorrect
content of the BAG data. This reports are being investigated and information is being
corrected if applicable. This mechanism helps to support the trust in the content of the (open)
BAG data. Special platforms exist for users to share information about the use of the BAG.28
These platforms serve as input for future changes to the data structure, information systems
and information services. This participation of stakeholders supports the trustworthiness of
the BAG ecosystem as well.
27 https://www.geobasisregistraties.nl/binaries/basisregistraties-ienm/documenten/rapport/2018/08/06/rapportage-bag-tevredenheidsonderzoek-afnemers-2018/Rapportage+Afnemersonderzoek+BAG+2018+2.0.pdf
28 https://imbag.github.io/praktijkhandleiding/
31
Figure 5.3 Infographic on the re-use of BAG data
5.3 Air quality measurements by Dutch institutions
How the data has been accessed for this research and based on what information
The basic information about the registry of air quality data can be found and accessed using
the platform www.luchtmeetnet.nl On the platform a general description is given of the open
information that can be accessed and about the different data-providers of air quality
measurements (the Dutch institute for public health (RIVM), the ministry of infrastructure and
water-management (I&W) and 5 regional institutions that measure air quality as well).
References are made to the individual websites of each institution for more in-depth
information. The creation, maintenance and use of this information itself in not defined by
Dutch legislation. Nevertheless the Netherlands has to comply with European requirements
for air quality and agreements on lowering contamination. To monitor this compliance and
verify that actions taken have sufficient impact on air quality this network of air quality
measurement stations has been established. These measurements have to comply with
European regulations, the European air quality directive 2008/50/EG29. The regulations
indicate how many stations for measurements are required (related to population density and
level of air quality) and how the quality should be measured.
29 https://eur-lex.europa.eu/legal-content/NL/TXT/PDF/?uri=CELEX:32008L0050&from=en
32
The actual data that are available as open data can be accessed in several ways:
• visualization of current measurements, using the web platform30
• creation and downloads of reports of (historical) data, using the same web platform31
• use of specific services, including mapping32
• use of a specific API33
Description of the objective and content of the data
Within the European Union countries are obliged to maintain their air quality within certain
limits. To support this compliance with air quality regulation, different networks of stations
exist that measure air quality. The information about air quality is generated by sensors
located within the Netherlands which continuously measure the levels of various air
components like O2, CO2, carbon, etc. The network consists of 95 measuring stations (44 are
national stations managed by the RIVM, 51 are regional stations) The registration itself is not
mandatory, however determining air quality is, as well as compliance with how to measure air
quality.
The objective of this information is:
• to advise the general public if air quality is getting worse, this is a short term goal
• long term analysis of air quality to see whether policies do work, which relations exist
with public health and other relevant aspects related to air quality
• to comply with European regulations to inform about the compliance on air quality
norms
Since 2014 the Dutch institute for public health and environment (RIVM) has initiated a
platform34 to publish the air quality in the Netherlands. The purpose to publish this
information is on the one side to inform the general public on the air quality over time and to
give access to experts to (re)use the information for data analysis.
The measurements published at luchtmeetnet.nl. There are three different types of data
available on the platform:
• measured values of air components of the current year that are not validated (status35
of this information is not definitive) :
• measured values of air components of the current year that are validated (status of this
information is not definitive )
• measured values of air components of past years that are validated (status of this
information is definitive)
30 https://www.luchtmeetnet.nl/meetpunten
31 https://www.luchtmeetnet.nl/rapportages
32 https://geodata.rivm.nl/geoserver/wms?
33 https://api.luchtmeetnet.nl
34 https://www.luchtmeetnet.nl
35 This indication of status definitive or not definitive is not clearly indicated on the web platform but has
been provided by RIVM after consulting them directly
33
A clear statement is made about the validity of this information. The measurements published
at luchtmeetnet.nl are done with official reference devices. These devices are scientifically
viable and are calibrated. Nowadays a lot of cheaper sensors are available and are being used
to measure air quality (see example in the next chapter of this research and the reference
website for crowd sourced measurement of air quality36). The results are not comparable.
The current measurements (these are measurements that are executed and are registered in the
current year) are measured with reference devices but are not validated yet. They are
published instantly and are available using the map interface on luchtmeet.net. or as not
validated averages per hour via the API api.luchtmeetnet.nl. Not validated data are available
as well by using a specific API37, but this is limited to averages per hour. Raw data (each
individual measurement) is not available via luchtmeetnet.nl.
Data is also available as “reports”. It is stated that these data are validated reference
measurements, which means that measured values are post processed to get the information
that is being published. Therefore this information can be used for quantitative analyses of
concentrations of air components.
After contacting RIVM, they provided some additional information about the metadata
registration and de post-processing of measurements. The platform doesn’t provide metadata
about the used sensors, however they are registered38. The metadata on how the post-
processing of measurements take place is limited to a general description of the 4 principal
steps:
1. a direct control by a computer which verifies whether the sensor has been functioning
correctly (this is related to temperature and pressure requirements); all measurements
published on the website have passed this direct control
2. a monthly control by humans; detailed values of stations are compared with values of
other stations and values of different components of the same station are compared;
humans can interpret “strange” combinations that might be indications of correct
measurements in itself but incorrect conditions which might lead to rejections of the
measurements
3. a control of the monthly averages to get insight at a more generic level of possible
(minor) systematic changes; these can be noted because random deviations are
deleted by averaging.
4. A yearly control: a comparison of the yearly averages of the components with the
averages of earlies years to find irregularities (i.e. bigger changes than expected);
control of the standards of calibration for the components in order to define
correction parameters for the sensors or the software
36 www.samenmetenaanluchtkwaliteit.nl.
37 https://api.luchtmeetnet.nl/open_api.
38 Consulting RIVM they indicate that the national network for measuring air quality contemplates an ample
registry system for registering additional (meta) data that contribute to verify, validate and publish the
measurement values but this information is not accessible via luchtmeetnet.nl
34
For this research RIVM was consulted additionally whether more information is available
about the post-processing and about the metadata that is being registered. The following
description was given by RIVM:
“considering the network for measuring air quality which is responsibility of RIVM the
following metadata is being acquired, stored and related to the individual measurement
values:
• measurement station: there are regular visits to revise the state of a measurement
station and it’s direct environment. If there are relevant changes that might influence
the measurements and that therefore have to be taken into account with the post
processing, this is registered, reported and stored in the information systems to be
used in the validation process of measurements (changes that might effect
measurements are for instance changes in vegetation, building activities near stations,
road closures).
• Sensors within the measurement stations: the national network for measuring air
quality (Landelijk Meetnet Luchtkwaliteit, LML) measures air quality conform or
equivalent with the methodologies which are established within European directives39;
all sensors have unique identification numbers; maintenance, repair, control, testing
and calibration40 of sensors is being registered within the information systems to be
taken into account for post-processing.
• Measured values: every measured values is related to a sensor and to the
corresponding station. Every measured values has a starting time and an end time.
These raw values (every minute, every 15 minutes) are aggregated within the station
to an average value per hour. However, the original values are being stored (since
2012). This allows for post processing based on the original values. Besides the actual
value of an air component, additional metadata is being measured and stored related
to for instance the quality of the measurement itself, the temperature within the station
and other relevant parameters. This information is used within the validation process
of the individual measurements and can lead to decline or to mark the measurements.
This is used as a filter as well to not publish values on luchtmeet.net that are definitely
wrong because of, for instance, an error of the sensor. The measurement of some
component require calibration afterwards (for instance fine dust). This calibration
can lead to recalculation of values. Original data stays unchanged.
• Validation of measured values: validation is a complex automated and human revised
process in which already marked values (because of erroneous sensors for instance)
and newly marked values (because of extreme values form instance) are used to
compare with surrounding stations, additional information about the measurement
39 2008/50/EC and 2004/107/EC 40 Regularly sensors are being calibrated. This is merely an automatic process. Information about the
calibration is stored within the information systems. If a calibration leads to doubts about the measured values, these values will be marked and/or declined. Calibration failure can often be diagnosed and repaired.
35
station, the environment of the station, information about the sensors and the expert
judgement; values can then be approved or declined; the reason of approval or
decline is being stored and related the corresponding value(s). The validated status is
being registered and communicated on the platform
• Changing status from not definitive to definitive: validated data is getting the stats
definitive after the 1st of January of the following year (until then data has the status
preliminary); reason is the application of calibration a posteriori (a comparison of
used methodologies for measurement with a reference methodology); this can only be
done after all measurements of one year have been received and validated.
Corrections are shared with the public after the status has been changed to definitive
• Corrections of definitive data; because of new insights sometimes it is necessary to
recalculate definitive data. Because this contemplates changes to official data (which
has been communicated tot the EU as part of the compliance with air quality
regulations) this is referred to as an infringement procedure. It is registered when,
why and how the corrections have been executed. Calculated data will be
recalculated, normally original values are not changed (technically these data are
visible by means of views with the new parameters), however in extreme cases this
might be needed. If this is the case, the original value is archived and a second version
of the value is created. Corrections of definitive data are communicated as well via
the platform”
This detailed additional information by RIVM indicates that a large amount of metadata is
being acquired and registered to support the quality of the measurements and therefore the
trustworthiness of the information. It is clear as well that this complex mechanism of post-
processing of data that is or might be corrected implies that reports concerning the same
information that is generated today be may different form the reports that were generated a
year ago.
The historical data comprises the period from 2009 until 2020. Data older than that does not
seem to be available as open data. Using the API which is provided to select data from the
publishing platform does not indicate the availability of a broader set of information in time.
The next figures show a visualization of the actual measurements and reports with validated
data.
36
Figure 5.4 Visualization of actual measurements of fijnstof (PM10) at June 6th 2020, 14:55
hours
Figure 5.5 Report of historical measurements of fine dust (PM10) on August 6th 2018
Assessment of the presence of recordkeeping principles
Based on the publicly available documentation as mentioned in the description of the dataset
and an analysis of the available open data itself, the presence and/or incorporation of
recordkeeping principles has been assessed using the assessment form elaborated for this
research.
37
Table 5.2 Assessment form for air quality measurements by Dutch institutions
Minimal available descriptive metadata: can juridical
and administrative context, i.e. actors and the
procedural context, i.e. processes, use of standards
be derived
Is the format of the data described? The format of the data can be selected/defined when
a report is being solicited that can be downloaded.
There is also a web mapping service available41 but
this returns information formatted in images. The
format of the data within the publishing platform can
be partly derived from the API definition42 (only that
data that is available by means of the API).
Provenance General: can the identity of the creator
be identified and verified?
The identity of the creator can be identified based on
the unique identification of each sensor. Additionally
the organisation that is responsible for that station is
available. Nevertheless it cannot be validated that the
information presented as being generated by the
sensor indeed has been generated by that sensor.
This is being stored in the source information system
of the organisations that provide the data to the
publishing platform.
Provenance Specific: can the process / methods be
identified that have been used create or derive the
data?
The process / methods cannot be identified based on
the publicly available information on meetnet.nl.
However, internally a description of the
characteristics of the sensor (which are available), the
calibration of the sensor (which does not seem
available as open data) and mechanisms for pre and
post processing that lead to the published values exist
(these are only described and are available, but have
been shared for this research by RIVM in a more
general way).
Are the data primary (no selection and no processing
(except processes of dissociation of personal data)
and no copying (accessing the source)).?
Whether the data can be considered primary or not
depends on your point of view: the raw data are not
published but averages per hour in the current year
(not validated and validated) are, and from the 1st of
January also definitive data of all previous years (in
the form of reports). The validated data can be
considered as primary because of the authority of the
organisations which are measuring and publishing
them.
41 https://geodata.rivm.nl/geoserver/wms?
42 https://api-docs.luchtmeetnet.nl/?version=latest#release-notes
38
Topic: what does the data represent; do the data have
a defined taxonomy or ontology ?
Yes, the data do have a defined taxonomy. At a
certain point in space and time, several physical and
chemical measurements take place. Location, time,
and raw values are registered and post-processed to
reflect values that are the basis to indicate air quality.
Purpose: for what purpose has it been created? The purpose is to have a scientific basis consisting of
measurements of the presence of air components
that are an indication of the air quality. From that 3
types of uses are derived.
Location: to what location is the data related? All individual measurements are related to a specific
location of the measuring station.
Time: To what time frame is the data related? What’s
the actuality of the data and metadata; does historical
data remain online with appropriate version-tracking
and archiving; is continuous availability guaranteed by
a legal or policy commitment?
All individual measurements are related to one
specific moment in time (with a start time and end
time of each individual measurement); historical
published data remain online; consulting RIVM reveals
that in the source information system version tracking
and archiving is done related to pre- and post-
processing. Continuous availability is not guaranteed.
Quality and uncertainty: is a description of
correctness, completeness, and precision/granularity
present?
There is no explicit description of correctness or
completeness of the information; measured values
can be changed based on a validation process (pre-
and post-processing procedures) which is described in
a general way but exact mechanisms and algorithms
are not provided. The precision/granularity can be
derived from the different domains that have been
defined for every attribute that is registered which is
also related to the European directive on air quality
measurements. Transparency on the validation
mechanisms can support an assessment on quality
and uncertainty.
What other non-domain specific metadata are
available that support trust components?
There are no other non-domain specific metadata
identified that support trust components; domain
specific metadata are described in the data model.
The procedures for measuring and post processing are
not provided in detail.
Are the metadata comprehensive; do the metadata
comply with metadata standards (international
standards, domain standards)
There is general description of the dataset at the
publishing platform. The metadata within the data
itself are domain specific.
The vast majority of the metadata is automatically
generated and therefore can be considered as
metadated by design.
39
Repository: accessibility, integrity and preservation
Are the data findable and accessible : How can the
data be found (portal, search engine), how can the
data be accessed (API, linked data/URI, (download)
portal), which commonly owned or open formats
(including machine readable) are used
The data can be found using the general platform 43 or
a specific API44. Parts of the data can be found as well
using the facilities provided by the individual
organisations that are responsible for the
measurements45. The data can be accessed using the
website form to select and download, can be used
(based on CSV format).
It is explicitly stated that availability of the data is not
guaranteed: datasets can be changed, datasets can be
removed, or access to the platform can be removed
without notification. The API can be used based on
Fair Use Policy, but continuous use is not guaranteed,
it can be blocked or removed.
Is the data is accessible from its source information
system or a recognized repository for publishing
Based on the luchtmeetnet platform it is not clear
whether the data is accessed from its source
information system or from the publication platform.
The publication platform is an environment provided
by recognized governmental organizations in the air
quality domain. Consultation of RIVM has learned that
these environments are different. In the source
information systems raw values are archived as are all
calculated measurements that are recalculated based
on new parameters.
It is clear that the accessible data is a subset of the
data in the source information system: the data on
luchtmeetnet.nl have passed through extensive
quality control (except the actual values that are
published “as-is” before validation except when the
values are flagged direct after measurement as
erroneous). The amount of metadata linked to this
validated data is numerous and a lot of them require
domain knowledge to be able to interpret them).
43 https://www.luchtmeetnet.nl
44 https://api.luchtmeetnet.nl/open_api.
45 For instance Amsterdam: https://api.data.amsterdam.nl/dcatd/datasets/Jlag-G3UBN4sHA
40
Are technologies being used that guarantee the
integrity of the source information system or
repository: securing that the information at the right
moment has been “frozen” and therefore cannot be
changed anymore?
The data in the publishing repository cannot be
changed from the outside (by the the users of the
platform). From the inside (by the organisations that
publish the data) it can:
Raw data is being processed and new validated data is
derived from that. Post processing is not done once
but several times based on new measured data
(based on the 4 steps by machine correction, human
verification and correction, monthly averaging and
correction and yearly averages and correction). This
post processing might lead to new measured values.
Old values will be deleted in the publication platform
and new values are published. Consultation of RIVM
has learned that the previous versions of information
objects are being archived in the source information
system, but these versions are not available as open
data.
Does the information system or repository
incorporate technology of preservation to guarantee
readability and usability in the future? Are the records
(metadata and data) available in a sustainable format
(5 star model of Tim Berners Lee
(http://5stardata.info/en/): Open licence, not
structured or open format/ machine-readable
structured data / non-proprietary format/ open
standards from W3C / linked open data)
It is not clear whether the repository has preservation
technology. It is not clear for how long data is being
archived nor is it clear whether the platform will stay
available for the future.
Consultation of RIVM shows that all stakeholders
strive to be consistent over time concerning data
model, measuring methodologies, post processing and
publishing which should help to guarantee usability in
the future. Conversions to the publishing platform are
prepared and executed with the stakeholders to
guarantee readability in the future.
The data available on www.luchtmeetnet.nl is
provided under the licence CC BY-ND 4.0
A reflection on the use of the data and the relevance of recordkeeping principles for trust.
The official distribution mechanism is the platform that has been used for this research as
well, luchtmeet.net and the related API. No statistical data has been found on the (re)use of
this information The measurement information is used by government for verifying
compliance with international air quality norms and trends in air quality Therefore it is
important that the measurements are of high quality. There are mechanisms in place to
support the quality of the information presented. However, details are not given on the
publication platform, which leaves this processing as a black box for the user of the open data.
After consulting RIVM it has become clear that the data that has been made definitive (data of
2019 and older) can be used for legal purposes (for instance in juridical processes related to
emitting levels of air components that do not comply with legal limits of emissions). The
measurements in the current year (2020) cannot be used as such, because they are still
preliminary, although validated, and might still change.
41
It is not clear who is (re)using the measurement data. RIVM indicates, after consulting, that
they do not know who is (re)using the information: the API and download functionality is
freely available, without registration. Therefore it is not clear either for what goal data is
downloaded. The only indication they have are page counts of web URL’s.
The final reflection is related to the principal question whether the open data has a level of
trustworthiness to be able to serve in legal contexts. This is the case for this air quality
measurements, because of the legal role of RIVM.
Detailed additional information provided by RIVM indicates that a large amount of metadata
is being acquired and registered to support the quality of the measurements and therefore the
trustworthiness of the information. However, this metadata is not provided in detail and
therefore verification is not possible. This undermines the social- and content based trust.
Therefore private initiatives have started measuring air quality, sound levels and other
relevant environmental values that are regulated by law to verify these official measurements.
It is clear as well that the complex mechanism of post-processing of data that is or might be
corrected implies that reports concerning the same information that is generated today be may
different form the reports that were generated a year ago. Only transparency on all relevant
data and metadata can support the social based trust of RIVM and the content based trust of
the air quality measurements.
In this case it is clear that the lack of recordkeeping standards for the published part of the
data, related to link data to the context of their creation for proper interpretation, causes
knowledge gaps and therefore affects the trustworthiness.
5.4 Crowd sourced air quality data
How the data has been accessed for this research and based on what information
On the governmental platform data.overheid.nl data are published about air quality that are
not measured by recognized institutions of governmental bodies but that are crowd sourced.
For this research the raw air quality data captured by bike riders with air quality sensors
(sniffer bike) have been assessed46.
On several public sites information is being given about this initiative:
• the website of the initiative itself47
• the website of Civity, the organisation that manages the data that is being generated48
• the website of RIVM that recently started to post processes the information49
The data is being stored, managed and published on the City Innovation Platform, hosted by
the Dutch organisation Civity. The data can be accessed by using a specific application or
46 https://data.overheid.nl/dataset/raw-snifferbike-snuffelfiets-data
47 https://www.snuffelfiets.nl
48 https://civity.nl/products-solutions/snuffelfiets/
49 https://www.rivm.nl/nieuws/500-fietsen-gaan-luchtkwaliteit-meten
42
dashboard50 or by downloading the data from data.overheid.nl. A definition of the data (data
dictionary) is available on the platform.51
Besides the publication of this raw data, the Dutch institute of public health RIVM recently
started to publish a post-processed version of this information. This information can be
accessed by downloading it from data.overheid.nl52.
Description of the objective and content of the data
The sniffer bike initiative is an experiment to obtain more knowledge, data and information
about cycling with the objective to better the incorporation of cycling in mobility policies.
Until recently a lot of information was available about car mobility but a lot less about
mobility based on bicycles. The sniffer bike experiment parts from the idea that cyclists can
contribute to solve several societal problems related to traffic congestions, air-quality, public
health, participation and sense of happiness. New technologies cause changes in the mobility
ecosystem related to cycling, like the rise of the use of electric bicycles and the use of media
devices connected to bicycles. The province of Utrecht has the ambition to be the frontrunner
in acquiring new knowledge about these changes.
In this experiment the focus is on using bicycles and cyclists to execute measurements while
cycling. It is designed to acquire at large scale mobile data by citizens who volunteer to
participate in the experiment. It is a public private partnership in which also participate the
IoT company SODAQ for the production of mobile sensors and Civity for the data
management. Recently RIVM started to support the experiment by validating the acquired
data.
The project has started about a year ago and is executed in different phases to be able to learn
about and anticipate on the uncertainties in this kind of innovative experiments (which depend
on new technologies, human behaviour, weather conditions etc). The experiment has started
with a small number of volunteers measuring a limited number of air components (those that
were topical for that moment, especially the level of fine dust in relation to air quality for
cyclists and public health). Based on the measurements, “green” (more healthy) cycling
routes can be identified for cyclists to choose from. Governments will get the basic
information available to take measures to improve routes that are considered not healthy.
After a period of testing in 2018 the activities started in 2019 with the participation of 500
cyclists. With the sensor made by SODAQ, real time data is being transmitted via
narrowband (later on there will be a switch to LTE-M) directly to the data platform of Civity
(see figure 5.6).
50 https://dashboard.dataplatform.nl/sodaq/v2/groene_fietsroutes.html
51 https://ckan.dataplatform.nl/dataset/8b04f4f3-666c-4448-91fc-234d5a75e6c4/resource/1d914417-3a80-
4120-9e2e-e4c918ee67a5/download/datawoordenboek_snuffelfiets.csv
52 https://data.overheid.nl/dataset/rivm-corrected-snifferbike-snuffelfiets-data
43
Figure 5.6 Data acquisition, processing and publication of sniffer bike data 53
Technically Sniffer Bike54 is a mobile solution that is mounted to the bicycle steer. The
published data contemplates the raw data measured by “sniffer bikers”. Each week on
Monday a file collected in the previous week is published in this dataset. This data has not
been validated by the RIVM. To protect the privacy of cyclists the reference to the IMEI
number of sensors (the unique identification of the sensor) has been removed. The figure
below shows the visualisation of the open data within the application.
Figure 5.7 Visualisation of sniffer bike open data with specific web application
During this research project, post-processed data became available separately55. The post
processing is done by RIVM. The dataset contains the RIVM corrected measurement data
collected by the snifferbikes. Every week a collection is made of the data collected in the
previous week and published here. However, verifying the content of these data (data of
verification 14th of June 2020) resulted in empty datasets.
53 https://civity.nl/products-solutions/snuffelfiets
54 https://civity.nl/en/sniffer-bike/
55 https://ckan.dataplatform.nl/dataset/rivm-corrected-snifferbike-snuffelfiets-data
44
The province of Utrecht is financing the sensors to be used by the volunteers and Civity
provides for the data management part. Besides the measurements themselves, the experiment
has the additional objective to define what the minimum amount of measuring cyclists has to
be to obtain viable results and what conclusions can be drawn from these results. The
experiment is ongoing in 2020 and will end in December 2020. On the 2nd of June 2020 a
webinar was organised to present the results obtained until June 202056. In essence three
conclusions were made:
• measuring fine dust (making use of the Sensirion SPS30 sensor) works good for
PM2.5 but doesn’t work for PM10
• the sensors need frequent calibration
• whether there are real differences in air quality between routes cannot be concluded
yet (one of the reasons is that on many days there are too few measurements to be able
to make significant statements for the whole of the province Utrecht)
Some additional information has been solicited from Civity about how metadata is provided
and how data is managed within the CIP platform. Civity has indicated that the meta data
related to the content of the data is being provided for by the owner (in this case the
volunteers that measure the data by means of automatically generated metadata by the
sensor); the metadata related to the publication of the data (date of publication, date of
updating, number of data sources etc.) are being generated and/or updated automatically by
the publication platform. Before publishing the measured data, simple post processing takes
place that consists of the conversion of the raw data from the sensor to corresponding
interpretable values of the measured air components. Additionally deselecting takes place of
data that might infringe privacy regulations. The CIP platform keeps historical data, however
information has not been locked or “frozen”. It can be changed from the outside and/or the
inside. The repository contains the raw data to prevent any loss of information if conversions
of the repository are needed. Changes in technologies and/or methodologies therefore should
not have any effect on the readability and usability of the data.
Assessment of the presence of recordkeeping principles
Based on the available documentation as mentioned in the description of the dataset and an
analysis of the available open data itself, the presence and/or application of recordkeeping
principles has been assessed using the framework elaborated for this research.
56 https://snuffelfiets.nl/wp-content/uploads/2020/06/Webinar-2-juni-2020.pdf
45
Table 5.3 Assessment form for crowd sourced air quality data
Minimal available descriptive metadata: can juridical
and administrative context, i.e. actors and the
procedural context, i.e. processes, use of standards
be derived
Is the format of the data described? The format of the data is described and explained: for
every measurement the following is being emitted
and stored: EIM (identifier of the sensor,) unique
identifier of the measurement, trip-sequence, values
for air components (PM10,PM2.5,PM1.0, Volatile
Organic Compounds), atmospheric condition
(Temperature, Humidity, Barometric pressure), GPS
location (latitude & longitude), Date and time,
Accelerometer (indicates irregularities in the road)
Provenance General: can the identity of the creator
be identified and verified?
The identity of the creator cannot be identified
because the unique identification of the sensor that
generates the information is removed for compliance
with GDPR (although the publication of the
geographic location in combination with time-stamps
might give the possibility to unique identification of
persons57)
Provenance Specific: can the process / methods be
identified that have been used create or derive the
data?
The process / methods can partially be identified: a
description is given how data is captured, transmitted
and stored within the environment from which it is
published as well58; Civity has provided some general
information about the post-processing of data but not
all details were given.
Are the data primary (no selection and no processing
(except processes of dissociation of personal data)
and no copying (accessing the source)).?
The data can be considered primary although some
minimal processing takes place: original data is being
stored in the sensor and copied/transmitted to the
CIP platform. It is stated that the data are raw data,
but identification of the sensor and the information of
the starting part of the route are removed because of
privacy reasons. Raw values are converted to
interpretable values of air components
Topic: what does the data represent; do the data have
a defined taxonomy or ontology?
The taxonomy is described in the explanation that
exists how to interpret the sniffer bike data59;
57 The location and time-stamp of every measurement can be combined with for instance available mobile
phone data (which records location and time-stamps as well and might have the identity of the user
connected to it)
58 https://civity.nl/products-solutions/snuffelfiets/
59 https://ckan.dataplatform.nl/dataset/8b04f4f3-666c-4448-91fc-234d5a75e6c4/resource/1d914417-3a80-
4120-9e2e-e4c918ee67a5/download/datawoordenboek_snuffelfiets.csv
46
Purpose: for what purpose has it been created? The purpose is described in the general description of
the dataset that can be found in the data.overheid.nl
environment60. Its objective is to provide insight into
air quality, cycling routes and urban heat islands.
Location: to what location is the data related? All individual measurements are related to a specific
location of the measuring station at a specific
moment in time by registering the geo position based
on GPS data.
Time: To what time frame is the data related? What’s
the actuality of the data and metadata; does historical
data remain online with appropriate version-tracking
and archiving; is continuous availability guaranteed by
a legal or policy commitment?
All individual measurements are related to 1 specific
moment in time
Quality and uncertainty: is a description of
correctness, completeness, and precision/granularity
present?
There is no description of correctness, completeness
or precision/granularity within the metadata. This
depends on the quality of the sensors and eventual
pre- and post-processing procedures. The sensors
which have been used can be identified, but
calibration reports do no seem to be available. Post
processing procedures are not described in detail to
derive quality and uncertainty. However, the
validation mechanisms of RIVM can support an
assessment on quality and uncertainty. They have
been communicated in the webinar of 2nd of June
2020.
What other non-domain specific metadata are
available that supports trust components?
There are no other specific metadata identified that
support trust components.
Are the metadata comprehensive; do the metadata
comply with metadata standards (international
standards, domain standards)
The metadata of the CIP platform are based on de
DCAT standard; therefore this information is
automatically available in the national registry of
open data (data.overheid.nl) and the EU portal.
The metadata within the data itself are domain
specific for this registry. They are automatically
generated and therefore can be considered as
metadated by design.
Repository: accessibility, integrity and preservation
Are the data findable and accessible : How can the
data be found (portal, search engine), how can the
data be accessed (API, linked data/URI, (download)
portal), which commonly owned or open formats
(including machine readable) are used
The data can be found using data.overheid.nl and is
accessible from the CIP platform. The data can be
accessed using a specific app, services and API’s and
can be downloaded as well.
60 https://data.overheid.nl/dataset/raw-snifferbike-snuffelfiets-data
47
Is the data is accessible from its source information
system or a recognized repository for publishing
The data is directly accessed from its source
information system, the City Innovation Platform,
assuming that the individual sensor is not considered
a source information system.
Are technologies being used that guarantee the
integrity of the source information system or
repository: securing that the information at the right
moment has been “frozen” and therefore cannot be
changed anymore?
The users of the City Innovation Platform are
supposed to have different levels of authorisations
that permit them to add, change and publish
information. Civity indicates that information can be
changed after is has been published as open data.
Does the information system or repository
incorporate technology of preservation to guarantee
readability and usability in the future? Are the records
(metadata and data) available in a sustainable format
(5 star model of Tim Berners Lee
(http://5stardata.info/en/) : Open licence, not
structured or open format/ machine-readable
structured data / non-proprietary format/ open
standards from W3C / linked open data)
It is not clear whether the repository has preservation
technology. It is not clear for how long data is being
archived nor is it clear whether the platform will stay
available for the future.
The data is provided under the licence CC-0 (1.0)
The data can be downloaded in CSV format.
A reflection on the use of the data and the relevance of recordkeeping principles for trust.
The first reflection is related to the principal question whether the open data has a sufficient
level of trustworthiness to be able to serve in legal contexts. This seems not to be the case for
the sniffer bike data although the data are post-processed by RIVM. Although the questions
related to this were not answered by CIP, from the context of the pilot project can be derived
that because of the quality of the sensors they can be considered helpful for additional
measuring which is at higher density then the official measurement stations. This helps at this
moment in time mainly to define or detail policies, allow for additional input in air quality
models and for scientific research on the uses technologies and methodologies.
Statistical information about the (re) use of the data could not be found. Based on the webinar
held on the 2nd of June can be concluded that the data at least is being (re)used by the
stakeholders of the experimental project itself. Civity provided some additional information
indicating that it is counted how many time the CKAN webpage has been consulted, but that
isn’t related to consulting and/or downloading data itself. The same counts for the use of the
API, it does not monitor the (re)use of data.
The cyclists (re)use the information to assess the quality of the information and raise
questions about the usability of the information; the transparency about the process from
creation to publication helps to assess the usability of the information (sensor characteristics
and processing determine content, quality and trustworthiness) but has to be combined as well
with the specific domain knowledge about air quality (measurements). They might use the
information as well to define whether they want to cycle on a specific moment, considering
the levels of fine dust, and what routes they choose.
48
The participating provinces reuse the information for compliance with their program to better
air quality (measurement with sensors by citizens is part of the program
“Uitvoeringsprogramma Schone Lucht” of the province of Utrecht61). They want to use the
information to develop green(er) bicycle routes but also to use this information in a more
general way to develop policies for the design of the living environment, to increase the
involvement of citizens with these policies. Transparency and trustworthiness are key for this
involvement.
The raw sniffer bike data is being (re)used by the RIVM to process new information which is
validated. It is not clear what corrections are being made exactly, but it is assumed that
correction is based on comparison with other reference measurements which should lead to
higher trust related to the validated data. After validation and corresponding correction these
values are used by RIVM as an addition of their official measurements done within their own
network of stations in Utrecht (6 stations in Utrecht) to support scientific research to
determine the variation of air quality depending on geography (city, village, rural area),
determine and fine-tune calculation models and provide additional information for the
Provincial Report about air quality62. The validation of RIVM is an important part to improve
the trustworthiness of the information related to content-based trust.
5.5 Analysis of the research data
In the previous three chapters several data sources have been assessed on their incorporation
of recordkeeping principles. The three data sources were chosen based on their difference in
legal context:
1. a data source that is created by governmental bodies and is maintained and published
based on legislation on the registry itself
2. a data source that is created, maintained and published by a recognized governmental
organisation but the creation of the dataset is not based on legislation, however a part
of the content does have to comply with European directives
3. a data source that is created by private persons, published by a private party on a
platform that is being supported and used by governmental institutions63
In general the assessment of three datasets shows that descriptive metadata vary and lack
metadata that support specific trust components. For instance there is a lack of metadata that
indicate uncertainty of data (which is related to content based trust). Domain (meta) data
models vary and are not always available. Data collection and data processing standards vary
61 This program is a co-creation between the province of Utrecht and the stakeholder (citizens, NMU, the
Dutch lung fund, EBU, research institutes and municipalities.
62 https://www.provincie-utrecht.nl/sites/default/files/2020-
03/rapportage_luchtkwaliteit_provincie_utrecht_2018.pdf
63 During the research project an alternative, post processed version of this information became also
available. Post processing is done by the Dutch institute for public health.
49
and are not always available. With respect to the repositories it can be concluded that two of
the three repositories seem to lack guarantees to secure integrity and accessibility over time.
Below you can find a classification of the results of the assessment between these three
different data sources regarding the level of incorporation of the recordkeeping principles as
defined in the assessment form. Every recordkeeping principles is classified based on the
following criteria:
• Green: the dataset and its accompanying information comply fully with the
recordkeeping principle
• Orange: the dataset and its accompanying information comply partly with the
recordkeeping principle
• Red: the dataset and its accompanying information do not comply with the
recordkeeping principle
• Grey: the level of compliance could not be confirmed because of lack of information
The classification is based primarily on the publicly available information and secondarily on
the information provided by Kadaster, RIVM and Civity.
Table 5.4 Classification of presence of recordkeeping principles in 3 assessed open data sets
Recordkeeping Principle Legislation: BAG Recognized bodies but no legislation: air quality measurements
Private parties: air quality measurements
Format described
Known creator
Known process
Primary data
Known taxonomy
Known purpose
Known location
Known time
Known uncertainty
Additional trust metadata
Compliance metadata standards
Findable and accessible
Source information system or recognized publishing platform
Integrity guarantee
Preservation guarantee
50
The overview of the incorporation of recordkeeping principles shows that the BAG registry
applies the majority of recordkeeping principles. The formal air quality measurements of
RIVM and its partners and the informal sensor data of private parties in general apply less of
the principles, specifically considering the trust components regarding available metadata and
the uncertainty about the incorporation of the repository-related principles. In general this
seems to be an indication that legislation might help to define and secure (by design) a level
of trustworthiness based on the application of recordkeeping principles.
Considering the available general metadata:
• All datasets are described generally with relevant metadata, mostly incorporated in
explanations on the platforms where the data can be found and in related additional
documentation. Two of the three datasets are described in a standardized way in
data.overheid.nl. Considering the metadata of individual information objects,
standardized metadata lack. Depending on the data models, some attributes are
incorporated that can be considered metadata (for instance location and time) but they
are registry and/or domain dependent. It would be of great interest to investigate
whether a standardized minimum set of metadata for each individual information
object can be implemented to support the trustworthiness of reuse of individual
information objects. This is elaborated in the next chapter.
Considering some specific metadata:
• Metadata that have the objective to give detailed information about the provenance
should indicate clearly who is the creator of the information; as is shown in this
research, in the case of the private air quality measurements, publication of this
information as open data can be in conflict with European regulation for privacy.
However, the fact that procedures are transparent, data is emitted directly from the
source to the processing- and publishing platform, metadata is generated
automatically, and the possibility of saving this information (for verification) within
the sensor itself supports the (derived) trustworthiness of the information.
• Primary data: the provenance metadata are directly related to the recordkeeping
principle of primary data and the access of the source information system or a
recognized publishing platform. This principle of primary data minimizes the risk that
changes in metadata about provenance or changes in the data itself can be made
between creation and publication, which would affect the trustworthiness of the
information. Primary data can be supported by direct access to the source information
system
• Metadata that support information about the processes that generated the information
have resulted to be very important to be able to create a combined social based trust
(based on the transparency of an organisation to publish all process details) and
content based trust (verification of (the quality of) the processing and the data based
on the available metadata).
51
• Known uncertainty and additional trust metadata: in all three datasets detailed
information lacks to be able to assess the quality of the information; some information
is publicly available, some information was given with more detail, but a complete
detailed description of processes, parameters, algorithms and/or human decision
details have not been found or provided; this definitely affects the trustworthiness of
the information; on the other hand this is compensated in all three cases by several
protocols to give feedback on the quality of the content and/or the possibility of
public/private co-creation of data
For all of these metadata reflections, a disclaimer is very important: none of the metadata is
inseparably connected to the individual information objects (with exception of the metadata
that is incorporated at the level of each individual information object because of its domain
data model). This means that data can be accessed and re-used without taking into account its
original context and significance. This allows for inappropriate reuse, involuntary or
voluntary. This possible inappropriate reuse affects the trustworthiness of the reused
information. When metadata are inseparable from the information object, misinterpretation
and manipulation can be countered or may be even prevented.
Considering the recordkeeping principles of the repository:
• Integrity guarantee: the integrity of the data is difficult to determine. When using apps
and downloads it is impossible to guarantee that the source data is being served
without changes. API’s and services can support to secure that, providing that a
complete data model of all available information is available for the information
within the publishing platform. In that case it is still possible that the information is
not the same as in the source database, neither can it be guaranteed that information
cannot be changed. The trustworthiness can be improved by diminishing the
“distance” between source information system and the repositories for the (re)use of
the open data, the ideal situation being a direct access to the information at the source
without copying (parts of) the information. This can be facilitated by making use of
the principles of linked data and incorporating additional functionality in source
databases. This is elaborated in the next chapter.
• Preservation guarantee: two of the three repository environments (both related to the
air quality measurements) did not have an indication of guaranteeing preservation of
information. Mechanisms of guarding historical information exist, but the legibility
and usability of this information in the future is not guaranteed explicitly. In the case
of the legislated BAG registry however this is obligatory. In all three cases the
organisations strive to maintain legibility and usability in the future by updating
repository environments together with the stakeholders.
These findings of the compliance with recordkeeping principles are directly related to the
assessment form which has been elaborated and used for this research. During the application
of the form several flaws were encountered. These will be briefly described in the next
chapter.
52
5.6 Assessment of the assessment framework
This research has been an attempt to assess current open data sources within a governmental
context. The assessment has been executed with an assessment form based on literature.
The assessment form has served to analyse the open data in a structured way and look at the
implementation of recordkeeping principles. However, it has been difficult to assess all the
aspects with sufficient profoundness. This is mainly caused because of the limited information
available online. In the context of this research the different institutions (creators, processors
and/or publishers) had not been contacted to guarantee that the trustworthiness should be
based on information that is publicly available for the user. However, consultation supposedly
would give additional information valuable for the research and therefore Kadaster, RIVM
and Civity have been consulted with some questions related to the assessment form. This has
definitely resulted in valuable additional information.
It has been possible to analyse the different data sources and data platforms, reviewing the
details that were available online (legislation, manuals, services, API’s and downloaded data).
It has been possible as well to derive conclusions about whether the recordkeeping principles
are present or not and to what level. It resulted not to be practical to have a subdivision at
dataset level, record level and attribute level. Documentation was applicable mostly on the
dataset / data model as a whole, metadata at lower levels were domain dependent.
Two specific aspects were not present in the assessment form and resulted to be very relevant
for the objective of the thesis on the trustworthiness of information:
• can the available information be (re)used to serve as evidence (to serve in legal
contexts), which is an indication of the trustworthiness of the information
• what can be said about current (re)use of the different datasets which also indicates a
level of trustworthiness
Both aspects can and should be added to the assessment form to complete the assessment
from a user-oriented point of view.
The findings of this research have been translated to recommendations that can help to
promote the trust of open data environments and the preceding data governance environments.
These are described in the next chapter.
53
6. Recommendations to promote trust of open data environments
6.1 Translation of the research findings to recommendations
This research parted from two basis assumptions:
• recordkeeping metadata can support the accountability of an organization and support
the traceability, authenticity and sustainability of records; it provides information
about the context of the creation of records needed for proper interpretation and
(re)use
• the record continuum model reflects the (continuous and limitless) reuse of records;
the digital age technology enables and supports this reuse of records; therefore a focus
is needed on the creation, capture, and maintenance of records in a way that
sustainable accessibility is guaranteed. This implies the use of repositories that
guarantee integrity and sustainable accessibility
This research has shown that the availability of recordkeeping metadata can help (re)users of
open data to assess the usability of the open data for their goals. In what context was the data
created, by whom, which processing has been taken place and what is finally published. The
research has shown that metadata is mostly available at dataset level. However, not all
metadata is publicly available. Details about the processes that are used to create and/or
process the data is mostly not available. Most metadata is not available in a form that supports
automated (re)use of data in which the metadata can be taken into account. Some of the
metadata is available for individual information objects as well, but this mostly is related to
domain-related metadata. Metadata are not inseparably connected to each individual
information object. This means that data can be accessed and (re)used without taking into
account its original context and significance. This allows for inappropriate reuse, involuntary
or voluntary. This possible inappropriate reuse affects the trustworthiness of the (re)used
information. Metadata that are inseparable from the information object can prevent
misinterpretation and manipulation. To support the trustworthiness of open data it is
recommended to investigate whether a standardized minimum set of inseparable metadata for
each individual information object can be implemented to support the trustworthiness of
(automated) reuse of individual information objects. This is elaborated in chapter 6.2.
This research found that the sustainability of two of the repositories is not guaranteed. This
research has shown as well that the characteristics of the repository, the processes to publish
information in the repository and the way this information can be accessed determine the
trustworthiness of data as well. All of these aspects affect whether integrity of data can be
guaranteed. Is the source data being served without changes? When it is served without
changes, who can guarantee that, if information objects go to “wander”, that it stays
unchanged and will be used correctly? All repositories contain copies of the source data and
all repositories allow for reuse of information by copying information, separating it therefore
from the repository. The trustworthiness of records can be improved as well by diminishing
the “distance” between source information system and the repositories for the (re)use of the
open data, the ideal situation being a direct access of the information at the source without
54
copying (parts of) the information. This can be done for instance by making use of the
principles of linked data for access to records and incorporating additional functionality in
source databases to guarantee integrity and preservation. This is elaborated in chapter 6.3.
6.2 Standardized and inseparable minimum metadata set for individual
information objects
The research findings indicate that trustworthiness of open data is affected by the lack of
inseparable (standardized) metadata at the level of individual information objects. A
standardized minimum set of inseparable and unchangeable metadata for each individual
information object can help the trustworthiness.
If by design principles are used, a standardized set of metadata, could be used for every
individual information object, when creating and/or changing information objects. The
standardized set could consist of different groups of metadata that support the trustworthiness
of the data and the sensemaking of the data:
• a group of metadata that identifies the information object uniquely: a unique identifier,
the creator, type of information object or classification (based on taxonomy), the
creation date for the information object, the date of fixation (after which changes are
not possible any more), date of the period for which the data is valid etc.
• a group of metadata that helps to comply with legislation: registration of retention
times, confidentiality of the information object, purpose of creation to determine
allowed (re)use (in the case of open data retention time has only significance when the
data is not provided physically but by means of linked data, confidentiality is only
relevant in the source information system to determine whether the information can be
published / linked to)
• a group of metadata that helps to detail the context of creation and the related quality
of the information object: information about the process that created and/or changed
the information object, used methodologies, information about quality and uncertainty,
quality control and corrections
• a group of metadata that specifically supports further sensemaking (apart from the
metadata that are already mentioned that help support sensemaking like creator, type /
classification, purpose, context of creation, methodologies) and might be more domain
dependent: format, topic, taxonomy, location, time and other metadata to support the
sensemaking of data.
This standardized minimum set of inseparable and unchangeable metadata for each individual
information object could be taken into account with the design of every new information
system. It can be based on and integrated with the new metadata standard under development
for sustainable accessibility of governmental information objects (MDTO).
The standardization of those metadata that are domain independent support the
interoperability of data in integral analysis based on that data and automatic processing of
processes like retention, publication, verification of allowed (re)use and other alike processes.
55
The standardization per domain of domain dependent metadata will support the sensemaking
of the data, help to be able to verify the (quality of the) content and enable automatic
processing and analysis (including the use of artificial intelligence). This supports the
trustworthiness of open data (strengthening both social based and content based trust).
6.2 Source information system as archiving and publication platform
The research findings indicate that if information objects are not managed and maintained in
their original source information system or a recognized and managed publication
environment, and copying is allowed, they might go to “wander”. When that happens (and
that is the case with the majority of open data) it is impossible to guarantee that it stays
unchanged and will be used correctly. It is not possible either to guarantee compliance with
legislation regarding retention times, privacy regulation and purpose related (re)use of
records.
All three investigated datasets in this research concern open data that are not residing in the
information system where it was created. This can have been done for several reasons.
Separation of the data that will be published as open data from the other private data might be
needed to provide the open data in an environment that is specifically made for use of open
data. Separation might be needed as well because of different governance rules between the
source information system and the publication information system. The downside of this
separation is the possibility that the information might not be the same as in the source
database (because of changes made during the copying of information, or less regulation in
the publication environment that allows for manipulation of the data).
The trustworthiness of open data can be improved by diminishing the “distance” between the
source information system and the information system used for the (re)use of the open data,
the ideal situation being a direct access to the information at the source without extracting or
copying the information. This can be done making use of services and API’s that directly
access the source database. This can also be supported making use of the principles of linked
data. At the same time it is relevant to incorporate functionality in the source information
system that guarantees the integrity and preservation of information. This would lead to an
information infrastructure in which all data is being managed in its source information system
during its complete lifecycle, from creation to archiving or deletion. Figure 6.1 shows a
possible architecture with management of metadata by design and source repositories that
have recordkeeping functionality from which data can be reused based on linked data
principles.
56
Figure 6.1 Possible information architecture with integral recordkeeping functionality
This concept of publishing information from its source information system (by means of for
instance linked data) helps to better protect the information, better manage the compliance to
legal requirements (in combination with metadata related to GDPR, archival law,
secretiveness of data, open data legislation, “objective of use” regulations) and prevent
physical “wandering” of information. However, currently technical limitations of information
systems impede low level re-use of big amounts of data that are not copied to the processing
platform (consultation, referencing or processing of individual information objects is possible
but to do this for complete datasets with millions of information objects is more difficult, as is
being done within data science environments for instance). Therefore this is a model to grow
to within the coming years.
The “architectural” approach for the design of data models, information processing and
information systems and -architectures allow for the incorporation of recordkeeping principles
that strengthen the trustworthiness of information. Both standardized inseparable and
unchangeable metadata and source information systems with integrity and preservation
guarantees help to reach these goals. This will be stressed as well in the last section of this
research with conclusions.
57
7. Conclusions
The concept of trust has been the key topic in this research. The focus of this research has
been the data that governments generate and open up for reuse. Government bodies are
responsible for the execution of a lot of tasks where trustworthy information is key. For the
execution itself, for the accountability of the execution and for the reuse of information,
information is created, maintained and published that are considered trustworthy.
That Dutch government bodies are considered to be trustworthy and therefore supposedly
produce accurate and reliable information is part of our social culture and related to the
concept of social trust. However, opening up government data in combination with new
individual ways of creating and/or verifying data allow for assessing this social based trust by
assessing the quality of the content of data. The resulting content based trust can support
already exiting social based trust or it can undermine it when the quality is poor or when the
quality cannot be assessed because of lacking transparency. This has been the case in the past
years in several cases, from environmental measurements by government bodies to fraud
profiling and taxation issues. Gillian (2018) concluded that there are many socio-cognitive
factors in play and a reliable source alone does not consistently determine trustworthiness for
users. Yoon (2014) concludes on the trust of data in repositories that “trust in data itself plays
a distinctive and important role for users to reuse data, which may or may not be related to the
trust in repositories”
The objective of this research was to find whether open government data can be trusted based
on recordkeeping principles that support trustworthiness. The research has shown that
available open data do incorporate recordkeeping principles but at different levels and in
different ways. The current Dutch open data landscape is diverse. Governments are opening
up, but do not have everything in place yet to support trustworthiness of information. The
assessment of three datasets has shown that descriptive metadata vary and lack metadata that
support specific trust components, that domain (meta) data models vary and are not always
available, that data collection and data processing standards vary and are not always available,
and that (source) repositories seem to lack guarantees to secure integrity and accessibility over
time. This lack of availability of metadata and adequate repositories, that support the
trustworthiness and sensemaking of data, can be strengthened.
To strengthen the social- and content based trust, this research has recommended the further
incorporation of recordkeeping principles by creating standards for descriptive metadata at the
level of individual information objects that are valid during the complete life cycle of these
information objects and can be assessed by the re-users of information. At the same time the
information can be protected, while being transparent, promoting that the data reside in
(source)repositories that guarantee integrity, accessibility and usability over time.
It has to be taken into account continuously that data are not neutral: the design of sensors, the
modelling of data models, the classification of information, the design of algorithms, the
results of analysis, all are human activities that part from an interpreted and/or subjective
theory or design that influences the way the data can be used and interpreted.
58
The strengthening of the social and content based trust, by incorporation of the recordkeeping
principles in the design of data models, data processing and information systems will further
foment confidence and trust. Co-creation and participation of creators and (re)users of data
help to establish an environment in which data really fulfils its function to support the daily
processes of policy making, policy application and the execution of a lot of governmental
processes.
What has not been incorporated in this research, and could be a topic for further research, is to
investigate whether the absence of the identified recordkeeping princples, or the inability to
prove the presence, really influences the use or not use of the information. An attempt to
reflect on this has been incorporated in the assessment of the three datasets and reveals the
necessity and relevance to investigate this (considering for instance that of the air quality data
the (re)use was not known).
Trust and authoritativeness in the digital world have to be created combining a lot of different
key components, from having trustworthy digital identities, to authentication mechanisms to
verify that identities, to rights attached to information objects to be created, updated or read
by those identities. To be able to guarantee the robustness of an information ecosystem in
relation to trust authoritativeness is to create this ecosystem by design incorporating the key
components. This is already difficult for government bodies, and seems to be almost
impossible for the diverse landscape of information systems in the connected world.
59
Bibliography
Acker, Amelia. (2017), “When is a record?”, in Research in the Archival Multiverse, edited by Anne J. Gilliland, Sue Mc Kemmish and Andrew J. Lau. Monash University Publishing, p. 288-323. Bearman, David. (1993), “Record-Keeping Systems.” in Archivaria, 36 (1993) p. 16-36.
Borglund, E. and Engvall, T. (2014), “Open data?: Data, information, document or record?”, Records Management Journal, Vol. 24 No. 2, pp. 163-180. Borgman Christine L. (2015), Big Data, Little Data, No Data: Scholarship in the Networked World. Cambridge, MA: MIT Press, 2015
Boyd, Danah & Kate Crawford (2012), Critical Questions for Big Data, Information, Communication & Society, 15:5, 662-679. Casellas, L.E., Oliveras, S. and Reixach, M. (2012), “The authenticity of Data-Centric Systems at the Girona City Council”, InterPARES 3 Project, Catalonia TEAM Case Study. Available at: www.interpares.org/ip3/display_file.cfm?docip3_catalonia_cs02_final_report_EN.pdf Ceolin, D. et al. (2016), Combining User Reputation and Provenance Analysis for Trust Assessment in Journal of Data and Information Quality (JDIQ), 06 June 2016, Vol.7(1-2), pp.1-28 Donaldson, D.R. & Conway, P. (2015): User conceptions of trustworthiness for digital archival documents in Journal of the Association for Information Science and Technology, December 2015, Vol.66(12), pp.2427-2444
Duranti L. (2018), 'Whose truth? Records and archives as evidence in the era of post-truth and disinformation' in Brown, Caroline (eds) Archival Futures London : Facet Publishing, 2018 Faniel, I.M., Frank, R.D., Yakel, E., Context from the data reuser’s point of view, Journal of
Documentation, 26 September 2019, Vol.75(6), pp.1274-1297
Floridi, L. (2011) “The philosophy of information”: Oxford University Press, Reprinted 2014 Gillian O. (2018), AA02User perspectives of trust, InterPARES Trust Project. Available at: https://interparestrust.org/assets/public/dissemination/AA02FinalReport.pdf Gilliland, A. J. (2016), Setting the stage in Introduction to metadata: Pathways to digital information. Third Edition. Murtha Baca ed. Available at: http://www.getty.edu/publications/intrometadata/setting-the-stage Hurley, Grant, Valerie Léveillé, and John McDonald. Managing Records of Citizen Engagement Initiatives: A Primer. InterPARES Trust Project, 2016. Heersink, H, Bohmer, R and Giesen, S. (2017) “Verkenning raakvlakken GDI in relatie tot digitale archivering en duurzame toegankelijkheid”, Eindrapport ICTU C867
60
International Organization for Standardization (ISO) (2001), TC 46/SC 11. ISO 15489-1:2001 Information and Documentation. Records Management. Part 1: General, 1st ed., International Organization for Standardization, Geneva. Jaakkola, H., Makinen, T., Etelaaho, A.: Open data: opportunities and challenges. Paper presented at the Proceedings of the 15th International Conference on Computer Systems and Technologies, Ruse, Bulgaria (2014) Janssen, M., Charalabidis, Y. and Zuiderwijk, A. (2012), “Benefits, adoption barriers and myths of open data and open government”, Information Systems Management, Vol. 29 No. 4, pp. 258-268. Kelton, K., Fleischmann, K.R., & Wallace, W.A. (2008). Trust in digital information. Journal of the American Society for Information Science and Technology, 59(3), 363–374. Koesten, L, Gregory, K., Groth, P., Simperl, E. (2020), “Talking datasets: Understanding data sensemaking behaviours”. Cornell University (2020). Available at: https://arxiv.org/abs/1911.09041 Kool, L., J. Timmer, L. Royakkers and R. van Est (2017). Opwaarderen – Borgen van publieke waarden in de digitale samenleving. Den Haag: Rathenau Instituut Lemieux, V.L., Gormly, B. and Rowledge, L (2014) "Meeting Big Data challenges with visual analytics: The role of records management", Records Management Journal, Vol. 24 Issue: 2, pp.122-141. Available at: https://doi.org/10.1108/RMJ-01-2014-0009 Lemieux V.L. (2017) “A Typology of Blockchain Recordkeeping Solutions and Some Reflections on their Implications for the Future of Archival Preservation”. Available at: http://dcicblog.umd.edu/cas/wp-content/uploads/sites/13/2017/06/Lemieux.pdf Léveillé, V., and Timms, K. (2015), Through a Records Management Lens: Creating a Framework for Trust in Open Government and Open Government Information.” in Canadian Journal of Information and Library Science39, No. 2(2015): 154-190. Mazon, J.N., Zubcoff, J.J., Garrig, I., Espinosa, R., Rodríguez, R. (2012): Open business intelligence: on the importance of data quality awareness in user-friendly data mining. Paper presented at the Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany (2012) Mosley M. (2008) “DAMA-DMBOK Functional Framework” : DAMA International, 2008 Natonaal Archief. (2016e) “DUTO:Belangen in balans: Handreiking voorwaardering en selectie van archiefescheiden in de digitale tijd”. Available at: https://www.nationaalarchief.nl/archiveren/kennisbank/handreiking-waardering-en-selectie
Reigeluth, Tyler (2014) ‘Why data is not enough: digital traces as control of self and self-control’ in Surveillance & Society 12(2) 243-254 Richards, M. (2015) “Software Architecture Patterns: Understanding Common Architecture Patterns and when to use them”: O'Reilly Media, Incorporated, 2015 Richards, Neil M. and Jonathan H. King (2013). ‘Three paradoxes of big data’. Stanford Law Review, September 2013. Available at: https://review.law.stanford.edu/wp-content/uploads/sites/3/2016/08/66_StanLRevOnline_41_RichardsKing.pdf
61
Serra, L.E.C. (2014), “The mapping, selecting and opening of data: the records management contribution to the open data project in Girona city council”, Records Management Journal, Vol. 24 No. 2, pp. 87-98. Strong, D.M., Lee, Y.W., Wang, R.Y. (1997): 10 potholes in the road to information quality. IEEE Comput. 30(8), 38–46 (1997) Suderman J. and Timms, K. (2017), NA08 Open Data, Open Government and Big Data: Implications for the Management of Records in an Online Environment. Researchers: Final Report. Available at: https://interparestrust.org/assets/public/dissemination/IPT_NA08_FinalReport_1Oct2016_fordistribution_.pdf Sunlight Foundation (2010), Ten Principles for Opening Up Government Information. Available online: https://sunlightfoundation.com/policy/documents/ten-open-data-principles/ Tennis, J.T. (2019), Evidence of Authenticity through “Metadata” and its Sources in Records Preservation NA16 Metadata: Mutatis mutandis –Design Requirements for Authenticity in the Cloud and Across Contexts. Available at: https://interparestrust.org/assets/public/dissemination/Tennis.pdf Ubaldi, B. (2013), “Open government data: towards empirical analysis of open government data initiatives”, OECD Working Papers on Public Governance, No. 22, OECD Publishing. Available at: http://dx.doi.org/10.1787/5k46bj4f03s7-en Upward F., Reed B., Oliver G. and Evans J. (2018), Recordkeeping informatics for a networked age: Monash University Publishing, Clayton, Victoria, 2018 Wang X., Govindan K, and Mohapatra P. (2011), “Collusion-resilient quality of information evaluation based on information provenance,” IEEE 8th Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks, 2011. Frederika Welle Donker & Bastiaan van Loenen (2017) How to assess the success of the open data ecosystem?, International Journal of Digital Earth, 10:3, 284-306. Available at: http://dx.doi.org/10.1080/17538947.2016.1224938 Geoffrey Yeo (2018), Record, Information and data, exploring the role of record-keeping in an information culture, University College London
Geoffrey Yeo, ‘Information, records, and the philosophy of speech acts’ in Frans Smit, Arnoud
Glaudemans and Rienk Jonker (eds) Archives in Liquid Times(‘s-Gravenhage 2017) 93-118. Available at: http://www.oapen.org/search?identifier=641001 Yoon, A. (2014), End users’ trust in data repositories: definition and influences on trust development in Arch Sci (2014) 14:17–34 Zuiderwijk, A., Janssen, M. and Dwivedi, Y.K. (2015), “Acceptance and use predictors of open data technologies: drawing upon the unified theory of acceptance and use of technology”, Government Information Quarterly, Vol. 32 No. 4, pp. 429-440.